𝗙𝗿𝗼𝗺 𝗥𝗲𝗴𝗲𝘅 𝘁𝗼 𝗟𝗟𝗠𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮
I built a price comparison tool. I needed to pull product data from dozens of e-commerce sites. Every site had a different structure. Some used random CSS classes. Some used JavaScript to load content.
I tried the classic way first. I used Regex and BeautifulSoup.
It worked for two sites. Then, one site changed its layout. My code broke. Another site used dynamic content. I spent more time fixing scrapers than using data.
Then I tried AI. I fed raw HTML into an LLM.
The results were bad. The output was inconsistent. Sometimes I got JSON. Sometimes I got paragraphs. The model hallucinated data. The cost was too high because I sent too many tokens.
I found a middle ground. I now use a hybrid approach.
Here is my process:
- Preprocess the HTML. I strip scripts, styles, and navigation bars. I only keep the visible text. This reduces token counts and costs.
- Use JSON mode or function calling. This forces the LLM to return structured data.
- Use a retry logic. If the JSON is wrong, the code tries again.
- Cache results. I save successful extractions per URL to avoid paying for the same page twice.
Regex and BeautifulSoup are still best for static, well-structured sites. They are fast and free.
LLMs are better for messy, unpredictable sites. But you must clean the data first to keep costs low.
My lessons learned:
- Monitor your costs and success rates.
- Watch your latency. LLM calls take seconds, not milliseconds.
- Validate the output. Check if the price looks like a real price.
- Use a hybrid system. Use a parser for easy sites and an LLM for the messy ones.
This setup saves me hours of work. I can add a new store in under an hour.
How do you handle messy web data? Do you prefer parsers or LLMs?
Optional learning community: https://t.me/GyaanSetuAi