𝗙𝗿𝗼𝗺 𝗥𝗲𝗴𝗲𝘅 𝘁𝗼 𝗟𝗟𝗠𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗨𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮
I tried to build a price comparison tool. It needed product data from dozens of e-commerce sites.
Every site used different HTML structures. Some used random CSS classes. Others used JavaScript to load content.
My first plan used Regex and BeautifulSoup. It worked for two sites. Then everything broke. One site changed its layout. Another site started using dynamic content. I spent more time fixing scrapers than using data.
I tried using an LLM next. I sent raw HTML to an AI and asked for data. This failed too. The output was inconsistent. The AI hallucinated values. My API costs went up because HTML uses too many tokens.
I found a middle ground. I now use a hybrid approach.
Here is my process:
- Preprocess the HTML. I strip scripts, styles, and navigation bars. I only keep the visible text. This keeps token counts low.
- Use JSON mode. I use function calling to force the LLM to return structured data.
- Add a retry logic. I retry up to three times if the JSON fails.
- Cache results. I save successful extractions by URL to avoid repeat costs.
Regex and BeautifulSoup are still best for static, well-structured pages. They are fast and free.
LLMs are best for messy or changing sites. They are not magic. You must clean your input first to save money.
My current workflow:
• Use a lightweight parser for easy sites. • Use an LLM for unpredictable sites. • Monitor cost per extraction. • Validate the output to ensure prices look real.
This change helped me add new stores in one hour instead of one day.
What is your strategy for messy web data? Do you use LLMs or stick to traditional scrapers?