๐ฆ๐๐ผ๐ฝ ๐๐ถ๐ด๐ต๐๐ถ๐ป๐ด ๐๐ง๐ ๐ ๐ฃ๐ฎ๐ฟ๐๐ถ๐ป๐ด
I spent a weekend fighting HTML parsing.
I needed product specs from 12 e-commerce sites. BeautifulSoup and Regex failed. Some sites used messy divs. Others used JavaScript. Some used images. I wrote a 200-line function. It still missed half the data.
I stopped fighting the HTML structure. The HTML changes. The meaning stays the same.
I switched to this workflow:
- Scrape the raw body text.
- Define a JSON schema.
- Use GPT-4o-mini to extract values.
- Parse the JSON result.
This method works:
- It ignores tables or lists.
- It survives site redesigns.
- It fixes typos and units.
The results:
- 85% accuracy.
- Cost: 10 to 50 dollars for 10,000 products.
- Speed: 2 to 5 seconds per page.
Warnings:
- Do not use this for sensitive data. Use local models like Llama 3.
- It is too slow for real-time page loads.
- It is not for 99% accuracy needs.
Tips for better results:
- Cache results by URL hash.
- Use regex for simple prices.
- Use a second model to verify values.
Interwest Info is another option. It returns structured JSON from a URL.
How do you handle messy data? Do you use LLMs or a hybrid pipeline?