𝗦𝘁𝗼𝗽 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗕𝗿𝗶𝘁𝘁𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿𝘀
I scraped 5,000 product pages. Every page had different HTML. My CSS selectors broke. I tried XPath. I tried Regex. Nothing worked.
I built a scoring system. It failed during A/B tests. I used a headless browser. I wrote 500 lines of error handling. It was too fragile.
I stopped fighting the HTML structure. I extracted all visible text. I asked an LLM to find the data. I asked for JSON.
The process is simple.
- Use Playwright to render the page.
- Extract visible text.
- Prompt the LLM for specific fields.
- Store the result.
Here are the results.
- Cost: 5,000 pages cost $1.50 using GPT-4o-mini.
- Speed: Each call takes 2 to 5 seconds.
- Accuracy: 97% for prices. 85% for rare fields.
Use this method for messy sites.
Avoid this method for these cases.
- Stable sites.
- Real-time data.
- Private data.
Treat web pages as text. Stop treating them as structured documents.