๐ฆ๐๐ผ๐ฝ ๐๐ถ๐ด๐ต๐๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐ข๐
I tried to scrape 5,000 product pages. It failed. Each page had different HTML. One price stayed in a span. Another hid in a div. My selectors broke.
I tried XPath and regex. Neither worked reliably. I wrote 500 lines of code to handle errors. The code was brittle.
I changed my approach. I stopped fighting the structure. I treated the page as natural language.
The new process:
- Get visible text with Playwright.
- Send text to an LLM.
- Request JSON output.
This removes fragile selectors. The LLM finds the data regardless of the HTML tag.
Lessons learned:
- Cost: GPT-4o-mini is cheap.
- Speed: API calls take 2 to 5 seconds.
- Accuracy: High for titles. Lower for rare specs.
Use this for messy sites. Avoid this for stable sites. Avoid this for real-time needs. Use local models for private data.
Stop treating pages as structured documents. Treat them as text.
Source: https://dev.to/__c1b9e06dc90a7e0a676b/why-css-selectors-failed-me-using-llms-to-scrape-inconsistent-web-pages-40ap Optional learning community: https://t.me/GyaanSetuAi