๐ ๐ ๐ช๐ฒ๐ฏ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ถ๐ป๐ด ๐ก๐ถ๐ด๐ต๐๐บ๐ฎ๐ฟ๐ฒ ๐๐ป๐ฑ๐ฒ๐ฑ ๐ช๐ถ๐๐ต ๐๐๐ ๐
I built scrapers for years. I used CSS selectors. It worked for one site. It failed for ten sites. Every page had a different layout. Code became a mess.
I tried a new way. I fed raw HTML to an LLM. I told the AI which data I wanted. AI understands text meaning. It ignores tags.
Here is how you do it:
- Use a cheap model like GPT-4o-mini.
- Write a prompt for specific fields.
- Ask for a JSON object.
- Trim HTML to save money.
I improved the process:
- Remove script and style tags.
- Cache results by URL.
- Validate JSON output.
- Use a hybrid system. Try selectors first. Use AI as a backup.
Use this for:
- Sites with shifting layouts.
- Small projects.
- Fast prototypes.
Avoid this for:
- Millions of pages.
- Real time needs.
- Private data.
Try this for job boards or news. Start small. Measure accuracy.
What is your setup? Do you use AI or traditional scrapers?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/my-web-scraping-nightmare-ended-when-i-let-an-llm-read-the-html-1bj4 Optional learning community: https://t.me/GyaanSetuAi