๐๐๐ ๐ ๐๐ผ๐ฟ ๐๐ฒ๐๐๐ฒ๐ฟ ๐ช๐ฒ๐ฏ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ถ๐ป๐ด
I spent years writing scrapers. I used CSS selectors and regex. It worked until the website changed.
One layout update broke my code. I spent days fixing it. I lost the battle against changing HTML.
I tried a new way. I used LLMs. I stop guessing selectors. I send page text to the model.
My process is simple:
- Clean HTML with BeautifulSoup.
- Remove scripts and styles.
- Prompt the LLM for a JSON object.
The results are better. It works on different layouts. It recognizes prices and stock status without specific rules.
There are trade-offs:
- Higher cost per request.
- Slower speed.
- Occasional mistakes.
Choose your tool based on your needs:
- Use regex for stable sites and fast data.
- Use LLMs for chaotic sites and semi-structured data.
I have not touched my code in three weeks. The LLM handles the fragile DOM for me.
How do you handle website changes? Do you use selectors or models?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/regex-broke-my-scraper-using-llms-for-robust-data-extraction-5bef Optional learning community: https://t.me/GyaanSetuAi