๐ช๐ต๐ ๐ ๐ฆ๐๐ถ๐๐ฐ๐ต๐ฒ๐ฑ ๐๐ผ ๐๐ ๐ณ๐ผ๐ฟ ๐ช๐ฒ๐ฏ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ถ๐ป๐ด
I scraped websites for years. I used CSS selectors and XPath. It worked until sites changed their layout. Then my scripts broke. I spent more time fixing code than using data.
I tried BeautifulSoup. I tried Regex. I tried OCR. Nothing lasted. Small changes in HTML broke everything.
Now I use AI models. I send the HTML to an LLM. I ask for a JSON object. The AI finds the price and name. It ignores the HTML structure. It looks at the meaning of the text.
Why this works:
- It survives layout changes.
- It is easy to add new fields.
- One prompt works for many sites.
The trade-offs:
- Each request costs money.
- It is slower than a script.
- AI sometimes makes mistakes.
- Large pages hit limits.
My strategy:
- Use simple rules first.
- Use AI as a fallback.
- Validate every result.
- Cache your data.
Stop fighting with HTML tags. Focus on your data.
Source: https://dev.to/__c1b9e06dc90a7e0a676b/why-i-gave-up-on-regex-and-started-using-ai-for-web-scraping-339d Optional learning community: https://t.me/GyaanSetuAi