๐๐ ๐ช๐ฒ๐ฏ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ถ๐ป๐ด ๐ฉ๐ ๐ง๐ฟ๐ฎ๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฎ๐น ๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ผ๐ฟ๐
Traditional web scraping breaks often. You write CSS selectors. The site updates. Your code fails. I tried a new way using AI.
I built a price tool. I used XPath and regex. Site redesigns broke my scrapers. Regex picked up wrong numbers. I needed a tool to understand meaning.
I first sent raw HTML to an LLM. It cost too much. It hallucinated data. I tried removing too much text. The model lost the context.
I changed my process. First, I cleaned the HTML. I removed scripts and footers. I kept only headings and prices. This cut token use by 70%.
Second, I gave the AI examples. I showed it what a price looks like. Third, I set temperature to 0. This made the output stable.
There are trade-offs.
- Cost: Each page costs a few cents.
- Speed: Calls take 1 to 3 seconds.
- Errors: AI sometimes makes mistakes. You need a check to validate numbers.
Skip AI if:
- The site has an API.
- You scrape millions of pages.
- The selectors never change.
Respect robots.txt. Start with AI to save time. Add validation for safety.
What does your scraping stack look like?