๐ฆ๐๐ผ๐ฝ ๐ช๐ฟ๐ถ๐๐ถ๐ป๐ด ๐๐ฟ๐ฎ๐ด๐ถ๐น๐ฒ ๐ช๐ฒ๐ฏ ๐ฆ๐ฐ๐ฟ๐ฎ๐ฝ๐ฒ๐ฟ๐
I spent years building scrapers. I used CSS selectors and BeautifulSoup. This worked for simple sites.
Then I hit a wall. E-commerce sites change layouts often. Some use random class names. My 300 line code failed often. Headless browsers used too much memory.
I tried a new method. I fed raw HTML to an AI. I asked it to find the data.
The AI understands meaning. It does not care if a price is in a span or a div. You write a prompt instead of a selector.
How to make it work:
- Remove script and style tags. This saves tokens.
- Store results in a cache. This saves money.
- Use a cheap model first.
- Use a strong model for hard pages.
Use this for:
- Sites with changing layouts.
- Small scale tasks.
- Fast prototyping.
Avoid this for:
- Millions of pages. The cost is too high.
- Real time data. AI is too slow.
- Private data.
The best setup is a hybrid. Use CSS selectors for stable sites. Use AI when selectors fail.
Describe the data you need. Let the model handle the structure.