𝗦𝘁𝗼𝗽 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗙𝗿𝗮𝗴𝗶𝗹𝗲 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗲𝗿𝘀

📅1 week ago⏱1 min read

I spent years building scrapers. I used CSS selectors and BeautifulSoup. This worked for simple sites.

Then I hit a wall. E-commerce sites change layouts often. Some use random class names. My 300 line code failed often. Headless browsers used too much memory.

I tried a new method. I fed raw HTML to an AI. I asked it to find the data.

The AI understands meaning. It does not care if a price is in a span or a div. You write a prompt instead of a selector.

How to make it work:

Remove script and style tags. This saves tokens.
Store results in a cache. This saves money.
Use a cheap model first.
Use a strong model for hard pages.

Use this for:

Sites with changing layouts.
Small scale tasks.
Fast prototyping.

Avoid this for:

Millions of pages. The cost is too high.
Real time data. AI is too slow.
Private data.

The best setup is a hybrid. Use CSS selectors for stable sites. Use AI when selectors fail.

Describe the data you need. Let the model handle the structure.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/my-web-scraping-nightmare-ended-when-i-let-an-llm-read-the-html-1bj4

𝗦𝘁𝗼𝗽 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗙𝗿𝗮𝗴𝗶𝗹𝗲 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗲𝗿𝘀

Continue reading

𝗔𝗜 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗩𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀

𝗠𝘆 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗡𝗶𝗴𝗵𝘁𝗺𝗮𝗿𝗲 𝗘𝗻𝗱𝗲𝗱 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

𝗪𝗵𝘆 𝗜 𝗦𝘄𝗶𝘁𝗰𝗵𝗲𝗱 𝘁𝗼 𝗔𝗜 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

𝗟𝗟𝗠𝘀 𝗙𝗼𝗿 𝗕𝗲𝘁𝘁𝗲𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴