𝗠𝘆 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗡𝗶𝗴𝗵𝘁𝗺𝗮𝗿𝗲 𝗘𝗻𝗱𝗲𝗱 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

📅1 week ago⏱1 min read

I built scrapers for years. I used CSS selectors. It worked for one site. It failed for ten sites. Every page had a different layout. Code became a mess.

I tried a new way. I fed raw HTML to an LLM. I told the AI which data I wanted. AI understands text meaning. It ignores tags.

Here is how you do it:

Use a cheap model like GPT-4o-mini.
Write a prompt for specific fields.
Ask for a JSON object.
Trim HTML to save money.

I improved the process:

Remove script and style tags.
Cache results by URL.
Validate JSON output.
Use a hybrid system. Try selectors first. Use AI as a backup.

Use this for:

Sites with shifting layouts.
Small projects.
Fast prototypes.

Avoid this for:

Millions of pages.
Real time needs.
Private data.

Try this for job boards or news. Start small. Measure accuracy.

What is your setup? Do you use AI or traditional scrapers?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/my-web-scraping-nightmare-ended-when-i-let-an-llm-read-the-html-1bj4 Optional learning community: https://t.me/GyaanSetuAi

𝗠𝘆 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗡𝗶𝗴𝗵𝘁𝗺𝗮𝗿𝗲 𝗘𝗻𝗱𝗲𝗱 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

Continue reading

𝗔𝗜 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗩𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀

𝗦𝘁𝗼𝗽 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗙𝗿𝗮𝗴𝗶𝗹𝗲 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗲𝗿𝘀

𝗪𝗵𝘆 𝗜 𝗦𝘄𝗶𝘁𝗰𝗵𝗲𝗱 𝘁𝗼 𝗔𝗜 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗥𝗲𝗴𝗲𝘅 𝗕𝗿𝗼𝗸𝗲 𝗠𝘆 𝗦𝗰𝗿𝗮𝗽𝗲𝗿

𝗟𝗟𝗠𝘀 𝗙𝗼𝗿 𝗕𝗲𝘁𝘁𝗲𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴