𝗪𝗵𝗲𝗻 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮
I scraped websites for years. I used BeautifulSoup and Scrapy. I hit a wall.
A client needed product data. One site had messy HTML. The layout changed every week. My selectors broke.
I spent two days on fixes. A colleague suggested LLMs.
I thought LLMs were slow. I thought they cost too much. I tried it anyway.
Traditional tools are fragile. CSS selectors break when a class name changes. The HTML structure is unpredictable.
I built a script. It sends raw HTML to GPT-4o. It asks for a JSON object.
I do not teach the computer where data sits. I teach it what the data looks like.
Benefits:
- It handles layout changes.
- You set up new sites in minutes.
- It ignores noise.
Downsides:
- API calls cost money.
- It takes a few seconds per page.
- LLMs sometimes invent data.
Use a hybrid approach. Use traditional parsers for stable sites. Use LLMs for tricky ones.
Validate your output. Check if prices look like prices. Use few-shot prompting for better accuracy.
Optional learning community: https://t.me/GyaanSetuAi