𝗪𝗵𝗲𝗻 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮

I scraped websites for years. I used BeautifulSoup and Scrapy. I hit a wall.

A client needed product data. One site had messy HTML. The layout changed every week. My selectors broke.

I spent two days on fixes. A colleague suggested LLMs.

I thought LLMs were slow. I thought they cost too much. I tried it anyway.

Traditional tools are fragile. CSS selectors break when a class name changes. The HTML structure is unpredictable.

I built a script. It sends raw HTML to GPT-4o. It asks for a JSON object.

I do not teach the computer where data sits. I teach it what the data looks like.

Benefits:

Downsides:

Use a hybrid approach. Use traditional parsers for stable sites. Use LLMs for tricky ones.

Validate your output. Check if prices look like prices. Use few-shot prompting for better accuracy.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-html-parsing-fails-using-llms-to-extract-messy-web-data-20ab

Optional learning community: https://t.me/GyaanSetuAi