๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ป๐—ด ๐— ๐—ฒ๐˜€๐˜€๐˜† ๐—ช๐—ฒ๐—ฏ ๐——๐—ฎ๐˜๐—ฎ ๐—ช๐—ถ๐˜๐—ต ๐—Ÿ๐—Ÿ๐— ๐˜€

I scraped websites for years. I used BeautifulSoup and Scrapy. One site broke my process. The HTML was a mess. The layout changed every week. My selectors broke.

I tried an LLM. I gave it raw HTML. I asked for JSON.

Traditional tools rely on structure. LLMs rely on meaning. I describe the data. The AI finds it.

Pros:

Cons:

My strategy:

Do you use AI for scraping? Do you prefer XPath?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-html-parsing-fails-using-llms-to-extract-messy-web-data-20ab