๐—ช๐—›๐—˜๐—ก ๐—ฅ๐—˜๐—š๐—˜๐—ซ ๐—™๐—”๐—œ๐—Ÿ๐—ฆ ๐—™๐—ข๐—ฅ ๐——๐—”๐—ง๐—” ๐—˜๐—ซ๐—ง๐—ฅ๐—”๐—–๐—ง๐—œ๐—ข๐—ก

I scraped websites for years. I used BeautifulSoup and regex. These tools work for simple templates.

Then I faced a hard task. I needed data from hundreds of e-commerce sites. Every site had different HTML. My selectors broke. I spent more time fixing code than using data.

I tried several fixes. I wrote more complex code. I tried classifiers. I tried headless browsers. Everything failed.

Then I tried a new way. I sent raw HTML to a language model. I asked it to extract data. It worked.

Use this method for your project:

This approach has trade-offs:

My advice: Use CSS selectors for consistent sites. Use LLMs for messy pages. A hybrid approach works best.

This saves weeks of effort.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-using-llms-to-extract-structured-data-from-messy-pages-2mbg Optional learning community: https://t.me/GyaanSetuAi