๐—ช๐—ต๐—ฒ๐—ป ๐—ฅ๐—ฒ๐—ด๐—ฒ๐˜… ๐—™๐—ฎ๐—ถ๐—น๐˜€: ๐—จ๐˜€๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜๐—ถ๐—ผ๐—ป

I scraped web data for years. I used BeautifulSoup and regex. It worked for simple templates.

Then I had to scrape hundreds of e-commerce sites. Every site had a different structure. My selectors broke.

I tried more code. I tried classifiers. I tried headless browsers. These tools were slow or fragile.

I tried a new idea. I sent raw HTML to a language model. I used a prompt with a few examples. It worked.

Follow these steps:

This method has trade-offs:

Use a hybrid system. Use CSS selectors for known sites. Use LLMs for unknown layouts.

Stop writing fragile parsers. Use a prompt.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-using-llms-to-extract-structured-data-from-messy-pages-2mbg