๐—˜๐˜…๐˜๐—ฟ๐—ฎ๐—ฐ๐˜ ๐——๐—ฎ๐˜๐—ฎ ๐—ช๐—ถ๐˜๐—ต ๐—Ÿ๐—Ÿ๐— ๐˜€

I tried to scrape 50 e-commerce sites. I used BeautifulSoup and regex. It failed.

Prices looked different on every site. Some used dollars. Some used euros. Sizes were in dropdowns or radio buttons.

My regex patterns broke. I spent hours fixing small bugs. It did not scale.

I switched to LLMs. I used GPT-4o-mini.

I gave the model raw HTML. I used function calling to get structured JSON. I used Pydantic to check the data.

This method works for messy data. Here is how you do it:

There are trade-offs:

Use regex for stable sites. Use LLMs for messy sites.

This turned a nightmare into a two-day task.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-isnt-enough-extracting-structured-data-with-llms-3lo6