𝗘𝘅𝘁𝗿𝗮𝗰𝘁 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

📅2 weeks ago⏱1 min read

I tried to scrape 50 e-commerce sites. I used BeautifulSoup and regex. It failed.

Prices looked different on every site. Some used dollars. Some used euros. Sizes were in dropdowns or radio buttons.

My regex patterns broke. I spent hours fixing small bugs. It did not scale.

I switched to LLMs. I used GPT-4o-mini.

I gave the model raw HTML. I used function calling to get structured JSON. I used Pydantic to check the data.

This method works for messy data. Here is how you do it:

There are trade-offs:

Use regex for stable sites. Use LLMs for messy sites.

This turned a nightmare into a two-day task.

Continue reading