๐๐ ๐๐ฟ๐ฎ๐ฐ๐ ๐๐ฎ๐๐ฎ ๐ช๐ถ๐๐ต ๐๐๐ ๐
I tried to scrape 50 e-commerce sites. I used BeautifulSoup and regex. It failed.
Prices looked different on every site. Some used dollars. Some used euros. Sizes were in dropdowns or radio buttons.
My regex patterns broke. I spent hours fixing small bugs. It did not scale.
I switched to LLMs. I used GPT-4o-mini.
I gave the model raw HTML. I used function calling to get structured JSON. I used Pydantic to check the data.
This method works for messy data. Here is how you do it:
- Clean your HTML first.
- Define a strict schema.
- Use tool calling functions.
- Validate the result.
There are trade-offs:
- Cost: You pay per token.
- Speed: It is slower than regex.
- Accuracy: Models sometimes hallucinate.
Use regex for stable sites. Use LLMs for messy sites.
This turned a nightmare into a two-day task.