𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗠𝗲𝘀𝘀𝘆 𝗛𝗧𝗠𝗟

📅4 days ago⏱1 min read

I took over a project last month. The HTML was a mess. No classes. No patterns. Bad tags.

I tried Regex. I spent six hours writing logic. It broke every time the page changed.

BeautifulSoup worked for 80% of the pages. The last 20% failed. I wrote custom rules for every case. The list grew too long.

I tried GPT-4. It worked. It cost 0.03 dollars per item. For 10,000 items, it cost 300 dollars. It was too slow. It was too expensive.

I switched to a small local model. I used Llama 3.1 8B via Ollama. I asked for JSON output.

Here is the method:

Rules for success:

Avoid LLMs when:

This approach stopped the cycle of fragile code. It is the best way to handle messy data.

What do you use when scraping fails?

Continue reading