๐—ช๐—ต๐—ฒ๐—ป ๐—ฅ๐—ฒ๐—ด๐—ฒ๐˜… ๐—™๐—ฎ๐—ถ๐—น๐˜€: ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐— ๐—ฒ๐˜€๐˜€๐˜† ๐—›๐—ง๐— ๐—Ÿ

I took over a project last month. The HTML was a mess. No classes. No patterns. Bad tags.

I tried Regex. I spent six hours writing logic. It broke every time the page changed.

BeautifulSoup worked for 80% of the pages. The last 20% failed. I wrote custom rules for every case. The list grew too long.

I tried GPT-4. It worked. It cost 0.03 dollars per item. For 10,000 items, it cost 300 dollars. It was too slow. It was too expensive.

I switched to a small local model. I used Llama 3.1 8B via Ollama. I asked for JSON output.

Here is the method:

Rules for success:

Avoid LLMs when:

This approach stopped the cycle of fragile code. It is the best way to handle messy data.

What do you use when scraping fails?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-llms-for-messy-html-data-3j7f