๐ช๐ต๐ฒ๐ป ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ฎ๐ถ๐น๐: ๐๐๐ ๐ ๐ณ๐ผ๐ฟ ๐ ๐ฒ๐๐๐ ๐๐ง๐ ๐
I took over a project last month. The HTML was a mess. No classes. No patterns. Bad tags.
I tried Regex. I spent six hours writing logic. It broke every time the page changed.
BeautifulSoup worked for 80% of the pages. The last 20% failed. I wrote custom rules for every case. The list grew too long.
I tried GPT-4. It worked. It cost 0.03 dollars per item. For 10,000 items, it cost 300 dollars. It was too slow. It was too expensive.
I switched to a small local model. I used Llama 3.1 8B via Ollama. I asked for JSON output.
Here is the method:
- Get the raw HTML of the product card.
- Create a prompt with a JSON schema.
- Add a few examples.
- Get the JSON from the local LLM.
- Validate the data.
Rules for success:
- Set temperature to 0. This makes output consistent.
- Keep context small. Send only the product card.
- Be clear about data types. Use float or boolean.
Avoid LLMs when:
- HTML is clean. Use a parser.
- You need real-time speed. LLMs are slow.
- Data is secret. Local models are safer.
This approach stopped the cycle of fragile code. It is the best way to handle messy data.
What do you use when scraping fails?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-llms-for-messy-html-data-3j7f