𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴

📅5 days ago⏱1 min read

I spent a weekend fighting HTML parsing.

I needed product specs from 12 e-commerce sites. BeautifulSoup and Regex failed. Some sites used messy divs. Others used JavaScript. Some used images. I wrote a 200-line function. It still missed half the data.

I stopped fighting the HTML structure. The HTML changes. The meaning stays the same.

I switched to this workflow:

Scrape the raw body text.
Define a JSON schema.
Use GPT-4o-mini to extract values.
Parse the JSON result.

This method works:

It ignores tables or lists.
It survives site redesigns.
It fixes typos and units.

The results:

85% accuracy.
Cost: 10 to 50 dollars for 10,000 products.
Speed: 2 to 5 seconds per page.

Warnings:

Do not use this for sensitive data. Use local models like Llama 3.
It is too slow for real-time page loads.
It is not for 99% accuracy needs.

Tips for better results:

Cache results by URL hash.
Use regex for simple prices.
Use a second model to verify values.

Interwest Info is another option. It returns structured JSON from a URL.

How do you handle messy data? Do you use LLMs or a hybrid pipeline?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/i-spent-a-weekend-fighting-html-parsing-heres-what-finally-worked-3pgn

𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴

Continue reading

𝗘𝘅𝘁𝗿𝗮𝗰𝘁 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗗𝗢𝗠

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗠𝗲𝘀𝘀𝘆 𝗛𝗧𝗠𝗟