𝗙𝗿𝗼𝗺 𝗥𝗲𝗴𝗲𝘅 𝘁𝗼 𝗟𝗟𝗠𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮

Translated for your language. Leer el original.

AI-assisted draft.

GyaanSetu Editorialhace 14 horas2min de lectura

I built a price comparison tool. I needed to pull product data from dozens of e-commerce sites. Every site had a different structure. Some used random CSS classes. Some used JavaScript to load content.

I tried the classic way first. I used Regex and BeautifulSoup.

It worked for two sites. Then, one site changed its layout. My code broke. Another site used dynamic content. I spent more time fixing scrapers than using data.

Then I tried AI. I fed raw HTML into an LLM.

The results were bad. The output was inconsistent. Sometimes I got JSON. Sometimes I got paragraphs. The model hallucinated data. The cost was too high because I sent too many tokens.

I found a middle ground. I now use a hybrid approach.

Here is my process:

Preprocess the HTML. I strip scripts, styles, and navigation bars. I only keep the visible text. This reduces token counts and costs.
Use JSON mode or function calling. This forces the LLM to return structured data.
Use a retry logic. If the JSON is wrong, the code tries again.
Cache results. I save successful extractions per URL to avoid paying for the same page twice.

Regex and BeautifulSoup are still best for static, well-structured sites. They are fast and free.

LLMs are better for messy, unpredictable sites. But you must clean the data first to keep costs low.

My lessons learned:

Monitor your costs and success rates.
Watch your latency. LLM calls take seconds, not milliseconds.
Validate the output. Check if the price looks like a real price.
Use a hybrid system. Use a parser for easy sites and an LLM for the messy ones.

This setup saves me hours of work. I can add a new store in under an hour.

How do you handle messy web data? Do you prefer parsers or LLMs?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/from-regex-to-llms-my-journey-extracting-unstructured-web-data-5gmh

Optional learning community: https://t.me/GyaanSetuAi

𝗙𝗿𝗼𝗺 𝗥𝗲𝗴𝗲𝘅 𝘁𝗼 𝗟𝗟𝗠𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮

Seguir leyendo

Extraer datos con LLMs

𝗧𝗮𝗺𝗶𝗻𝗴 𝗟𝗼𝗻𝗴 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝘄𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

Domando documentos largos con LLMs

Crea un Web Scraper y vende los datos

𝗙𝗿𝗼𝗺 𝗥𝗲𝗴𝗲𝘅 𝘁𝗼 𝗟𝗟𝗠𝘀: 𝗠𝘆 𝗝𝗼𝘂𝗿𝗻𝗲𝘆 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗨𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮