𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

📅1 week ago⏱1 min read

I scraped websites for years. I used BeautifulSoup and regex. These tools work for simple templates.

Then I faced a hard task. I needed data from hundreds of e-commerce sites. Every site had different HTML. My selectors broke. I spent more time fixing code than using data.

I tried several fixes. I wrote more complex code. I tried classifiers. I tried headless browsers. Everything failed.

Then I tried a new way. I sent raw HTML to a language model. I asked it to extract data. It worked.

Use this method for your project:

Trim the HTML to keep tokens low.
Use a prompt with 2 to 3 examples.
Parse the JSON response.

This approach has trade-offs:

Costs: API calls cost money.
Speed: Requests take a few seconds.
Errors: LLMs sometimes hallucinate.
Limits: You have token caps.

My advice: Use CSS selectors for consistent sites. Use LLMs for messy pages. A hybrid approach works best.

This saves weeks of effort.

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-using-llms-to-extract-structured-data-from-messy-pages-2mbg Optional learning community: https://t.me/GyaanSetuAi

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

Continue reading

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗪𝗵𝗲𝗻 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮

𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

𝗧𝗵𝗲 𝗘𝗻𝗱 𝗼𝗳 𝗣𝗲𝗿𝗳𝗲𝗰𝘁 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀

𝗟𝗟𝗠𝘀 𝗙𝗼𝗿 𝗕𝗲𝘁𝘁𝗲𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴