๐ช๐๐๐ก ๐ฅ๐๐๐๐ซ ๐๐๐๐๐ฆ ๐๐ข๐ฅ ๐๐๐ง๐ ๐๐ซ๐ง๐ฅ๐๐๐ง๐๐ข๐ก
I scraped websites for years. I used BeautifulSoup and regex. These tools work for simple templates.
Then I faced a hard task. I needed data from hundreds of e-commerce sites. Every site had different HTML. My selectors broke. I spent more time fixing code than using data.
I tried several fixes. I wrote more complex code. I tried classifiers. I tried headless browsers. Everything failed.
Then I tried a new way. I sent raw HTML to a language model. I asked it to extract data. It worked.
Use this method for your project:
- Trim the HTML to keep tokens low.
- Use a prompt with 2 to 3 examples.
- Parse the JSON response.
This approach has trade-offs:
- Costs: API calls cost money.
- Speed: Requests take a few seconds.
- Errors: LLMs sometimes hallucinate.
- Limits: You have token caps.
My advice: Use CSS selectors for consistent sites. Use LLMs for messy pages. A hybrid approach works best.
This saves weeks of effort.
Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-using-llms-to-extract-structured-data-from-messy-pages-2mbg Optional learning community: https://t.me/GyaanSetuAi