𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴
I spent years scraping the web. Sites change designs. My scrapers break. I spent hours on regex and XPath. It felt like a fight.
I tried telling a computer what I wanted in English. I used Large Language Models for data extraction.
Old scripts needed long lists of rules. One site used one tag. Another used a different tag. A third used dynamic classes. This was hard to maintain.
The new way is simple. Send a snippet of HTML to an LLM. Ask for data as JSON. The LLM finds patterns for you.
LLMs have issues.
- Cost: Thousands of pages cost money.
- Speed: Requests take seconds.
- Errors: LLMs sometimes make up data.
- Privacy: Third party APIs see your data.
Use a mix of tools.
- Use selectors for stable IDs.
- Use LLMs for messy text.
- Cache results to save money.
- Set a budget cap.
Selectors are fast and free. English is effective for messy data. Try it on your next project.
What is your tool for messy web data?
Source: https://dev.to/__c1b9e06dc90a7e0a676b/i-stopped-writing-regex-for-web-scraping-heres-what-i-do-instead-584a Optional learning community: https://t.me/GyaanSetuAi