𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

📅1 week ago⏱1 min read

I spent years scraping the web. Sites change designs. My scrapers break. I spent hours on regex and XPath. It felt like a fight.

I tried telling a computer what I wanted in English. I used Large Language Models for data extraction.

Old scripts needed long lists of rules. One site used one tag. Another used a different tag. A third used dynamic classes. This was hard to maintain.

The new way is simple. Send a snippet of HTML to an LLM. Ask for data as JSON. The LLM finds patterns for you.

LLMs have issues.

Use a mix of tools.

Selectors are fast and free. English is effective for messy data. Try it on your next project.

What is your tool for messy web data?