𝗪𝗵𝗲𝗻 𝗥𝗲𝗴𝗲𝘅 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

📅1 week ago⏱1 min read

I scraped web data for years. I used BeautifulSoup and regex. It worked for simple templates.

Then I had to scrape hundreds of e-commerce sites. Every site had a different structure. My selectors broke.

I tried more code. I tried classifiers. I tried headless browsers. These tools were slow or fragile.

I tried a new idea. I sent raw HTML to a language model. I used a prompt with a few examples. It worked.

Follow these steps:

This method has trade-offs:

Use a hybrid system. Use CSS selectors for known sites. Use LLMs for unknown layouts.

Stop writing fragile parsers. Use a prompt.

Continue reading