𝗦𝘁𝗼𝗽 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗕𝗿𝗶𝘁𝘁𝗹𝗲 𝗦𝗰𝗿𝗮𝗽𝗲𝗿𝘀

📅1 week ago⏱1 min read

I scraped 5,000 product pages. Every page had different HTML. My CSS selectors broke. I tried XPath. I tried Regex. Nothing worked.

I built a scoring system. It failed during A/B tests. I used a headless browser. I wrote 500 lines of error handling. It was too fragile.

I stopped fighting the HTML structure. I extracted all visible text. I asked an LLM to find the data. I asked for JSON.

The process is simple.

Here are the results.

Use this method for messy sites.

Avoid this method for these cases.

Treat web pages as text. Stop treating them as structured documents.