𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝗗𝗢𝗠

📅1 week ago⏱1 min read

I tried to scrape 5,000 product pages. It failed. Each page had different HTML. One price stayed in a span. Another hid in a div. My selectors broke.

I tried XPath and regex. Neither worked reliably. I wrote 500 lines of code to handle errors. The code was brittle.

I changed my approach. I stopped fighting the structure. I treated the page as natural language.

The new process:

This removes fragile selectors. The LLM finds the data regardless of the HTML tag.

Lessons learned:

Use this for messy sites. Avoid this for stable sites. Avoid this for real-time needs. Use local models for private data.

Stop treating pages as structured documents. Treat them as text.

Continue reading