𝗧𝗵𝗲 𝗘𝗻𝗱 𝗼𝗳 𝗣𝗲𝗿𝗳𝗲𝗰𝘁 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀

📅5 days ago⏱2 min read

𝗧𝗵𝗲 𝗘𝗻𝗱 𝗼𝗳 𝗣𝗲𝗿𝗳𝗲𝗰𝘁 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀 I spent years building scrapers with CSS selectors, XPath, and regex. But every new site meant new selectors. When the HTML changed, my script would break. It was exhausting.

I needed to monitor prices across 30 online stores with different DOM structures. I tried BeautifulSoup and Selenium, but they were slow and required site-specific selectors. I even tried heuristic approaches, but they only worked 60% of the time.

Then I tried something new: feeding raw HTML to a large language model (LLM) and asking it to return the data I needed. I used OpenAI's API and was shocked by the results. With a good prompt, the LLM could extract product name, price, and availability from an entire page of HTML without any selectors.

Here's how it works:

Send a snippet of HTML to an LLM with a system prompt that explains the schema you want back
The LLM returns the data in valid JSON

I call this function on a small, cleaned-up version of the page HTML. The results are surprisingly consistent. For a test set of 10 product pages, it got the price right 9 out of 10 times.

This approach has some limitations:

Token limits: full pages can be huge, so you need to trim the HTML aggressively
Cost: each request costs ~$0.02-0.05, so it's not cheap
Hallucination: the LLM can invent data if it's not present, so you need to add validation

This technique is great for:

Pages with highly variable structure
When you only need a handful of fields
Prototyping or small-scale projects

It's not great for:

High-volume scraping
When you need perfect accuracy
Scraping behind login or CAPTCHAs

What's your go-to method for extracting data from wildly different HTML structures? Source: https://dev.to/__c1b9e06dc90a7e0a676b/why-i-gave-up-on-perfect-selectors-and-asked-gpt-to-extract-my-data-5efn