𝗧𝗵𝗲 𝗘𝗻𝗱 𝗼𝗳 𝗣𝗲𝗿𝗳𝗲𝗰𝘁 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀 I spent years building scrapers with CSS selectors, XPath, and regex. But every new site meant new selectors. When the HTML changed, my script would break. It was exhausting.
I needed to monitor prices across 30 online stores with different DOM structures. I tried BeautifulSoup and Selenium, but they were slow and required site-specific selectors. I even tried heuristic approaches, but they only worked 60% of the time.
Then I tried something new: feeding raw HTML to a large language model (LLM) and asking it to return the data I needed. I used OpenAI's API and was shocked by the results. With a good prompt, the LLM could extract product name, price, and availability from an entire page of HTML without any selectors.
Here's how it works:
- Send a snippet of HTML to an LLM with a system prompt that explains the schema you want back
- The LLM returns the data in valid JSON
I call this function on a small, cleaned-up version of the page HTML. The results are surprisingly consistent. For a test set of 10 product pages, it got the price right 9 out of 10 times.
This approach has some limitations:
- Token limits: full pages can be huge, so you need to trim the HTML aggressively
- Cost: each request costs ~$0.02-0.05, so it's not cheap
- Hallucination: the LLM can invent data if it's not present, so you need to add validation
This technique is great for:
- Pages with highly variable structure
- When you only need a handful of fields
- Prototyping or small-scale projects
It's not great for:
- High-volume scraping
- When you need perfect accuracy
- Scraping behind login or CAPTCHAs
What's your go-to method for extracting data from wildly different HTML structures? Source: https://dev.to/__c1b9e06dc90a7e0a676b/why-i-gave-up-on-perfect-selectors-and-asked-gpt-to-extract-my-data-5efn