𝗧𝗵𝗲 𝗘𝗻𝗱 𝗼𝗳 𝗣𝗲𝗿𝗳𝗲𝗰𝘁 𝗦𝗲𝗹𝗲𝗰𝘁𝗼𝗿𝘀 I spent years building scrapers with CSS selectors, XPath, and regex. But every new site meant new selectors. When the HTML changed, my script would break. It was exhausting.

I needed to monitor prices across 30 online stores with different DOM structures. I tried BeautifulSoup and Selenium, but they were slow and required site-specific selectors. I even tried heuristic approaches, but they only worked 60% of the time.

Then I tried something new: feeding raw HTML to a large language model (LLM) and asking it to return the data I needed. I used OpenAI's API and was shocked by the results. With a good prompt, the LLM could extract product name, price, and availability from an entire page of HTML without any selectors.

Here's how it works:

I call this function on a small, cleaned-up version of the page HTML. The results are surprisingly consistent. For a test set of 10 product pages, it got the price right 9 out of 10 times.

This approach has some limitations:

This technique is great for:

It's not great for:

What's your go-to method for extracting data from wildly different HTML structures? Source: https://dev.to/__c1b9e06dc90a7e0a676b/why-i-gave-up-on-perfect-selectors-and-asked-gpt-to-extract-my-data-5efn