๐ช๐ต๐ฒ๐ป ๐ฅ๐ฒ๐ด๐ฒ๐ ๐๐ฎ๐ถ๐น๐: ๐จ๐๐ถ๐ป๐ด ๐๐๐ ๐ ๐ณ๐ผ๐ฟ ๐๐ฎ๐๐ฎ ๐๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป
I scraped web data for years. I used BeautifulSoup and regex. It worked for simple templates.
Then I had to scrape hundreds of e-commerce sites. Every site had a different structure. My selectors broke.
I tried more code. I tried classifiers. I tried headless browsers. These tools were slow or fragile.
I tried a new idea. I sent raw HTML to a language model. I used a prompt with a few examples. It worked.
Follow these steps:
- Cut a small part of the HTML page.
- Give the model 2 or 3 examples of HTML and the JSON you want.
- Send the HTML.
- Parse the JSON response.
This method has trade-offs:
- Cost: You pay per request. 10,000 products cost $200 to $500.
- Speed: Each request takes 1 to 3 seconds.
- Errors: Models make mistakes. Use a regex check to validate numbers.
- Tokens: Do not send the whole page. Trim the HTML first.
- Privacy: Use local models like Llama 3 for private data.
Use a hybrid system. Use CSS selectors for known sites. Use LLMs for unknown layouts.
Stop writing fragile parsers. Use a prompt.