𝗥𝗲𝗴𝗲𝘅 𝗕𝗿𝗼𝗸𝗲 𝗠𝘆 𝗦𝗰𝗿𝗮𝗽𝗲𝗿
I built scrapers for years. I used CSS selectors and regex. It worked until the website changed.
I managed 200 supplier sites. One site changed its layout. My code broke. I spent days fixing it. I tried headless browsers. They were too slow.
I tried a new way. I used an LLM. I sent HTML text to GPT. I asked for JSON.
My process:
- Clean HTML with BeautifulSoup.
- Remove scripts and styles.
- Send text to the LLM.
- Get name, price, and SKU.
Pros:
- It ignores layout changes.
- It reads different currencies.
- It understands stock levels.
Cons:
- It costs money.
- It is slower than regex.
- It makes mistakes.
My tips for you:
- Use local models like Phi-3 to save money.
- Use regex to check LLM output.
- Use regex for stable APIs or logs.
- Use LLMs for messy pages.
How do you handle website changes?