𝗦𝘁𝗼𝗽 𝗙𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝗪𝗶𝘁𝗵 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴
I scraped websites for years. Static pages were easy. Modern sites are harder. They use React or Vue. Data hides behind API calls. It loads after the page opens.
I tried requests and BeautifulSoup. I got empty results. I tried mimicking API calls. CSRF tokens and rate limits broke the script. I tried Playwright. CSS selectors changed every week. Maintenance became a nightmare.
A colleague gave me a new idea. Stop parsing HTML. Let the browser render the page. Ask an AI what it sees.
The process is simple. Use a headless browser. Wait for the page to load. Take a screenshot. Send the image to a vision model. Tell the AI what data you need.
The AI finds the product name and price. It does not need fragile selectors. It understands the layout.
This method has trade-offs.
- Cost: API credits add up for large jobs.
- Speed: Each page takes seconds to process.
- Accuracy: AI sometimes misreads numbers.
Do not use this if:
- A public API exists.
- You need to scrape millions of pages.
- The site has CAPTCHAs.
Start with cheap options. Use vision models as a last resort. Cache your results to save money. Try local models like LLaVA to cut costs.
Hybrid setups work best. Use selectors for stable parts. Use AI for the hard parts.
Source: https://dev.to/__c1b9e06dc90a7e0a676b/i-thought-i-knew-web-scraping-until-i-hit-javascript-5e9g Optional learning community: https://t.me/GyaanSetuAi