𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗠𝗲𝘀𝘀𝘆 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

📅1 week ago⏱1 min read

I scraped websites for years. I used BeautifulSoup and Scrapy. One site broke my process. The HTML was a mess. The layout changed every week. My selectors broke.

I tried an LLM. I gave it raw HTML. I asked for JSON.

Traditional tools rely on structure. LLMs rely on meaning. I describe the data. The AI finds it.

Pros:

Layout changes do not stop it.
Setup takes minutes.
It ignores noise.

Cons:

API calls cost money.
It is slow.
It sometimes makes up data.
Huge HTML needs cleaning.

My strategy:

Use traditional tools for stable sites.
Use AI for hard sites.
Validate your data.

Do you use AI for scraping? Do you prefer XPath?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/when-html-parsing-fails-using-llms-to-extract-messy-web-data-20ab

𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝗠𝗲𝘀𝘀𝘆 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮 𝗪𝗶𝘁𝗵 𝗟𝗟𝗠𝘀

Continue reading

𝗪𝗵𝘆 𝗜 𝗦𝘄𝗶𝘁𝗰𝗵𝗲𝗱 𝘁𝗼 𝗔𝗜 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗪𝗿𝗶𝘁𝗶𝗻𝗴 𝗥𝗲𝗴𝗲𝘅 𝗙𝗼𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴

𝗪𝗵𝗲𝗻 𝗛𝗧𝗠𝗟 𝗣𝗮𝗿𝘀𝗶𝗻𝗴 𝗙𝗮𝗶𝗹𝘀: 𝗨𝘀𝗶𝗻𝗴 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗪𝗲𝗯 𝗗𝗮𝘁𝗮

𝗪𝗛𝗘𝗡 𝗥𝗘𝗚𝗘𝗫 𝗙𝗔𝗜𝗟𝗦 𝗙𝗢𝗥 𝗗𝗔𝗧𝗔 𝗘𝗫𝗧𝗥𝗔𝗖𝗧𝗜𝗢𝗡

𝗟𝗟𝗠𝘀 𝗙𝗼𝗿 𝗕𝗲𝘁𝘁𝗲𝗿 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴