异步抓取更适合 RAG 数据摄取

📅3 hours ago⏱1 min read

𝗔𝘀𝘆𝗻𝗰 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗜𝘀 𝗕𝗲𝘁𝘁𝗲𝗿 𝗳𝗼𝗿 𝗥𝗔𝗚 𝗜𝗻𝗴𝗲𝘀𝘁𝗶𝗼𝗻

RAG systems often fail because of stale data. The page changes but your index stays the same. Your AI then gives wrong answers with high confidence.

Many people try to fix this with simple synchronous scrapers. You fetch a page, extract data, and update your vector store. This approach creates problems in production.

The main issues with synchronous scraping:

Page loads take a long time due to JavaScript or cookie banners.
Your API waits for the scraper to finish, which slows down your users.
You run out of memory or open sockets when running tasks in parallel.
Errors like timeouts or rate limits are hard to manage.

Async scraping uses a submit, poll, and retrieve flow. You submit a task, get a job ID, and check for the result later. This keeps your application fast.

How to build a reliable ingestion pipeline:

Separate scraping from request handling. Your app should not wait for a browser to load.
Store job states in a database. Track the URL, status, and errors.
Use content hashes. If the page content has not changed, do not re-embed it. This saves money and time.
Use dead-letter queues. If a job fails three times, stop retrying. Move it to a visible list so you can fix it.
Validate your data. Use a schema to check the extracted data before it reaches your vector store. An empty string is worse than a failed job.

Async scraping works best for background updates and scheduled refreshes. It is not for real-time needs where a user waits for a fresh page.

If a user needs data immediately, show them cached content and update the index in the background.

Source: https://dev.to/anakin_writers/async-scraping-jobs-are-usually-a-better-fit-for-rag-ingestion-than-blocking-requests-12k1

Optional learning community: https://t.me/GyaanSetuAi

异步抓取更适合 RAG 数据摄取

Continue reading

GraphQL vs REST：选择您的 API 架构

𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝗟𝗶𝗳𝗲𝗰𝘆𝗰𝗹𝗲: 𝗖𝗼𝘀𝘁 𝘃𝘀 𝗙𝗿𝗲𝘀𝗵𝗻𝗲𝘀𝘀

𝗛𝗼𝘄 𝗝𝗮𝘃𝗮𝗦𝗰𝗿𝗶𝗽𝘁 𝗔𝘀𝘆𝗻𝗰 𝗪𝗼𝗿𝗸𝘀

𝗛𝗼𝘄 𝘁𝗼 𝗛𝗮𝗻𝗱𝗹𝗲 𝗙𝗹𝗮𝗸𝘆 𝗔𝗜 𝗔𝗣𝗜𝘀

𝗔𝗜 𝗖𝗼𝗱𝗲 𝗥𝗲𝘃𝗶𝗲𝘄 𝗜𝘀 𝗔 𝗥𝗼𝘂𝘁𝗶𝗻𝗴 𝗣𝗿𝗼𝗯𝗹𝗲𝗺