𝗜 𝗕𝘂𝗶𝗹𝘁 𝗮 𝗖𝗼𝗱𝗲 𝗤&𝗔 𝗕𝗼𝘁 𝗪𝗶𝘁𝗵 𝗥𝗔𝗚: 𝗪𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝗲𝗱 𝗮𝗻𝗱 𝗪𝗵𝗮𝘁 𝗙𝗮𝗶𝗹𝗲𝗱
Our developers spent days searching through Slack and old docs to understand our microservices. I decided to build a chatbot to answer these questions using RAG.
I made many mistakes along the way. Here is what I learned.
𝗧𝗵𝗲 𝗙𝗮𝗶𝗹𝘂𝗿𝗲𝘀
- I tried putting all docs into one prompt. It hit token limits, caused hallucinations, and cost too much money.
- I used a basic TF-IDF index. It failed when users used synonyms or different terms.
- I tried simple 500-character chunks. The results were random because chunks often cut off mid-sentence.
𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻
I stopped treating the LLM as a search engine. I turned it into a reading engine for a dedicated search index.
Here is the pipeline that worked:
- Chunk docs into 300-token pieces with a 50-token overlap.
- Embed each chunk into a vector.
- Store vectors in a similarity search index.
- At query time, find the top 5 most similar chunks.
- Feed only those chunks into the LLM to generate an answer.
This change reduced hallucinations by 80% and cut costs to under $0.01 per query.
𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀
- Chunk size is vital. 150 tokens is too little context. 1000 tokens is too much noise. 300 tokens is the sweet spot.
- Overlap is mandatory. It prevents losing context between chunks.
- Use small models for speed. A small embedding model worked well for our internal needs.
- Test your retrieval. Do not rely on manual checks. Build a test set to measure accuracy.
RAG is not magic. It is an engineering puzzle. If your chunks are bad, your retrieval is bad. If your retrieval is bad, your answers are bad.
We now answer 80% of onboarding questions correctly. This is much faster than waiting for a human to reply on Slack.
How do you build AI assistants for your documentation?
Optional learning community: https://t.me/GyaanSetuAi