𝗜 𝗦𝗽𝗲𝗻𝘁 $𝟱𝟬𝟬 𝗼𝗻 𝗥𝗔𝗚 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗕𝗲𝗳𝗼𝗿𝗲 𝗙𝗶𝘅𝗶𝗻𝗴 𝗧𝗵𝗲𝘀𝗲 𝟳 𝗠𝗶𝘀𝘁𝗮𝗸𝗲𝘀
I built a RAG pipeline for private document search. It cost me $500 in compute and weeks of debugging. The results were bad. Users got wrong answers and queries were slow.
I audited the pipeline. I found 7 mistakes. Fixing them changed everything.
- Fixed Token Chunking I split documents by 512 tokens. This destroyed context. An API explanation would split mid-sentence. The LLM received fragments and gave garbage answers. The fix: Use semantic chunking.
- Split by natural boundaries like paragraphs or headers.
- Use parent-document retrieval.
- Create small child chunks for search.
- Return the full parent document to the LLM.
- Add 10-20% overlap between chunks.
- Bad Hybrid Search Weights I used a 50/50 split for vector and keyword search. This failed for technical docs. Technical users need exact keyword matches. The fix: Use dynamic weights.
- Factual queries: 35% vector, 65% keyword.
- Semantic queries: 75% vector, 25% keyword.
- General queries: 60% vector, 40% keyword.
- Over-tuning HNSW Parameters I set ef_construction to the maximum. This crashed my server. It used all available RAM. The fix: Use appropriate parameters.
- Set M between 8 and 32.
- Set ef_construction to 200.
- Set ef_search to 50. Memory usage dropped 70%.
General Embedding Models I used a model trained on Wikipedia. My documents were technical engineering runbooks. The model did not understand my domain. The fix: Use a model fine-tuned for technical or code content.
No Query Rewriting Users ask natural questions. Technical docs use formal terms. They do not match. The fix: Add a lightweight LLM step to rewrite queries.
- User asks: "why is my build slow"
- System rewrites to: "CI pipeline performance optimization"
- This improved recall by 40%.
Redundant Results Retrieving top-10 chunks often gave the same paragraph three times. The LLM repeated itself. The fix: Use Maximal Marginal Relevance (MMR) to ensure diversity in results.
Testing the Wrong Thing I tested the whole pipeline at once. I did not know if the problem was retrieval or the LLM. The fix: Separate retrieval evaluation.
- Track hit rate.
- Track Mean Reciprocal Rank (MRR).
- Build a test set of 100 query-document pairs.
சரிசெய்தலுக்குப் பின் முடிவுகள்:
- பதிலின் பொருத்தத்தன்மை: 45% முதல் 85% வரை
- தாமதம் (Latency): 3.2s முதல் 1.8s வரை
- மாதச் செலவு: $180 முதல் $95 வரை
முதலில் chunking-ஐ சரிசெய்யவும். பிறகு weights. அதன் பிறகு embedding quality.