๐ฅ๐๐ ๐ถ๐ป ๐ด ๐๐ฎ๐๐ฒ๐ฟ๐: ๐๐ฟ๐ผ๐บ ๐ง๐ผ๐ธ๐ฒ๐ป๐ ๐๐ผ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป
You ship a RAG system. A week later, it breaks. The answers are confident, the citations look real, but the conclusions are wrong. Your logs show nothing.
I hit this wall many times. I realized RAG is not one step. It is eight layers. Each layer is a place where things go wrong.
If you build an AI assistant for engineers, bad answers cost time and money. Use this framework to build systems that actually work.
Layer 1: Tokenization Before a model reads a word, it converts it to tokens. Tokens are small units like sub-words. If your chunk is 512 tokens, it is not 512 words. Technical jargon fragments into many tokens. If you exceed the limit, the model silently cuts the end of your text. You lose the fix.
Layer 2: Chunking Bad chunks ruin everything. If you split a table in half, the model sees nothing.
- Use overlap so meaning stays intact.
- Use recursive splitting to keep sentences together.
- Use parent-child chunking. Index small chunks for precision, but give the LLM the large parent chunk for context.
Layer 3: Embeddings Embeddings turn text into numbers.
- Sparse embeddings (BM25) are great for exact keywords like error codes.
- Dense embeddings are great for meaning and synonyms.
- Use both.
Layer 4: Vector Indexing Searching millions of vectors takes too long. You need Approximate Nearest Neighbor (ANN) indexing. Use HNSW to trade a tiny bit of accuracy for massive speed. Aim for sub-100ms responses.
Layer 5: Retrieval Strategy Do not rely on one method. Use Hybrid Search. Combine BM25 and dense retrieval. This catches both the exact error code and the general symptom.
Layer 6: Reranking This is the biggest quality jump. Retrieve 20 candidates with a fast model. Use a Cross-Encoder to score them precisely. This turns a "maybe" into a "correct" answer.
Layer 7: Query Rewriting Users ask messy questions.
- Multi-query: Generate several versions of the question to find more matches.
- HyDE: Generate a fake answer first, then search for documents that look like that answer.
Layer 8: Evaluation Do not wait for a crash to test. Use RAGAS to measure:
- Faithfulness: Does the answer match the context?
- Relevancy: Does it answer the question?
- Recall: Did you find the right data?
If your RAG is failing, check your chunking, your hybrid search, and your evaluation first.
Source: https://dev.to/aashna_mahajan/rag-in-8-layers-from-tokens-to-production-39kf
Optional learning community: https://t.me/GyaanSetuAi