๐๐ผ๐ ๐ ๐ฆ๐๐ผ๐ฝ๐ฝ๐ฒ๐ฑ ๐๐๐บ๐ฝ๐ถ๐ป๐ด ๐ฃ๐๐๐ ๐๐ป๐ฑ ๐ฆ๐๐ฎ๐ฟ๐๐ฒ๐ฑ ๐๐ต๐ฎ๐๐๐ถ๐ป๐ด ๐ช๐ถ๐๐ต ๐๐ผ๐ฐ๐๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป
My team had hundreds of pages of internal guides. Nobody read them. The same questions filled our Slack channels every week.
I tried a basic search index. It failed. People asked about staging databases and received results about production credentials. Context was lost.
I spent two weekends building a RAG system. Here is what I learned from my mistakes.
My first attempt used a simple recipe: PDFs, text splitting, OpenAI embeddings, and Pinecone. It worked for one question. For everything else, it returned junk.
The problem was chunking. I used a fixed 512-token size. This split sentences and code blocks in half. The retriever found text that looked similar but made no sense to the model.
I tried larger chunks and better embedding models. This helped a little, but the model got distracted by too much text.
I eventually settled on a two-layer approach:
- Document summaries: I use an LLM to create a short summary for every document.
- Logical chunks: I split documents by headings. I use 256-token chunks with a 50-token overlap.
- Hybrid retrieval: I search summaries first. Then I use a mix of dense and sparse (BM25) search.
This system now runs for my team of 20. It handles 50 questions a day. It reduced our Slack repetitions by 70%.
My main takeaways for you:
- Chunking is the hardest part. Use logical splits like markdown headings instead of fixed token windows.
- Use metadata. Store the title, section, and URL to cite your sources.
- Retrieval strategy matters more than the embedding model.
- Do not rely on vector search alone. BM25 finds keywords that embeddings miss.
- Use tools like LangChain or LlamaIndex. They handle edge cases like tables and code blocks for you.
What chunking strategies work for your technical docs?