𝗛𝗼𝘄 𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗗𝘂𝗺𝗽𝗶𝗻𝗴 𝗣𝗗𝗙𝘀 𝗔𝗻𝗱 𝗦𝘁𝗮𝗿𝘁𝗲𝗱 𝗖𝗵𝗮𝘁𝘁𝗶𝗻𝗴 𝗪𝗶𝘁𝗵 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶��

📅4 hours ago⏱2 min read

𝗛𝗼𝘄 𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗗𝘂𝗺𝗽𝗶𝗻𝗴 𝗣𝗗𝗙𝘀 𝗔𝗻𝗱 𝗦𝘁𝗮𝗿𝘁𝗲𝗱 𝗖𝗵𝗮𝘁𝘁𝗶𝗻𝗴 𝗪𝗶𝘁𝗵 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻

My team had hundreds of pages of internal guides. Nobody read them. The same questions filled our Slack channels every week.

I tried a basic search index. It failed. People asked about staging databases and received results about production credentials. Context was lost.

I spent two weekends building a RAG system. Here is what I learned from my mistakes.

My first attempt used a simple recipe: PDFs, text splitting, OpenAI embeddings, and Pinecone. It worked for one question. For everything else, it returned junk.

The problem was chunking. I used a fixed 512-token size. This split sentences and code blocks in half. The retriever found text that looked similar but made no sense to the model.

I tried larger chunks and better embedding models. This helped a little, but the model got distracted by too much text.

I eventually settled on a two-layer approach:

Document summaries: I use an LLM to create a short summary for every document.
Logical chunks: I split documents by headings. I use 256-token chunks with a 50-token overlap.
Hybrid retrieval: I search summaries first. Then I use a mix of dense and sparse (BM25) search.

This system now runs for my team of 20. It handles 50 questions a day. It reduced our Slack repetitions by 70%.

My main takeaways for you:

Chunking is the hardest part. Use logical splits like markdown headings instead of fixed token windows.
Use metadata. Store the title, section, and URL to cite your sources.
Retrieval strategy matters more than the embedding model.
Do not rely on vector search alone. BM25 finds keywords that embeddings miss.
Use tools like LangChain or LlamaIndex. They handle edge cases like tables and code blocks for you.

What chunking strategies work for your technical docs?

Source: https://dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-dumping-pdfs-and-started-chatting-with-my-documentation-2c8j

𝗛𝗼𝘄 𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗗𝘂𝗺𝗽𝗶𝗻𝗴 𝗣𝗗𝗙𝘀 𝗔𝗻𝗱 𝗦𝘁𝗮𝗿𝘁𝗲𝗱 𝗖𝗵𝗮𝘁𝘁𝗶𝗻𝗴 𝗪𝗶𝘁𝗵 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶���

𝗛𝗼𝘄 𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗗𝘂𝗺𝗽𝗶𝗻𝗴 𝗣𝗗𝗙𝘀 𝗔𝗻𝗱 𝗦𝘁𝗮𝗿𝘁𝗲𝗱 𝗖𝗵𝗮𝘁𝘁𝗶𝗻𝗴 𝗪𝗶𝘁𝗵 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶��