๐ฃ๐ฟ๐ฒ๐ณ๐ถ๐ ๐๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐๐ ๐ฆ๐ฐ๐ฎ๐น๐ฒ
Your LLM prefill is slow. You use RAG. You send 6,000 tokens every time. Time to first token jumps from 180ms to 1.4 seconds.
Most of these tokens are the same. Your system prompt and documents do not change. The model computes the same state over and over. Then it throws it away.
Prefix caching stops this waste. It saves the KV cache for the start of your prompt. If the next request has the same start, the model skips the work.
Two main tools do this:
- vLLM uses blocks of 16 tokens. It uses hashes to find matches.
- SGLang uses a radix tree. It matches one token at a time.
SGLang wins on short changes. vLLM wins on long shared parts.
Watch your hit rate. If it is below 30%, you have a problem.
Memory pressure is a silent killer. Your GPU fills up. The system deletes old cache blocks. Your 80% savings drop to 5%.
Increase your GPU memory limit to keep more prefixes warm.
Use prefix caching for:
- RAG with stable documents.
- Multi-turn chat.
- Long document QA.
Skip it for:
- Unique prompts.
- One-off analysis.
- Mid-prompt changes.
Measure your hit rate first. Then enable the feature.
Optional learning community: https://t.me/GyaanSetuAi