𝗣𝗿𝗲𝗳𝗶𝘅 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗔𝘁 𝗦𝗰𝗮𝗹𝗲

📅1 week ago⏱1 min read

Your LLM prefill is slow. You use RAG. You send 6,000 tokens every time. Time to first token jumps from 180ms to 1.4 seconds.

Most of these tokens are the same. Your system prompt and documents do not change. The model computes the same state over and over. Then it throws it away.

Prefix caching stops this waste. It saves the KV cache for the start of your prompt. If the next request has the same start, the model skips the work.

Two main tools do this:

vLLM uses blocks of 16 tokens. It uses hashes to find matches.
SGLang uses a radix tree. It matches one token at a time.

SGLang wins on short changes. vLLM wins on long shared parts.

Watch your hit rate. If it is below 30%, you have a problem.

Memory pressure is a silent killer. Your GPU fills up. The system deletes old cache blocks. Your 80% savings drop to 5%.

Increase your GPU memory limit to keep more prefixes warm.

Use prefix caching for:

RAG with stable documents.
Multi-turn chat.
Long document QA.

Skip it for:

Unique prompts.
One-off analysis.
Mid-prompt changes.

Measure your hit rate first. Then enable the feature.

Source: https://dev.to/tech_nuggets/prefix-caching-at-scale-when-it-saves-you-80-of-prefill-cost-and-the-eviction-policies-that-5e8

Optional learning community: https://t.me/GyaanSetuAi

𝗣𝗿𝗲𝗳𝗶𝘅 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗔𝘁 𝗦𝗰𝗮𝗹𝗲

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗦𝗽𝗲𝗰𝘂𝗹𝗮𝘁𝗶𝘃𝗲 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴: 𝗙𝗮𝘀𝘁𝗲𝗿 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲

𝗣𝗮𝗴𝗲𝗱𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝘃𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗔𝗚 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀: 𝗟𝗮𝗻𝗴𝗖𝗵𝗮𝗶𝗻 𝘃𝘀 𝗟𝗹𝗮𝗺𝗮𝗜𝗻𝗱𝗲𝘅

𝗦𝗲𝗰𝘂𝗿𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗥𝗔𝗚 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲