𝗣𝗮𝗴𝗲𝗱𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝘃𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲

📅6 days ago⏱1 min read

Every token you generate uses GPU memory. Traditional KV caching wastes this memory. vLLM fixed this using an idea from operating systems.

Old systems reserve a big block of memory for every request. You reserve space for 2048 tokens even if you use 50. This creates fragmentation. You fit fewer requests. Your GPU sits idle.

PagedAttention divides memory into small blocks. It uses a block table to track them. It allocates memory as you need it. It does not reserve space in advance.

This change brings big wins:

Memory use hits 96 percent.
You fit more sequences in one batch.
Throughput increases up to 24 times.
You share system prompts across requests.

There are trade-offs:

You need custom CUDA kernels.
You must pick the right block size.
It helps memory limits more than compute limits.

Use vLLM for production serving. It handles high traffic better than HuggingFace Transformers.

Source: https://dev.to/murali8k/pagedattention-vs-traditional-kv-cache-how-vllm-reinvented-gpu-memory-for-llm-inference-3ncc Optional learning community: https://t.me/GyaanSetuAi

𝗣𝗮𝗴𝗲𝗱𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝘃𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗦𝗽𝗲𝗰𝘂𝗹𝗮𝘁𝗶𝘃𝗲 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴: 𝗙𝗮𝘀𝘁𝗲𝗿 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲

𝗣𝗿𝗲𝗳𝗶𝘅 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗔𝘁 𝗦𝗰𝗮𝗹𝗲

𝗠𝗶𝗻𝗶𝗠𝗮𝘅 𝗠𝟯 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗨𝗽𝗴𝗿𝗮𝗱𝗲𝘀

𝗙𝗹𝗮𝘀𝗵 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: 𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀