๐ฃ๐ฎ๐ด๐ฒ๐ฑ๐๐๐๐ฒ๐ป๐๐ถ๐ผ๐ป ๐๐ ๐ง๐ฟ๐ฎ๐ฑ๐ถ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฉ ๐๐ฎ๐ฐ๐ต๐ฒ
Every token you generate uses GPU memory. Traditional KV caching wastes this memory. vLLM fixed this using an idea from operating systems.
Old systems reserve a big block of memory for every request. You reserve space for 2048 tokens even if you use 50. This creates fragmentation. You fit fewer requests. Your GPU sits idle.
PagedAttention divides memory into small blocks. It uses a block table to track them. It allocates memory as you need it. It does not reserve space in advance.
This change brings big wins:
- Memory use hits 96 percent.
- You fit more sequences in one batch.
- Throughput increases up to 24 times.
- You share system prompts across requests.
There are trade-offs:
- You need custom CUDA kernels.
- You must pick the right block size.
- It helps memory limits more than compute limits.
Use vLLM for production serving. It handles high traffic better than HuggingFace Transformers.
Source: https://dev.to/murali8k/pagedattention-vs-traditional-kv-cache-how-vllm-reinvented-gpu-memory-for-llm-inference-3ncc Optional learning community: https://t.me/GyaanSetuAi