𝗙𝗹𝗮𝘀𝗵𝗠𝗲𝗺𝗼𝗿𝘆 𝗖𝘂𝘁𝘀 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗩𝟰 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝘁𝗼 𝟭𝟯.𝟱%
Long context models face a massive problem. Memory, not math, is the limit.
As you add tokens, the KV cache grows. At 500,000 tokens, the cache becomes huge. It eats up all the GPU memory. This makes serving long context expensive and slow.
A new paper called FlashMemory-DeepSeek-V4 solves this with Lookahead Sparse Attention (LSA).
Here is how it works:
Traditional models use a dense KV cache. They hold every single piece of past information in memory. This is like hauling an entire library to your desk just to read one sentence.
LSA works differently. It uses a Neural Memory Indexer. This indexer acts like an assistant. It predicts which specific parts of the past you need right now. It only brings those specific parts to the desk.
The results on DeepSeek-V4 are impressive:
- Physical memory footprint drops to 13.5% of the original size.
- This is a 90% reduction at 500,000 tokens.
- Accuracy actually increases by 0.6%.
Why is this better than previous methods?
Other sparse attention methods save compute time. They still keep the whole cache in memory. LSA saves actual gigabytes of space. It avoids holding the cache at all.
Also, training this indexer is cheap. The team used backbone-free training. They do not need to load the trillion-parameter model to train the small indexer.
This makes ultra-long context models affordable to run.
Summary of approaches:
- Full KV Cache: Exact but uses massive memory.
- Sliding Window: Low memory but forgets old information.
- Block-Sparse: Saves compute but the cache stays large.
- LSA: Saves massive memory and keeps accuracy high.
Source: https://dev.to/pueding/flashmemory-cuts-deepseek-v4s-kv-cache-to-135-lookahead-sparse-attention-5coe
Optional learning community: https://t.me/GyaanSetuAi