FlashMemory скорочує KV-кеш DeepSeek V4 до 13,5%

Translated for your language. Read the original.

AI-assisted draft.

3 дні тому1min read

𝗙𝗹𝗮𝘀𝗵𝗠𝗲𝗺𝗼𝗿𝘆 𝗖𝘂𝘁𝘀 𝗗𝗲𝗲𝗽𝗦𝗲𝗲𝗸-𝗩𝟰 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝘁𝗼 𝟭𝟯.𝟱%

Long context models face a massive problem. Memory, not math, is the limit.

As you add tokens, the KV cache grows. At 500,000 tokens, the cache becomes huge. It eats up all the GPU memory. This makes serving long context expensive and slow.

A new paper called FlashMemory-DeepSeek-V4 solves this with Lookahead Sparse Attention (LSA).

Here is how it works:

Traditional models use a dense KV cache. They hold every single piece of past information in memory. This is like hauling an entire library to your desk just to read one sentence.

LSA works differently. It uses a Neural Memory Indexer. This indexer acts like an assistant. It predicts which specific parts of the past you need right now. It only brings those specific parts to the desk.

The results on DeepSeek-V4 are impressive:

Physical memory footprint drops to 13.5% of the original size.
This is a 90% reduction at 500,000 tokens.
Accuracy actually increases by 0.6%.

Why is this better than previous methods?

Other sparse attention methods save compute time. They still keep the whole cache in memory. LSA saves actual gigabytes of space. It avoids holding the cache at all.

Also, training this indexer is cheap. The team used backbone-free training. They do not need to load the trillion-parameter model to train the small indexer.

This makes ultra-long context models affordable to run.

Summary of approaches:

Full KV Cache: Exact but uses massive memory.
Sliding Window: Low memory but forgets old information.
Block-Sparse: Saves compute but the cache stays large.
LSA: Saves massive memory and keeps accuracy high.

Source: https://dev.to/pueding/flashmemory-cuts-deepseek-v4s-kv-cache-to-135-lookahead-sparse-attention-5coe

Optional learning community: https://t.me/GyaanSetuAi

FlashMemory скорочує KV-кеш DeepSeek V4 до 13,5%

Continue reading

𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻 𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀

𝗙𝗮𝘀𝘁𝗖𝗼𝗻𝘁𝗲𝘅: 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗲 𝗦𝗲𝗮𝗿𝗰𝗵 𝗳𝗿𝗼𝗺 𝗦𝗼𝗹𝘃𝗶𝗻𝗴

MiniMax M3: Новий підхід до обробки довгого контексту

Огляд DeepSeek V4 Flash: два тижні тестування

𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗮𝗻𝗱 𝗣𝗮𝗴𝗲𝗱𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: 𝗪𝗵𝘆 𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 𝗦𝗲𝗿𝘃𝗲𝗿 𝗦𝗹𝗼𝘄𝘀 𝗗𝗼𝘄𝗻