𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻-𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀

Running Llama 3.2 3B on an Android device with 2 GB of RAM is hard. Most developers focus on model weights. This is a mistake. The real memory killer is the KV cache.

The KV cache grows as you chat. If you use standard FP16 precision, the cache eats hundreds of megabytes. This causes your app to crash after only a few turns.

You can fix this with three specific steps.

  1. Use Mixed-Precision Quantization Keys and values do not need the same precision. Key caches handle low precision well. Value caches do not.

This approach reduces your cache size by 62%. For a 2048 token context, you drop from 224 MB down to 84 MB. This happens without changing the model weights.

  1. Implement Sliding Window Eviction You cannot keep every token in active memory. Use a sliding window to keep only the most recent 1536 tokens. Keep the first 64 tokens as anchors to preserve the system prompt.

  2. Use Flash Spilling When tokens leave the sliding window, move them to flash storage. Use memory-mapped files on Android. Modern UFS 4.0 storage is fast enough to page this data back into memory without lag.

The results are significant. On a Snapdragon 8 Gen 3:

Avoid these mistakes:

Build your memory budget before you build your features.

Source: https://dev.to/software_mvp-factory/kv-cache-quantization-for-on-device-llms-kf