𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻-𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀
Running Llama 3.2 3B on an Android device with 2 GB of RAM is hard. Most developers focus on model weights. This is a mistake. The real memory killer is the KV cache.
The KV cache grows as you chat. If you use standard FP16 precision, the cache eats hundreds of megabytes. This causes your app to crash after only a few turns.
You can fix this with three specific steps.
- Use Mixed-Precision Quantization Keys and values do not need the same precision. Key caches handle low precision well. Value caches do not.
- Use INT4 for keys.
- Use INT8 for values.
This approach reduces your cache size by 62%. For a 2048 token context, you drop from 224 MB down to 84 MB. This happens without changing the model weights.
Implement Sliding Window Eviction You cannot keep every token in active memory. Use a sliding window to keep only the most recent 1536 tokens. Keep the first 64 tokens as anchors to preserve the system prompt.
Use Flash Spilling When tokens leave the sliding window, move them to flash storage. Use memory-mapped files on Android. Modern UFS 4.0 storage is fast enough to page this data back into memory without lag.
The results are significant. On a Snapdragon 8 Gen 3:
- Peak memory drops below the 2 GB limit.
- Max conversation turns increase from 4 to over 12.
- Token speed increases because smaller caches use memory bandwidth better.
- Model quality stays almost the same.
Avoid these mistakes:
- Do not quantize keys and values to the same level. You will lose quality.
- Do not ignore thermal throttling. Sustained inference gets hot. Check the Android Thermal HAL to manage performance.
- Do not forget the cache lifecycle. Always tie mapped buffers to a proper scope to avoid memory leaks.
Build your memory budget before you build your features.
Source: https://dev.to/software_mvp-factory/kv-cache-quantization-for-on-device-llms-kf