𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻 𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀

📅2 hours ago⏱2 min read

𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻-𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀

Running Llama 3.2 3B on an Android device with 2 GB of RAM is hard. Most developers focus on model weights. This is a mistake. The real memory killer is the KV cache.

The KV cache grows as you chat. If you use standard FP16 precision, the cache eats hundreds of megabytes. This causes your app to crash after only a few turns.

You can fix this with three specific steps.

Use Mixed-Precision Quantization Keys and values do not need the same precision. Key caches handle low precision well. Value caches do not.

Use INT4 for keys.
Use INT8 for values.

This approach reduces your cache size by 62%. For a 2048 token context, you drop from 224 MB down to 84 MB. This happens without changing the model weights.

Implement Sliding Window Eviction You cannot keep every token in active memory. Use a sliding window to keep only the most recent 1536 tokens. Keep the first 64 tokens as anchors to preserve the system prompt.
Use Flash Spilling When tokens leave the sliding window, move them to flash storage. Use memory-mapped files on Android. Modern UFS 4.0 storage is fast enough to page this data back into memory without lag.

The results are significant. On a Snapdragon 8 Gen 3:

Peak memory drops below the 2 GB limit.
Max conversation turns increase from 4 to over 12.
Token speed increases because smaller caches use memory bandwidth better.
Model quality stays almost the same.

Avoid these mistakes:

Do not quantize keys and values to the same level. You will lose quality.
Do not ignore thermal throttling. Sustained inference gets hot. Check the Android Thermal HAL to manage performance.
Do not forget the cache lifecycle. Always tie mapped buffers to a proper scope to avoid memory leaks.

Build your memory budget before you build your features.

Source: https://dev.to/software_mvp-factory/kv-cache-quantization-for-on-device-llms-kf

𝗞𝗩 𝗖𝗮𝗰𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗢𝗻 𝗗𝗲𝘃𝗶𝗰𝗲 𝗟𝗟𝗠𝘀

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗧𝗵𝗲 𝗧𝗿𝗲𝗮𝘀𝘂𝗿𝗲 𝗛𝘂𝗻𝘁 𝗘𝗻𝗴𝗶𝗻𝗲 𝗡𝗲𝗮𝗿𝗹𝘆 𝗕𝗿𝗼𝗸𝗲 𝗢𝘂𝗿 𝗦𝗲𝗿𝘃𝗲𝗿

𝗖𝘂𝘀𝘁𝗼𝗺 𝗩𝘂𝗹𝗸𝗮𝗻 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗳𝗼𝗿 𝗔𝗻𝗱𝗿𝗼𝗶𝗱 𝗟𝗟𝗠𝘀

𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗧𝗶𝗺𝗲 𝗖𝗵𝗮𝘁 𝗔𝗽𝗽𝘀

𝗥𝗲𝗮𝗰𝘁 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻