𝗖𝘂𝘀𝘁𝗼𝗺 𝗩𝘂𝗹𝗸𝗮𝗻 𝗞𝗲𝗿𝗻𝗲𝗹𝘀 𝗳𝗼𝗿 𝗔𝗻𝗱𝗿𝗼𝗶𝗱 𝗟𝗟𝗠𝘀

📅2 weeks ago⏱1 min read

Stop using NNAPI and TFLite for LLMs on Android. These frameworks add too much overhead. You double your token speed with custom Vulkan kernels.

Here is the data from Snapdragon 8 Gen 4:

Follow these steps for better performance:

Use tiled matrix multiplication. Match tile size to your GPU warp width.
Fuse softmax and attention. Combine these into one kernel. This stops extra trips to global memory. It recovers 40% of lost speed.
Map weights directly. Use AHardwareBuffer. This stops slow deserialization.
Tune for each GPU. One setting fails. Adreno 750 needs 16x16 tiles. Mali-G720 needs 8x8 tiles. Use SPIR-V variants for different hardware.

Profile your dispatch overhead first. The bottleneck is often the dispatch, not the math.

Continue reading