๐ ๐ถ๐ป๐ถ๐ ๐ฎ๐ ๐ ๐ฏ ๐๐๐๐ฒ๐ป๐๐ถ๐ผ๐ป ๐จ๐ฝ๐ด๐ฟ๐ฎ๐ฑ๐ฒ๐
LLM speed is not about raw compute. It is about memory bandwidth.
Your GPU has fast SRAM but slow HBM. The gap is 300 times. This gap slows down your AI.
MiniMax M3 solves this with Sparse Attention.
Here is how it works:
- GQA: It shares KV Cache to move less data.
- Top-K: It computes only the most relevant tokens.
- Tiling: It groups tokens into blocks of 100. This makes reading data faster.
- Outer Loop: It reads one block once for all queries. This maximizes I/O.
The results are clear:
- Compute costs dropped to 1/20 for long text.
- Pre-fill speed increased by 9.7 times.
- Decoding speed increased by 15.6 times.
- Quality matches full attention models.
AI needs more throughput to reach more people. Lower costs and faster speeds are the goal.
Source: https://dev.to/cognitalk/minimax-m3-da-mo-xing-zhu-yi-li-ji-zhi-shang-suo-zuo-de-zhong-da-dian-fu-yu-you-hua-1dcg Optional learning community: https://t.me/GyaanSetuAi