𝗠𝗶𝗻𝗶𝗠𝗮𝘅 𝗠𝟯 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗨𝗽𝗴𝗿𝗮𝗱𝗲𝘀

📅6 days ago⏱1 min read

𝗠𝗶𝗻𝗶𝗠𝗮𝘅 𝗠𝟯 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗨𝗽𝗴𝗿𝗮𝗱𝗲𝘀

LLM speed is not about raw compute. It is about memory bandwidth.

Your GPU has fast SRAM but slow HBM. The gap is 300 times. This gap slows down your AI.

MiniMax M3 solves this with Sparse Attention.

Here is how it works:

GQA: It shares KV Cache to move less data.
Top-K: It computes only the most relevant tokens.
Tiling: It groups tokens into blocks of 100. This makes reading data faster.
Outer Loop: It reads one block once for all queries. This maximizes I/O.

The results are clear:

Compute costs dropped to 1/20 for long text.
Pre-fill speed increased by 9.7 times.
Decoding speed increased by 15.6 times.
Quality matches full attention models.

AI needs more throughput to reach more people. Lower costs and faster speeds are the goal.

Source: https://dev.to/cognitalk/minimax-m3-da-mo-xing-zhu-yi-li-ji-zhi-shang-suo-zuo-de-zhong-da-dian-fu-yu-you-hua-1dcg Optional learning community: https://t.me/GyaanSetuAi