𝗠𝗶𝗻𝗶𝗠𝗮𝘅 𝗠𝟯: 𝗔 𝗡𝗲𝘄 𝗪𝗮𝘆 𝗧𝗼 𝗛𝗮𝗻𝗱𝗹𝗲 𝗟𝗼𝗻𝗴 𝗖𝗼𝗻𝘁𝗲𝘅𝘁
MiniMax M3 is a new multimodal model from Shanghai. It features a 1-million-token context window. Most models become too slow and expensive at this length. MiniMax M3 uses a new method called MiniMax Sparse Attention (MSA) to solve this.
How MSA works:
• Index branch: A fast pass finds the most relevant parts of the data. • Sparse branch: The model only looks at those specific parts. • Memory efficiency: It groups queries together to stop GPU memory bottlenecks.
This makes the model 9x faster at processing data and 15x faster at generating text compared to previous versions.
Performance scores:
• SWE-Bench Pro: 59.0% • Terminal-Bench 2.1: 66.0% • BrowseComp: 83.5% • KernelBench Hard: 28.8% • MCP Atlas: 74.2%
The SWE-Bench Pro score is higher than GPT-5.5 and Gemini 3.1 Pro. However, Claude Opus 4.8 still leads with 69.2%. You should note that MiniMax ran these tests on their own hardware.
Technical details:
M3 is trained on text, images, and video together. It can operate desktop computers. In tests, it optimized a CUDA kernel on NVIDIA hardware.
You can use M3 in three ways:
- MiniMax Platform API: It works with existing OpenAI code.
- OpenRouter: Good if you do not want a direct MiniMax account.
- Self-hosting: You need vLLM or SGLang support for the MSA architecture.
Pricing:
The cost is $0.60 per million input tokens and $2.40 per million output tokens. A launch discount brings these prices down to $0.30 and $1.20. This is much cheaper than Claude Opus.
Three things to remember:
- Context is not memory. You still need external memory for long-term agent tasks.
- Verify benchmarks. Wait for third-party tests before you switch your entire system.
- Data privacy. MiniMax is based in Shanghai. Consider this if you handle sensitive data.
Optional learning community: https://t.me/GyaanSetuAi