𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 𝗘𝘅𝗽𝗲𝗿𝘁𝘀 (𝗠𝗼𝗘): 𝗛𝗼𝘄 𝗜𝘁 𝗪𝗼𝗿𝗸𝘀 𝗮𝗻𝗱 𝗪𝗵𝗲𝗻 𝘁𝗼 𝗨𝘀𝗲 𝗜𝘁
You want to scale from a 7B model to a 70B model without buying four more GPUs.
Someone suggests Mixture of Experts (MoE). They claim you get 70B performance with only 7B compute.
It sounds like a free lunch. But there is a catch.
How does it work?
A dense transformer like Llama 3.2 uses 100 percent of its parameters for every token. If you scale from 7B to 70B, you multiply both memory and compute by 10x.
MoE splits these two. The model stores more parameters (higher memory cost) but uses only a fraction of them for each token (lower compute cost).
The Trade-off:
• Dense 7B: 7B total params | 7B active | 7B compute | 14 GB memory • Dense 70B: 70B total params | 70B active | 70B compute | 140 GB memory • MoE 45B: 45B total params | ~13B active | ~14B compute | ~90 GB memory
The catch: You still pay the memory cost of a large model. You cannot run Mixtral on a single 24 GB GPU. You need enough VRAM to hold all the experts, even the ones not being used.
The Architecture:
In a sparse MoE, the standard Feed-Forward Network (FFN) is replaced by multiple "expert" FFNs and a learned router.
- The router takes a token.
- It assigns a score to each expert.
- It selects the top-k experts (for Mixtral, k=2).
- It runs the token through those experts only.
- It combines the results.
The router is not a manual scheduler. It is a learned layer. It learns to send math tokens to one expert and code tokens to another.
The Training Challenge:
The biggest risk is router collapse. Without help, the router might send every token to the same two experts. Those experts get better, so the router sends even more traffic to them. The other experts become useless.
Engineers use an auxiliary load-balancing loss to fix this. It penalizes the model if it does not use all experts equally.
When to avoid MoE:
• You need consistent latency: MoE has higher variance in response times. • You have limited VRAM: If you only have one GPU under 48 GB, stick to dense models. • You are building tiny models: If your model is under 3B parameters, the overhead is too high. • You need simple infrastructure: MoE requires complex expert parallelism and custom kernels.
MoE is best when you target a dense baseline above 30B parameters and have the memory to support it.
Optionele leercommunity: https://t.me/GyaanSetuAi