Mixture of Experts (MoE): Hoe het werkt en wanneer je het moet gebruiken

Translated for your language. Read the original.

AI-assisted draft.

GyaanSetu Editorial3 weken geleden2min read

𝗠𝗶𝘅𝘁𝘂𝗿𝗲 𝗼𝗳 𝗘𝘅𝗽𝗲𝗿𝘁𝘀 (𝗠𝗼𝗘): 𝗛𝗼𝘄 𝗜𝘁 𝗪𝗼𝗿𝗸𝘀 𝗮𝗻𝗱 𝗪𝗵𝗲𝗻 𝘁𝗼 𝗨𝘀𝗲 𝗜𝘁

You want to scale from a 7B model to a 70B model without buying four more GPUs.

Someone suggests Mixture of Experts (MoE). They claim you get 70B performance with only 7B compute.

It sounds like a free lunch. But there is a catch.

How does it work?

A dense transformer like Llama 3.2 uses 100 percent of its parameters for every token. If you scale from 7B to 70B, you multiply both memory and compute by 10x.

MoE splits these two. The model stores more parameters (higher memory cost) but uses only a fraction of them for each token (lower compute cost).

The Trade-off:

The catch: You still pay the memory cost of a large model. You cannot run Mixtral on a single 24 GB GPU. You need enough VRAM to hold all the experts, even the ones not being used.

The Architecture:

In a sparse MoE, the standard Feed-Forward Network (FFN) is replaced by multiple "expert" FFNs and a learned router.

The router takes a token.
It assigns a score to each expert.
It selects the top-k experts (for Mixtral, k=2).
It runs the token through those experts only.
It combines the results.

The router is not a manual scheduler. It is a learned layer. It learns to send math tokens to one expert and code tokens to another.

The Training Challenge:

The biggest risk is router collapse. Without help, the router might send every token to the same two experts. Those experts get better, so the router sends even more traffic to them. The other experts become useless.

Engineers use an auxiliary load-balancing loss to fix this. It penalizes the model if it does not use all experts equally.

When to avoid MoE:

• You need consistent latency: MoE has higher variance in response times. • You have limited VRAM: If you only have one GPU under 48 GB, stick to dense models. • You are building tiny models: If your model is under 3B parameters, the overhead is too high. • You need simple infrastructure: MoE requires complex expert parallelism and custom kernels.

MoE is best when you target a dense baseline above 30B parameters and have the memory to support it.

Bron: https://dev.to/tech_nuggets/mixture-of-experts-moe-what-it-actually-does-under-the-hood-and-when-it-pays-off-alb

Optionele leercommunity: https://t.me/GyaanSetuAi