𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻𝗚𝗲𝗺𝗺𝗮: 𝟭,𝟬𝟬𝟬 𝗧𝗼𝗸𝗲𝗻𝘀 𝗣𝗲𝗿 𝗦𝗲𝗰𝗼𝗻𝗱
Most language models work one word at a time. They go from left to right. This creates a speed limit because the model must wait for each word to finish before starting the next.
Google DeepMind changed this with DiffusionGemma.
Instead of sequential writing, it uses a denoising process. It takes a block of up to 256 tokens and refines them all at once. This approach achieves over 1,000 tokens per second on a single NVIDIA H100. That is four times faster than standard models.
How it works:
- The model starts with a block of placeholder tokens.
- It runs multiple passes to clean up these placeholders.
- Every token looks at every other token in the block at the same time.
- This bidirectional view helps the model understand context from both sides.
Hardware performance:
• NVIDIA H100: 1,000+ tokens/second • NVIDIA DGX Station: up to 2,000 tokens/second • GeForce RTX 5090: ~700 tokens/second • VRAM need: ~18GB when quantized
Where to use it:
DiffusionGemma excels in local settings. In the cloud, companies batch many users together to stay efficient. On your own computer, the GPU often sits idle between words. DiffusionGemma solves this by turning memory bottlenecks into raw compute tasks.
Use it for:
- Code infilling: Adding code to the middle of a function.
- Text editing: Changing a sentence inside a paragraph.
- Constraint tasks: Solving puzzles or math where the whole block must fit together.
The trade-off is quality. Benchmarks show DiffusionGemma scores lower than standard Gemma 4 in reasoning and coding. Language is harder to diffuse than images because one wrong word can ruin a whole sentence.
The verdict:
Use DiffusionGemma if you need speed on local hardware. Use standard Gemma 4 if you need the highest accuracy and deep reasoning.
Optional learning community: https://t.me/GyaanSetuAi