๐ฆ๐ฝ๐ฒ๐ฐ๐๐น๐ฎ๐๐ถ๐๐ฒ ๐๐ฒ๐ฐ๐ผ๐ฑ๐ถ๐ป๐ด: ๐๐ฎ๐๐๐ฒ๐ฟ ๐๐๐ ๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ
Your GPU utilization is high. Your latency is still bad. You think you need a bigger box.
You are wrong. Your GPU is memory-bound. It spends too much time moving weights and state.
Speculative decoding fixes this. It turns a one-token pipeline into a multi-token pipeline. It dropped p50 TTFT from 380 ms to 140 ms on a 70B model using the same hardware.
How it works:
- A small draft model guesses K tokens.
- The large target model verifies them in one forward pass.
- The target accepts or rejects the guesses.
- You get the same output quality. You get more speed.
You trade VRAM and engineering work for speed.
Best methods:
- EAGLE: Best for general use. It predicts hidden states.
- MTP: Built into the model. Zero extra parameters.
- N-gram: Fast prompt lookup. Best for code and JSON.
The one number to track is mean accepted tokens per cycle (mu).
- Mu 4 to 5: High speedup.
- Mu below 2: Not worth the cost.
Avoid these traps:
- Tokenizer mismatch between models.
- Different chat templates.
- Too many speculative tokens.
- Pure greedy decoding with temperature 0.
Skip this if:
- You need high throughput over low latency.
- You lack a draft model.
- Your outputs are short and random.
Measure your acceptance rate before you go to production.
Source: https://dev.to/tech_nuggets/speculative-decoding-when-and-why-it-actually-speeds-up-inference-5pl
Optional learning community: https://t.me/GyaanSetuAi