I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough
Speculative Decoding (SD) relies on a simple math rule: a > 1 + α + β
The acceptance length (a) must beat 1 plus the compute ratio (α) and the verification overhead (β). If it does, SD wins. If not, it loses.
I tested this theory on a real machine. I used a 12th Gen Intel CPU with 64GB RAM. I paired a small Qwen2.5-0.5B draft model with a larger Qwen2.5-1.5B target model.
The results were surprising. SD was 49% to 62% slower than raw generation.
Here is how acceptance length (a) varied by task:
• JSON (Structured): a = 3.50. The draft model predicted the format well. • Code (Semi-structured): a = 3.00. Good, but naming patterns varied. • Story (Creative): a = 2.11. The draft model struggled with word choices.
Even when "a" was high, SD failed on the CPU. Why?
The biggest issue was the zero-accept rate. Between 15% and 30% of the rounds accepted zero tokens.
In these rounds, the draft model works, the target model verifies, and you get nothing new. You paid for two runs to get one token. This makes SD cost 2x more for the same output.
This highlights why SD is a GPU optimization.
On a GPU, the draft model is nearly free. The compute ratio (α) is tiny. On a CPU, the draft model competes for memory bandwidth. It is not free. The inequality collapses on CPU.
If you use SD on a CPU, do not do it. The numbers do not work.
Key takeaways for your deployments:
- Measure your own "a" value. Do not trust vendor claims.
- Split your data by task type. Code and chat have different acceptance rates.
- Watch the zero-accept rate. High variance ruins your p99 latency.
- Use SD on GPUs where the draft model cost is minimal.
The best optimization is knowing when to turn it off.
Source: https://dev.to/zxpmail/i-benchmarked-speculative-decoding-a-35-wasnt-enough-1geb
Optional learning community: https://t.me/GyaanSetuAi
