𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱
vLLM used to lead multi-GPU speed. llama.cpp was slower. Build b9455 changes this.
You now get 70 tokens per second on two RTX 3090 GPUs. This works with Qwen 27B UQ8.
The -sm tensor flag is the key. It distributes work across both GPUs. The old way left one card idle. This new way uses both cards at once.
Use these settings:
- --tensor-split 50,50 -sm tensor
- --flash-attn on
- --cache-type-k q8_0 --cache-type-v q8_0
- --spec-type draft-mtp --spec-draft-n-max 3
Why this matters for you:
- Speed stays between 67 and 81 tokens per second.
- Prefill is fast. 27K context loads in 18.8 seconds.
- Quality is better. Q8 quantization stops coding errors.
Lower quality quants cause bugs. They make wrong variable names. They create loop errors. Q8 avoids these failures.
Try b9455 if you need speed and quality. The speed gap is gone.
Source: https://dev.to/yiqinumber1/llamacpp-b9455-finally-caught-vllm-70ts-on-2x3090-qwen-27b-uq8-1m74 Optional learning community: https://t.me/GyaanSetuAi