๐๐น๐ฎ๐บ๐ฎ.๐ฐ๐ฝ๐ฝ ๐ก๐ผ๐ ๐ ๐ฎ๐๐ฐ๐ต๐ฒ๐ ๐๐๐๐ ๐ฆ๐ฝ๐ฒ๐ฒ๐ฑ
vLLM used to lead multi-GPU speed. llama.cpp was slower. Build b9455 changes this.
You now get 70 tokens per second on two RTX 3090 GPUs. This works with Qwen 27B UQ8.
The -sm tensor flag is the key. It distributes work across both GPUs. The old way left one card idle. This new way uses both cards at once.
Use these settings:
- --tensor-split 50,50 -sm tensor
- --flash-attn on
- --cache-type-k q8_0 --cache-type-v q8_0
- --spec-type draft-mtp --spec-draft-n-max 3
Why this matters for you:
- Speed stays between 67 and 81 tokens per second.
- Prefill is fast. 27K context loads in 18.8 seconds.
- Quality is better. Q8 quantization stops coding errors.
Lower quality quants cause bugs. They make wrong variable names. They create loop errors. Q8 avoids these failures.
Try b9455 if you need speed and quality. The speed gap is gone.
Source: https://dev.to/yiqinumber1/llamacpp-b9455-finally-caught-vllm-70ts-on-2x3090-qwen-27b-uq8-1m74 Optional learning community: https://t.me/GyaanSetuAi