𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

📅2 weeks ago⏱1 min read

vLLM used to lead multi-GPU speed. llama.cpp was slower. Build b9455 changes this.

You now get 70 tokens per second on two RTX 3090 GPUs. This works with Qwen 27B UQ8.

The -sm tensor flag is the key. It distributes work across both GPUs. The old way left one card idle. This new way uses both cards at once.

Use these settings:

Why this matters for you:

Lower quality quants cause bugs. They make wrong variable names. They create loop errors. Q8 avoids these failures.

Try b9455 if you need speed and quality. The speed gap is gone.

Continue reading