𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗖𝗵𝗮𝘀𝗶𝗻𝗴 𝗠𝗧𝗣 𝗧𝗣𝗦 𝗔𝗻𝗱 𝗚𝗼𝘁 𝗔 𝗟𝗼𝗰𝗮𝗹 𝟮𝟳𝗕 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝘀 𝗼𝗻 𝟮

AI-assisted draft.

2 days ago1min read

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗖𝗵𝗮𝘀𝗶𝗻𝗴 𝗠𝗧𝗣 𝗧𝗣𝗦 𝗔𝗻𝗱 𝗚𝗼𝘁 𝗔 𝗟𝗼𝗰𝗮𝗹 𝟮𝟳𝗕 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

I do not care about single prompt benchmarks.

I care about the loop.

A coding agent needs to work for hours. It needs to handle edits, terminal calls, retries, and growing context. If the model fails after ten prompts, it is useless.

I wanted to see if I could run a 27B model on a single 24GB GPU. I tested Qwopus3.6-27B-v2 and created a new version: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1.

Here is my setup for a stable 24GB agent loop:

Model: Qwopus3.6-27B GPTQ-Pro 4-bit
Engine: vLLM with GPTQ-Marlin
Context: 131k tokens
KV Cache: FP8 (fp8_e5m2)
Strategy: Prefix caching enabled
Constraint: max_num_seqs=1

Why max_num_seqs=1?

On a single 24GB card, parallelism is not free. If you run multiple requests, they fight for memory. I want one request to finish cleanly. I would rather have one useful answer than two broken ones.

I also skipped speculative decoding (MTP). On a single 3090, MTP added memory pressure and complexity without increasing end-to-end speed for long contexts.

The real metrics that matter:

Prefix cache hit ratio: ~83%
Average TTFT: ~5.7s at 33k tokens
Prefill throughput: ~1917 tok/s
Decode speed: ~43 tok/s

When the prefix cache hits, your latency drops. When you change tasks, the cache gets cold and latency rises. That is normal. The goal is to return to high cache reuse once the task stabilizes.

If you only test one prompt, you are testing the wrong thing. For coding agents, you must test the long-run stability.

Are you running agent loops on a single GPU? What tricks do you use for KV cache or prefix caching?

Source: https://dev.to/xreyrobertibm/i-stopped-chasing-mtp-tps-and-got-a-local-27b-agent-that-actually-stayed-usable-on-24gb-vram-5897

Optional learning community: https://t.me/GyaanSetuAi

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗖𝗵𝗮𝘀𝗶𝗻𝗴 𝗠𝗧𝗣 𝗧𝗣𝗦 𝗔𝗻𝗱 𝗚𝗼𝘁 𝗔 𝗟𝗼𝗰𝗮𝗹 𝟮𝟳𝗕 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝘀 𝗼𝗻 𝟮

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗤𝘄𝗲𝗻 𝟯.𝟲 𝟮𝟳𝗕: 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗖𝗼𝗱𝗶𝗻𝗴 𝗼𝗻 𝗮 𝟮𝟰𝗚𝗕 𝗚𝗣𝗨

𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗧𝘄𝗼 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗢𝗻𝗲 𝗚𝗣𝗨: 𝗧𝗵𝗲 𝗠𝗮𝘁𝗵 𝗕𝗲𝗵𝗶𝗻𝗱 𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀

𝗤𝘄𝗲𝗻𝟯.𝟲 𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗔 𝗟𝗼𝗰𝗮𝗹 𝗖𝗼𝗱𝗶𝗻𝗴 𝗔𝗴𝗲𝗻𝘁 𝗼𝗻 𝗮 𝗠𝗮𝗰 𝗠𝗶𝗻𝗶