𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗖𝗵𝗮𝘀𝗶𝗻𝗴 𝗠𝗧𝗣 𝗧𝗣𝗦 𝗔𝗻𝗱 𝗚𝗼𝘁 𝗔 𝗟𝗼𝗰𝗮𝗹 𝟮𝟳𝗕 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠
I do not care about single prompt benchmarks.
I care about the loop.
A coding agent needs to work for hours. It needs to handle edits, terminal calls, retries, and growing context. If the model fails after ten prompts, it is useless.
I wanted to see if I could run a 27B model on a single 24GB GPU. I tested Qwopus3.6-27B-v2 and created a new version: XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1.
Here is my setup for a stable 24GB agent loop:
- Model: Qwopus3.6-27B GPTQ-Pro 4-bit
- Engine: vLLM with GPTQ-Marlin
- Context: 131k tokens
- KV Cache: FP8 (fp8_e5m2)
- Strategy: Prefix caching enabled
- Constraint: max_num_seqs=1
Why max_num_seqs=1?
On a single 24GB card, parallelism is not free. If you run multiple requests, they fight for memory. I want one request to finish cleanly. I would rather have one useful answer than two broken ones.
I also skipped speculative decoding (MTP). On a single 3090, MTP added memory pressure and complexity without increasing end-to-end speed for long contexts.
The real metrics that matter:
- Prefix cache hit ratio: ~83%
- Average TTFT: ~5.7s at 33k tokens
- Prefill throughput: ~1917 tok/s
- Decode speed: ~43 tok/s
When the prefix cache hits, your latency drops. When you change tasks, the cache gets cold and latency rises. That is normal. The goal is to return to high cache reuse once the task stabilizes.
If you only test one prompt, you are testing the wrong thing. For coding agents, you must test the long-run stability.
Are you running agent loops on a single GPU? What tricks do you use for KV cache or prefix caching?
Optional learning community: https://t.me/GyaanSetuAi