𝗤𝘄𝗲𝗻𝟯.𝟲-𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠
You want to run a local coding agent on a 24GB GPU. You need stability. You need long context. You need to avoid crashes.
This setup uses Qwen3.6-27B-GPTQ-Pro-4bit via vLLM. I focus on text only. Multimodal models consume too much memory for this specific goal.
The Strategy: • Use one local coding agent. • Disable all child agents. • Prevent side tasks from stealing memory. • Prioritize stable sessions over raw speed.
The vLLM Configuration: Run vLLM with the gptq_marlin quantization. This provides the best balance for long context and prefix caching on an RTX 3090.
Key flags to use:
- --max-num-seqs 1: This is vital. Parallelism steals KV cache from your main task. I prefer one successful request over two failing ones.
- --max-model-len 131072: This allows a massive context. If you hit memory errors, lower this to 110k or 80k.
- --enable-prefix-caching: This makes repeated long prompts much faster.
- --language-model-only: Keep it simple to save VRAM.
Hermes Settings: Point Hermes to your vLLM endpoint. Use these specific settings for the best results: • Enable thinking and preserve thinking. • Set a long request timeout. Use 1800 seconds. Large contexts take time to prefill. • Disable delegation and child agents. • Remove hard max_tokens caps to prevent truncated answers.
Why this works: Prefix caching is not magic. It is an optimization. If you keep your inputs boring and repeatable, the model stops paying the full prefill cost for every turn.
My results on 24GB VRAM: • Small prompt (41 tokens): 0.29s TTFT. • Large prompt (41,985 tokens): 38.6s TTFT. • Cached prompt (41,985 tokens): 1.59s TTFT.
The model is not the bottleneck. The bottleneck is your serving discipline. Control your context size, your request sequence, and your concurrency.
Stop testing if a model answers one prompt. Test if the agent survives a loop.
Source: https://dev.to/xreyrobertibm/qwen36-27b-vllm-hermes-on-24gb-vram-may-2026-recipe-5452
Optional learning community: https://t.me/GyaanSetuAi