𝗤𝘄𝗲𝗻𝟯.𝟲 𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

Translated for your language. Read the original.

AI-assisted draft.

przedwczoraj2min read

𝗤𝘄𝗲𝗻𝟯.𝟲-𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

You want to run a local coding agent on a 24GB GPU. You need stability. You need long context. You need to avoid crashes.

This setup uses Qwen3.6-27B-GPTQ-Pro-4bit via vLLM. I focus on text only. Multimodal models consume too much memory for this specific goal.

The Strategy: • Use one local coding agent. • Disable all child agents. • Prevent side tasks from stealing memory. • Prioritize stable sessions over raw speed.

The vLLM Configuration: Run vLLM with the gptq_marlin quantization. This provides the best balance for long context and prefix caching on an RTX 3090.

Key flags to use:

--max-num-seqs 1: This is vital. Parallelism steals KV cache from your main task. I prefer one successful request over two failing ones.
--max-model-len 131072: This allows a massive context. If you hit memory errors, lower this to 110k or 80k.
--enable-prefix-caching: This makes repeated long prompts much faster.
--language-model-only: Keep it simple to save VRAM.

Hermes Settings: Point Hermes to your vLLM endpoint. Use these specific settings for the best results: • Enable thinking and preserve thinking. • Set a long request timeout. Use 1800 seconds. Large contexts take time to prefill. • Disable delegation and child agents. • Remove hard max_tokens caps to prevent truncated answers.

Why this works: Prefix caching is not magic. It is an optimization. If you keep your inputs boring and repeatable, the model stops paying the full prefill cost for every turn.

My results on 24GB VRAM: • Small prompt (41 tokens): 0.29s TTFT. • Large prompt (41,985 tokens): 38.6s TTFT. • Cached prompt (41,985 tokens): 1.59s TTFT.

The model is not the bottleneck. The bottleneck is your serving discipline. Control your context size, your request sequence, and your concurrency.

Stop testing if a model answers one prompt. Test if the agent survives a loop.

Source: https://dev.to/xreyrobertibm/qwen36-27b-vllm-hermes-on-24gb-vram-may-2026-recipe-5452

Optional learning community: https://t.me/GyaanSetuAi

𝗤𝘄𝗲𝗻𝟯.𝟲 𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗤𝘄𝗲𝗻 𝟯.𝟲 𝟮𝟳𝗕: 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗖𝗼𝗱𝗶𝗻𝗴 𝗼𝗻 𝗮 𝟮𝟰𝗚𝗕 𝗚𝗣𝗨

𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗧𝘄𝗼 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗢𝗻𝗲 𝗚𝗣𝗨: 𝗧𝗵𝗲 𝗠𝗮𝘁𝗵 𝗕𝗲𝗵𝗶𝗻𝗱 𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀

𝗜 𝗦𝘁𝗼𝗽𝗽𝗲𝗱 𝗖𝗵𝗮𝘀𝗶𝗻𝗴 𝗠𝗧𝗣 𝗧𝗣𝗦 𝗔𝗻𝗱 𝗚𝗼𝘁 𝗔 𝗟𝗼𝗰𝗮𝗹 𝟮𝟳𝗕 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗼𝗿𝗸𝘀 𝗼𝗻 𝟮

KV Cache i PagedAttention: Dlaczego Twój serwer LLM zwalnia