એક GPU પર બે મોડલ ચલાવવા: લોકલ LLMs પાછળનું ગણિત

📅3 hours ago⏱2 min read

𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗧𝘄𝗼 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗢𝗻𝗲 𝗚𝗣𝗨: 𝗧𝗵𝗲 𝗠𝗮𝘁𝗵 𝗕𝗲𝗵𝗶𝗻𝗱 𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀

I run an agent stack on a workstation. The models live on a DGX Spark via a LAN. I use vLLM instead of Ollama to manage memory better.

The goal is to run two models at once:

Qwen3-Next-80B for heavy reasoning.
Qwen3-4B for fast turns.

Both models hit one URL through a LiteLLM proxy. This setup failed several times before I found the right math.

Here are the lessons from the struggle.

𝗧𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗧𝗿𝗮𝗽 The setting gpu_memory_utilization is not a target for free memory. It is a fraction of the total GPU memory.

If you have a 120 GB card and set utilization to 0.80, vLLM tries to claim 96 GB of the total capacity. It does not look at what is currently free. If you try to run two processes, their percentages must sum to less than 0.95. You must leave room for the CUDA framework overhead.

𝗪𝗵𝗮𝘁 𝗛𝗮𝗽𝗽𝗲𝗻𝗲𝗱 𝗪𝗶𝘁𝗵 𝗧𝗵𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 I tried using the Thinking version of the 80B model. It failed. The model would reason inside tags but never actually trigger a tool call. It would just stop.

I had to swap the 80B backbone to the Instruct version. This allowed the agent to use tools properly.

𝗧𝗵𝗲 𝗔𝗰𝘁𝘂𝗮𝗹 𝗠𝗮𝘁𝗵 After testing, I found these numbers work for my setup:

• Qwen3-Next-80B (at 0.80 target): Uses ~87.8 GiB actual memory. • Qwen3-4B (at 0.10 target): Uses ~13.8 GiB actual memory. • Total usage: ~101.6 GiB. • Free headroom: ~18 GiB.

If I pushed the 80B to 0.85, the 4B model could not start. The 80B would claim too much, leaving no room for the 4B's minimum needs.

𝗠𝘆 𝗣𝗹𝗮𝘆𝗯𝗼𝗼𝗸 𝗳𝗼𝗿 𝗖𝗼-𝗿𝗲𝘀𝗶𝗱𝗲𝗻𝘁 𝗠𝗼𝗱𝗲𝗹𝘀

Load the largest model first.
Let it settle.
Run nvidia-smi to see the actual memory used.
Size the smaller model based on the remaining free memory minus 5 GB for overhead.
Restart both models twice to ensure stability.

Do not guess your memory settings. Use this command to see your reality: nvidia-smi --query-gpu=memory.used --format=csv

If your target allocation and your actual usage differ by more than 10%, your math is wrong. Fix it before you deploy your agent stack.

Source: https://dev.to/ric03uec/two-qwen3-models-on-one-dgx-spark-the-residency-math-for-local-llm-coding-5bpj

Optional learning community: https://t.me/GyaanSetuAi

એક GPU પર બે મોડલ ચલાવવા: લોકલ LLMs પાછળનું ગણિત

Continue reading

𝗟𝗹𝗮𝗺𝗮.𝗰𝗽𝗽 𝗡𝗼𝘄 𝗠𝗮𝘁𝗰𝗵𝗲𝘀 𝘃𝗟𝗟𝗠 𝗦𝗽𝗲𝗲𝗱

𝗤𝘄𝗲𝗻 𝟯.𝟲 𝟮𝟳𝗕: 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗖𝗼𝗱𝗶𝗻𝗴 𝗼𝗻 𝗮 𝟮𝟰𝗚𝗕 𝗚𝗣𝗨

𝗡𝘃𝗶𝗱𝗶𝗮 𝗗𝗚𝗫 𝗦𝗽𝗮𝗿𝗸: 𝗔 𝗧𝗼𝗼𝗹 𝗙𝗼𝗿 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿𝘀

𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀 𝗶𝗻 𝟮𝟬𝟮𝟲 𝗯𝘂𝘁 𝗗𝗲𝘃 𝗘𝘅𝗽𝗲𝗿𝗶𝗲𝗻𝗰𝗲 𝗶𝗻 𝟮𝟬𝟭𝟬

RAM એ નવું GPU છે