𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝗧𝘄𝗼 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗢𝗻𝗲 𝗚𝗣𝗨: 𝗧𝗵𝗲 𝗠𝗮𝘁𝗵 𝗕𝗲𝗵𝗶𝗻𝗱 𝗟𝗼𝗰𝗮𝗹 𝗟𝗟𝗠𝘀

I run an agent stack on a workstation. The models live on a DGX Spark via a LAN. I use vLLM instead of Ollama to manage memory better.

The goal is to run two models at once:

Both models hit one URL through a LiteLLM proxy. This setup failed several times before I found the right math.

Here are the lessons from the struggle.

𝗧𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗧𝗿𝗮𝗽 The setting gpu_memory_utilization is not a target for free memory. It is a fraction of the total GPU memory.

If you have a 120 GB card and set utilization to 0.80, vLLM tries to claim 96 GB of the total capacity. It does not look at what is currently free. If you try to run two processes, their percentages must sum to less than 0.95. You must leave room for the CUDA framework overhead.

𝗪𝗵𝗮𝘁 𝗛𝗮𝗽𝗽𝗲𝗻𝗲𝗱 𝗪𝗶𝘁𝗵 𝗧𝗵𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 I tried using the Thinking version of the 80B model. It failed. The model would reason inside tags but never actually trigger a tool call. It would just stop.

I had to swap the 80B backbone to the Instruct version. This allowed the agent to use tools properly.

𝗧𝗵𝗲 𝗔𝗰𝘁𝘂𝗮𝗹 𝗠𝗮𝘁𝗵 After testing, I found these numbers work for my setup:

• Qwen3-Next-80B (at 0.80 target): Uses ~87.8 GiB actual memory. • Qwen3-4B (at 0.10 target): Uses ~13.8 GiB actual memory. • Total usage: ~101.6 GiB. • Free headroom: ~18 GiB.

If I pushed the 80B to 0.85, the 4B model could not start. The 80B would claim too much, leaving no room for the 4B's minimum needs.

𝗠𝘆 𝗣𝗹𝗮𝘆𝗯𝗼𝗼𝗸 𝗳𝗼𝗿 𝗖𝗼-𝗿𝗲𝘀𝗶𝗱𝗲𝗻𝘁 𝗠𝗼𝗱𝗲𝗹𝘀

  1. Load the largest model first.
  2. Let it settle.
  3. Run nvidia-smi to see the actual memory used.
  4. Size the smaller model based on the remaining free memory minus 5 GB for overhead.
  5. Restart both models twice to ensure stability.

Do not guess your memory settings. Use this command to see your reality: nvidia-smi --query-gpu=memory.used --format=csv

If your target allocation and your actual usage differ by more than 10%, your math is wrong. Fix it before you deploy your agent stack.

Source: https://dev.to/ric03uec/two-qwen3-models-on-one-dgx-spark-the-residency-math-for-local-llm-coding-5bpj

Optional learning community: https://t.me/GyaanSetuAi