𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗮 𝟲 𝗚𝗕 𝗟𝗮𝗽𝘁𝗼𝗽 𝗚𝗣𝗨

I tried to fit large language models onto an RTX 3050 laptop GPU. This card has only 6 GB of VRAM. I wanted to see which models work with 4-bit quantization and which ones fail.

I used a single script to quantize three models:

  • Phi-3.5-mini (3.8B)
  • Llama-3.2-3B
  • Qwen2.5-3B (VibeThinker)

The Results: Phi and Llama worked well. Phi went from 7.6 GB to 2.2 GB in 34 minutes. Llama and VibeThinker followed a similar path. These models fit easily.

Then I tried Qwen2.5-7B. It failed. The process crashed on the second layer with an Out of Memory error.

Why it failed: GPTQ quantization builds a Hessian matrix for each layer. For a 7B model, this math requires more memory than a 6 GB card provides. I tried several fixes:

  • Smaller calibration datasets: No change.
  • Offloading Hessians to CPU: It lasted longer but still crashed.
  • Using AWQ instead of GPTQ: It crashed in the same place.
  • Using CPU only: It works but it is too slow. It takes about 16 minutes per layer.

Key Takeaways for Small GPUs:

  • Expect a 3x reduction in model size.
  • Aim for a 3 to 4 billion parameter limit for GPU quantization.
  • Watch your KV budget. Even if file sizes are similar, the memory used during inference varies.
  • Quantization uses more memory than serving. Monitor your system RAM during the process.

Model Comparison (W4A16): • Phi-3.5-mini: 2.27 GB | 68.7 tok/s • Llama-3.2-3B: 2.26 GB | 66.0 tok/s • VibeThinker-3B: 2.07 GB | 43.9 tok/s

All three models handled basic math and prime number logic correctly after quantization.

Source: https://dev.to/syedazeez/quantizing-three-models-to-fit-a-6-gb-laptop-gpu-and-the-one-that-wouldnt-4pjl

Optional learning community: https://t.me/GyaanSetuAi