𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗮 𝟲 𝗚𝗕 𝗟𝗮𝗽𝘁𝗼𝗽 𝗚𝗣𝗨
I tried to fit large language models onto an RTX 3050 laptop GPU. This card has only 6 GB of VRAM. I wanted to see which models work with 4-bit quantization and which ones fail.
I used a single script to quantize three models:
- Phi-3.5-mini (3.8B)
- Llama-3.2-3B
- Qwen2.5-3B (VibeThinker)
The Results: Phi and Llama worked well. Phi went from 7.6 GB to 2.2 GB in 34 minutes. Llama and VibeThinker followed a similar path. These models fit easily.
Then I tried Qwen2.5-7B. It failed. The process crashed on the second layer with an Out of Memory error.
Why it failed: GPTQ quantization builds a Hessian matrix for each layer. For a 7B model, this math requires more memory than a 6 GB card provides. I tried several fixes:
- Smaller calibration datasets: No change.
- Offloading Hessians to CPU: It lasted longer but still crashed.
- Using AWQ instead of GPTQ: It crashed in the same place.
- Using CPU only: It works but it is too slow. It takes about 16 minutes per layer.
Key Takeaways for Small GPUs:
- Expect a 3x reduction in model size.
- Aim for a 3 to 4 billion parameter limit for GPU quantization.
- Watch your KV budget. Even if file sizes are similar, the memory used during inference varies.
- Quantization uses more memory than serving. Monitor your system RAM during the process.
Model Comparison (W4A16): • Phi-3.5-mini: 2.27 GB | 68.7 tok/s • Llama-3.2-3B: 2.26 GB | 66.0 tok/s • VibeThinker-3B: 2.07 GB | 43.9 tok/s
All three models handled basic math and prime number logic correctly after quantization.
Optional learning community: https://t.me/GyaanSetuAi