𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗮 𝟲 𝗚𝗕 𝗟𝗮𝗽𝘁𝗼𝗽 𝗚𝗣𝗨

Translated for your language. Read the original.

AI-assisted draft.

I tried to fit large language models onto an RTX 3050 laptop GPU. This card has only 6 GB of VRAM. I wanted to see which models work with 4-bit quantization and which ones fail.

I used a single script to quantize three models:

Phi-3.5-mini (3.8B)
Llama-3.2-3B
Qwen2.5-3B (VibeThinker)

The Results: Phi and Llama worked well. Phi went from 7.6 GB to 2.2 GB in 34 minutes. Llama and VibeThinker followed a similar path. These models fit easily.

Then I tried Qwen2.5-7B. It failed. The process crashed on the second layer with an Out of Memory error.

Why it failed: GPTQ quantization builds a Hessian matrix for each layer. For a 7B model, this math requires more memory than a 6 GB card provides. I tried several fixes:

Smaller calibration datasets: No change.
Offloading Hessians to CPU: It lasted longer but still crashed.
Using AWQ instead of GPTQ: It crashed in the same place.
Using CPU only: It works but it is too slow. It takes about 16 minutes per layer.

Key Takeaways for Small GPUs:

Expect a 3x reduction in model size.
Aim for a 3 to 4 billion parameter limit for GPU quantization.
Watch your KV budget. Even if file sizes are similar, the memory used during inference varies.
Quantization uses more memory than serving. Monitor your system RAM during the process.

Model Comparison (W4A16): • Phi-3.5-mini: 2.27 GB | 68.7 tok/s • Llama-3.2-3B: 2.26 GB | 66.0 tok/s • VibeThinker-3B: 2.07 GB | 43.9 tok/s

All three models handled basic math and prime number logic correctly after quantization.

Source: https://dev.to/syedazeez/quantizing-three-models-to-fit-a-6-gb-laptop-gpu-and-the-one-that-wouldnt-4pjl

Optional learning community: https://t.me/GyaanSetuAi

𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗶𝗻𝗴 𝗠𝗼𝗱𝗲𝗹𝘀 𝗼𝗻 𝗮 𝟲 𝗚𝗕 𝗟𝗮𝗽𝘁𝗼𝗽 𝗚𝗣𝗨

Continue reading

𝗤𝘄𝗲𝗻 𝟯.𝟲 𝟮𝟳𝗕: 𝗙𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗖𝗼𝗱𝗶𝗻𝗴 𝗼𝗻 𝗮 𝟮𝟰𝗚𝗕 𝗚𝗣𝗨

Kuantisasi Cache KV untuk LLM pada Peranti

𝟯𝟮𝗕 𝗟𝗟𝗠 𝗼𝗻 𝗮 𝟮𝟬𝟬𝟴 𝗫𝗲𝗼𝗻: 𝗥𝗮𝗺 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 𝗠𝗼𝗿𝗲 𝗧𝗵𝗮𝗻 𝗩𝗥𝗔𝗠

𝗤𝘄𝗲𝗻𝟯.𝟲 𝟮𝟳𝗕 + 𝘃𝗟𝗟𝗠 + 𝗛𝗲𝗿𝗺𝗲𝘀 𝗼𝗻 𝟮𝟰𝗚𝗕 𝗩𝗥𝗔𝗠

Saya Melakukan Fine-Tuning Model 270M pada Komputer Riba Saya