๐๐ผ๐ ๐ ๐๐ฐ๐ต ๐ฅ๐๐ ๐๐ผ ๐ฌ๐ผ๐ ๐ก๐ฒ๐ฒ๐ฑ ๐ณ๐ผ๐ฟ ๐๐๐ ๐?
Stop asking if a model will run on your machine. Use a formula instead.
A model memory footprint follows this rule: RAM = (parameters in billions) * (bytes per parameter) + overhead
For most users, Q4 quantization is the sweet spot. It uses about 0.6 GB per billion parameters. A 7B model needs roughly 4.2 GB to 4.7 GB of space.
Understand the difference between RAM and VRAM:
โข RAM is your system memory. Your CPU runs the math here. It is slow. โข VRAM is your GPU memory. The GPU runs the math here. It is 10x to 30x faster.
If a model fits half in VRAM and half in RAM, it runs at the speed of the slow half. Your goal is to fit the entire model in VRAM. This is why a cheap GPU often beats an expensive laptop with massive RAM.
Quantization guide:
- FP16: Full quality, uses 2.0 GB per 1B params.
- Q8_0: Nearly lossless, uses 1.1 GB per 1B params.
- Q4_K_M: The sweet spot, uses 0.6 GB per 1B params.
- Q2_K: Too low quality to trust.
Do not go below Q4 unless you have no choice. Lower bits save space but ruin logic and code quality.
Hardware tiers:
RAM-only (8GB total): Run 1.5B models. They are fast on CPU. Avoid 8B models or your system will crawl.
RAM-only (16GB total): Run 7B models comfortably. You can keep your browser and IDE open.
GPU-enabled (8GB to 12GB VRAM): Everything flies. A 7B model feels like a paid API. This is the best setup for developers.
High-end GPU (24GB VRAM): You can run 32B models. This rivals cloud quality.
Pro tips:
- Use "ollama ps" to check your split. You want 100% GPU.
- Leave 4GB of RAM for your operating system.
- Long prompts increase memory use. Plan for extra headroom.
- Benchmark the second time you run a model. The first run is always slow due to loading.
You need less RAM than you think, but more VRAM than you have.
Source: https://dev.to/pavelespitia/how-much-ram-do-you-really-need-to-run-llms-locally-2026-benchmarks-3kd2
Optional learning community: https://t.me/GyaanSetuAi