๐ก๐ฉ๐๐๐๐ ๐๐น๐ฎ๐ฐ๐ธ๐๐ฒ๐น๐น ๐ฆ๐ฝ๐ฒ๐ฒ๐ฑ๐ ๐จ๐ฝ ๐๐๐ซ ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด
NVIDIA released NVFP4. This 4-bit format runs on Blackwell GPUs. It speeds up JAX training in MaxText.
You get 1.8x faster training over FP8. The format puts two 4-bit values in one 8-bit register. This doubles the math density.
Here are the key facts:
- 1.8x training speedup over FP8.
- No accuracy loss for models up to 70B parameters.
- Native support in Google MaxText.
- Dedicated FP4 tensor cores in Blackwell hardware.
Blackwell GPUs save memory. They use half the memory of FP16. They use 1.5x less memory than FP8. This lets you use larger batch sizes.
The H100 lacks these FP4 cores. This gives Blackwell a clear edge for pre-training.
NVIDIA says accuracy stays high. We need more tests for models over 100B parameters. We need to see real scores to be sure.
Source: https://dev.to/gentic_news/nvidia-nvfp4-on-blackwell-cuts-jax-training-by-18x-in-maxtext-373a Optional learning community: https://t.me/GyaanSetuAi