๐๐น๐ฎ๐๐ต ๐๐๐๐ฒ๐ป๐๐ถ๐ผ๐ป: ๐ช๐ต๐ ๐๐ ๐ ๐ฎ๐๐๐ฒ๐ฟ๐
You pay for expensive GPUs. Your model trains slowly. Your GPU compute units sit idle. The problem is memory bandwidth. Attention makes a massive matrix. This matrix is too big for fast memory. The GPU writes it to slow memory. This wastes time.
Flash Attention solves this. It breaks data into small tiles. These tiles fit in fast SRAM. It uses online softmax. This removes the big matrix.
The results show:
- 2x to 4x speedup on A100 GPUs.
- 3x to 7x speedup on H100 GPUs.
- Zero loss in accuracy.
- No model changes needed.
Versions:
- v1: Added tiling.
- v2: Better parallelism.
- v3: Added FP8 and H100 support.
How to use it: Use PyTorch. Use F.scaled_dot_product_attention. It picks the best method for you.
Avoid it if:
- You use a CPU.
- Your sequences exceed 100k tokens.
- Your bottleneck is MLP layers.
- Your head dimension exceeds 256.
Source: https://dev.to/tech_nuggets/flash-attention-what-it-does-and-why-it-matters-59b8 Optional learning community: https://t.me/GyaanSetuAi