𝗙𝗹𝗮𝘀𝗵 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: 𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀

📅5 days ago⏱1 min read

You pay for expensive GPUs. Your model trains slowly. Your GPU compute units sit idle. The problem is memory bandwidth. Attention makes a massive matrix. This matrix is too big for fast memory. The GPU writes it to slow memory. This wastes time.

Flash Attention solves this. It breaks data into small tiles. These tiles fit in fast SRAM. It uses online softmax. This removes the big matrix.

The results show:

2x to 4x speedup on A100 GPUs.
3x to 7x speedup on H100 GPUs.
Zero loss in accuracy.
No model changes needed.

Versions:

v1: Added tiling.
v2: Better parallelism.
v3: Added FP8 and H100 support.

How to use it: Use PyTorch. Use F.scaled_dot_product_attention. It picks the best method for you.

Avoid it if:

You use a CPU.
Your sequences exceed 100k tokens.
Your bottleneck is MLP layers.
Your head dimension exceeds 256.

Source: https://dev.to/tech_nuggets/flash-attention-what-it-does-and-why-it-matters-59b8 Optional learning community: https://t.me/GyaanSetuAi

𝗙𝗹𝗮𝘀𝗵 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻: 𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀

Continue reading

𝗦𝗽𝗲𝗰𝘂𝗹𝗮𝘁𝗶𝘃𝗲 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴: 𝗙𝗮𝘀𝘁𝗲𝗿 𝗟𝗟𝗠 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲

𝗪𝗵𝘆 𝗬𝗼𝘂𝗿 𝗡𝗲𝘅𝘁 𝗔𝗜 𝗧𝗼𝗼𝗹 𝗠𝗶𝗴𝗵𝘁 𝗕𝗲 𝗕𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸𝗲𝗱 𝗕𝘆 𝗧𝗵𝗲 𝗪𝗿𝗼𝗻𝗴 𝗖𝗵𝗶𝗽

𝗣𝗮𝗴𝗲𝗱𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝘃𝘀 𝗧𝗿𝗮𝗱𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗞𝗩 𝗖𝗮𝗰𝗵𝗲

𝗛𝗼𝘄 𝗠𝘂𝗰𝗵 𝗥𝗔𝗠 𝗗𝗼 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀?

𝗛𝗶𝗴𝗵 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆 𝗟𝗼𝘄 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴