FlashAttention: Fast and Memory-Efficient Exact Attention

Introduction

FlashAttention is a highly optimized library developed by Dao-AILab, providing implementations of FlashAttention, FlashAttention-2, and a beta release of FlashAttention-3. Its core purpose is to deliver fast and memory-efficient exact attention, a critical component in modern deep learning architectures like transformers. By addressing the memory and speed bottlenecks of traditional attention mechanisms, FlashAttention has become widely adopted across the AI community, enabling the training of larger models and longer sequence lengths.

FlashAttention-2 offers a complete rewrite, resulting in up to 2x faster performance, while FlashAttention-3 beta is further optimized for Hopper GPUs (e.g., H100). The library supports various advanced features including multi-query and grouped-query attention (MQA/GQA), sliding window local attention, ALiBi (attention with linear bias), paged KV cache, and softcapping.

Installation

To get started with FlashAttention, you can install it via pip:

pip install flash-attn --no-build-isolation

Alternatively, you can compile from source:

python setup.py install

For systems with less than 96GB of RAM and many CPU cores, you might need to limit parallel compilation jobs:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Requirements:

CUDA toolkit (12.0+) or ROCm toolkit (6.0+)
PyTorch 2.2 and above
Python packages: packaging, psutil, ninja (ensure ninja is correctly installed for faster compilation)
Linux (Windows support is experimental)

FlashAttention-3 beta specifically requires H100 / H800 GPUs and CUDA >= 12.3 (CUDA 12.8 recommended for best performance).

Examples

FlashAttention provides several functions for implementing scaled dot product attention. Here are the main interfaces:

1. flash_attn_qkvpacked_func for pre-stacked QKV: This function is faster when Q, K, V are already stacked into a single tensor, as it avoids explicit concatenation of gradients in the backward pass.

from flash_attn import flash_attn_qkvpacked_func

flash_attn_qkvpacked_func(qkv, dropout_p=0.0, softmax_scale=None, causal=False,
                          window_size=(-1, -1), alibi_slopes=None, deterministic=False)
# Arguments:
#     qkv: (batch_size, seqlen, 3, nheads, headdim)
#     dropout_p: float. Dropout probability.
#     softmax_scale: float. The scaling of QK^T before applying softmax.
#     causal: bool. Whether to apply causal attention mask.
#     window_size: (left, right). For sliding window local attention.
#     alibi_slopes: (nheads,) or (batch_size, nheads), fp32. Bias for ALiBi.
#     deterministic: bool. Whether to use deterministic backward pass.
# Return:
#     out: (batch_size, seqlen, nheads, headdim).

2. flash_attn_func for separate Q, K, V: This function supports multi-query and grouped-query attention (MQA/GQA) by allowing K and V to have fewer heads than Q.

from flash_attn import flash_attn_func

flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=False,
                window_size=(-1, -1), alibi_slopes=None, deterministic=False)
# Arguments:
#     q: (batch_size, seqlen, nheads, headdim)
#     k: (batch_size, seqlen, nheads_k, headdim)
#     v: (batch_size, seqlen, nheads_k, headdim)
#     (Other arguments similar to flash_attn_qkvpacked_func)
# Return:
#     out: (batch_size, seqlen, nheads, headdim).

3. flash_attn_with_kvcache for inference and incremental decoding: This function is optimized for inference, allowing for inplace updates of KV cache and supporting rotary embeddings.

from flash_attn import flash_attn_with_kvcache

flash_attn_with_kvcache(
    q, k_cache, v_cache, k=None, v=None, rotary_cos=None, rotary_sin=None,
    cache_seqlens=None, cache_batch_idx=None, block_table=None,
    softmax_scale=None, causal=False, window_size=(-1, -1),
    rotary_interleaved=True, alibi_slopes=None,
)
# Arguments include:
#     q: (batch_size, seqlen, nheads, headdim)
#     k_cache, v_cache: Cached keys/values, updated inplace.
#     k, v: New keys/values to update the cache.
#     rotary_cos, rotary_sin: For applying rotary embeddings.
#     cache_seqlens: Sequence lengths of the KV cache.
#     block_table: For paged KV cache.
#     (Other arguments similar to flash_attn_func)
# Note: Does not support backward pass.

For a full multi-head attention layer implementation, including QKV and output projections, refer to the official MHA implementation in the repository.

Why Use FlashAttention

FlashAttention offers significant advantages for deep learning practitioners:

Exceptional Performance: FlashAttention-2 provides up to 2x speedup in combined forward and backward passes compared to standard PyTorch attention. This translates to faster training times and more efficient inference.
Memory Efficiency: It drastically reduces memory footprint, achieving 10X memory savings at sequence length 2K and 20X at 4K. This allows for training models with much longer sequence lengths and larger batch sizes that would otherwise be impossible due to memory constraints.
Broad GPU Support: The library is optimized for a wide range of GPUs, including NVIDIA Ampere, Ada, and Hopper architectures (e.g., A100, RTX 3090, RTX 4090, H100), and AMD ROCm-enabled GPUs (MI200x, MI300x).
Advanced Features: It incorporates state-of-the-art attention features such as multi-query/grouped-query attention (MQA/GQA), sliding window local attention, ALiBi, paged KV cache (PagedAttention), and softcapping, making it versatile for various model architectures.
Full Model Integration: FlashAttention is not just an attention kernel, it's part of a broader optimization effort. The repository includes full GPT model implementations and training scripts that leverage FlashAttention and other optimized layers, achieving high model FLOPs utilization (up to 225 TFLOPs/sec per A100).