{"name":"FlashAttention: Fast and Memory-Efficient Exact Attention","description":"FlashAttention is a cutting-edge library from Dao-AILab, designed to provide fast and memory-efficient exact attention for deep learning models. It significantly accelerates transformer training and inference by optimizing memory usage and computational speed. This makes it an essential tool for researchers and developers working with large-scale AI models.","github":"https://github.com/Dao-AILab/flash-attention","url":"https://osrepos.com/repo/dao-ailab-flash-attention","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/dao-ailab-flash-attention","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/dao-ailab-flash-attention.md","json":"https://osrepos.com/repo/dao-ailab-flash-attention.json","topics":["Python","Deep Learning","Attention Mechanism","GPU Optimization","Transformers","PyTorch","AI Research","High Performance Computing"],"keywords":["Python","Deep Learning","Attention Mechanism","GPU Optimization","Transformers","PyTorch","AI Research","High Performance Computing"],"stars":null,"summary":"FlashAttention is a cutting-edge library from Dao-AILab, designed to provide fast and memory-efficient exact attention for deep learning models. It significantly accelerates transformer training and inference by optimizing memory usage and computational speed. This makes it an essential tool for researchers and developers working with large-scale AI models.","content":"## Introduction\nFlashAttention is a highly optimized library developed by Dao-AILab, providing implementations of FlashAttention, FlashAttention-2, and a beta release of FlashAttention-3. Its core purpose is to deliver fast and memory-efficient exact attention, a critical component in modern deep learning architectures like transformers. By addressing the memory and speed bottlenecks of traditional attention mechanisms, FlashAttention has become widely adopted across the AI community, enabling the training of larger models and longer sequence lengths.\n\nFlashAttention-2 offers a complete rewrite, resulting in up to 2x faster performance, while FlashAttention-3 beta is further optimized for Hopper GPUs (e.g., H100). The library supports various advanced features including multi-query and grouped-query attention (MQA/GQA), sliding window local attention, ALiBi (attention with linear bias), paged KV cache, and softcapping.\n\n## Installation\nTo get started with FlashAttention, you can install it via pip:\n\nsh\npip install flash-attn --no-build-isolation\n\n\nAlternatively, you can compile from source:\n\nsh\npython setup.py install\n\n\nFor systems with less than 96GB of RAM and many CPU cores, you might need to limit parallel compilation jobs:\n\nsh\nMAX_JOBS=4 pip install flash-attn --no-build-isolation\n\n\n**Requirements:**\n*   CUDA toolkit (12.0+) or ROCm toolkit (6.0+)\n*   PyTorch 2.2 and above\n*   Python packages: `packaging`, `psutil`, `ninja` (ensure `ninja` is correctly installed for faster compilation)\n*   Linux (Windows support is experimental)\n\nFlashAttention-3 beta specifically requires H100 / H800 GPUs and CUDA >= 12.3 (CUDA 12.8 recommended for best performance).\n\n## Examples\nFlashAttention provides several functions for implementing scaled dot product attention. Here are the main interfaces:\n\n**1. `flash_attn_qkvpacked_func` for pre-stacked QKV:**\nThis function is faster when Q, K, V are already stacked into a single tensor, as it avoids explicit concatenation of gradients in the backward pass.\n\npython\nfrom flash_attn import flash_attn_qkvpacked_func\n\nflash_attn_qkvpacked_func(qkv, dropout_p=0.0, softmax_scale=None, causal=False,\n                          window_size=(-1, -1), alibi_slopes=None, deterministic=False)\n# Arguments:\n#     qkv: (batch_size, seqlen, 3, nheads, headdim)\n#     dropout_p: float. Dropout probability.\n#     softmax_scale: float. The scaling of QK^T before applying softmax.\n#     causal: bool. Whether to apply causal attention mask.\n#     window_size: (left, right). For sliding window local attention.\n#     alibi_slopes: (nheads,) or (batch_size, nheads), fp32. Bias for ALiBi.\n#     deterministic: bool. Whether to use deterministic backward pass.\n# Return:\n#     out: (batch_size, seqlen, nheads, headdim).\n\n\n**2. `flash_attn_func` for separate Q, K, V:**\nThis function supports multi-query and grouped-query attention (MQA/GQA) by allowing K and V to have fewer heads than Q.\n\npython\nfrom flash_attn import flash_attn_func\n\nflash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=False,\n                window_size=(-1, -1), alibi_slopes=None, deterministic=False)\n# Arguments:\n#     q: (batch_size, seqlen, nheads, headdim)\n#     k: (batch_size, seqlen, nheads_k, headdim)\n#     v: (batch_size, seqlen, nheads_k, headdim)\n#     (Other arguments similar to flash_attn_qkvpacked_func)\n# Return:\n#     out: (batch_size, seqlen, nheads, headdim).\n\n\n**3. `flash_attn_with_kvcache` for inference and incremental decoding:**\nThis function is optimized for inference, allowing for inplace updates of KV cache and supporting rotary embeddings.\n\npython\nfrom flash_attn import flash_attn_with_kvcache\n\nflash_attn_with_kvcache(\n    q, k_cache, v_cache, k=None, v=None, rotary_cos=None, rotary_sin=None,\n    cache_seqlens=None, cache_batch_idx=None, block_table=None,\n    softmax_scale=None, causal=False, window_size=(-1, -1),\n    rotary_interleaved=True, alibi_slopes=None,\n)\n# Arguments include:\n#     q: (batch_size, seqlen, nheads, headdim)\n#     k_cache, v_cache: Cached keys/values, updated inplace.\n#     k, v: New keys/values to update the cache.\n#     rotary_cos, rotary_sin: For applying rotary embeddings.\n#     cache_seqlens: Sequence lengths of the KV cache.\n#     block_table: For paged KV cache.\n#     (Other arguments similar to flash_attn_func)\n# Note: Does not support backward pass.\n\n\nFor a full multi-head attention layer implementation, including QKV and output projections, refer to the official MHA implementation in the repository.\n\n## Why Use FlashAttention\nFlashAttention offers significant advantages for deep learning practitioners:\n\n*   **Exceptional Performance**: FlashAttention-2 provides up to 2x speedup in combined forward and backward passes compared to standard PyTorch attention. This translates to faster training times and more efficient inference.\n*   **Memory Efficiency**: It drastically reduces memory footprint, achieving 10X memory savings at sequence length 2K and 20X at 4K. This allows for training models with much longer sequence lengths and larger batch sizes that would otherwise be impossible due to memory constraints.\n*   **Broad GPU Support**: The library is optimized for a wide range of GPUs, including NVIDIA Ampere, Ada, and Hopper architectures (e.g., A100, RTX 3090, RTX 4090, H100), and AMD ROCm-enabled GPUs (MI200x, MI300x).\n*   **Advanced Features**: It incorporates state-of-the-art attention features such as multi-query/grouped-query attention (MQA/GQA), sliding window local attention, ALiBi, paged KV cache (PagedAttention), and softcapping, making it versatile for various model architectures.\n*   **Full Model Integration**: FlashAttention is not just an attention kernel, it's part of a broader optimization effort. The repository includes full GPT model implementations and training scripts that leverage FlashAttention and other optimized layers, achieving high model FLOPs utilization (up to 225 TFLOPs/sec per A100).\n\n## Links\n*   **GitHub Repository**: [https://github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention){:target=\"_blank\"}\n*   **FlashAttention Paper**: [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135){:target=\"_blank\"}\n*   **FlashAttention-2 Paper**: [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://tridao.me/publications/flash2/flash2.pdf){:target=\"_blank\"}\n*   **FlashAttention-3 Beta Blogpost**: [FlashAttention-3 Blogpost](https://tridao.me/blog/2024/flash3/){:target=\"_blank\"}\n*   **FlashAttention-3 Beta Paper**: [FlashAttention-3 Paper](https://tridao.me/publications/flash3/flash3.pdf){:target=\"_blank\"}","metrics":{"detailViews":7,"githubClicks":3},"dates":{"published":null,"modified":"2026-02-18T12:01:21.000Z"}}