FlashAttention: Fast and Memory-Efficient Exact Attention

This repository profile is provided by osrepos.com, an open source repository discovery platform.

FlashAttention: Fast and Memory-Efficient Exact Attention

Summary

FlashAttention is a cutting-edge library from Dao-AILab, designed to provide fast and memory-efficient exact attention for deep learning models. It significantly accelerates transformer training and inference by optimizing memory usage and computational speed. This makes it an essential tool for researchers and developers working with large-scale AI models.

Repository Information

Analyzed by OSRepos on February 18, 2026

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

FlashAttention is a highly optimized library developed by Dao-AILab, providing implementations of FlashAttention, FlashAttention-2, and a beta release of FlashAttention-3. Its core purpose is to deliver fast and memory-efficient exact attention, a critical component in modern deep learning architectures like transformers. By addressing the memory and speed bottlenecks of traditional attention mechanisms, FlashAttention has become widely adopted across the AI community, enabling the training of larger models and longer sequence lengths.

FlashAttention-2 offers a complete rewrite, resulting in up to 2x faster performance, while FlashAttention-3 beta is further optimized for Hopper GPUs (e.g., H100). The library supports various advanced features including multi-query and grouped-query attention (MQA/GQA), sliding window local attention, ALiBi (attention with linear bias), paged KV cache, and softcapping.

Installation

To get started with FlashAttention, you can install it via pip:

pip install flash-attn --no-build-isolation

Alternatively, you can compile from source:

python setup.py install

For systems with less than 96GB of RAM and many CPU cores, you might need to limit parallel compilation jobs:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Requirements:

  • CUDA toolkit (12.0+) or ROCm toolkit (6.0+)
  • PyTorch 2.2 and above
  • Python packages: packaging, psutil, ninja (ensure ninja is correctly installed for faster compilation)
  • Linux (Windows support is experimental)

FlashAttention-3 beta specifically requires H100 / H800 GPUs and CUDA >= 12.3 (CUDA 12.8 recommended for best performance).

Examples

FlashAttention provides several functions for implementing scaled dot product attention. Here are the main interfaces:

1. flash_attn_qkvpacked_func for pre-stacked QKV: This function is faster when Q, K, V are already stacked into a single tensor, as it avoids explicit concatenation of gradients in the backward pass.

from flash_attn import flash_attn_qkvpacked_func

flash_attn_qkvpacked_func(qkv, dropout_p=0.0, softmax_scale=None, causal=False,
                          window_size=(-1, -1), alibi_slopes=None, deterministic=False)
# Arguments:
#     qkv: (batch_size, seqlen, 3, nheads, headdim)
#     dropout_p: float. Dropout probability.
#     softmax_scale: float. The scaling of QK^T before applying softmax.
#     causal: bool. Whether to apply causal attention mask.
#     window_size: (left, right). For sliding window local attention.
#     alibi_slopes: (nheads,) or (batch_size, nheads), fp32. Bias for ALiBi.
#     deterministic: bool. Whether to use deterministic backward pass.
# Return:
#     out: (batch_size, seqlen, nheads, headdim).

2. flash_attn_func for separate Q, K, V: This function supports multi-query and grouped-query attention (MQA/GQA) by allowing K and V to have fewer heads than Q.

from flash_attn import flash_attn_func

flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=False,
                window_size=(-1, -1), alibi_slopes=None, deterministic=False)
# Arguments:
#     q: (batch_size, seqlen, nheads, headdim)
#     k: (batch_size, seqlen, nheads_k, headdim)
#     v: (batch_size, seqlen, nheads_k, headdim)
#     (Other arguments similar to flash_attn_qkvpacked_func)
# Return:
#     out: (batch_size, seqlen, nheads, headdim).

3. flash_attn_with_kvcache for inference and incremental decoding: This function is optimized for inference, allowing for inplace updates of KV cache and supporting rotary embeddings.

from flash_attn import flash_attn_with_kvcache

flash_attn_with_kvcache(
    q, k_cache, v_cache, k=None, v=None, rotary_cos=None, rotary_sin=None,
    cache_seqlens=None, cache_batch_idx=None, block_table=None,
    softmax_scale=None, causal=False, window_size=(-1, -1),
    rotary_interleaved=True, alibi_slopes=None,
)
# Arguments include:
#     q: (batch_size, seqlen, nheads, headdim)
#     k_cache, v_cache: Cached keys/values, updated inplace.
#     k, v: New keys/values to update the cache.
#     rotary_cos, rotary_sin: For applying rotary embeddings.
#     cache_seqlens: Sequence lengths of the KV cache.
#     block_table: For paged KV cache.
#     (Other arguments similar to flash_attn_func)
# Note: Does not support backward pass.

For a full multi-head attention layer implementation, including QKV and output projections, refer to the official MHA implementation in the repository.

Why Use FlashAttention

FlashAttention offers significant advantages for deep learning practitioners:

  • Exceptional Performance: FlashAttention-2 provides up to 2x speedup in combined forward and backward passes compared to standard PyTorch attention. This translates to faster training times and more efficient inference.
  • Memory Efficiency: It drastically reduces memory footprint, achieving 10X memory savings at sequence length 2K and 20X at 4K. This allows for training models with much longer sequence lengths and larger batch sizes that would otherwise be impossible due to memory constraints.
  • Broad GPU Support: The library is optimized for a wide range of GPUs, including NVIDIA Ampere, Ada, and Hopper architectures (e.g., A100, RTX 3090, RTX 4090, H100), and AMD ROCm-enabled GPUs (MI200x, MI300x).
  • Advanced Features: It incorporates state-of-the-art attention features such as multi-query/grouped-query attention (MQA/GQA), sliding window local attention, ALiBi, paged KV cache (PagedAttention), and softcapping, making it versatile for various model architectures.
  • Full Model Integration: FlashAttention is not just an attention kernel, it's part of a broader optimization effort. The repository includes full GPT model implementations and training scripts that leverage FlashAttention and other optimized layers, achieving high model FLOPs utilization (up to 225 TFLOPs/sec per A100).

Links

Related repositories

Similar repositories that may be relevant next.

LLM Guard: The Security Toolkit for LLM Interactions

LLM Guard: The Security Toolkit for LLM Interactions

June 26, 2026

LLM Guard is an open-source security toolkit developed by Protect AI, designed to fortify the safety of Large Language Models. It offers comprehensive protection against various threats, including prompt injection, data leakage, and harmful language, ensuring secure and reliable LLM interactions.

llm-securityprompt-injectionlarge-language-models
AuditNLG: Auditing Generative AI for Trustworthiness

AuditNLG: Auditing Generative AI for Trustworthiness

June 25, 2026

AuditNLG is an open-source library from Salesforce designed to enhance the trustworthiness of generative AI language models. It provides state-of-the-art techniques to detect and improve factualness, safety, and constraint adherence in AI-generated text. This library simplifies the process of auditing AI outputs, offering explanations and alternative suggestions for problematic content.

PythonGenerative AIAI Safety
Odysseus: A Comprehensive Self-Hosted AI Workspace for Productivity

Odysseus: A Comprehensive Self-Hosted AI Workspace for Productivity

June 25, 2026

Odysseus is a powerful self-hosted AI workspace designed to integrate various AI-powered tools into a single platform. It offers functionalities for chat, agents, deep research, document management, email, and calendar, supporting both local and API models. This comprehensive solution aims to enhance productivity and streamline AI workflows in a private environment.

AI WorkspaceSelf-HostedPython
Headroom: Drastically Reduce LLM Token Usage for AI Agents

Headroom: Drastically Reduce LLM Token Usage for AI Agents

June 25, 2026

Headroom is an innovative context compression layer for AI agents, designed to significantly reduce token usage for LLMs. It achieves 60-95% fewer tokens across various inputs like tool outputs, logs, files, and RAG chunks, all while preserving answer accuracy. This powerful tool enhances efficiency and cost-effectiveness for AI interactions.

AILLMToken Optimization

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️