Wan2.2: Open and Advanced Large-Scale Video Generative Models

Wan2.2: Open and Advanced Large-Scale Video Generative Models

Summary

Wan2.2 is an open-source and advanced suite of large-scale video generative models, introducing innovations like a Mixture-of-Experts (MoE) architecture for enhanced capacity and cinematic-level aesthetics. It offers efficient high-definition video generation capabilities, including text-to-video, image-to-video, speech-to-video, and character animation. This powerful framework is designed for both industrial and academic applications, pushing the boundaries of AI-driven video creation.

Repository Info

Updated on October 29, 2025
View on GitHub

Tags

Click on any tag to explore related repositories

Introduction

Wan2.2 is an open and advanced suite of large-scale video generative models, pushing the boundaries of AI-driven video creation. This major upgrade builds upon its predecessor, Wan2.1, by incorporating several key innovations designed to enhance generation quality, model capability, and computational efficiency. It introduces a Mixture-of-Experts (MoE) architecture, achieves cinematic-level aesthetics through meticulously curated data, and significantly improves complex motion generation with a larger training dataset. Furthermore, Wan2.2 offers an efficient high-definition hybrid Text-Image-to-Video (TI2V) model, capable of generating 720P videos at 24fps on consumer-grade GPUs.

Installation

To get started with Wan2.2, follow these simple steps:

First, clone the repository:

git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2

Next, install the required dependencies. Ensure you have torch >= 2.4.0. If flash_attn installation fails, try installing other packages first and then flash_attn.

pip install -r requirements.txt
# If you want to use CosyVoice for Speech-to-Video Generation, install additional requirements:
pip install -r requirements_s2v.txt

Examples

Wan2.2 supports various video generation tasks, including Text-to-Video (T2V), Image-to-Video (I2V), Text-Image-to-Video (TI2V), Speech-to-Video (S2V), and character animation with Wan-Animate.

You can download the models using huggingface-cli or modelscope-cli. For instance, to download the T2V-A14B model:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B

Here's an example of running Text-to-Video generation without prompt extension on a single GPU:

python generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

For multi-GPU inference or other generation tasks like Image-to-Video, Speech-to-Video, or Wan-Animate, please refer to the comprehensive documentation in the repository's README.

Why Use Wan2.2?

Wan2.2 stands out for its cutting-edge features and performance:

  • Mixture-of-Experts (MoE) Architecture: Enhances model capacity and generation quality while maintaining efficient computational costs.
  • Cinematic-level Aesthetics: Generates videos with precise and controllable cinematic styles, thanks to meticulously curated aesthetic data.
  • Complex Motion Generation: Trained on significantly larger datasets, leading to superior generalization across motions, semantics, and aesthetics.
  • Efficient High-Definition TI2V: The 5B model, powered by an advanced Wan2.2-VAE, supports 720P video generation at 24fps, even on consumer-grade GPUs like the RTX 4090.
  • Versatile Applications: Supports Text-to-Video, Image-to-Video, Speech-to-Video, and Character Animation, making it suitable for a wide range of creative and industrial uses.
  • Open-Source: Provides an accessible and powerful tool for researchers and developers in the generative AI community.

Links