Wan2.2: Open and Advanced Large-Scale Video Generative Models

Summary
Wan2.2 is an open-source and advanced suite of large-scale video generative models, introducing innovations like a Mixture-of-Experts (MoE) architecture for enhanced capacity and cinematic-level aesthetics. It offers efficient high-definition video generation capabilities, including text-to-video, image-to-video, speech-to-video, and character animation. This powerful framework is designed for both industrial and academic applications, pushing the boundaries of AI-driven video creation.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Wan2.2 is an open and advanced suite of large-scale video generative models, pushing the boundaries of AI-driven video creation. This major upgrade builds upon its predecessor, Wan2.1, by incorporating several key innovations designed to enhance generation quality, model capability, and computational efficiency. It introduces a Mixture-of-Experts (MoE) architecture, achieves cinematic-level aesthetics through meticulously curated data, and significantly improves complex motion generation with a larger training dataset. Furthermore, Wan2.2 offers an efficient high-definition hybrid Text-Image-to-Video (TI2V) model, capable of generating 720P videos at 24fps on consumer-grade GPUs.
Installation
To get started with Wan2.2, follow these simple steps:
First, clone the repository:
git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2
Next, install the required dependencies. Ensure you have torch >= 2.4.0. If flash_attn installation fails, try installing other packages first and then flash_attn.
pip install -r requirements.txt
# If you want to use CosyVoice for Speech-to-Video Generation, install additional requirements:
pip install -r requirements_s2v.txt
Examples
Wan2.2 supports various video generation tasks, including Text-to-Video (T2V), Image-to-Video (I2V), Text-Image-to-Video (TI2V), Speech-to-Video (S2V), and character animation with Wan-Animate.
You can download the models using huggingface-cli or modelscope-cli. For instance, to download the T2V-A14B model:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B
Here's an example of running Text-to-Video generation without prompt extension on a single GPU:
python generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
For multi-GPU inference or other generation tasks like Image-to-Video, Speech-to-Video, or Wan-Animate, please refer to the comprehensive documentation in the repository's README.
Why Use Wan2.2?
Wan2.2 stands out for its cutting-edge features and performance:
- Mixture-of-Experts (MoE) Architecture: Enhances model capacity and generation quality while maintaining efficient computational costs.
- Cinematic-level Aesthetics: Generates videos with precise and controllable cinematic styles, thanks to meticulously curated aesthetic data.
- Complex Motion Generation: Trained on significantly larger datasets, leading to superior generalization across motions, semantics, and aesthetics.
- Efficient High-Definition TI2V: The 5B model, powered by an advanced Wan2.2-VAE, supports 720P video generation at 24fps, even on consumer-grade GPUs like the RTX 4090.
- Versatile Applications: Supports Text-to-Video, Image-to-Video, Speech-to-Video, and Character Animation, making it suitable for a wide range of creative and industrial uses.
- Open-Source: Provides an accessible and powerful tool for researchers and developers in the generative AI community.
Links
- GitHub Repository: Wan-Video/Wan2.2
- Official Website: Wan.video
- Research Paper: arXiv:2503.20314
- Hugging Face: Wan-AI
- Discord Community: Join Discord