Wan2.2: Open and Advanced Large-Scale Video Generative Models

Introduction

Wan2.2 is an open and advanced suite of large-scale video generative models, pushing the boundaries of AI-driven video creation. This major upgrade builds upon its predecessor, Wan2.1, by incorporating several key innovations designed to enhance generation quality, model capability, and computational efficiency. It introduces a Mixture-of-Experts (MoE) architecture, achieves cinematic-level aesthetics through meticulously curated data, and significantly improves complex motion generation with a larger training dataset. Furthermore, Wan2.2 offers an efficient high-definition hybrid Text-Image-to-Video (TI2V) model, capable of generating 720P videos at 24fps on consumer-grade GPUs.

Installation

To get started with Wan2.2, follow these simple steps:

First, clone the repository:

git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2

Next, install the required dependencies. Ensure you have torch >= 2.4.0. If flash_attn installation fails, try installing other packages first and then flash_attn.

pip install -r requirements.txt
# If you want to use CosyVoice for Speech-to-Video Generation, install additional requirements:
pip install -r requirements_s2v.txt

Examples

Wan2.2 supports various video generation tasks, including Text-to-Video (T2V), Image-to-Video (I2V), Text-Image-to-Video (TI2V), Speech-to-Video (S2V), and character animation with Wan-Animate.

You can download the models using huggingface-cli or modelscope-cli. For instance, to download the T2V-A14B model:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B

Here's an example of running Text-to-Video generation without prompt extension on a single GPU:

python generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

For multi-GPU inference or other generation tasks like Image-to-Video, Speech-to-Video, or Wan-Animate, please refer to the comprehensive documentation in the repository's README.

Why Use Wan2.2?

Wan2.2 stands out for its cutting-edge features and performance:

Mixture-of-Experts (MoE) Architecture: Enhances model capacity and generation quality while maintaining efficient computational costs.
Cinematic-level Aesthetics: Generates videos with precise and controllable cinematic styles, thanks to meticulously curated aesthetic data.
Complex Motion Generation: Trained on significantly larger datasets, leading to superior generalization across motions, semantics, and aesthetics.
Efficient High-Definition TI2V: The 5B model, powered by an advanced Wan2.2-VAE, supports 720P video generation at 24fps, even on consumer-grade GPUs like the RTX 4090.
Versatile Applications: Supports Text-to-Video, Image-to-Video, Speech-to-Video, and Character Animation, making it suitable for a wide range of creative and industrial uses.
Open-Source: Provides an accessible and powerful tool for researchers and developers in the generative AI community.

Wan2.2: Open and Advanced Large-Scale Video Generative Models

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use Wan2.2?

Links