Step-Video-T2V: State-of-the-Art Text-to-Video Generation Model

Summary

Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

Step-Video-T2V is a groundbreaking text-to-video pre-trained model developed by stepfun-ai, featuring 30 billion parameters and the capability to generate videos up to 204 frames. It introduces a deep compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios, which significantly boosts both training and inference efficiency. Furthermore, Direct Preference Optimization (DPO) is applied in the final stage to refine and enhance the visual quality of the generated videos, ensuring smoother and more realistic outputs. The model's architecture leverages a DiT with 3D full attention, trained using Flow Matching, and supports both English and Chinese prompts.

Installation

To get started with Step-Video-T2V, follow these steps. An NVIDIA GPU with CUDA support is required, and GPUs with 80GB of memory are recommended for optimal generation quality. The text-encoder (step_llm) self-attention supports CUDA capabilities sm_80, sm_86, and sm_90.

Requirements:

Python >= 3.10.0
PyTorch >= 2.3-cu121
CUDA Toolkit
FFmpeg

git clone https://github.com/stepfun-ai/Step-Video-T2V.git
conda create -n stepvideo python=3.10
conda activate stepvideo

cd Step-Video-T2V
pip install -e .
pip install flash-attn --no-build-isolation  ## flash-attn is optional

Examples

Step-Video-T2V offers robust performance in generating high-fidelity and dynamic videos. You can explore various video demonstrations directly on the GitHub repository's README.

For inference, the project provides scripts for both multi-GPU parallel deployment and single-GPU inference, with the latter supported by ModelScope's DiffSynth-Studio for VRAM reduction.

Multi-GPU Parallel Deployment Example:

python api/call_remote_server.py --model_dir where_you_download_dir &
parallel=4
url='127.0.0.1'
model_dir=where_you_download_dir

tp_degree=2
ulysses_degree=2

torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt "????????????????????“stepfun”???????" --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0

Refer to the official repository for detailed best-of-practice inference settings and single-GPU usage via DiffSynth-Studio.

Why Use Step-Video-T2V?

Step-Video-T2V stands out as a leading solution for text-to-video generation due to several key advantages:

State-of-the-Art Performance: It delivers exceptional video quality, outperforming both open-source and commercial engines, as validated by the Step-Video-T2V-Eval benchmark.
Efficiency: The innovative deep compression Video-VAE significantly reduces computational overhead during training and inference.
Advanced Architecture: It incorporates a sophisticated DiT with 3D full attention and leverages Direct Preference Optimization (DPO) for superior visual consistency and realism.
Multilingual Support: The model utilizes bilingual text encoders, supporting both English and Chinese prompts.
Community & Integration: The project is actively developed, with code planned for integration into Huggingface/Diffusers, and benefits from collaborations with teams like xDiT and FastVideo.