# Step-Video-T2V: State-of-the-Art Text-to-Video Generation Model

This repository profile is provided by osrepos.com, an open source repository discovery platform.

Source: osrepos.com
Repository profile: https://osrepos.com/repo/stepfun-ai-step-video-t2v
Generated for open source discovery and AI-assisted research.

Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.

GitHub: https://github.com/stepfun-ai/Step-Video-T2V
OSRepos URL: https://osrepos.com/repo/stepfun-ai-step-video-t2v

## Summary

Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.

## Topics

- Python
- Text-to-Video
- AI
- Machine Learning
- Video Generation
- Diffusion Models
- Generative AI

## Repository Information

Last analyzed by OSRepos: Wed Oct 29 2025 20:01:32 GMT+0000 (Western European Standard Time)
Detail views: 7
GitHub clicks: 4

## Safety Notice

OSRepos shares public repositories for knowledge and discovery only. Review source code, dependencies, licenses, and security implications before running or installing anything.

## Content

## Introduction

Step-Video-T2V is a groundbreaking text-to-video pre-trained model developed by stepfun-ai, featuring 30 billion parameters and the capability to generate videos up to 204 frames. It introduces a deep compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios, which significantly boosts both training and inference efficiency. Furthermore, Direct Preference Optimization (DPO) is applied in the final stage to refine and enhance the visual quality of the generated videos, ensuring smoother and more realistic outputs. The model's architecture leverages a DiT with 3D full attention, trained using Flow Matching, and supports both English and Chinese prompts.

## Installation

To get started with Step-Video-T2V, follow these steps. An NVIDIA GPU with CUDA support is required, and GPUs with 80GB of memory are recommended for optimal generation quality. The text-encoder (step_llm) self-attention supports CUDA capabilities sm_80, sm_86, and sm_90.

**Requirements:**
*   Python >= 3.10.0
*   PyTorch >= 2.3-cu121
*   CUDA Toolkit
*   FFmpeg

bash
git clone https://github.com/stepfun-ai/Step-Video-T2V.git
conda create -n stepvideo python=3.10
conda activate stepvideo

cd Step-Video-T2V
pip install -e .
pip install flash-attn --no-build-isolation  ## flash-attn is optional


## Examples

Step-Video-T2V offers robust performance in generating high-fidelity and dynamic videos. You can explore various video demonstrations directly on the GitHub repository's README.

For inference, the project provides scripts for both multi-GPU parallel deployment and single-GPU inference, with the latter supported by ModelScope's DiffSynth-Studio for VRAM reduction.

**Multi-GPU Parallel Deployment Example:**
bash
python api/call_remote_server.py --model_dir where_you_download_dir &
parallel=4
url='127.0.0.1'
model_dir=where_you_download_dir

tp_degree=2
ulysses_degree=2

torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt "????????????????????“stepfun”???????" --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0


Refer to the official repository for detailed best-of-practice inference settings and single-GPU usage via DiffSynth-Studio.

## Why Use Step-Video-T2V?

Step-Video-T2V stands out as a leading solution for text-to-video generation due to several key advantages:
*   **State-of-the-Art Performance**: It delivers exceptional video quality, outperforming both open-source and commercial engines, as validated by the Step-Video-T2V-Eval benchmark.
*   **Efficiency**: The innovative deep compression Video-VAE significantly reduces computational overhead during training and inference.
*   **Advanced Architecture**: It incorporates a sophisticated DiT with 3D full attention and leverages Direct Preference Optimization (DPO) for superior visual consistency and realism.
*   **Multilingual Support**: The model utilizes bilingual text encoders, supporting both English and Chinese prompts.
*   **Community & Integration**: The project is actively developed, with code planned for integration into Huggingface/Diffusers, and benefits from collaborations with teams like xDiT and FastVideo.

## Links

*   **GitHub Repository**: [https://github.com/stepfun-ai/Step-Video-T2V](https://github.com/stepfun-ai/Step-Video-T2V){target="_blank"}
*   **Hugging Face Models**:
    *   Step-Video-T2V: [https://huggingface.co/stepfun-ai/stepvideo-t2v](https://huggingface.co/stepfun-ai/stepvideo-t2v){target="_blank"}
    *   Step-Video-T2V-Turbo: [https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo){target="_blank"}
*   **Technical Report (Arxiv)**: [https://arxiv.org/abs/2502.10248](https://arxiv.org/abs/2502.10248){target="_blank"}
*   **Online Engine (????)**: [https://yuewen.cn/videos](https://yuewen.cn/videos){target="_blank"}