Step-Video-T2V: State-of-the-Art Text-to-Video Generation Model
This repository profile is provided by osrepos.com, an open source repository discovery platform.

Summary
Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.
Repository Information
Topics
Click on any tag to explore related repositories
Use at your own risk
OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.
Introduction
Step-Video-T2V is a groundbreaking text-to-video pre-trained model developed by stepfun-ai, featuring 30 billion parameters and the capability to generate videos up to 204 frames. It introduces a deep compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios, which significantly boosts both training and inference efficiency. Furthermore, Direct Preference Optimization (DPO) is applied in the final stage to refine and enhance the visual quality of the generated videos, ensuring smoother and more realistic outputs. The model's architecture leverages a DiT with 3D full attention, trained using Flow Matching, and supports both English and Chinese prompts.
Installation
To get started with Step-Video-T2V, follow these steps. An NVIDIA GPU with CUDA support is required, and GPUs with 80GB of memory are recommended for optimal generation quality. The text-encoder (step_llm) self-attention supports CUDA capabilities sm_80, sm_86, and sm_90.
Requirements:
- Python >= 3.10.0
- PyTorch >= 2.3-cu121
- CUDA Toolkit
- FFmpeg
git clone https://github.com/stepfun-ai/Step-Video-T2V.git
conda create -n stepvideo python=3.10
conda activate stepvideo
cd Step-Video-T2V
pip install -e .
pip install flash-attn --no-build-isolation ## flash-attn is optional
Examples
Step-Video-T2V offers robust performance in generating high-fidelity and dynamic videos. You can explore various video demonstrations directly on the GitHub repository's README.
For inference, the project provides scripts for both multi-GPU parallel deployment and single-GPU inference, with the latter supported by ModelScope's DiffSynth-Studio for VRAM reduction.
Multi-GPU Parallel Deployment Example:
python api/call_remote_server.py --model_dir where_you_download_dir &
parallel=4
url='127.0.0.1'
model_dir=where_you_download_dir
tp_degree=2
ulysses_degree=2
torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt "????????????????????“stepfun”???????" --infer_steps 50 --cfg_scale 9.0 --time_shift 13.0
Refer to the official repository for detailed best-of-practice inference settings and single-GPU usage via DiffSynth-Studio.
Why Use Step-Video-T2V?
Step-Video-T2V stands out as a leading solution for text-to-video generation due to several key advantages:
- State-of-the-Art Performance: It delivers exceptional video quality, outperforming both open-source and commercial engines, as validated by the Step-Video-T2V-Eval benchmark.
- Efficiency: The innovative deep compression Video-VAE significantly reduces computational overhead during training and inference.
- Advanced Architecture: It incorporates a sophisticated DiT with 3D full attention and leverages Direct Preference Optimization (DPO) for superior visual consistency and realism.
- Multilingual Support: The model utilizes bilingual text encoders, supporting both English and Chinese prompts.
- Community & Integration: The project is actively developed, with code planned for integration into Huggingface/Diffusers, and benefits from collaborations with teams like xDiT and FastVideo.
Links
- GitHub Repository: https://github.com/stepfun-ai/Step-Video-T2V
- Hugging Face Models:
- Step-Video-T2V: https://huggingface.co/stepfun-ai/stepvideo-t2v
- Step-Video-T2V-Turbo: https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo
- Technical Report (Arxiv): https://arxiv.org/abs/2502.10248
- Online Engine (????): https://yuewen.cn/videos
Related repositories
Similar repositories that may be relevant next.

LazyLLM: Low-Code Development for Multi-Agent LLM Applications
July 2, 2026
LazyLLM offers a low-code development tool designed for building multi-agent LLM applications with ease. It simplifies the creation of complex AI applications, providing a streamlined workflow for rapid prototyping, data feedback, and iterative optimization. Developers can leverage its extensive features for deployment, cross-platform compatibility, and efficient model fine-tuning.

ChatArena: Multi-Agent Language Game Environments for LLMs
July 1, 2026
ChatArena is a Python library designed to provide multi-agent language game environments for Large Language Models (LLMs), aiming to foster the development of communication and collaboration capabilities in AI. It offers a flexible framework for defining players, environments, and interactions based on Markov Decision Processes. Please note that as of August 11, 2025, this project has been deprecated due to a lack of widespread community use and is no longer receiving updates or support.
Agentarium: A Python Framework for AI Agent Simulations
July 1, 2026
Agentarium is an open-source Python framework designed for creating and managing simulations with AI-powered agents. It offers an intuitive platform for designing complex, interactive environments where agents can act, learn, and evolve. This powerful tool simplifies the orchestration of multiple AI agents and their interactions.
Lighteval: Your All-in-One Toolkit for LLM Evaluation
July 1, 2026
Lighteval is a comprehensive toolkit from Hugging Face for evaluating Large Language Models (LLMs) across various backends. It enables users to dive deep into model performance by saving detailed, sample-by-sample results and supports over 1000 evaluation tasks. The framework offers extensive customization options, allowing users to create custom tasks and metrics tailored to their specific needs.
Source repository
Open the original repository on GitHub.