{"name":"Step-Video-T2V: State-of-the-Art Text-to-Video Generation Model","description":"Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.","github":"https://github.com/stepfun-ai/Step-Video-T2V","url":"https://osrepos.com/repo/stepfun-ai-step-video-t2v","source":"osrepos.com","sourceDescription":"This repository profile is provided by osrepos.com, an open source repository discovery platform.","repositoryProfile":"https://osrepos.com/repo/stepfun-ai-step-video-t2v","generatedFor":"open source discovery and AI-assisted research","markdown":"https://osrepos.com/repo/stepfun-ai-step-video-t2v.md","json":"https://osrepos.com/repo/stepfun-ai-step-video-t2v.json","topics":["Python","Text-to-Video","AI","Machine Learning","Video Generation","Diffusion Models","Generative AI"],"keywords":["Python","Text-to-Video","AI","Machine Learning","Video Generation","Diffusion Models","Generative AI"],"stars":null,"summary":"Step-Video-T2V is a state-of-the-art text-to-video pre-trained model capable of generating videos up to 204 frames with 30 billion parameters. It achieves high efficiency through a deep compression Video-VAE and enhances visual quality using Direct Preference Optimization (DPO). The model's performance is validated on its novel benchmark, Step-Video-T2V-Eval, demonstrating superior text-to-video quality.","content":"## Introduction\n\nStep-Video-T2V is a groundbreaking text-to-video pre-trained model developed by stepfun-ai, featuring 30 billion parameters and the capability to generate videos up to 204 frames. It introduces a deep compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios, which significantly boosts both training and inference efficiency. Furthermore, Direct Preference Optimization (DPO) is applied in the final stage to refine and enhance the visual quality of the generated videos, ensuring smoother and more realistic outputs. The model's architecture leverages a DiT with 3D full attention, trained using Flow Matching, and supports both English and Chinese prompts.\n\n## Installation\n\nTo get started with Step-Video-T2V, follow these steps. An NVIDIA GPU with CUDA support is required, and GPUs with 80GB of memory are recommended for optimal generation quality. The text-encoder (step_llm) self-attention supports CUDA capabilities sm_80, sm_86, and sm_90.\n\n**Requirements:**\n*   Python >= 3.10.0\n*   PyTorch >= 2.3-cu121\n*   CUDA Toolkit\n*   FFmpeg\n\nbash\ngit clone https://github.com/stepfun-ai/Step-Video-T2V.git\nconda create -n stepvideo python=3.10\nconda activate stepvideo\n\ncd Step-Video-T2V\npip install -e .\npip install flash-attn --no-build-isolation  ## flash-attn is optional\n\n\n## Examples\n\nStep-Video-T2V offers robust performance in generating high-fidelity and dynamic videos. You can explore various video demonstrations directly on the GitHub repository's README.\n\nFor inference, the project provides scripts for both multi-GPU parallel deployment and single-GPU inference, with the latter supported by ModelScope's DiffSynth-Studio for VRAM reduction.\n\n**Multi-GPU Parallel Deployment Example:**\nbash\npython api/call_remote_server.py --model_dir where_you_download_dir &\nparallel=4\nurl='127.0.0.1'\nmodel_dir=where_you_download_dir\n\ntp_degree=2\nulysses_degree=2\n\ntorchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt \"????????????????????“stepfun”???????\" --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0\n\n\nRefer to the official repository for detailed best-of-practice inference settings and single-GPU usage via DiffSynth-Studio.\n\n## Why Use Step-Video-T2V?\n\nStep-Video-T2V stands out as a leading solution for text-to-video generation due to several key advantages:\n*   **State-of-the-Art Performance**: It delivers exceptional video quality, outperforming both open-source and commercial engines, as validated by the Step-Video-T2V-Eval benchmark.\n*   **Efficiency**: The innovative deep compression Video-VAE significantly reduces computational overhead during training and inference.\n*   **Advanced Architecture**: It incorporates a sophisticated DiT with 3D full attention and leverages Direct Preference Optimization (DPO) for superior visual consistency and realism.\n*   **Multilingual Support**: The model utilizes bilingual text encoders, supporting both English and Chinese prompts.\n*   **Community & Integration**: The project is actively developed, with code planned for integration into Huggingface/Diffusers, and benefits from collaborations with teams like xDiT and FastVideo.\n\n## Links\n\n*   **GitHub Repository**: [https://github.com/stepfun-ai/Step-Video-T2V](https://github.com/stepfun-ai/Step-Video-T2V){target=\"_blank\"}\n*   **Hugging Face Models**:\n    *   Step-Video-T2V: [https://huggingface.co/stepfun-ai/stepvideo-t2v](https://huggingface.co/stepfun-ai/stepvideo-t2v){target=\"_blank\"}\n    *   Step-Video-T2V-Turbo: [https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo){target=\"_blank\"}\n*   **Technical Report (Arxiv)**: [https://arxiv.org/abs/2502.10248](https://arxiv.org/abs/2502.10248){target=\"_blank\"}\n*   **Online Engine (????)**: [https://yuewen.cn/videos](https://yuewen.cn/videos){target=\"_blank\"}","metrics":{"detailViews":7,"githubClicks":5},"dates":{"published":null,"modified":"2025-10-29T20:01:32.000Z"}}