MuseTalk: Real-Time High-Fidelity Lip Synchronization for Virtual Humans

Introduction

MuseTalk, developed by Lyra Lab at Tencent Music Entertainment, is a cutting-edge, real-time lip synchronization model designed for high-fidelity video dubbing. This innovative project enables seamless synchronization of facial movements with audio, making it a powerful tool for virtual human solutions. MuseTalk operates by inpainting in the latent space, offering real-time performance at 30 frames per second or more on an NVIDIA Tesla V100. It supports various languages, including Chinese, English, and Japanese, and can modify face regions of 256x256 pixels.

The latest MuseTalk 1.5 version introduces significant advancements, integrating perceptual loss, GAN loss, and sync loss during training. This, combined with a two-stage training strategy and spatio-temporal data sampling, dramatically boosts overall performance, leading to enhanced clarity, improved identity consistency, and precise lip-speech synchronization. Both inference and training codes, along with model weights, are now fully open-sourced, inviting developers to explore and build upon this technology.

Installation

Getting started with MuseTalk involves setting up the environment and installing necessary dependencies. Follow these steps to prepare your system:

Build Environment:
It is recommended to use Python 3.10 and CUDA 11.7.
```
conda create -n MuseTalk python==3.10
conda activate MuseTalk
```

Install PyTorch 2.0.1:
Choose one of the following installation methods:

# Option 1: Using pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

# Option 2: Using conda
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia

Install Dependencies:
Install the remaining required packages:
```
pip install -r requirements.txt
```

Install MMLab Packages:
Install the MMLab ecosystem packages:

pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"

Setup FFmpeg:
Download the ffmpeg-static package and configure it for your operating system. Refer to the official GitHub repository for detailed instructions on setting FFMPEG_PATH for Linux or adding it to your system's PATH for Windows.
Download Weights:
Weights can be downloaded using provided scripts for Linux or Windows, or manually from Hugging Face and other specified links. Ensure the weights are organized correctly in the ./models/ directory as described in the repository.

Examples

MuseTalk provides compelling examples showcasing its capabilities, particularly the improvements in version 1.5. The project's GitHub repository features comparison tables demonstrating the visual quality and synchronization enhancements between MuseTalk 1.0 and 1.5. These examples highlight the model's ability to generate realistic lip movements from various input videos and audio.

For an interactive experience, a Gradio demo is available on Hugging Face Spaces, allowing users to easily adjust parameters and test the model with their own inputs. This demo is optimized for fine-tuning lip-sync parameters and offers insights into the model's real-time inference capabilities.

Why Use MuseTalk

Real-time Performance: Achieve high-speed video dubbing with over 30 frames per second on an NVIDIA Tesla V100, crucial for live applications and efficient workflows.
High-Fidelity Output: Version 1.5 delivers superior visual quality, maintaining identity consistency and achieving precise lip-speech synchronization, making generated content highly realistic.
Multi-language Support: The model is trained to support various languages, including Chinese, English, and Japanese, broadening its applicability for global content creation.
Comprehensive Virtual Human Integration: MuseTalk can be seamlessly integrated with other virtual human generation tools, such as MuseV, to create complete and dynamic virtual characters.
Open-Source Accessibility: With open-sourced training and inference codes, along with pre-trained models, developers and researchers can easily implement, customize, and contribute to the project.

MuseTalk: Real-Time High-Fidelity Lip Synchronization for Virtual Humans

Summary

Repository Info

Tags

Introduction

Installation

Examples

Why Use MuseTalk

Links