Spark-TTS: Efficient LLM-Based Text-to-Speech with Zero-Shot Voice Cloning
Summary
Spark-TTS is an advanced text-to-speech system that leverages large language models (LLM) for highly accurate and natural-sounding voice synthesis. Built on Qwen2.5, it offers streamlined efficiency, high-quality zero-shot voice cloning, bilingual support for Chinese and English, and controllable speech generation, making it versatile for both research and production.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Spark-TTS is an advanced text-to-speech (TTS) system developed by SparkAudio, designed for highly accurate and natural-sounding voice synthesis. This repository provides the official PyTorch inference code for the model, which leverages the power of large language models (LLM) to deliver efficient and flexible speech generation. Spark-TTS stands out by building entirely on Qwen2.5, simplifying the process by directly reconstructing audio from LLM-predicted codes, thus enhancing efficiency and reducing complexity.
Key features of Spark-TTS include:
- Simplicity and Efficiency: Built entirely on Qwen2.5, Spark-TTS eliminates the need for additional generation models like flow matching. It directly reconstructs audio from LLM-predicted codes, streamlining the process and improving efficiency.
- High-Quality Zero-Shot Voice Cloning: Supports zero-shot voice cloning, allowing it to replicate a speaker's voice without specific training data. This is ideal for cross-lingual and code-switching scenarios.
- Bilingual Support: Seamlessly synthesizes speech in both Chinese and English, with capabilities for zero-shot voice cloning across languages.
- Controllable Speech Generation: Enables the creation of virtual speakers by adjusting parameters such as gender, pitch, and speaking rate.
The project's paper, "Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens", was recently published, highlighting its innovative approach.
Installation
Getting Spark-TTS up and running is straightforward. Follow these steps to set up the environment and download the necessary models.
1. Clone the Repository
git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS
2. Create Conda Environment and Install Dependencies
Ensure you have Conda installed. For installation instructions, refer to the Miniconda installation guide.
conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
# For users in mainland China, you can use a mirror:
# pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
For Windows installation, refer to the Windows Installation Guide.
3. Download Pre-trained Models
Via Python:
from huggingface_hub import snapshot_download
snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")
Via Git LFS:
mkdir -p pretrained_models
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B pretrained_models/Spark-TTS-0.5B
Examples
Spark-TTS offers both command-line interface (CLI) and a user-friendly web interface for inference.
Basic CLI Usage
Run the provided demo script:
cd example
bash infer.sh
Or execute directly:
python -m cli.inference \
--text "text to synthesis." \
--device 0 \
--save_dir "path/to/save/audio" \
--model_dir pretrained_models/Spark-TTS-0.5B \
--prompt_text "transcript of the prompt audio" \
--prompt_speech_path "path/to/prompt_audio"
Web UI Usage
You can start the UI interface by running python webui.py --device 0, which allows you to perform Voice Cloning and Voice Creation. Voice Cloning supports uploading reference audio or directly recording the audio.
Demos
Experience the high-quality zero-shot voice cloning capabilities of Spark-TTS by visiting the official demo page. You can hear examples of various voices, including public figures and fictional characters, in both English and Chinese.
Why Use Spark-TTS
Spark-TTS represents a significant advancement in text-to-speech technology, offering a powerful and efficient solution for generating natural-sounding speech. Its LLM-based architecture, built on Qwen2.5, simplifies the synthesis pipeline while delivering exceptional results. With features like zero-shot voice cloning, comprehensive bilingual support, and fine-grained control over speech characteristics, Spark-TTS is an invaluable tool for researchers, developers, and anyone looking to integrate cutting-edge TTS capabilities into their projects. Whether for personalized speech synthesis, assistive technologies, or linguistic research, Spark-TTS provides a robust and versatile platform.
Links
Explore Spark-TTS further through these official resources: