Spark-TTS: Efficient LLM-Based Text-to-Speech with Zero-Shot Voice Cloning

Introduction

Spark-TTS is an advanced text-to-speech (TTS) system developed by SparkAudio, designed for highly accurate and natural-sounding voice synthesis. This repository provides the official PyTorch inference code for the model, which leverages the power of large language models (LLM) to deliver efficient and flexible speech generation. Spark-TTS stands out by building entirely on Qwen2.5, simplifying the process by directly reconstructing audio from LLM-predicted codes, thus enhancing efficiency and reducing complexity.

Key features of Spark-TTS include:

Simplicity and Efficiency: Built entirely on Qwen2.5, Spark-TTS eliminates the need for additional generation models like flow matching. It directly reconstructs audio from LLM-predicted codes, streamlining the process and improving efficiency.
High-Quality Zero-Shot Voice Cloning: Supports zero-shot voice cloning, allowing it to replicate a speaker's voice without specific training data. This is ideal for cross-lingual and code-switching scenarios.
Bilingual Support: Seamlessly synthesizes speech in both Chinese and English, with capabilities for zero-shot voice cloning across languages.
Controllable Speech Generation: Enables the creation of virtual speakers by adjusting parameters such as gender, pitch, and speaking rate.

The project's paper, "Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens", was recently published, highlighting its innovative approach.

Installation

Getting Spark-TTS up and running is straightforward. Follow these steps to set up the environment and download the necessary models.

1. Clone the Repository

git clone https://github.com/SparkAudio/Spark-TTS.git
cd Spark-TTS

2. Create Conda Environment and Install Dependencies

Ensure you have Conda installed. For installation instructions, refer to the Miniconda installation guide.

conda create -n sparktts -y python=3.12
conda activate sparktts
pip install -r requirements.txt
# For users in mainland China, you can use a mirror:
# pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

For Windows installation, refer to the Windows Installation Guide.

3. Download Pre-trained Models

Via Python:

from huggingface_hub import snapshot_download

snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")

Via Git LFS:

mkdir -p pretrained_models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B pretrained_models/Spark-TTS-0.5B

Examples

Spark-TTS offers both command-line interface (CLI) and a user-friendly web interface for inference.

Basic CLI Usage

Run the provided demo script:

cd example
bash infer.sh

Or execute directly:

python -m cli.inference \
    --text "text to synthesis." \
    --device 0 \
    --save_dir "path/to/save/audio" \
    --model_dir pretrained_models/Spark-TTS-0.5B \
    --prompt_text "transcript of the prompt audio" \
    --prompt_speech_path "path/to/prompt_audio"

Web UI Usage

You can start the UI interface by running python webui.py --device 0, which allows you to perform Voice Cloning and Voice Creation. Voice Cloning supports uploading reference audio or directly recording the audio.

Demos

Experience the high-quality zero-shot voice cloning capabilities of Spark-TTS by visiting the official demo page. You can hear examples of various voices, including public figures and fictional characters, in both English and Chinese.

Why Use Spark-TTS

Spark-TTS represents a significant advancement in text-to-speech technology, offering a powerful and efficient solution for generating natural-sounding speech. Its LLM-based architecture, built on Qwen2.5, simplifies the synthesis pipeline while delivering exceptional results. With features like zero-shot voice cloning, comprehensive bilingual support, and fine-grained control over speech characteristics, Spark-TTS is an invaluable tool for researchers, developers, and anyone looking to integrate cutting-edge TTS capabilities into their projects. Whether for personalized speech synthesis, assistive technologies, or linguistic research, Spark-TTS provides a robust and versatile platform.

Links

Explore Spark-TTS further through these official resources:

Spark-TTS: Efficient LLM-Based Text-to-Speech with Zero-Shot Voice Cloning

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

1. Clone the Repository

2. Create Conda Environment and Install Dependencies

3. Download Pre-trained Models

Examples

Basic CLI Usage

Web UI Usage

Demos

Why Use Spark-TTS

Links

Related repositories

Grab: A Powerful Python Web Scraping Framework

Awesome Django: A Curated List of Essential Django Resources and Packages

Awesome Django: A Curated List of Essential Resources for Developers

Awesome Flask: A Curated List of Flask Resources and Plugins

Source repository