chatterbox-vllm: Accelerating Chatterbox TTS with vLLM for Enhanced Performance

Summary

chatterbox-vllm is a high-performance port of the Chatterbox Text-to-Speech (TTS) model to vLLM, designed to significantly improve generation speed and GPU memory efficiency. This personal project aims to provide a more efficient and easily integratable solution for speech synthesis, offering substantial speedups compared to the original implementation. While currently usable and demonstrating benchmark-topping throughput, it leverages internal vLLM APIs and hacky workarounds, with ongoing refactoring planned.

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

chatterbox-vllm is an impressive project that ports the Chatterbox Text-to-Speech (TTS) model to vLLM, a high-performance inference engine. Developed by randombk, this repository aims to dramatically enhance the performance and efficiency of the Chatterbox model, making it faster and more memory-friendly on GPUs. It's a personal project focused on leveraging vLLM's capabilities for state-of-the-art speech synthesis. Early benchmarks indicate significant speedups, making it an exciting development for anyone working with TTS models.

Installation

This project primarily supports Linux and WSL2 with Nvidia hardware. While AMD might work with minor adjustments, it has not been tested.

Prerequisites: Ensure git and uv (a fast Python package installer and resolver) are installed on your system.

git clone https://github.com/randombk/chatterbox-vllm.git
cd chatterbox-vllm
uv venv
source .venv/bin/activate
uv sync

The necessary model weights should be automatically downloaded from the Hugging Face Hub. If you encounter CUDA-related issues, try resetting your virtual environment and using uv pip install -e . instead of uv sync.

Examples

To quickly generate audio samples, you can run the provided example-tts.py script. This example demonstrates how to generate speech for multiple prompts using different voices.

import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS


if __name__ == "__main__":
    model = ChatterboxTTS.from_pretrained(
        gpu_memory_utilization = 0.4,
        max_model_len = 1000,

        # Disable CUDA graphs to reduce startup time for one-off generation.
        enforce_eager = True,
    )

    for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
        prompts = [
            "You are listening to a demo of the Chatterbox TTS model running on VLLM.",
            "This is a separate prompt to test the batching implementation.",
            "And here is a third prompt. It's a bit longer than the first one, but not by much.",
        ]
    
        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
        for audio_idx, audio in enumerate(audios):
            ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)

Why Use It

The primary motivation behind chatterbox-vllm is to overcome performance bottlenecks and improve GPU memory utilization of the original Chatterbox TTS model. By porting it to vLLM, the project achieves:

Improved Performance: Early benchmarks show significant speedups, with generation tokens/s increasing by approximately 4x without batching and over 10x with batching. This is a substantial improvement over the original implementation, which was often bottlenecked by CPU-GPU synchronization.
Efficient GPU Memory Use: vLLM's optimized inference infrastructure allows for more efficient use of GPU memory, enabling higher throughput and potentially larger batch sizes.
Easier Integration: The vLLM port facilitates easier integration with modern, high-performance inference systems, streamlining deployment and scaling of TTS applications.
Benchmark-Topping Throughput: The project currently boasts impressive throughput, particularly for the T3 Llama token generation component, which is no longer the bottleneck in the TTS pipeline.