chatterbox-vllm: Accelerating Chatterbox TTS with vLLM for Enhanced Performance

This repository profile is provided by osrepos.com, an open source repository discovery platform.

chatterbox-vllm: Accelerating Chatterbox TTS with vLLM for Enhanced Performance

Summary

chatterbox-vllm is a high-performance port of the Chatterbox Text-to-Speech (TTS) model to vLLM, designed to significantly improve generation speed and GPU memory efficiency. This personal project aims to provide a more efficient and easily integratable solution for speech synthesis, offering substantial speedups compared to the original implementation. While currently usable and demonstrating benchmark-topping throughput, it leverages internal vLLM APIs and hacky workarounds, with ongoing refactoring planned.

Repository Information

Analyzed by OSRepos on October 11, 2025

Topics

Click on any tag to explore related repositories

Use at your own risk

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of code from these repositories is the user's own responsibility. Always review the repository, source code, dependencies, licenses, and security implications before running or installing anything. OSRepos is not responsible for issues, damages, or losses resulting from third-party repositories.

Introduction

chatterbox-vllm is an impressive project that ports the Chatterbox Text-to-Speech (TTS) model to vLLM, a high-performance inference engine. Developed by randombk, this repository aims to dramatically enhance the performance and efficiency of the Chatterbox model, making it faster and more memory-friendly on GPUs. It's a personal project focused on leveraging vLLM's capabilities for state-of-the-art speech synthesis. Early benchmarks indicate significant speedups, making it an exciting development for anyone working with TTS models.

Installation

This project primarily supports Linux and WSL2 with Nvidia hardware. While AMD might work with minor adjustments, it has not been tested.

Prerequisites: Ensure git and uv (a fast Python package installer and resolver) are installed on your system.

git clone https://github.com/randombk/chatterbox-vllm.git
cd chatterbox-vllm
uv venv
source .venv/bin/activate
uv sync

The necessary model weights should be automatically downloaded from the Hugging Face Hub. If you encounter CUDA-related issues, try resetting your virtual environment and using uv pip install -e . instead of uv sync.

Examples

To quickly generate audio samples, you can run the provided example-tts.py script. This example demonstrates how to generate speech for multiple prompts using different voices.

import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS


if __name__ == "__main__":
    model = ChatterboxTTS.from_pretrained(
        gpu_memory_utilization = 0.4,
        max_model_len = 1000,

        # Disable CUDA graphs to reduce startup time for one-off generation.
        enforce_eager = True,
    )

    for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
        prompts = [
            "You are listening to a demo of the Chatterbox TTS model running on VLLM.",
            "This is a separate prompt to test the batching implementation.",
            "And here is a third prompt. It's a bit longer than the first one, but not by much.",
        ]
    
        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
        for audio_idx, audio in enumerate(audios):
            ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)

Why Use It

The primary motivation behind chatterbox-vllm is to overcome performance bottlenecks and improve GPU memory utilization of the original Chatterbox TTS model. By porting it to vLLM, the project achieves:

  • Improved Performance: Early benchmarks show significant speedups, with generation tokens/s increasing by approximately 4x without batching and over 10x with batching. This is a substantial improvement over the original implementation, which was often bottlenecked by CPU-GPU synchronization.
  • Efficient GPU Memory Use: vLLM's optimized inference infrastructure allows for more efficient use of GPU memory, enabling higher throughput and potentially larger batch sizes.
  • Easier Integration: The vLLM port facilitates easier integration with modern, high-performance inference systems, streamlining deployment and scaling of TTS applications.
  • Benchmark-Topping Throughput: The project currently boasts impressive throughput, particularly for the T3 Llama token generation component, which is no longer the bottleneck in the TTS pipeline.

Links

Related repositories

Similar repositories that may be relevant next.

RL4LMs: A Modular RL Library for Fine-tuning Language Models

RL4LMs: A Modular RL Library for Fine-tuning Language Models

July 6, 2026

RL4LMs is a powerful and modular reinforcement learning library designed to fine-tune language models to human preferences. It offers easily customizable building blocks for training, including on-policy algorithms, reward functions, and metrics. Thoroughly tested and benchmarked, RL4LMs supports a wide range of NLP tasks and models.

reinforcement-learningnatural-language-processinglanguage-modeling
torchtune: PyTorch Native Library for LLM Post-Training and Experimentation

torchtune: PyTorch Native Library for LLM Post-Training and Experimentation

July 5, 2026

torchtune is a PyTorch native library designed for authoring, post-training, and experimenting with Large Language Models (LLMs). It offers hackable training recipes, simple PyTorch implementations of popular LLMs, and best-in-class memory efficiency. Please note: torchtune is no longer actively maintained as of 2025.

PythonPyTorchLLM
RouteLLM: Optimize LLM Costs and Maintain Quality with Intelligent Routing

RouteLLM: Optimize LLM Costs and Maintain Quality with Intelligent Routing

July 5, 2026

RouteLLM is a powerful framework designed to serve and evaluate LLM routers, enabling significant cost savings without compromising response quality. It intelligently routes simpler queries to cheaper models while maintaining high performance, offering a drop-in replacement for existing OpenAI clients or a compatible server. This solution helps balance the dilemma of LLM deployment costs versus model capabilities.

PythonLLM RoutingAI
Memoripy: An AI Memory Layer for Context-Aware Applications

Memoripy: An AI Memory Layer for Context-Aware Applications

July 5, 2026

Memoripy is a Python library designed to provide an AI memory layer for context-aware applications. It offers both short-term and long-term storage, semantic clustering, and optional memory decay. This robust tool helps AI systems manage and retrieve relevant information efficiently, supporting various LLM APIs like OpenAI and Ollama.

aillmmemory

Source repository

Open the original repository on GitHub.

View on GitHub
OS
OSRepos

Analysis and discovery of open source repositories. Find interesting projects and follow their updates.

Monitor your website with YourWebsiteScore

OSRepos shares public repositories for knowledge and discovery only. Any installation, execution, configuration, or use of third-party repository code is at your own risk. Always review source code, dependencies, licenses, and security implications before running anything.

© 2025 OSRepos. Built with Nuxt 3 and lots of ❤️