Chatterbox: State-of-the-Art Open-Source Text-to-Speech by Resemble AI

Introduction

Chatterbox is a collection of state-of-the-art, open-source text-to-speech (TTS) models developed by Resemble AI. This family of models offers high-fidelity speech generation, catering to various applications from real-time voice agents to creative content creation.

The latest addition, Chatterbox-Turbo, stands out as Resemble AI's most efficient model to date. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with significantly less compute and VRAM compared to its predecessors. A key innovation is the distillation of the speech-token-to-mel decoder, reducing generation from 10 steps to just one, while maintaining exceptional audio fidelity. Chatterbox-Turbo also natively supports paralinguistic tags, allowing users to integrate realistic elements like [cough], [laugh], and [chuckle] into generated speech. While optimized for low-latency voice agents, Turbo also excels in narration and diverse creative workflows.

The Chatterbox family also includes a Multilingual model supporting over 23 languages for global applications and localization, and the original Chatterbox model offering creative controls like CFG and exaggeration tuning.

Installation

You can easily install Chatterbox using pip:

pip install chatterbox-tts

Alternatively, you can install from source for more control:

# conda create -yn chatterbox python=3.11
# conda activate chatterbox

git clone https://github.com/resemble-ai/chatterbox.git
cd chatterbox
pip install -e .

Chatterbox was developed and tested on Python 3.11 on Debian 11 OS, with dependencies pinned for consistency.

Examples

Here are examples demonstrating how to use the Chatterbox models for speech generation.

Chatterbox-Turbo

import torchaudio as ta
import torch
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Load the Turbo model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with Paralinguistic Tags
text = "Hi there, Sarah here from MochaFone calling you back [chuckle], have you got one minute to chat about the billing issue?"

# Generate audio (requires a reference clip for voice cloning)
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")

ta.save("test-turbo.wav", wav, model.sr)

Chatterbox and Chatterbox-Multilingual


import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# English example
model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("test-english.wav", wav, model.sr)

# Multilingual examples
multilingual_model = ChatterboxMultilingualTTS.from_pretrained(device=device)

french_text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
wav_french = multilingual_model.generate(french_text, language_id="fr")
ta.save("test-french.wav", wav_french, model.sr)

chinese_text = "???????????????????????"
wav_chinese = multilingual_model.generate(chinese_text, language_id="zh")
ta.save("test-chinese.wav", wav_chinese, model.sr)

# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)

Why Use Chatterbox?

Chatterbox offers a compelling solution for text-to-speech needs due to several key advantages:

State-of-the-Art Performance: Delivers high-quality, natural-sounding speech across its model family.
Exceptional Efficiency: Chatterbox-Turbo provides significant performance gains, reducing compute and VRAM requirements while speeding up generation to a single step.
Realistic Paralinguistic Tags: Enhance expressiveness and realism in generated audio with built-in tags like laughs and chuckles.
Broad Multilingual Support: The Multilingual model supports over 23 languages, making it suitable for global applications and diverse content.
Voice Cloning Capabilities: Easily generate speech in different voices by providing an audio prompt.
Responsible AI: Integrates Resemble AI's PerTh (Perceptual Threshold) Watermarker, embedding imperceptible neural watermarks for ethical AI use.
Active Community: Join the official Discord for support and collaboration.

Chatterbox: State-of-the-Art Open-Source Text-to-Speech by Resemble AI

Summary

Repository Info

Tags