VoxCPM: Tokenizer-Free TTS for Multilingual Speech, Voice Design, and Cloning

VoxCPM: Tokenizer-Free TTS for Multilingual Speech, Voice Design, and Cloning

Summary

VoxCPM2 is a groundbreaking tokenizer-free Text-to-Speech system, offering highly natural and expressive synthesis across 30 languages. It enables creative voice design from natural language descriptions and provides advanced controllable voice cloning capabilities. With its 2B parameter model, VoxCPM2 delivers 48kHz studio-quality audio, making it a powerful tool for diverse speech generation needs.

Repository Info

Updated on June 1, 2026
View on GitHub

Introduction

VoxCPM2 is a cutting-edge, tokenizer-free Text-to-Speech (TTS) system developed by OpenBMB. It directly generates continuous speech representations using an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization for highly natural and expressive synthesis. The latest major release, VoxCPM2, is a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages. It introduces innovative features like Voice Design, Controllable Voice Cloning, and delivers 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone, VoxCPM2 is fully open-source and commercial-ready under the Apache-2.0 license.

Installation

Getting started with VoxCPM is straightforward. Ensure you have Python ? 3.10 (<3.13), PyTorch ? 2.5.0, and CUDA ? 12.0 installed.

pip install voxcpm

For more detailed installation instructions, refer to the official documentation.

Examples

VoxCPM provides a flexible Python API for various speech synthesis tasks.

Text-to-Speech

Generate speech from text with ease:

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 is a powerful multilingual speech synthesis system.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("saved: demo.wav")

Voice Design

Create a brand-new voice from a natural-language description, no reference audio needed. Simply put the description in parentheses at the start of your text:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Controllable Voice Cloning

Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre:

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

Why Use VoxCPM?

VoxCPM2 stands out with several key advantages for speech generation:

  • 30-Language Multilingual Support: Synthesize text in any of the 30 supported languages without needing language tags.
  • Creative Voice Design: Generate unique voices purely from natural-language descriptions, eliminating the need for reference audio.
  • Controllable Voice Cloning: Clone voices from short clips and fine-tune style, emotion, and pace while maintaining the original timbre.
  • 48kHz High-Quality Audio: Output studio-quality audio directly, with built-in super-resolution.
  • Real-Time Streaming: Achieve low Real-Time Factor (RTF) for efficient, high-throughput serving, especially with Nano-vLLM or vLLM-Omni.
  • Fully Open-Source & Commercial-Ready: Released under the Apache-2.0 license, making it free for commercial use.

Links

Explore VoxCPM further through these official resources: