CSM: A Conversational Speech Generation Model by SesameAILabs

Introduction

CSM, or Conversational Speech Model, is a cutting-edge speech generation model developed by SesameAILabs. This model specializes in generating RVQ audio codes from both text and audio inputs, making it highly versatile for various speech synthesis applications. Its architecture is built upon a robust Llama backbone, complemented by a smaller audio decoder that produces Mimi audio codes.

CSM is the technology powering the interactive voice demo showcased in Sesame's research blog post. As of version 4.52.1, CSM is natively available in Hugging Face Transformers, simplifying its integration into existing projects. For quick testing, a hosted Hugging Face Space is also available.

Installation

To get started with CSM, ensure you meet the following requirements:

A CUDA-compatible GPU.
The code has been tested on CUDA 12.4 and 12.6, but may work on other versions.
Python 3.10 is recommended, though newer versions might also be compatible.
ffmpeg may be required for certain audio operations.
Access to the following Hugging Face models:
- Llama-3.2-1B
- CSM-1B

Follow these steps for setup:

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Disable lazy compilation in Mimi
export NO_TORCH_COMPILE=1

# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login

For Windows users, please note that the triton package cannot be installed directly. Instead, use pip install triton-windows.

Examples

Quickstart

To generate a conversation between two characters using prompts, run the following script:

python run_csm.py

Generate a Sentence

This example demonstrates how to generate a single sentence using a random speaker identity, as no prompt or context is provided.

from generator import load_csm_1b
import torchaudio
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Generate with Context

CSM performs optimally when provided with context. You can prompt or provide context to the model using a Segment for each speaker's utterance. The following example is instructional and assumes audio files exist.

from generator import Segment

speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Why Use CSM?

CSM offers a powerful solution for high-quality conversational speech generation, making it an excellent tool for research and educational purposes. Its ability to leverage context allows for more natural and coherent speech synthesis, which is crucial for creating engaging audio experiences.

It's important to note that the open-sourced model is a base generation model. While it can produce a variety of voices, it has not been fine-tuned on any specific voice. CSM is designed as an audio generation model, not a general-purpose multimodal LLM, meaning it does not generate text. For text generation, a separate LLM is recommended. Although the model may have some capacity for non-English languages due to training data contamination, its performance in such languages is likely to be limited.

Misuse and Abuse ??

This project provides a high-quality speech generation model for research and educational purposes. While responsible and ethical use is encouraged, the following activities are explicitly prohibited:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

CSM: A Conversational Speech Generation Model by SesameAILabs

Summary

Repository Info

Tags