MOSS-TTS Family: Open-Source High-Fidelity Speech and Sound Generation

Introduction

MOSS-TTS Family, developed by MOSI.AI and the OpenMOSS team, is an open-source collection of models dedicated to advanced speech and sound generation. It is engineered to meet the demands of high-fidelity, high-expressiveness, and complex real-world applications. This family of models addresses various needs, including stable long-form speech, multi-speaker dialogue, voice and character design, environmental sound effects, and real-time streaming text-to-speech (TTS).

A single audio piece often requires nuanced capabilities, such as sounding like a real person, accurate pronunciation, switching speaking styles, maintaining stability over long durations, and supporting dialogue and real-time interaction. The MOSS-TTS Family breaks down this complexity into five production-ready models that can be used independently or integrated into a complete pipeline:

MOSS-TTS: The flagship model for high fidelity and optimal zero-shot voice cloning, supporting long-speech generation, fine-grained control, and multilingual synthesis.
MOSS-TTSD: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues.
MOSS-VoiceGenerator: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, without reference speech.
MOSS-TTS-Realtime: A multi-turn context-aware model for real-time voice agents, ensuring natural and coherent replies with low latency.
MOSS-SoundEffect: A content creation model specialized in sound effect generation with wide category coverage and controllable duration.

Installation

To get started with MOSS-TTS Family, a clean, isolated Python environment is recommended to avoid dependency conflicts. Transformers 5.0.0 is required.

Using Conda

Create and activate a new Conda environment:

conda create -n moss-tts python=3.12 -y
conda activate moss-tts

Clone the repository and install dependencies:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"

Using `uv`

Install uv (if not already installed, see uv documentation).

Clone the repository and create a virtual environment:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
uv venv --python 3.12 .venv
source .venv/bin/activate

Install dependencies:

uv pip install --torch-backend cu128 -e ".[torch-runtime]"

(Optional) Install FlashAttention 2

For improved speed and reduced GPU memory usage, FlashAttention 2 can be installed if your hardware supports it.

For Conda/pip:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

For uv:

uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"

Examples

MOSS-TTS provides a straightforward generate interface. Here's a basic example demonstrating direct TTS and voice cloning:

from pathlib import Path
import importlib.util
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor

# Disable the broken cuDNN SDPA backend
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

def resolve_attn_implementation() -> str:
    if (
        device == "cuda"
        and importlib.util.find_spec("flash_attn") is not None
        and dtype in {torch.float16, torch.bfloat16}
    ):
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            return "flash_attention_2"
    if device == "cuda":
        return "sdpa"
    return "eager"

attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")

processor = AutoProcessor.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

text_chinese = "?????\n????\n\n???????????????????????????"
text_english = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories."

# Use audio from ./assets/audio to avoid downloading from the cloud.
ref_audio_chinese = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_english = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

conversations = [
    # Direct TTS (no reference). Language tags are recommended in v1.5.
    [processor.build_user_message(text=text_chinese)],
    [processor.build_user_message(text=text_english)],
    # Voice cloning (with reference)
    [processor.build_user_message(text=text_chinese, reference=[ref_audio_chinese])],
    [processor.build_user_message(text=text_english, reference=[ref_audio_english])],
]

model = AutoModel.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
    attn_implementation=attn_implementation,
    torch_dtype=dtype,
).to(device)
model.eval()

save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0
with torch.no_grad():
    for batch_conversations in conversations:
        batch = processor([batch_conversations], mode="generation")
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=4096,
        )

        for message in processor.decode(outputs):
            audio = message.audio_codes_list[0]
            out_path = save_dir / f"sample{sample_idx}.wav"
            sample_idx += 1
            torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)

Why Use MOSS-TTS Family?

MOSS-TTS Family stands out for its comprehensive approach to audio generation, offering a suite of specialized models that collectively achieve state-of-the-art performance. Key advantages include:

High Fidelity and Expressiveness: The models are designed to produce highly realistic and emotionally rich speech and sound, crucial for immersive user experiences.
Versatility for Complex Scenarios: From long-form narratives and multi-speaker dialogues to unique voice design and real-time conversational agents, MOSS-TTS Family provides tailored solutions for diverse and challenging applications.
Multilingual Support: MOSS-TTS-v1.5 supports 31 languages, offering robust multilingual synthesis and voice cloning capabilities, making it suitable for global applications.
Optimized Performance: With features like FlashAttention 2 support and a torch-free llama.cpp backend, the models are optimized for speed, lower GPU memory usage, and efficient deployment on various hardware, including CPU-only environments.
Modular Architecture: The family's modular design allows developers to use individual models for specific tasks or combine them into powerful pipelines, offering flexibility and scalability.
Strong Evaluation Results: MOSS-TTS models consistently achieve leading performance on objective and subjective benchmarks, often rivaling or surpassing top closed-source systems in areas like speaker attribution accuracy, voice similarity, and overall quality.

MOSS-TTS Family: Open-Source High-Fidelity Speech and Sound Generation

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Using Conda

Using `uv`

(Optional) Install FlashAttention 2

Examples

Why Use MOSS-TTS Family?

Links

Related repositories

LibrePods: Liberate Your AirPods on Android and Linux

VoxCPM: Tokenizer-Free TTS for Multilingual Speech, Voice Design, and Cloning

pyAudioAnalysis: A Python Library for Audio Feature Extraction and Analysis

Riffusion (hobby): Real-time Music Generation with Stable Diffusion

Source repository

Summary

Repository Information

Topics

Use at your own risk

Introduction

Installation

Using Conda

Using uv

(Optional) Install FlashAttention 2

Examples

Why Use MOSS-TTS Family?

Links

Related repositories

LibrePods: Liberate Your AirPods on Android and Linux

VoxCPM: Tokenizer-Free TTS for Multilingual Speech, Voice Design, and Cloning

pyAudioAnalysis: A Python Library for Audio Feature Extraction and Analysis

Riffusion (hobby): Real-time Music Generation with Stable Diffusion

Source repository

Using `uv`