MOSS-TTS Family: Open-Source High-Fidelity Speech and Sound Generation

MOSS-TTS Family: Open-Source High-Fidelity Speech and Sound Generation

Summary

The MOSS-TTS Family offers an open-source suite of models for high-fidelity, highly expressive speech and sound generation. Designed for complex real-world scenarios, it covers stable long-form speech, multi-speaker dialogue, voice design, environmental sound effects, and real-time streaming TTS. This comprehensive family of models from MOSI.AI and OpenMOSS team provides robust solutions for diverse audio generation needs.

Repository Info

Updated on May 31, 2026
View on GitHub

Introduction

MOSS-TTS Family, developed by MOSI.AI and the OpenMOSS team, is an open-source collection of models dedicated to advanced speech and sound generation. It is engineered to meet the demands of high-fidelity, high-expressiveness, and complex real-world applications. This family of models addresses various needs, including stable long-form speech, multi-speaker dialogue, voice and character design, environmental sound effects, and real-time streaming text-to-speech (TTS).

A single audio piece often requires nuanced capabilities, such as sounding like a real person, accurate pronunciation, switching speaking styles, maintaining stability over long durations, and supporting dialogue and real-time interaction. The MOSS-TTS Family breaks down this complexity into five production-ready models that can be used independently or integrated into a complete pipeline:

  • MOSS-TTS: The flagship model for high fidelity and optimal zero-shot voice cloning, supporting long-speech generation, fine-grained control, and multilingual synthesis.
  • MOSS-TTSD: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues.
  • MOSS-VoiceGenerator: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, without reference speech.
  • MOSS-TTS-Realtime: A multi-turn context-aware model for real-time voice agents, ensuring natural and coherent replies with low latency.
  • MOSS-SoundEffect: A content creation model specialized in sound effect generation with wide category coverage and controllable duration.

Installation

To get started with MOSS-TTS Family, a clean, isolated Python environment is recommended to avoid dependency conflicts. Transformers 5.0.0 is required.

Using Conda

  1. Create and activate a new Conda environment:
    conda create -n moss-tts python=3.12 -y
    conda activate moss-tts
    
  2. Clone the repository and install dependencies:
    git clone https://github.com/OpenMOSS/MOSS-TTS.git
    cd MOSS-TTS
    pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
    

Using uv

  1. Install uv (if not already installed, see uv documentation).
  2. Clone the repository and create a virtual environment:
    git clone https://github.com/OpenMOSS/MOSS-TTS.git
    cd MOSS-TTS
    uv venv --python 3.12 .venv
    source .venv/bin/activate
    
  3. Install dependencies:
    uv pip install --torch-backend cu128 -e ".[torch-runtime]"
    

(Optional) Install FlashAttention 2

For improved speed and reduced GPU memory usage, FlashAttention 2 can be installed if your hardware supports it.

For Conda/pip:

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]"

For uv:

uv pip install --torch-backend cu128 -e ".[torch-runtime,flash-attn]"

Examples

MOSS-TTS provides a straightforward generate interface. Here's a basic example demonstrating direct TTS and voice cloning:

from pathlib import Path
import importlib.util
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor

# Disable the broken cuDNN SDPA backend
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

def resolve_attn_implementation() -> str:
    if (
        device == "cuda"
        and importlib.util.find_spec("flash_attn") is not None
        and dtype in {torch.float16, torch.bfloat16}
    ):
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            return "flash_attention_2"
    if device == "cuda":
        return "sdpa"
    return "eager"

attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")

processor = AutoProcessor.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

text_chinese = "?????\n????\n\n???????????????????????????"
text_english = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories."

# Use audio from ./assets/audio to avoid downloading from the cloud.
ref_audio_chinese = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_english = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

conversations = [
    # Direct TTS (no reference). Language tags are recommended in v1.5.
    [processor.build_user_message(text=text_chinese)],
    [processor.build_user_message(text=text_english)],
    # Voice cloning (with reference)
    [processor.build_user_message(text=text_chinese, reference=[ref_audio_chinese])],
    [processor.build_user_message(text=text_english, reference=[ref_audio_english])],
]

model = AutoModel.from_pretrained(
    pretrained_model_name_or_path,
    trust_remote_code=True,
    attn_implementation=attn_implementation,
    torch_dtype=dtype,
).to(device)
model.eval()

save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0
with torch.no_grad():
    for batch_conversations in conversations:
        batch = processor([batch_conversations], mode="generation")
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=4096,
        )

        for message in processor.decode(outputs):
            audio = message.audio_codes_list[0]
            out_path = save_dir / f"sample{sample_idx}.wav"
            sample_idx += 1
            torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)

Why Use MOSS-TTS Family?

MOSS-TTS Family stands out for its comprehensive approach to audio generation, offering a suite of specialized models that collectively achieve state-of-the-art performance. Key advantages include:

  • High Fidelity and Expressiveness: The models are designed to produce highly realistic and emotionally rich speech and sound, crucial for immersive user experiences.
  • Versatility for Complex Scenarios: From long-form narratives and multi-speaker dialogues to unique voice design and real-time conversational agents, MOSS-TTS Family provides tailored solutions for diverse and challenging applications.
  • Multilingual Support: MOSS-TTS-v1.5 supports 31 languages, offering robust multilingual synthesis and voice cloning capabilities, making it suitable for global applications.
  • Optimized Performance: With features like FlashAttention 2 support and a torch-free llama.cpp backend, the models are optimized for speed, lower GPU memory usage, and efficient deployment on various hardware, including CPU-only environments.
  • Modular Architecture: The family's modular design allows developers to use individual models for specific tasks or combine them into powerful pipelines, offering flexibility and scalability.
  • Strong Evaluation Results: MOSS-TTS models consistently achieve leading performance on objective and subjective benchmarks, often rivaling or surpassing top closed-source systems in areas like speaker attribution accuracy, voice similarity, and overall quality.

Links