AudioSep: Foundation Model for Open-Domain Sound Separation with Language Queries

AudioSep: Foundation Model for Open-Domain Sound Separation with Language Queries

Summary

AudioSep is a groundbreaking foundation model for open-domain sound separation, allowing users to isolate specific sounds using natural language descriptions. It demonstrates strong performance and impressive zero-shot generalization across various tasks, including audio event, musical instrument, and speech separation. This powerful tool simplifies complex audio processing with intuitive text-based queries.

Repository Info

Updated on March 30, 2026
View on GitHub

Introduction

AudioSep is the official implementation of the paper "Separate Anything You Describe," introducing a novel foundation model for open-domain sound separation. This innovative model leverages natural language queries to perform highly accurate sound isolation, making complex audio processing tasks more accessible. AudioSep showcases robust separation performance and remarkable zero-shot generalization capabilities across a wide array of tasks, such as separating audio events, musical instruments, and enhancing speech. Explore its capabilities and listen to separated audio examples on the official Demo Page.

Installation

To get started with AudioSep, follow these steps to clone the repository and set up your environment:

  1. Clone the repository and navigate into it:

    git clone https://github.com/Audio-AGI/AudioSep.git && \
    cd AudioSep
    
  2. Create and activate the Conda environment:

    conda env create -f environment.yml && \
    conda activate AudioSep
    
  3. Download model weights:

    Obtain the necessary model weights from the Hugging Face checkpoint directory and place them in the checkpoint/ folder within your cloned repository.

Examples

AudioSep offers flexible methods for inference and training. Here are some common use cases:

Basic Inference

Perform sound separation using a local model checkpoint:

from pipeline import build_audiosep, inference
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = build_audiosep(
      config_yaml='config/audiosep_base.yaml', 
      checkpoint_path='checkpoint/audiosep_base_4M_steps.ckpt', 
      device=device)

audio_file = 'path_to_audio_file'
text = 'textual_description'
output_file='separated_audio.wav'

# AudioSep processes audio at 32 kHz sampling rate  
inference(model, audio_file, text, output_file, device)

Inference from Hugging Face

Load the model directly from Hugging Face for convenience:

from models.audiosep import AudioSep
from utils import get_ss_model
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

ss_model = get_ss_model('config/audiosep_base.yaml')

model = AudioSep.from_pretrained("nielsr/audiosep-demo", ss_model=ss_model)

audio_file = 'path_to_audio_file'
text = 'textual_description'
output_file='separated_audio.wav'

# AudioSep processes audio at 32 kHz sampling rate  
inference(model, audio_file, text, output_file, device)

Chunk-based Inference

For memory efficiency, especially with longer audio files, use chunk-based inference:

inference(model, audio_file, text, output_file, device, use_chunk=True)

Training

AudioSep can be trained from scratch or fine-tuned with your own audio-text paired datasets. Refer to the repository's datafiles/template.json for the required data format and update config/audiosep_base.yaml to include your data files.

Why Use AudioSep?

AudioSep stands out for several compelling reasons:

  • Open-Domain Separation: Its core strength lies in its ability to separate anything you can describe with natural language, offering unparalleled flexibility.
  • Intuitive Interface: By using text queries, it makes advanced audio separation accessible to users without deep technical knowledge of signal processing.
  • Foundation Model Capabilities: As a foundation model, it exhibits strong generalization, performing well on diverse and unseen audio separation tasks without specific retraining.
  • Versatility: It effectively handles a broad spectrum of audio challenges, from isolating specific sound events and musical instruments to improving speech clarity.
  • Community and Integration: With integrations like Colab, Hugging Face Spaces, and Replicate, experimenting and deploying AudioSep is straightforward.

Links