AudioSep: Foundation Model for Open-Domain Sound Separation with Language Queries

Summary
AudioSep is a groundbreaking foundation model for open-domain sound separation, allowing users to isolate specific sounds using natural language descriptions. It demonstrates strong performance and impressive zero-shot generalization across various tasks, including audio event, musical instrument, and speech separation. This powerful tool simplifies complex audio processing with intuitive text-based queries.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
AudioSep is the official implementation of the paper "Separate Anything You Describe," introducing a novel foundation model for open-domain sound separation. This innovative model leverages natural language queries to perform highly accurate sound isolation, making complex audio processing tasks more accessible. AudioSep showcases robust separation performance and remarkable zero-shot generalization capabilities across a wide array of tasks, such as separating audio events, musical instruments, and enhancing speech. Explore its capabilities and listen to separated audio examples on the official Demo Page.
Installation
To get started with AudioSep, follow these steps to clone the repository and set up your environment:
Clone the repository and navigate into it:
git clone https://github.com/Audio-AGI/AudioSep.git && \ cd AudioSepCreate and activate the Conda environment:
conda env create -f environment.yml && \ conda activate AudioSepDownload model weights:
Obtain the necessary model weights from the Hugging Face checkpoint directory and place them in the
checkpoint/folder within your cloned repository.
Examples
AudioSep offers flexible methods for inference and training. Here are some common use cases:
Basic Inference
Perform sound separation using a local model checkpoint:
from pipeline import build_audiosep, inference
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = build_audiosep(
config_yaml='config/audiosep_base.yaml',
checkpoint_path='checkpoint/audiosep_base_4M_steps.ckpt',
device=device)
audio_file = 'path_to_audio_file'
text = 'textual_description'
output_file='separated_audio.wav'
# AudioSep processes audio at 32 kHz sampling rate
inference(model, audio_file, text, output_file, device)
Inference from Hugging Face
Load the model directly from Hugging Face for convenience:
from models.audiosep import AudioSep
from utils import get_ss_model
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
ss_model = get_ss_model('config/audiosep_base.yaml')
model = AudioSep.from_pretrained("nielsr/audiosep-demo", ss_model=ss_model)
audio_file = 'path_to_audio_file'
text = 'textual_description'
output_file='separated_audio.wav'
# AudioSep processes audio at 32 kHz sampling rate
inference(model, audio_file, text, output_file, device)
Chunk-based Inference
For memory efficiency, especially with longer audio files, use chunk-based inference:
inference(model, audio_file, text, output_file, device, use_chunk=True)
Training
AudioSep can be trained from scratch or fine-tuned with your own audio-text paired datasets. Refer to the repository's datafiles/template.json for the required data format and update config/audiosep_base.yaml to include your data files.
Why Use AudioSep?
AudioSep stands out for several compelling reasons:
- Open-Domain Separation: Its core strength lies in its ability to separate anything you can describe with natural language, offering unparalleled flexibility.
- Intuitive Interface: By using text queries, it makes advanced audio separation accessible to users without deep technical knowledge of signal processing.
- Foundation Model Capabilities: As a foundation model, it exhibits strong generalization, performing well on diverse and unseen audio separation tasks without specific retraining.
- Versatility: It effectively handles a broad spectrum of audio challenges, from isolating specific sound events and musical instruments to improving speech clarity.
- Community and Integration: With integrations like Colab, Hugging Face Spaces, and Replicate, experimenting and deploying AudioSep is straightforward.
Links
- GitHub Repository: Audio-AGI/AudioSep
- Demo Page: Separate Anything You Describe
- arXiv Paper: Separate Anything You Describe
- Colab Notebook: AudioSep Colab
- Hugging Face Spaces: Audio-AGI/AudioSep
- Replicate: cjwbw/audiosep