index-tts-lora: High-Quality Speech Synthesis with LoRA Fine-tuning

Summary
index-tts-lora offers a robust solution for high-quality speech synthesis, leveraging LoRA fine-tuning on the index-tts framework. It significantly enhances prosody and naturalness for both single and multi-speaker voices. This project provides practical methods for training and inference, making advanced voice synthesis more accessible.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
The index-tts-lora project, built upon Bilibili's index-tts, provides a powerful solution for enhancing speech synthesis. It focuses on applying LoRA (Low-Rank Adaptation) fine-tuning to achieve superior prosody and naturalness in generated audio. This repository supports both single-speaker and multi-speaker setups, making it versatile for various voice synthesis applications.
Installation and Usage
To get started with index-tts-lora, follow these steps for audio processing, training, and inference.
1. Audio token and speaker condition extraction
First, extract audio tokens and speaker conditions from your audio list.
# Extract tokens and speaker conditions
python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
# audio_list format: audio_path + transcript, separated by \t
/path/to/audio.wav ?????????????????????????????
After extraction, processed files and speaker_info.json will be generated under the finetune_data/processed_data/ directory.
2. Training
Initiate the training process using the provided script.
python train.py
3. Inference
Once trained, you can perform inference to generate speech.
python indextts/infer.py
Fine-tuning Results and Examples
The project demonstrates impressive fine-tuning results using Chinese audio data from Kai Shu Tells Stories. With approximately 30 minutes of audio and 270 audio clips, index-tts-lora shows significant improvements in speech quality. The dataset was split into 244 training samples and 26 validation samples.
Here are some speech synthesis examples:
| Text | Audio |
|---|---|
| ??????????????????????????????????????????????? | kaishu_cn_1.wav |
| ?????????????????????????????????????????????? | kaishu_cn_2.wav |
| ??Java????????M??????????????????Java Script?????????????? | kaishu_cn_en_mix_1.wav |
| ?? financial report ??????????????? revenue performance ? expenditure trends? | kaishu_cn_en_mix_2.wav |
| ???????????????????????????????????????????????????? | kaishu_raokouling.wav |
| A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. | kaishu_en_1.wav |
| As research continued, the protective effect of fluoride against dental decay was demonstrated. | kaishu_en_2.wav |
Model Evaluation
Why Use index-tts-lora?
Developers and researchers looking to achieve high-quality, natural-sounding speech synthesis will find index-tts-lora particularly useful. Its LoRA fine-tuning approach allows for efficient adaptation to specific voices, enhancing prosody and overall naturalness with relatively small datasets. The support for both single and multi-speaker scenarios makes it a flexible tool for diverse TTS projects.
Links
- GitHub Repository: https://github.com/asr-pub/index-tts-lora
- Original index-tts project: https://github.com/index-tts/index-tts
- finetune-index-tts: https://github.com/yrom/finetune-index-tts