🎸 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline
📄 Paper | 🎧 Audio Samples | 🚀 Space Demo | 💻 Colab Demo | 🤗 Models
🎯 SoloSpeech is a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech achieves state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data.
solospeech-demo.mp4
Based on the valuable comments on the Issues page, we plan to explore the following directions:
- Improve efficiency
- Add reranking
- Train on more realistic conditions
- Train on vocal mixtures in music
- Train on mulitple languages
📝 Feel free to add more comments to the Issues page. That really helps us to build the next version of SoloSpeech!
If you find this work useful, please consider contributing to this repo and cite our work:
@misc{wang2025solospeechenhancingintelligibilityquality,
title={SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline},
author={Helin Wang and Jiarui Hai and Dongchao Yang and Chen Chen and Kai Li and Junyi Peng and Thomas Thebaud and Laureano Moro Velazquez and Jesus Villalba and Najim Dehak},
year={2025},
eprint={2505.19314},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2505.19314},
}
@inproceedings{wang2025soloaudio,
title={SoloAudio: Target sound extraction with language-oriented audio diffusion transformer},
author={Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2025},
organization={IEEE}
}
All listening samples, source code, pretrained checkpoints, and the evaluation toolkit are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See the LICENSE file for details.
This implementation is based on SoloAudio, EzAudio, DPM-TSE, and stable-audio-tools. We appreciate their awesome work.
If you find this repo helpful or interesting, consider dropping a ⭐ — it really helps and means a lot!
