GitHub - WangHelin1997/SoloSpeech: SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

🎸 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

📄 Paper | 🎧 Audio Samples | 🚀 Space Demo | 💻 Colab Demo | 🤗 Models

Introduction

🎯 SoloSpeech is a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech achieves state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data.

solospeech-demo.mp4

Quick Start

Future works

Based on the valuable comments on the Issues page, we plan to explore the following directions:

📝 Feel free to add more comments to the Issues page. That really helps us to build the next version of SoloSpeech!

Citations

If you find this work useful, please consider contributing to this repo and cite our work:

@misc{wang2025solospeechenhancingintelligibilityquality,
      title={SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline}, 
      author={Helin Wang and Jiarui Hai and Dongchao Yang and Chen Chen and Kai Li and Junyi Peng and Thomas Thebaud and Laureano Moro Velazquez and Jesus Villalba and Najim Dehak},
      year={2025},
      eprint={2505.19314},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2505.19314}, 
}

@inproceedings{wang2025soloaudio,
  title={SoloAudio: Target sound extraction with language-oriented audio diffusion transformer},
  author={Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

License

All listening samples, source code, pretrained checkpoints, and the evaluation toolkit are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See the LICENSE file for details.

Acknowledgements

This implementation is based on SoloAudio, EzAudio, DPM-TSE, and stable-audio-tools. We appreciate their awesome work.

🌟 Like This Project?

If you find this repo helpful or interesting, consider dropping a ⭐ — it really helps and means a lot!

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
assets		assets
docs		docs
scripts		scripts
solospeech		solospeech
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎸 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Introduction

Quick Start

Future works

Citations

License

Acknowledgements

🌟 Like This Project?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

WangHelin1997/SoloSpeech

Folders and files

Latest commit

History

Repository files navigation

🎸 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Introduction

Quick Start

Future works

Citations

License

Acknowledgements

🌟 Like This Project?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages