Skip to content

WangHelin1997/SoloSpeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎸 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

📄 Paper  |  🎧 Audio Samples  |  🚀 Space Demo  |  💻 Colab Demo  |  🤗 Models

GitHub Stars Static Badge

Introduction

🎯 SoloSpeech is a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech achieves state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data.

solospeech-demo.mp4

Quick Start

Future works

Based on the valuable comments on the Issues page, we plan to explore the following directions:

  • Improve efficiency
  • Add reranking
  • Train on more realistic conditions
  • Train on vocal mixtures in music
  • Train on mulitple languages

📝 Feel free to add more comments to the Issues page. That really helps us to build the next version of SoloSpeech!

Citations

If you find this work useful, please consider contributing to this repo and cite our work:

@misc{wang2025solospeechenhancingintelligibilityquality,
      title={SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline}, 
      author={Helin Wang and Jiarui Hai and Dongchao Yang and Chen Chen and Kai Li and Junyi Peng and Thomas Thebaud and Laureano Moro Velazquez and Jesus Villalba and Najim Dehak},
      year={2025},
      eprint={2505.19314},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2505.19314}, 
}
@inproceedings{wang2025soloaudio,
  title={SoloAudio: Target sound extraction with language-oriented audio diffusion transformer},
  author={Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

License

All listening samples, source code, pretrained checkpoints, and the evaluation toolkit are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See the LICENSE file for details.

Acknowledgements

This implementation is based on SoloAudio, EzAudio, DPM-TSE, and stable-audio-tools. We appreciate their awesome work.

🌟 Like This Project?

If you find this repo helpful or interesting, consider dropping a ⭐ — it really helps and means a lot!

About

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages