feat: add step1 audio tts #121
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat:
support tts mode:
text+ref audio waveform -> tokenizer -> text+audio token ids -> step1 lm -> audio token ids (wav_code) -> flow(CFM) -> mel - vocoder(HiFT) -> waveform
src+ref audio waveform -> speech tokenizer-> audio token ids (wav_code) -> flow(CFM) -> mel - vocoder(HiFT) -> clone ref audio waveform
python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_get_voices REF_AUDIO_PATH=./test/audio_files/asr_example_zh.wav \ REF_TEXT="欢迎大家来体验达摩院推出的语音识别模型" \ python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_set_voice python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_synthesize python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_synthesize_speak # ref audio TTS_STREAM_FACTOR=4 \ REF_AUDIO_PATH=./test/audio_files/asr_example_zh.wav \ REF_TEXT="欢迎大家来体验达摩院推出的语音识别模型" \ TTS_TEXT="万物之始,大道至简,衍化至繁。君不见黄河之水天上来,奔流到海不复回。君不见高堂明镜悲白发,朝如青丝暮成雪。人生得意须尽欢,莫使金樽空对月。天生我材必有用,千金散尽还复来。" \ python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_synthesize TTS_STREAM_FACTOR=4 \ REF_AUDIO_PATH=./test/audio_files/asr_example_zh.wav \ REF_TEXT="欢迎大家来体验达摩院推出的语音识别模型" \ TTS_TEXT="万物之始,大道至简,衍化至繁。君不见黄河之水天上来,奔流到海不复回。君不见高堂明镜悲白发,朝如青丝暮成雪。人生得意须尽欢,莫使金樽空对月。天生我材必有用,千金散尽还复来。" \ python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_synthesize_speak # ---- TTS_MODE: voice_clone ---- # src audio + default ref audio SRC_AUDIO_PATH=./test/audio_files/asr_example_zh.wav \ python -m unittest test.modules.speech.tts.test_step.TestStepTTS.test_synthesizecolab 笔记:
step-audio TTS from step-audio (Speech Decoder)
step1 LM 3B + flow (code from CosyVoice)+ HiFT(code from CosyVoice)
speech tokenizer
a dual codebook speech tokenizer framework. like ARCON (from stepfun team);
linguistic tokenizer use FunASR Paraformer(NAR) model;
semantic tokenizer use CosyVoice speech tokenizer(from SenseVoice)
step1 LM 3B from step-audio 130B distillation
flow (CFM)
see:
HiFT vocoder
see: