feat: add audio step chat LM #122

weedge · 2025-02-21T11:56:09Z

feat:

add step-audio LM
add modal run step tts/voice inference

PS:

Step-Audio是Step-Omni的组成部分，论文中介绍了训练过程4,5章节，但Step-Omni 未公开(原因未知，可能还在训练调优)，
LM: 将语音和文本对齐，训练已有文本模型，增加语音理解生成能力 (首先在预训练好的文本模型 130B参数 step1 LM -> 然后训练130B参数统一的图像语音文本多模态模型(Step-Omni），其中包括语音文本理解能力模型(Step-Audio)的训练 -> RLHF -> Chat model)。其中基座 LM 采用step1 LM 130B -> scaling 了模型训练数据(学到更多的多模态知识，调整参数)，按照scaling law 性能比其他参数少的sota模型好，但是硬件要求相对高，以及推理性能可以复用Transformer attention的推理优化。（端侧适合整合成一体omni(结合端侧硬件芯片定制化蒸馏量化优化(比如量化成INT4,适合端侧推理))，而服务端侧则适合拆解部署(整体的模型权重大，FP32(float) BF16(bfloat16) FP16(float16) 低精度量化技术，如INT4、INT8 FP8(float8))，降低部署的推理成本，以及深入优化(常结合GPU的芯片显存能力深度优化kernel算子); 复用文本基座模型的推理部署）
当语音文本理解Chat模型根据提示词输出文本和语音时，直接对接TTS中的flow, hift 来生成语音(无需tts中的LM)
当模型结合提示词，直接输出文本，直接解耦对接TTS
模型生成的文本中有特殊语音语气词，需要抽取出来，对应的TTS需要有对应语气词的理解能力，如果没有，则需要数据进行微调

语音文本对齐后的模型(LM)，其文本语音能力可以支持这几种，但是公开代码示例只有A1-T2 以及原本 T1->T2的方式，其他能力以后有时间再去挖掘

    - A1-T1: (speech)-to-(text) (asr)
    - A1-T2: (speech)-to-(text) (audio gen/chat to text)
    - T1-T2A2: (text)-to-(speech and text) ((text-llm)+tts) (text gen/chat to text/audio)
    - A1-T2A2: (speech)-to-(speech and text) (asr+(text-llm)+tts) (audio gen/chat to text/audio)
    - T1-A1: (text)-to-(speech) (tts)
    - T1-A2: (text)-to-(speech) (text gen/chat to audio)
    - T1-T2 (text)-to-(text) (text gen/chat to text) 文本模型已有能力

论文中提到的 Real-time Inference ，系统工程代码并没有开源, 论文中没有给出RTF的对比

论文中提到的 function call （tool call），系统工程代码并未开源 (这个和daily_describe_vision_tools_bot 实现方式类似，2个旁路分支，一个旁路分支执行工具调用链路，另一执行语音合成，主干执行模型工具文本和语音文本的生成)

⭐️ Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction | paper code
论文仅仅是技术报告，未有创新点(整体结构和glm-4-voice类似); 结合已有文本基座模型加入多模态数据训练，工程化落地
- 将以往cosyvoice的成果使用预训练的130B step1 LLM 基座模型进行替换，加入多模态理解能力，并输出音频+文本token，
- 其中tokenizer: a dual codebook speech tokenizer framework. 思路借鉴了 ARCON (from stepfun team);
  - linguistic tokenizer use FunASR Paraformer(NAR) model;
  - semantic tokenizer use CosyVoice speech tokenizer(from SenseVoice);
- scaling 模型权重之后的多模态的训练推理(多模态token的调比)，以及工程化实时推理。
TTS 部分见：feat: add step1 audio tts #121
achatbot + glm-4-voice: feat: add daily_asr_glm_voice_bot daily_glm_voice_bot and deploy modal #95
cosyvoice解读：https://weedge.github.io/post/multimoding/voices/cosyvoice/

相对而言，论文中对比到模型, 其结构类似

⭐️ 2025.1 Minmo: A multimodal large language model for seamless voice interaction. (supports full-duplex interactions) ；
- Architecture: Voice Encoder(SenseVoice-large) + Input Projector (two-layer Transformer + downsampling CNN ) + Backbone LLM (Qwen2.5-7B-instruct) + Output Projector (single-layer NN.Linear) + Voice Token LM(CosyVoice 2) + Token2wav Synthesizer(Flow+HiFT from CosyVoice 2) + Full Duplex Predictor (single-layer Transformer and a linear softmax output layer) 借鉴了moshi (而Step-Audio是系统工程化实现)

⭐️ 2025.1 Lucy: Linguistic understanding and control yielding early stage of her. (from VITA)| paper code
- audio agent： Function-Calling + Linguistic Emotion Control + Acoustic Emotion Control
- Architecture: Audio Encoder (whisper) + Adapter ( 24 Transformer blocks + 4x downsampling CNN) + Backbone LLM (Qwen2.5-7B-instruct) + Codec(SNAC: encode speech into discrete tokens with 7 codebooks, decode: the parallel modeling paradigm Simple and Controllable Music Generation to decode text and speech tokens simultaneously eight language-model heads to predict one text token and seven audio token at each decoding step.)

deploy:

lm:
- vllm,
- sglang
- TensorRT-LLM
asr, tts(flow, hift):
- https://developer.nvidia.com/zh-cn/blog/deploy-speech-ai-model-on-gpu/
- https://github.com/NVIDIA/NeMo

Signed-off-by: weedge <[email protected]>

…ocessor Signed-off-by: weedge <[email protected]>

feat: add step-audio voice LM

859ee9c

Signed-off-by: weedge <[email protected]>

weedge added the voice label Feb 21, 2025

feat: add audio step lm

6ef35f0

Signed-off-by: weedge <[email protected]>

weedge changed the title ~~feat: add step-audio voice LM~~ feat: add audio step chat LM Feb 21, 2025

Merge branch 'main' into feat/voice

a456295

weedge added AR Flow VQ A1-T2A2 (speech)-to-(text and speech) labels Feb 22, 2025

weedge added 6 commits February 22, 2025 18:14

change step tts

e104750

Signed-off-by: weedge <[email protected]>

feat: add modal run step tts/voice

0ede884

Signed-off-by: weedge <[email protected]>

feat: add tts_inference tts_inference tts_inference_stream run on modal

36a77f0

Signed-off-by: weedge <[email protected]>

fix

d8ea61f

Signed-off-by: weedge <[email protected]>

Merge branch 'main' into feat/voice

61aff99

feat: add daily_asr_step_voice_bot daily_step_voice_bot step_voice_pr…

dfe382f

…ocessor Signed-off-by: weedge <[email protected]>

weedge merged commit f473c7d into main Feb 24, 2025

weedge removed the A1-T2A2 (speech)-to-(text and speech) label Feb 24, 2025

weedge mentioned this pull request Feb 24, 2025

[achatbot] add audio step chat LM (T1-T2, A1-T2) stepfun-ai/Step-Audio#97

Open

weedge mentioned this pull request Sep 13, 2025

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add audio step chat LM #122

feat: add audio step chat LM #122

Uh oh!

weedge commented Feb 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add audio step chat LM #122

feat: add audio step chat LM #122

Uh oh!

Conversation

weedge commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weedge commented Feb 21, 2025 •

edited

Loading