Skip to content

Conversation

@weedge
Copy link
Collaborator

@weedge weedge commented Sep 11, 2025

Tip

  • 开源的mini版本 实际聊天场景生成音频有些任务还不太稳定,比如语音翻译, 语音聊天有时会采样出现空白语音(temperature:0.7); tts任务采样参数尽量降低 temperature; 需要重复性惩罚;
  • hf transformers pytorch 实现的audio-llm 推理未加速,比如llm 使用 flash-attention; 如果使用vllm可以复用已有audio-llm 实现;
  • token2wav 可以参考cosyvoice2的推理加速实现;
  • audio-llm(audio encoder + adapter + llm )自回归生成token的速度和语音生成(token2wav (Flow+HiFT))的时间 尽量低于 播放时长;
  • 这里只集成step audio2 Transformers pytorch的音频任务流程(比如 AQAA任务(Audio Query, Audio+text Answer)) 包括function call tools;具体实现见:src/processors/voice/step_audio2_processor.py; 推理优化、量化暂时不表;
  • 百变ref声音实时更改, 需要引入ref speaker embedding 向量库进行索引, 然后提供实时查询更改(样式音频prompt文本 和 speaker embedding进行匹对) (引入一个多模态的RAG服务)
  • 主要是集成模型已实现语音任务覆盖实时语音任务,以便后续模型升级,或者场景微调,跟换权重(ckpt_engine)直接接入。
  • 具体实现和qwen2.5-omni类似(多了一个talker),见这个PR: feat: add qwen2.5-omni #143
  • Real-time Inference 和 Toolcall 可以看 step-audio 技术报告中的介绍(工程实现方法,achatbot也有类似实现),分析见这个PR: feat: add audio step chat LM #122

image image image

feat:

  • add stream demo
  • add step audio2 processors
modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini"
modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini-Base"

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task dump_model
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task tokenize
LLM_MODEL=stepfun-ai/Step-Audio-2-mini-Base IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task tokenize

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func audio_caption_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func tts_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func s2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func t2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func multi_turn_aqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func multi_turn_aqaa_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_audio_caption_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_s2tt_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_s2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_tqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_tqaa_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_aqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_aqaa_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_tool_call_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_paralinguistic_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_mmau_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_mmau_audio_answer_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_tts_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_aqaa_test 
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_aqaa_tools_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepASRProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepAudioCaptionProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepS2TTProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_say
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_t2st
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_s2st
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_aqaa
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_aqaa_tools
  • add session chat history
  • add step_audio2 daily_aqaa_bot
# 0. download models and assets
modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini"
modal run src/download_assets.py --asset-urls "https://raw.githubusercontent.com/stepfun-ai/Step-Audio2/refs/heads/main/assets/default_male.wav"
modal run src/download_assets.py --asset-urls "https://raw.githubusercontent.com/stepfun-ai/Step-Audio2/refs/heads/main/assets/default_female.wav"

# 1. run webrtc room http bots server

IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_bots \
    ACHATBOT_VERSION=0.0.25.post2 \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# 2. run webrtc room http signal bot server

modal volume create config

modal volume put config ./config/bots/daily_step_audio2_aqaa_bot.json /bots/ -f

# run container with gpu
IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_single_bot \
    ACHATBOT_VERSION=0.0.25.post2 \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_step_audio2_aqaa_bot.json \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# cold start fastapi webrtc http server
curl -v -XGET "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/health"

# run bot and join room
curl -XPOST "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyStepAudio2AQAABot"

daily_step_audio2_aqaa_bot.json

{
  "chat_bot_name": "DailyStepAudio2AQAABot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "daily_room",
    "args": {
      "privacy": "public"
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "step_audio2"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.032,
        "stop_secs": 0.32,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "voice_llm": {
      "processor": "StepAudio2TextAudioChatProcessor",
      "args": {
        "init_system_prompt": "",
        "prompt_wav": "/root/.achatbot/assets/default_male.wav",
        "verbose": true,
        "warmup_cn": 2,
        "chat_history_size": null,
        "text_stream_out": false,
        "no_stream_sleep_time": 0.001,
        "chunk_size": 100,
        "lm_gen_max_new_tokens": 256,
        "lm_gen_temperature": 0.7,
        "lm_gen_top_k": 20,
        "lm_gen_top_p": 0.9,
        "lm_gen_repetition_penalty": 1.1,
        "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini"
      }
    }
  },
  "config_list": []
}
  • add function call tools
modal volume put config ./config/bots/daily_step_audio2_aqaa_tools_bot.json /bots/ -f 

# run container with gpu
IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_single_bot \
    ACHATBOT_VERSION=0.0.25.post2 \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_step_audio2_aqaa_tools_bot.json \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# cold start fastapi webrtc http server
curl -v -XGET "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/health"

# run bot and join room
curl -XPOST "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyStepAudio2AQAABot"

daily_step_audio2_aqaa_tools_bot.json

{
  "chat_bot_name": "DailyStepAudio2AQAABot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "daily_room",
    "args": {
      "privacy": "public"
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "step_audio2"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.032,
        "stop_secs": 0.32,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "voice_llm": {
      "processor": "StepAudio2TextAudioChatProcessor",
      "args": {
        "init_system_prompt": "你的名字叫做小跃,是由阶跃星辰公司训练出来的语音大模型。\n你具备调用工具解决问题的能力,你需要根据用户的需求和上下文情景,自主选择是否调用系统提供的工具来协助用户。\n你情感细腻,观察能力强,擅长分析用户的内容,并作出善解人意的回复,说话的过程中时刻注意用户的感受,富有同理心,提供多样的情绪价值。\n今天是2025年9月12日,星期五",
        "tools": ["web_search"],
        "verbose": true,
        "prompt_wav": "/root/.achatbot/assets/default_male.wav",
        "warmup_cn": 2,
        "chat_history_size": null,
        "text_stream_out": false,
        "no_stream_sleep_time": 0.001,
        "chunk_size": 100,
        "lm_gen_max_new_tokens": 1024,
        "lm_gen_temperature": 0.7,
        "lm_gen_top_k": 20,
        "lm_gen_top_p": 0.9,
        "lm_gen_repetition_penalty": 1.1,
        "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini"
      }
    }
  },
  "config_list": []
}

Reference

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's voice AI capabilities by integrating the Step-Audio2 model. It introduces a new Daily.co bot that utilizes this model for real-time audio processing and conversational responses. The changes span across infrastructure (Modal deployments, Git submodules), core application logic (LLM wrappers, voice processors), and utility improvements (chat history, build system), aiming to provide a robust foundation for advanced audio-driven AI interactions.

Highlights

  • New Voice Bot Integration: Introduced a new DailyStepAudio2AQAABot for Daily.co, leveraging the Step-Audio2 model for advanced audio processing and conversational AI capabilities.
  • Step-Audio2 Model and Infrastructure: Added the Step-Audio2 as a Git submodule and integrated its core functionalities, including ASR, TTS, S2ST, and multi-turn chat, through new Modal deployment scripts and a dedicated LLM wrapper.
  • Enhanced Chat History Management: Refactored the ChatHistory and Session classes to improve chat context management, particularly for long-running conversations with LLMs, and added serialization support.
  • Build System and Deployment Updates: Updated pyproject.toml for achatbot versioning and refined the PyPI deployment script (scripts/pypi_achatbot.sh) for more precise dependency copying.
  • Logging and Debugging Improvements: Adjusted logging levels and added debug print statements in various demo scripts to provide better visibility into data processing and database interactions.
  • New Data Frames for LLM Interactions: Introduced TextQuestionsAudioRawFrame and LLMGenedTokensFrame to better represent complex data flows involving text, audio, and LLM-generated tokens.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: support for the Step-Audio2 model, including a new DailyStepAudio2AQAABot. The changes are extensive, adding new processors, model wrappers, and deployment scripts. The refactoring of Session and ChatHistory to be more robust is a good improvement. However, there are several issues that need attention. I've found some debugging print statements and hardcoded values that should be removed or made configurable. There's a critical bug in src/common/session.py due to inconsistent attribute naming (chat_history vs _chat_history). Additionally, the pypi_achatbot.sh script has been changed to only copy .py files, which might break dependencies that require other file types. Finally, there are some minor issues like a typo and a broken test script. Please review the detailed comments.

Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
@weedge
Copy link
Collaborator Author

weedge commented Sep 12, 2025

step-audio2-llm

stepfun-ai/Step-Audio-2-mini (AudioEncoder+Adapter+LLM Decoder) 8315.179264 M parameters

StepAudio2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(158720, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbedding()
  )
  (encoder): AudioEncoder(
    (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (positional_embedding): Embedding(1500, 1280)
    (blocks): ModuleList(
      (0-31): 32 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=1280, out_features=1280, bias=True)
          (key): Linear(in_features=1280, out_features=1280, bias=False)
          (value): Linear(in_features=1280, out_features=1280, bias=True)
          (out): Linear(in_features=1280, out_features=1280, bias=True)
        )
        (attn_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1280, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=1280, bias=True)
        )
        (mlp_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      )
    )
    (avg_pooler): AvgPool1d(kernel_size=(2,), stride=(2,), padding=(0,))
    (after_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  )
  (adapter): Adaptor(
    (conv): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (linear1): Linear(in_features=1280, out_features=2048, bias=True)
    (relu): ReLU()
    (linear2): Linear(in_features=2048, out_features=3584, bias=True)
  )
  (lm_head): Linear(in_features=3584, out_features=158720, bias=False)
)

token2wav

stepfun-ai/Step-Audio-2-mini/token2wav.audio_tokenizer 123.714568 M parameters (S3TokenizerV2)

Tip

  • S3TokenizerV2 AudioEncoderV2 blocks(6) the same as StepAudio2ForCausalLM AudioEncoder blocks(32)
  • use FSQVectorQuantization the same as cosyvoice2
  • for ref audio quantization (FSQ)
S3TokenizerV2(
  (encoder): AudioEncoderV2(
    (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): FSMNMultiHeadAttention(
          (query): Linear(in_features=1280, out_features=1280, bias=True)
          (key): Linear(in_features=1280, out_features=1280, bias=False)
          (value): Linear(in_features=1280, out_features=1280, bias=True)
          (out): Linear(in_features=1280, out_features=1280, bias=True)
          (fsmn_block): Conv1d(1280, 1280, kernel_size=(31,), stride=(1,), groups=1280, bias=False)
          (pad_fn): ConstantPad1d(padding=(15, 15), value=0.0)
        )
        (attn_ln): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1280, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=1280, bias=True)
        )
        (mlp_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (quantizer): FSQVectorQuantization(
    (_codebook): FSQCodebook(
      (project_down): Linear(in_features=1280, out_features=8, bias=True)
    )
  )
)

speaker embeding

CAM++ : https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary

Flow Matching

stepfun-ai/Step-Audio-2-mini/token2wav.flow 155.802352 M parameters

  • stepfun-ai/Step-Audio-2-mini/token2wav.flow.encoder 37.831168 M parameters
  • stepfun-ai/Step-Audio-2-mini/token2wav.flow.decoder 114.555472 M parameters (DiTBlock add CausalConvBlock)

Tip

  • decoder use DiT add CausalConvBlock
  • 在 Transformer 模块中每个自注意力模块后加入一个基于 CNN 的编码器层,并使用 20 万小时的高质量语音数据对模型进行训练。这一增强显著提升了模型的梅尔声谱图重建能力,从而显著提高了发音准确率和音色相似度。
CausalMaskedDiffWithXvec(
  (input_embedding): Embedding(6561, 512)
  (spk_embed_affine_layer): Linear(in_features=192, out_features=80, bias=True)
  (encoder): UpsampleConformerEncoderV2(
    (embed): LinearNoSubsampling(
      (out): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (pos_enc): EspnetRelPositionalEncoding(
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (after_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (pre_lookahead_layer): PreLookaheadLayer(
      (conv1): Conv1d(512, 512, kernel_size=(4,), stride=(1,))
      (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(1,))
    )
    (encoders): ModuleList(
      (0-5): 6 x ConformerEncoderLayer(
        (self_attn): RelPositionMultiHeadedAttention(
          (linear_q): Linear(in_features=512, out_features=512, bias=True)
          (linear_k): Linear(in_features=512, out_features=512, bias=True)
          (linear_v): Linear(in_features=512, out_features=512, bias=True)
          (linear_out): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_pos): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (activation): SiLU()
          (dropout): Dropout(p=0.1, inplace=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (up_layer): Upsample1D(
      (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,))
    )
    (up_embed): LinearNoSubsampling(
      (out): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (pos_enc): EspnetRelPositionalEncoding(
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (up_encoders): ModuleList(
      (0-3): 4 x ConformerEncoderLayer(
        (self_attn): RelPositionMultiHeadedAttention(
          (linear_q): Linear(in_features=512, out_features=512, bias=True)
          (linear_k): Linear(in_features=512, out_features=512, bias=True)
          (linear_v): Linear(in_features=512, out_features=512, bias=True)
          (linear_out): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_pos): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (activation): SiLU()
          (dropout): Dropout(p=0.1, inplace=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (encoder_proj): Linear(in_features=512, out_features=80, bias=True)
  (decoder): CausalConditionalCFM(
    (estimator): DiT(
      (t_embedder): TimestepEmbedder(
        (mlp): Sequential(
          (0): Linear(in_features=256, out_features=512, bias=True)
          (1): SiLU()
          (2): Linear(in_features=512, out_features=512, bias=True)
        )
      )
      (in_proj): Linear(in_features=320, out_features=512, bias=True)
      (blocks): ModuleList(
        (0-15): 16 x DiTBlock(
          (norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (attn): Attention(
            (to_q): Linear(in_features=512, out_features=512, bias=True)
            (to_k): Linear(in_features=512, out_features=512, bias=True)
            (to_v): Linear(in_features=512, out_features=512, bias=True)
            (q_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (k_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (mlp): MLP(
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (act): GELU(approximate='tanh')
            (drop1): Dropout(p=0, inplace=False)
            (norm): Identity()
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (drop2): Dropout(p=0, inplace=False)
          )
          (norm3): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (conv): CausalConvBlock(
            (block): Sequential(
              (0): Transpose()
              (1): CausalConv1d(512, 512, kernel_size=(3,), stride=(1,))
              (2): Transpose()
              (3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
              (4): Mish()
              (5): Transpose()
              (6): CausalConv1d(512, 512, kernel_size=(3,), stride=(1,))
              (7): Transpose()
            )
          )
          (adaLN_modulation): Sequential(
            (0): SiLU()
            (1): Linear(in_features=512, out_features=4608, bias=True)
          )
        )
      )
      (final_layer): FinalLayer(
        (adaLN_modulation): Sequential(
          (0): SiLU()
          (1): Linear(in_features=512, out_features=1024, bias=True)
        )
        (norm_final): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
        (linear): Linear(in_features=512, out_features=80, bias=True)
      )
    )
  )
)

HiFT

stepfun-ai/Step-Audio-2-mini/token2wav.hift 20.821295 M parameters

HiFTGenerator
HiFTGenerator(
  (m_source): SourceModuleHnNSF2(
    (l_sin_gen): SineGen2()
    (l_linear): Linear(in_features=9, out_features=1, bias=True)
    (l_tanh): Tanh()
  )
  (f0_upsamp): Upsample(scale_factor=480.0, mode='nearest')
  (conv_pre): ParametrizedConv1d(
    80, 512, kernel_size=(7,), stride=(1,), padding=(3,)
    (parametrizations): ModuleDict(
      (weight): ParametrizationList(
        (0): _WeightNorm()
      )
    )
  )
  (ups): ModuleList(
    (0): ParametrizedConvTranspose1d(
      512, 256, kernel_size=(16,), stride=(8,), padding=(4,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
    (1): ParametrizedConvTranspose1d(
      256, 128, kernel_size=(11,), stride=(5,), padding=(3,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
    (2): ParametrizedConvTranspose1d(
      128, 64, kernel_size=(7,), stride=(3,), padding=(2,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
  )
  (source_downs): ModuleList(
    (0): Conv1d(18, 256, kernel_size=(np.int64(30),), stride=(np.int64(15),), padding=(np.int64(7),))
    (1): Conv1d(18, 128, kernel_size=(np.int64(6),), stride=(np.int64(3),), padding=(np.int64(1),))
    (2): Conv1d(18, 64, kernel_size=(1,), stride=(1,))
  )
  (source_resblocks): ModuleList(
    (0): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (1): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (2): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
  )
  (resblocks): ModuleList(
    (0): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (1): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (2): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (3): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (4): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (5): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (6): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (7): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (8): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
  )
  (conv_post): ParametrizedConv1d(
    64, 18, kernel_size=(7,), stride=(1,), padding=(3,)
    (parametrizations): ModuleDict(
      (weight): ParametrizationList(
        (0): _WeightNorm()
      )
    )
  )
  (reflection_pad): ReflectionPad1d((1, 0))
  (f0_predictor): ConvRNNF0Predictor(
    (condnet): Sequential(
      (0): ParametrizedConv1d(
        80, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (1): ELU(alpha=1.0)
      (2): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (3): ELU(alpha=1.0)
      (4): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (5): ELU(alpha=1.0)
      (6): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (7): ELU(alpha=1.0)
      (8): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (9): ELU(alpha=1.0)
    )
    (classifier): Linear(in_features=512, out_features=1, bias=True)
  )
)

@weedge weedge changed the title feat: add step_audio2 daily_aqaa_bot feat: add step_audio2 daily_aqaa_bot with function call Sep 12, 2025
@weedge weedge changed the title feat: add step_audio2 daily_aqaa_bot with function call feat: add step_audio2 bots(e.g.: AQAA) with function call Sep 12, 2025
Signed-off-by: weedge <[email protected]>
@weedge weedge added Omni Omni Modality MLLM multimodal large language models and removed Omni Omni Modality labels Sep 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A1-T2A2 (speech)-to-(text and speech) ASR DiT Flow MLLM multimodal large language models streaming transformers TTS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants