feat: add step_audio2 bots(e.g.: AQAA) with function call #190

weedge · 2025-09-11T11:04:19Z

Tip

开源的mini版本实际聊天场景生成音频有些任务还不太稳定，比如语音翻译，语音聊天有时会采样出现空白语音（temperature：0.7）; tts任务采样参数尽量降低 temperature; 需要重复性惩罚；
hf transformers pytorch 实现的audio-llm 推理未加速，比如llm 使用 flash-attention; 如果使用vllm可以复用已有audio-llm 实现;
token2wav 可以参考cosyvoice2的推理加速实现；
audio-llm(audio encoder + adapter + llm )自回归生成token的速度和语音生成(token2wav (Flow+HiFT))的时间尽量低于播放时长；
这里只集成step audio2 Transformers pytorch的音频任务流程(比如 AQAA任务(Audio Query, Audio+text Answer)) 包括function call tools；具体实现见：src/processors/voice/step_audio2_processor.py；推理优化、量化暂时不表；
百变ref声音实时更改，需要引入ref speaker embedding 向量库进行索引，然后提供实时查询更改(样式音频prompt文本和 speaker embedding进行匹对) （引入一个多模态的RAG服务）
主要是集成模型已实现语音任务覆盖实时语音任务，以便后续模型升级，或者场景微调，跟换权重(ckpt_engine)直接接入。
具体实现和qwen2.5-omni类似(多了一个talker)，见这个PR: feat: add qwen2.5-omni #143
Real-time Inference 和 Toolcall 可以看 step-audio 技术报告中的介绍（工程实现方法，achatbot也有类似实现），分析见这个PR: feat: add audio step chat LM #122

feat:

add stream demo
add step audio2 processors

modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini"
modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini-Base"

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task dump_model
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task tokenize
LLM_MODEL=stepfun-ai/Step-Audio-2-mini-Base IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task tokenize

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func audio_caption_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func tts_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func s2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func t2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func multi_turn_aqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_base --test-func multi_turn_aqaa_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_audio_caption_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_s2tt_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_s2st_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_tqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_tqaa_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_aqta_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_multi_turn_aqaa_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_tool_call_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_paralinguistic_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_mmau_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task test_instruct --test-func instruct_mmau_audio_answer_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_asr_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_tts_test
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_aqaa_test 
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task generate_stream --test-func stream_aqaa_tools_test

IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepASRProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepAudioCaptionProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_audio2text --processor-name=StepS2TTProcessor
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_say
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_t2st
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_s2st
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_aqaa
IMAGE_GPU=L4 modal run src/llm/transformers/step_audio2.py --task achatbot_step_audio2_processor --test-func=achatbot_step_audio2_aqaa_tools

add session chat history
add step_audio2 daily_aqaa_bot

# 0. download models and assets
modal run src/download_models.py --repo-ids "stepfun-ai/Step-Audio-2-mini"
modal run src/download_assets.py --asset-urls "https://raw.githubusercontent.com/stepfun-ai/Step-Audio2/refs/heads/main/assets/default_male.wav"
modal run src/download_assets.py --asset-urls "https://raw.githubusercontent.com/stepfun-ai/Step-Audio2/refs/heads/main/assets/default_female.wav"

# 1. run webrtc room http bots server

IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_bots \
    ACHATBOT_VERSION=0.0.25.post2 \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# 2. run webrtc room http signal bot server

modal volume create config

modal volume put config ./config/bots/daily_step_audio2_aqaa_bot.json /bots/ -f

# run container with gpu
IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_single_bot \
    ACHATBOT_VERSION=0.0.25.post2 \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_step_audio2_aqaa_bot.json \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# cold start fastapi webrtc http server
curl -v -XGET "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/health"

# run bot and join room
curl -XPOST "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyStepAudio2AQAABot"

daily_step_audio2_aqaa_bot.json

{
  "chat_bot_name": "DailyStepAudio2AQAABot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "daily_room",
    "args": {
      "privacy": "public"
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "step_audio2"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.032,
        "stop_secs": 0.32,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "voice_llm": {
      "processor": "StepAudio2TextAudioChatProcessor",
      "args": {
        "init_system_prompt": "",
        "prompt_wav": "/root/.achatbot/assets/default_male.wav",
        "verbose": true,
        "warmup_cn": 2,
        "chat_history_size": null,
        "text_stream_out": false,
        "no_stream_sleep_time": 0.001,
        "chunk_size": 100,
        "lm_gen_max_new_tokens": 256,
        "lm_gen_temperature": 0.7,
        "lm_gen_top_k": 20,
        "lm_gen_top_p": 0.9,
        "lm_gen_repetition_penalty": 1.1,
        "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini"
      }
    }
  },
  "config_list": []
}

add function call tools

modal volume put config ./config/bots/daily_step_audio2_aqaa_tools_bot.json /bots/ -f 

# run container with gpu
IMAGE_GPU=L4 SERVER_TAG=fastapi_webrtc_single_bot \
    ACHATBOT_VERSION=0.0.25.post2 \
    CONFIG_FILE=/root/.achatbot/config/bots/daily_step_audio2_aqaa_tools_bot.json \
    modal serve src/fastapi_webrtc_step2_voice_bot_serve.py

# cold start fastapi webrtc http server
curl -v -XGET "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/health"

# run bot and join room
curl -XPOST "https://weedge--step-audio2-voice-bot-srv-app-dev.modal.run/bot_join/chat-room/DailyStepAudio2AQAABot"

daily_step_audio2_aqaa_tools_bot.json

{
  "chat_bot_name": "DailyStepAudio2AQAABot",
  "room_name": "chat-room",
  "room_url": "",
  "token": "",
  "room_manager": {
    "tag": "daily_room",
    "args": {
      "privacy": "public"
    }
  },
  "services": {
    "pipeline": "achatbot",
    "vad": "silero",
    "voice_llm": "step_audio2"
  },
  "config": {
    "vad": {
      "tag": "silero_vad_analyzer",
      "args": {
        "start_secs": 0.032,
        "stop_secs": 0.32,
        "confidence": 0.7,
        "min_volume": 0.6,
        "onnx": true
      }
    },
    "voice_llm": {
      "processor": "StepAudio2TextAudioChatProcessor",
      "args": {
        "init_system_prompt": "你的名字叫做小跃，是由阶跃星辰公司训练出来的语音大模型。\n你具备调用工具解决问题的能力，你需要根据用户的需求和上下文情景，自主选择是否调用系统提供的工具来协助用户。\n你情感细腻，观察能力强，擅长分析用户的内容，并作出善解人意的回复，说话的过程中时刻注意用户的感受，富有同理心，提供多样的情绪价值。\n今天是2025年9月12日，星期五",
        "tools": ["web_search"],
        "verbose": true,
        "prompt_wav": "/root/.achatbot/assets/default_male.wav",
        "warmup_cn": 2,
        "chat_history_size": null,
        "text_stream_out": false,
        "no_stream_sleep_time": 0.001,
        "chunk_size": 100,
        "lm_gen_max_new_tokens": 1024,
        "lm_gen_temperature": 0.7,
        "lm_gen_top_k": 20,
        "lm_gen_top_p": 0.9,
        "lm_gen_repetition_penalty": 1.1,
        "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini"
      }
    }
  },
  "config_list": []
}

Reference

Signed-off-by: weedge <[email protected]>

gemini-code-assist

Summary of Changes

Hello @weedge, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's voice AI capabilities by integrating the Step-Audio2 model. It introduces a new Daily.co bot that utilizes this model for real-time audio processing and conversational responses. The changes span across infrastructure (Modal deployments, Git submodules), core application logic (LLM wrappers, voice processors), and utility improvements (chat history, build system), aiming to provide a robust foundation for advanced audio-driven AI interactions.

Highlights

New Voice Bot Integration: Introduced a new DailyStepAudio2AQAABot for Daily.co, leveraging the Step-Audio2 model for advanced audio processing and conversational AI capabilities.
Step-Audio2 Model and Infrastructure: Added the Step-Audio2 as a Git submodule and integrated its core functionalities, including ASR, TTS, S2ST, and multi-turn chat, through new Modal deployment scripts and a dedicated LLM wrapper.
Enhanced Chat History Management: Refactored the ChatHistory and Session classes to improve chat context management, particularly for long-running conversations with LLMs, and added serialization support.
Build System and Deployment Updates: Updated pyproject.toml for achatbot versioning and refined the PyPI deployment script (scripts/pypi_achatbot.sh) for more precise dependency copying.
Logging and Debugging Improvements: Adjusted logging levels and added debug print statements in various demo scripts to provide better visibility into data processing and database interactions.
New Data Frames for LLM Interactions: Introduced TextQuestionsAudioRawFrame and LLMGenedTokensFrame to better represent complex data flows involving text, audio, and LLM-generated tokens.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: support for the Step-Audio2 model, including a new DailyStepAudio2AQAABot. The changes are extensive, adding new processors, model wrappers, and deployment scripts. The refactoring of Session and ChatHistory to be more robust is a good improvement. However, there are several issues that need attention. I've found some debugging print statements and hardcoded values that should be removed or made configurable. There's a critical bug in src/common/session.py due to inconsistent attribute naming (chat_history vs _chat_history). Additionally, the pypi_achatbot.sh script has been changed to only copy .py files, which might break dependencies that require other file types. Finally, there are some minor issues like a typo and a broken test script. Please review the detailed comments.

src/common/session.py

demo/insert_podcast.py

scripts/pypi_achatbot.sh

src/processors/voice/step_audio2_processor.py

demo/insert_podcast.py

src/cmd/bots/voice/step_audio2/helper.py

src/common/chat_history.py

src/types/ai_conf.py

Signed-off-by: weedge <[email protected]>

weedge · 2025-09-12T03:41:47Z

step-audio2-llm

stepfun-ai/Step-Audio-2-mini (AudioEncoder+Adapter+LLM Decoder) 8315.179264 M parameters

StepAudio2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(158720, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbedding()
  )
  (encoder): AudioEncoder(
    (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (positional_embedding): Embedding(1500, 1280)
    (blocks): ModuleList(
      (0-31): 32 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=1280, out_features=1280, bias=True)
          (key): Linear(in_features=1280, out_features=1280, bias=False)
          (value): Linear(in_features=1280, out_features=1280, bias=True)
          (out): Linear(in_features=1280, out_features=1280, bias=True)
        )
        (attn_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1280, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=1280, bias=True)
        )
        (mlp_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      )
    )
    (avg_pooler): AvgPool1d(kernel_size=(2,), stride=(2,), padding=(0,))
    (after_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  )
  (adapter): Adaptor(
    (conv): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (linear1): Linear(in_features=1280, out_features=2048, bias=True)
    (relu): ReLU()
    (linear2): Linear(in_features=2048, out_features=3584, bias=True)
  )
  (lm_head): Linear(in_features=3584, out_features=158720, bias=False)
)

token2wav

stepfun-ai/Step-Audio-2-mini/token2wav.audio_tokenizer 123.714568 M parameters (S3TokenizerV2)

Tip

S3TokenizerV2 AudioEncoderV2 blocks(6) the same as StepAudio2ForCausalLM AudioEncoder blocks(32)
use FSQVectorQuantization the same as cosyvoice2
for ref audio quantization (FSQ)

S3TokenizerV2(
  (encoder): AudioEncoderV2(
    (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-5): 6 x ResidualAttentionBlock(
        (attn): FSMNMultiHeadAttention(
          (query): Linear(in_features=1280, out_features=1280, bias=True)
          (key): Linear(in_features=1280, out_features=1280, bias=False)
          (value): Linear(in_features=1280, out_features=1280, bias=True)
          (out): Linear(in_features=1280, out_features=1280, bias=True)
          (fsmn_block): Conv1d(1280, 1280, kernel_size=(31,), stride=(1,), groups=1280, bias=False)
          (pad_fn): ConstantPad1d(padding=(15, 15), value=0.0)
        )
        (attn_ln): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1280, out_features=5120, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=5120, out_features=1280, bias=True)
        )
        (mlp_ln): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (quantizer): FSQVectorQuantization(
    (_codebook): FSQCodebook(
      (project_down): Linear(in_features=1280, out_features=8, bias=True)
    )
  )
)

speaker embeding

CAM++ : https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary

Flow Matching

stepfun-ai/Step-Audio-2-mini/token2wav.flow 155.802352 M parameters

stepfun-ai/Step-Audio-2-mini/token2wav.flow.encoder 37.831168 M parameters
stepfun-ai/Step-Audio-2-mini/token2wav.flow.decoder 114.555472 M parameters (DiTBlock add CausalConvBlock)

Tip

decoder use DiT add CausalConvBlock
在 Transformer 模块中每个自注意力模块后加入一个基于 CNN 的编码器层，并使用 20 万小时的高质量语音数据对模型进行训练。这一增强显著提升了模型的梅尔声谱图重建能力，从而显著提高了发音准确率和音色相似度。

CausalMaskedDiffWithXvec(
  (input_embedding): Embedding(6561, 512)
  (spk_embed_affine_layer): Linear(in_features=192, out_features=80, bias=True)
  (encoder): UpsampleConformerEncoderV2(
    (embed): LinearNoSubsampling(
      (out): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (pos_enc): EspnetRelPositionalEncoding(
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (after_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (pre_lookahead_layer): PreLookaheadLayer(
      (conv1): Conv1d(512, 512, kernel_size=(4,), stride=(1,))
      (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(1,))
    )
    (encoders): ModuleList(
      (0-5): 6 x ConformerEncoderLayer(
        (self_attn): RelPositionMultiHeadedAttention(
          (linear_q): Linear(in_features=512, out_features=512, bias=True)
          (linear_k): Linear(in_features=512, out_features=512, bias=True)
          (linear_v): Linear(in_features=512, out_features=512, bias=True)
          (linear_out): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_pos): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (activation): SiLU()
          (dropout): Dropout(p=0.1, inplace=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (up_layer): Upsample1D(
      (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,))
    )
    (up_embed): LinearNoSubsampling(
      (out): Sequential(
        (0): Linear(in_features=512, out_features=512, bias=True)
        (1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (pos_enc): EspnetRelPositionalEncoding(
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (up_encoders): ModuleList(
      (0-3): 4 x ConformerEncoderLayer(
        (self_attn): RelPositionMultiHeadedAttention(
          (linear_q): Linear(in_features=512, out_features=512, bias=True)
          (linear_k): Linear(in_features=512, out_features=512, bias=True)
          (linear_v): Linear(in_features=512, out_features=512, bias=True)
          (linear_out): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_pos): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (activation): SiLU()
          (dropout): Dropout(p=0.1, inplace=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (norm_ff): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (norm_mha): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (encoder_proj): Linear(in_features=512, out_features=80, bias=True)
  (decoder): CausalConditionalCFM(
    (estimator): DiT(
      (t_embedder): TimestepEmbedder(
        (mlp): Sequential(
          (0): Linear(in_features=256, out_features=512, bias=True)
          (1): SiLU()
          (2): Linear(in_features=512, out_features=512, bias=True)
        )
      )
      (in_proj): Linear(in_features=320, out_features=512, bias=True)
      (blocks): ModuleList(
        (0-15): 16 x DiTBlock(
          (norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (attn): Attention(
            (to_q): Linear(in_features=512, out_features=512, bias=True)
            (to_k): Linear(in_features=512, out_features=512, bias=True)
            (to_v): Linear(in_features=512, out_features=512, bias=True)
            (q_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (k_norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (mlp): MLP(
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (act): GELU(approximate='tanh')
            (drop1): Dropout(p=0, inplace=False)
            (norm): Identity()
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (drop2): Dropout(p=0, inplace=False)
          )
          (norm3): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
          (conv): CausalConvBlock(
            (block): Sequential(
              (0): Transpose()
              (1): CausalConv1d(512, 512, kernel_size=(3,), stride=(1,))
              (2): Transpose()
              (3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
              (4): Mish()
              (5): Transpose()
              (6): CausalConv1d(512, 512, kernel_size=(3,), stride=(1,))
              (7): Transpose()
            )
          )
          (adaLN_modulation): Sequential(
            (0): SiLU()
            (1): Linear(in_features=512, out_features=4608, bias=True)
          )
        )
      )
      (final_layer): FinalLayer(
        (adaLN_modulation): Sequential(
          (0): SiLU()
          (1): Linear(in_features=512, out_features=1024, bias=True)
        )
        (norm_final): LayerNorm((512,), eps=1e-06, elementwise_affine=False)
        (linear): Linear(in_features=512, out_features=80, bias=True)
      )
    )
  )
)

HiFT

stepfun-ai/Step-Audio-2-mini/token2wav.hift 20.821295 M parameters

HiFTGenerator

HiFTGenerator(
  (m_source): SourceModuleHnNSF2(
    (l_sin_gen): SineGen2()
    (l_linear): Linear(in_features=9, out_features=1, bias=True)
    (l_tanh): Tanh()
  )
  (f0_upsamp): Upsample(scale_factor=480.0, mode='nearest')
  (conv_pre): ParametrizedConv1d(
    80, 512, kernel_size=(7,), stride=(1,), padding=(3,)
    (parametrizations): ModuleDict(
      (weight): ParametrizationList(
        (0): _WeightNorm()
      )
    )
  )
  (ups): ModuleList(
    (0): ParametrizedConvTranspose1d(
      512, 256, kernel_size=(16,), stride=(8,), padding=(4,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
    (1): ParametrizedConvTranspose1d(
      256, 128, kernel_size=(11,), stride=(5,), padding=(3,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
    (2): ParametrizedConvTranspose1d(
      128, 64, kernel_size=(7,), stride=(3,), padding=(2,)
      (parametrizations): ModuleDict(
        (weight): ParametrizationList(
          (0): _WeightNorm()
        )
      )
    )
  )
  (source_downs): ModuleList(
    (0): Conv1d(18, 256, kernel_size=(np.int64(30),), stride=(np.int64(15),), padding=(np.int64(7),))
    (1): Conv1d(18, 128, kernel_size=(np.int64(6),), stride=(np.int64(3),), padding=(np.int64(1),))
    (2): Conv1d(18, 64, kernel_size=(1,), stride=(1,))
  )
  (source_resblocks): ModuleList(
    (0): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (1): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (2): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
  )
  (resblocks): ModuleList(
    (0): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (1): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (2): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          256, 256, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (3): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (4): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (5): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          128, 128, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (6): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(3,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(5,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(3,), stride=(1,), padding=(1,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (7): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(9,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(15,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(7,), stride=(1,), padding=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
    (8): ResBlock(
      (convs1): ModuleList(
        (0): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (1): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(15,), dilation=(3,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
        (2): ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(25,), dilation=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (convs2): ModuleList(
        (0-2): 3 x ParametrizedConv1d(
          64, 64, kernel_size=(11,), stride=(1,), padding=(5,)
          (parametrizations): ModuleDict(
            (weight): ParametrizationList(
              (0): _WeightNorm()
            )
          )
        )
      )
      (activations1): ModuleList(
        (0-2): 3 x Snake()
      )
      (activations2): ModuleList(
        (0-2): 3 x Snake()
      )
    )
  )
  (conv_post): ParametrizedConv1d(
    64, 18, kernel_size=(7,), stride=(1,), padding=(3,)
    (parametrizations): ModuleDict(
      (weight): ParametrizationList(
        (0): _WeightNorm()
      )
    )
  )
  (reflection_pad): ReflectionPad1d((1, 0))
  (f0_predictor): ConvRNNF0Predictor(
    (condnet): Sequential(
      (0): ParametrizedConv1d(
        80, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (1): ELU(alpha=1.0)
      (2): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (3): ELU(alpha=1.0)
      (4): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (5): ELU(alpha=1.0)
      (6): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (7): ELU(alpha=1.0)
      (8): ParametrizedConv1d(
        512, 512, kernel_size=(3,), stride=(1,), padding=(1,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (9): ELU(alpha=1.0)
    )
    (classifier): Linear(in_features=512, out_features=1, bias=True)
  )
)

Signed-off-by: weedge <[email protected]>

weedge added 6 commits September 7, 2025 23:55

feat: add stream demo

e16912d

Signed-off-by: weedge <[email protected]>

feat: add step audio2 processors

070f2a1

Signed-off-by: weedge <[email protected]>

feat: add step_audio2 daily_aqaa_bot; session chat history

4915dfd

Signed-off-by: weedge <[email protected]>

feat: add chat_history_size

84c65d0

Signed-off-by: weedge <[email protected]>

fix: podcast

3ad7063

Signed-off-by: weedge <[email protected]>

feat: add step_audio2 daily_aqaa_bot

5d30436

Signed-off-by: weedge <[email protected]>

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

weedge added 3 commits September 11, 2025 19:18

fix session chat history

655b25f

Signed-off-by: weedge <[email protected]>

remove print

04c871e

Signed-off-by: weedge <[email protected]>

fix ci review

32b4601

Signed-off-by: weedge <[email protected]>

feat: add funciton call tools

ac0163c

Signed-off-by: weedge <[email protected]>

weedge changed the title ~~feat: add step_audio2 daily_aqaa_bot~~ feat: add step_audio2 daily_aqaa_bot with function call Sep 12, 2025

weedge changed the title ~~feat: add step_audio2 daily_aqaa_bot with function call~~ feat: add step_audio2 bots(e.g.: AQAA) with function call Sep 12, 2025

weedge added TTS ASR DiT Flow A1-T2A2 (speech)-to-(text and speech) transformers labels Sep 12, 2025

fix register

c369630

Signed-off-by: weedge <[email protected]>

weedge added the streaming label Sep 13, 2025

add verbose to print log

ea6322f

Signed-off-by: weedge <[email protected]>

weedge added Omni Omni Modality MLLM multimodal large language models and removed Omni Omni Modality labels Sep 13, 2025

add BaseConfig for ai_config

d8d1cfe

Signed-off-by: weedge <[email protected]>

weedge merged commit b061c5e into main Sep 13, 2025

weedge mentioned this pull request Sep 15, 2025

[achatbot+StepAudio2(transformers)] AQAA with function call tools(web_search,get_weather) & S2ST with async ASR stepfun-ai/Step-Audio2#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

Uh oh!

weedge commented Sep 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Sep 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

feat: add step_audio2 bots(e.g.: AQAA) with function call #190

Uh oh!

Conversation

weedge commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

step-audio2-llm

token2wav

speaker embeding

Flow Matching

HiFT

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weedge commented Sep 11, 2025 •

edited

Loading

weedge commented Sep 12, 2025 •

edited

Loading