-
Notifications
You must be signed in to change notification settings - Fork 17
feat: add step_audio2 bots(e.g.: AQAA) with function call #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the project's voice AI capabilities by integrating the Step-Audio2 model. It introduces a new Daily.co bot that utilizes this model for real-time audio processing and conversational responses. The changes span across infrastructure (Modal deployments, Git submodules), core application logic (LLM wrappers, voice processors), and utility improvements (chat history, build system), aiming to provide a robust foundation for advanced audio-driven AI interactions.
Highlights
- New Voice Bot Integration: Introduced a new
DailyStepAudio2AQAABotfor Daily.co, leveraging theStep-Audio2model for advanced audio processing and conversational AI capabilities. - Step-Audio2 Model and Infrastructure: Added the
Step-Audio2as a Git submodule and integrated its core functionalities, including ASR, TTS, S2ST, and multi-turn chat, through new Modal deployment scripts and a dedicated LLM wrapper. - Enhanced Chat History Management: Refactored the
ChatHistoryandSessionclasses to improve chat context management, particularly for long-running conversations with LLMs, and added serialization support. - Build System and Deployment Updates: Updated
pyproject.tomlforachatbotversioning and refined the PyPI deployment script (scripts/pypi_achatbot.sh) for more precise dependency copying. - Logging and Debugging Improvements: Adjusted logging levels and added debug print statements in various
demoscripts to provide better visibility into data processing and database interactions. - New Data Frames for LLM Interactions: Introduced
TextQuestionsAudioRawFrameandLLMGenedTokensFrameto better represent complex data flows involving text, audio, and LLM-generated tokens.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: support for the Step-Audio2 model, including a new DailyStepAudio2AQAABot. The changes are extensive, adding new processors, model wrappers, and deployment scripts. The refactoring of Session and ChatHistory to be more robust is a good improvement. However, there are several issues that need attention. I've found some debugging print statements and hardcoded values that should be removed or made configurable. There's a critical bug in src/common/session.py due to inconsistent attribute naming (chat_history vs _chat_history). Additionally, the pypi_achatbot.sh script has been changed to only copy .py files, which might break dependencies that require other file types. Finally, there are some minor issues like a typo and a broken test script. Please review the detailed comments.
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
step-audio2-llmstepfun-ai/Step-Audio-2-mini (AudioEncoder+Adapter+LLM Decoder) 8315.179264 M parameters token2wavstepfun-ai/Step-Audio-2-mini/token2wav.audio_tokenizer 123.714568 M parameters (S3TokenizerV2) Tip
speaker embedingCAM++ : https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary Flow Matchingstepfun-ai/Step-Audio-2-mini/token2wav.flow 155.802352 M parameters
Tip
HiFTstepfun-ai/Step-Audio-2-mini/token2wav.hift 20.821295 M parameters HiFTGenerator |
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Signed-off-by: weedge <[email protected]>
Tip
feat:
daily_step_audio2_aqaa_bot.json
{ "chat_bot_name": "DailyStepAudio2AQAABot", "room_name": "chat-room", "room_url": "", "token": "", "room_manager": { "tag": "daily_room", "args": { "privacy": "public" } }, "services": { "pipeline": "achatbot", "vad": "silero", "voice_llm": "step_audio2" }, "config": { "vad": { "tag": "silero_vad_analyzer", "args": { "start_secs": 0.032, "stop_secs": 0.32, "confidence": 0.7, "min_volume": 0.6, "onnx": true } }, "voice_llm": { "processor": "StepAudio2TextAudioChatProcessor", "args": { "init_system_prompt": "", "prompt_wav": "/root/.achatbot/assets/default_male.wav", "verbose": true, "warmup_cn": 2, "chat_history_size": null, "text_stream_out": false, "no_stream_sleep_time": 0.001, "chunk_size": 100, "lm_gen_max_new_tokens": 256, "lm_gen_temperature": 0.7, "lm_gen_top_k": 20, "lm_gen_top_p": 0.9, "lm_gen_repetition_penalty": 1.1, "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini" } } }, "config_list": [] }daily_step_audio2_aqaa_tools_bot.json
{ "chat_bot_name": "DailyStepAudio2AQAABot", "room_name": "chat-room", "room_url": "", "token": "", "room_manager": { "tag": "daily_room", "args": { "privacy": "public" } }, "services": { "pipeline": "achatbot", "vad": "silero", "voice_llm": "step_audio2" }, "config": { "vad": { "tag": "silero_vad_analyzer", "args": { "start_secs": 0.032, "stop_secs": 0.32, "confidence": 0.7, "min_volume": 0.6, "onnx": true } }, "voice_llm": { "processor": "StepAudio2TextAudioChatProcessor", "args": { "init_system_prompt": "你的名字叫做小跃,是由阶跃星辰公司训练出来的语音大模型。\n你具备调用工具解决问题的能力,你需要根据用户的需求和上下文情景,自主选择是否调用系统提供的工具来协助用户。\n你情感细腻,观察能力强,擅长分析用户的内容,并作出善解人意的回复,说话的过程中时刻注意用户的感受,富有同理心,提供多样的情绪价值。\n今天是2025年9月12日,星期五", "tools": ["web_search"], "verbose": true, "prompt_wav": "/root/.achatbot/assets/default_male.wav", "warmup_cn": 2, "chat_history_size": null, "text_stream_out": false, "no_stream_sleep_time": 0.001, "chunk_size": 100, "lm_gen_max_new_tokens": 1024, "lm_gen_temperature": 0.7, "lm_gen_top_k": 20, "lm_gen_top_p": 0.9, "lm_gen_repetition_penalty": 1.1, "lm_model_name_or_path": "/root/.achatbot/models/stepfun-ai/Step-Audio-2-mini" } } }, "config_list": [] }Reference