Skip to content

Added support for gpt4o-realtime models for Speect to Speech interactions #659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sharananurag998
Copy link

This PR introduces real-time voice pipeline support for OpenAI’s gpt-4o-realtime-preview model, enabling seamless, low-latency speech-to-speech interactions in the Speect framework. The update brings a modern, streaming audio interface, integrated tool execution, and robust event handling—while maintaining full compatibility with the existing STT/TTS pipeline.


Key Features & Changes

  • RealtimeVoicePipeline:

    • New pipeline for direct, continuous audio-to-audio conversations with OpenAI’s real-time models.
    • Handles streaming microphone input and speaker output at 24kHz, as required by the API.
    • Supports push-to-talk and half-duplex operation to prevent echo/feedback.
  • Integrated Tool Calls:

    • Tools are registered with the pipeline and executed automatically when the model requests a function call.
    • Tool results are sent back to the model using the correct OpenAI Realtime API protocol.
  • Event Handling & Debugging:

    • Full support for all major OpenAI Realtime API events, including:
      • Audio and text deltas
      • Tool call arguments (streamed and completed)
      • Transcription events (conversation.item.input_audio_transcription.delta and .completed)
      • Session and rate limit updates
    • Example logs all transcription events for easy debugging of what the model “hears.”
  • Echo & Feedback Mitigation:

    • Implements a buffer window after assistant audio playback to prevent microphone echo from triggering new turns.
    • Optionally enables server-side noise/echo reduction via input_audio_noise_reduction in the session config.
  • Sample Rate Fixes:

    • Ensures both input and output audio are always 24kHz PCM, as required by the OpenAI API (fixes “slow motion” audio bug).
  • Backwards Compatibility:

    • All changes are fully compatible with the existing STT/TTS pipeline and configuration.
    • Legacy examples and workflows continue to work without modification.
  • Documentation & Examples:

    • Updated docs/voice/pipeline.md with new real-time usage, configuration, and troubleshooting sections.
    • New example: continuous_realtime_assistant.py demonstrates push-to-talk, tool calls, and event handling.

🛠️ How to Use

  • Realtime Pipeline:
    See the new example and documentation for how to use RealtimeVoicePipeline with your OpenAI API key and tools.
  • Classic Pipeline:
    No changes required—existing STT/TTS flows are unaffected.

…ions

- Added detailed documentation for the new `RealtimeVoicePipeline`, including usage examples and event handling for real-time audio interaction.
- Introduced a new example script demonstrating the `RealtimeVoicePipeline` with continuous audio streaming and tool execution.
@sharananurag998 sharananurag998 force-pushed the main branch 2 times, most recently from ba7af6d to 8bcb389 Compare May 7, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant