You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added support for gpt4o-realtime models for Speect to Speech interactions
- Added detailed documentation for the new `RealtimeVoicePipeline`, including usage examples and event handling for real-time audio interaction.
- Introduced a new example script demonstrating the `RealtimeVoicePipeline` with continuous audio streaming and tool execution.
Copy file name to clipboardExpand all lines: docs/voice/pipeline.md
+167
Original file line number
Diff line number
Diff line change
@@ -73,3 +73,170 @@ async for event in result.stream():
73
73
### Interruptions
74
74
75
75
The Agents SDK currently does not support any built-in interruptions support for [`StreamedAudioInput`][agents.voice.input.StreamedAudioInput]. Instead for every detected turn it will trigger a separate run of your workflow. If you want to handle interruptions inside your application you can listen to the [`VoiceStreamEventLifecycle`][agents.voice.events.VoiceStreamEventLifecycle] events. `turn_started` will indicate that a new turn was transcribed and processing is beginning. `turn_ended` will trigger after all the audio was dispatched for a respective turn. You could use these events to mute the microphone of the speaker when the model starts a turn and unmute it after you flushed all the related audio for a turn.
76
+
77
+
Once the pipeline is done processing all turns, the `stream()` method will complete and the context manager will exit.
78
+
79
+
## Real-time Voice Pipeline
80
+
81
+
The SDK includes a `RealtimeVoicePipeline` designed for direct, bidirectional voice interaction with newer, real-time capable models like OpenAI's `gpt-4o-realtime-preview`. This pipeline differs significantly from the standard `VoicePipeline`:
82
+
83
+
-**Direct Voice-to-Voice:** It sends your audio directly to the real-time LLM and receives audio back from the LLM. There are no separate STT (Speech-to-Text) or TTS (Text-to-Speech) steps managed by this pipeline. The LLM handles both transcription and speech generation internally.
84
+
-**Integrated Tool Calls:** If the LLM decides to use a tool, the pipeline will automatically execute it using the tools you provided during initialization and send the result back to the LLM. The pipeline emits `VoiceStreamEventToolCall` events so your application can log or display information about tool usage, but it does not need to perform any action in response to these events.
85
+
-**Continuous Streaming:** It's designed for continuous audio input and output, facilitating more natural conversational turn-taking.
86
+
87
+
### Usage
88
+
89
+
The `RealtimeVoicePipeline` follows a similar pattern to the standard `VoicePipeline`:
90
+
91
+
1. Create a `StreamedAudioInput` instance
92
+
2. Configure a `VoicePipelineConfig` with real-time specific settings
93
+
3. Initialize the pipeline with a real-time model and any tools
94
+
4. Call `run()` to get a result that can be streamed
95
+
5. Process the events from the stream
96
+
97
+
#### Basic example:
98
+
99
+
```python
100
+
from agents.voice import (
101
+
RealtimeVoicePipeline,
102
+
StreamedAudioInput,
103
+
VoicePipelineConfig
104
+
)
105
+
from agents.voice.models.sdk_realtime import SDKRealtimeLLM
106
+
107
+
# Create the input, config, and model
108
+
input_stream = StreamedAudioInput()
109
+
config = VoicePipelineConfig(
110
+
realtime_settings={
111
+
"turn_detection": "server_vad", # Use server-side voice activity detection
112
+
"system_message": "You are a helpful assistant.",
113
+
}
114
+
)
115
+
model = SDKRealtimeLLM(model_name="gpt-4o-realtime-preview")
116
+
117
+
# Create the pipeline with tools
118
+
pipeline = RealtimeVoicePipeline(
119
+
model=model,
120
+
tools=[get_weather, get_time],
121
+
config=config,
122
+
)
123
+
124
+
# Start the pipeline
125
+
result =await pipeline.run(input_stream)
126
+
127
+
# Process events from the pipeline
128
+
asyncfor event in result.stream():
129
+
# Handle different event types
130
+
ifisinstance(event, VoiceStreamEventAudio):
131
+
# Play this audio to the user
132
+
play_audio(event.data)
133
+
elifisinstance(event, VoiceStreamEventToolCall):
134
+
# Log tool usage (execution is automatic)
135
+
log_tool_call(event.tool_name, event.arguments)
136
+
# Handle other event types...
137
+
138
+
# Continuously send audio chunks to the pipeline
139
+
# There's no need to signal "end of audio" - the model handles turn-taking
140
+
whileTrue:
141
+
audio_chunk = record_audio_chunk()
142
+
await input_stream.queue.put(audio_chunk)
143
+
144
+
# If the application is closing, close the input
145
+
if stopping:
146
+
await input_stream.close()
147
+
break
148
+
```
149
+
150
+
### Turn Detection Modes
151
+
152
+
The realtime models can operate in different turn detection modes, controlled via the `turn_detection` setting:
153
+
154
+
-`"server_vad"` (default): The server automatically detects when the user has stopped speaking using Voice Activity Detection and starts responding.
155
+
-`"manual"`: Your application explicitly signals when the user has finished speaking by calling `await llm_session.commit_audio_buffer()`.
156
+
-`None`: Same as `"server_vad"` - the server handles turn detection automatically.
157
+
158
+
### Implementing Push-to-Talk
159
+
160
+
In push-to-talk mode, the application sends audio only when the user activates a button or key:
161
+
162
+
```python
163
+
# Start continuous silent audio (required for maintaining the connection)
-`VoiceStreamEventToolCall`: Provides information about tool calls being executed by the pipeline
190
+
-`VoiceStreamEventError`: Indicates an error condition
191
+
192
+
### Key Differences & Important Notes
193
+
194
+
-**Continuous Audio**: The realtime pipeline expects continuous audio input, not discrete turns ending with a `None` sentinel. Use `input_stream.close()` only when shutting down the pipeline entirely.
195
+
-**Event Types**: You'll receive `VoiceStreamEventToolCall` events for informational purposes when tools are used. The pipeline automatically executes registered tools and sends results back to the LLM - no action is needed from your application.
196
+
-**No Separate STT/TTS Events**: You will receive `VoiceStreamEventAudio` directly from the LLM. There are no separate events indicating STT transcription completion or explicit text-to-speech stages within this pipeline's event stream.
197
+
-**Configuration**: Real-time model specific settings (like assistant voice, system message, or turn detection mode) are passed via the `realtime_settings` dictionary within `VoicePipelineConfig`.
198
+
-**Audio Format**: The OpenAI realtime models currently require **16-bit PCM at a 24 kHz sample rate, mono, little-endian** for both _input_ and _output_ when you use the default `pcm16` format. Make sure your microphone capture (`StreamedAudioInput`) and speaker playback are configured for **24 kHz** to avoid chip-munk / slow-motion artefacts.
199
+
200
+
```python
201
+
INPUT_SAMPLE_RATE=24_000# 24 kHz for mic capture
202
+
OUTPUT_SAMPLE_RATE=24_000# 24 kHz for TTS playback
203
+
```
204
+
205
+
Failing to match this sample-rate is the most common cause of distorted or "slow" audio.
206
+
207
+
For complete working examples, see:
208
+
209
+
-[`realtime_assistant.py`](https://github.com/openai/openai-agents-python/blob/main/examples/voice/realtime_assistant.py) - Basic example with simulated audio
210
+
-[`continuous_realtime_assistant.py`](https://github.com/openai/openai-agents-python/blob/main/examples/voice/continuous_realtime_assistant.py) - Example showing continuous streaming with push-to-talk simulation
211
+
212
+
Note that these examples require approved access to the OpenAI `gpt-4o-realtime-preview` model.
213
+
214
+
### New transcription events
215
+
216
+
When you enable `input_audio_transcription` in the session configuration (the realtime pipeline does this automatically), the server can stream _your_ microphone audio back as text. Two new event types are surfaced by the SDK so you can inspect what the model thinks it heard:
These are invaluable for debugging cases where echo or background noise is being mis-interpreted by the model.
229
+
230
+
### Echo & feedback mitigation
231
+
232
+
If you hear the assistant repeatedly greeting you ("Hello again!") it usually means your microphone is re-capturing the speaker audio. Combine these techniques:
233
+
234
+
1. Enable the built-in echo / noise suppression with
2. In push-to-talk interfaces, _pause_ mic streaming for~300 ms after the last assistant audio chunk. See `ASSISTANT_AUDIO_SILENCE_BUFFER_S`in`continuous_realtime_assistant.py`.
0 commit comments