Skip to content

Commit b205f83

Browse files
Added support for gpt4o-realtime models for Speect to Speech interactions
- Added detailed documentation for the new `RealtimeVoicePipeline`, including usage examples and event handling for real-time audio interaction. - Introduced a new example script demonstrating the `RealtimeVoicePipeline` with continuous audio streaming and tool execution.
1 parent f976349 commit b205f83

16 files changed

+2247
-16
lines changed

docs/voice/pipeline.md

+167
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,170 @@ async for event in result.stream():
7373
### Interruptions
7474

7575
The Agents SDK currently does not support any built-in interruptions support for [`StreamedAudioInput`][agents.voice.input.StreamedAudioInput]. Instead for every detected turn it will trigger a separate run of your workflow. If you want to handle interruptions inside your application you can listen to the [`VoiceStreamEventLifecycle`][agents.voice.events.VoiceStreamEventLifecycle] events. `turn_started` will indicate that a new turn was transcribed and processing is beginning. `turn_ended` will trigger after all the audio was dispatched for a respective turn. You could use these events to mute the microphone of the speaker when the model starts a turn and unmute it after you flushed all the related audio for a turn.
76+
77+
Once the pipeline is done processing all turns, the `stream()` method will complete and the context manager will exit.
78+
79+
## Real-time Voice Pipeline
80+
81+
The SDK includes a `RealtimeVoicePipeline` designed for direct, bidirectional voice interaction with newer, real-time capable models like OpenAI's `gpt-4o-realtime-preview`. This pipeline differs significantly from the standard `VoicePipeline`:
82+
83+
- **Direct Voice-to-Voice:** It sends your audio directly to the real-time LLM and receives audio back from the LLM. There are no separate STT (Speech-to-Text) or TTS (Text-to-Speech) steps managed by this pipeline. The LLM handles both transcription and speech generation internally.
84+
- **Integrated Tool Calls:** If the LLM decides to use a tool, the pipeline will automatically execute it using the tools you provided during initialization and send the result back to the LLM. The pipeline emits `VoiceStreamEventToolCall` events so your application can log or display information about tool usage, but it does not need to perform any action in response to these events.
85+
- **Continuous Streaming:** It's designed for continuous audio input and output, facilitating more natural conversational turn-taking.
86+
87+
### Usage
88+
89+
The `RealtimeVoicePipeline` follows a similar pattern to the standard `VoicePipeline`:
90+
91+
1. Create a `StreamedAudioInput` instance
92+
2. Configure a `VoicePipelineConfig` with real-time specific settings
93+
3. Initialize the pipeline with a real-time model and any tools
94+
4. Call `run()` to get a result that can be streamed
95+
5. Process the events from the stream
96+
97+
#### Basic example:
98+
99+
```python
100+
from agents.voice import (
101+
RealtimeVoicePipeline,
102+
StreamedAudioInput,
103+
VoicePipelineConfig
104+
)
105+
from agents.voice.models.sdk_realtime import SDKRealtimeLLM
106+
107+
# Create the input, config, and model
108+
input_stream = StreamedAudioInput()
109+
config = VoicePipelineConfig(
110+
realtime_settings={
111+
"turn_detection": "server_vad", # Use server-side voice activity detection
112+
"system_message": "You are a helpful assistant.",
113+
}
114+
)
115+
model = SDKRealtimeLLM(model_name="gpt-4o-realtime-preview")
116+
117+
# Create the pipeline with tools
118+
pipeline = RealtimeVoicePipeline(
119+
model=model,
120+
tools=[get_weather, get_time],
121+
config=config,
122+
)
123+
124+
# Start the pipeline
125+
result = await pipeline.run(input_stream)
126+
127+
# Process events from the pipeline
128+
async for event in result.stream():
129+
# Handle different event types
130+
if isinstance(event, VoiceStreamEventAudio):
131+
# Play this audio to the user
132+
play_audio(event.data)
133+
elif isinstance(event, VoiceStreamEventToolCall):
134+
# Log tool usage (execution is automatic)
135+
log_tool_call(event.tool_name, event.arguments)
136+
# Handle other event types...
137+
138+
# Continuously send audio chunks to the pipeline
139+
# There's no need to signal "end of audio" - the model handles turn-taking
140+
while True:
141+
audio_chunk = record_audio_chunk()
142+
await input_stream.queue.put(audio_chunk)
143+
144+
# If the application is closing, close the input
145+
if stopping:
146+
await input_stream.close()
147+
break
148+
```
149+
150+
### Turn Detection Modes
151+
152+
The realtime models can operate in different turn detection modes, controlled via the `turn_detection` setting:
153+
154+
- `"server_vad"` (default): The server automatically detects when the user has stopped speaking using Voice Activity Detection and starts responding.
155+
- `"manual"`: Your application explicitly signals when the user has finished speaking by calling `await llm_session.commit_audio_buffer()`.
156+
- `None`: Same as `"server_vad"` - the server handles turn detection automatically.
157+
158+
### Implementing Push-to-Talk
159+
160+
In push-to-talk mode, the application sends audio only when the user activates a button or key:
161+
162+
```python
163+
# Start continuous silent audio (required for maintaining the connection)
164+
async def send_continuous_audio():
165+
while True:
166+
if push_to_talk_active:
167+
# Send real audio when button is pressed
168+
audio = get_microphone_audio()
169+
else:
170+
# Send silence when button is not pressed
171+
audio = np.zeros(CHUNK_SIZE, dtype=np.int16)
172+
173+
await input_stream.queue.put(audio)
174+
await asyncio.sleep(CHUNK_DURATION) # Simulate real-time pacing
175+
176+
# When user releases the push-to-talk button
177+
async def on_push_to_talk_released():
178+
# Optional: For manual turn detection, commit the buffer
179+
if turn_detection == "manual":
180+
await llm_session.commit_audio_buffer()
181+
```
182+
183+
### Event Handling
184+
185+
When processing events from a `RealtimeVoicePipeline`, you'll handle these event types:
186+
187+
- `VoiceStreamEventAudio`: Contains audio data from the LLM to play back to the user
188+
- `VoiceStreamEventLifecycle`: Indicates session lifecycle events (e.g., "turn_started", "turn_ended", "session_ended")
189+
- `VoiceStreamEventToolCall`: Provides information about tool calls being executed by the pipeline
190+
- `VoiceStreamEventError`: Indicates an error condition
191+
192+
### Key Differences & Important Notes
193+
194+
- **Continuous Audio**: The realtime pipeline expects continuous audio input, not discrete turns ending with a `None` sentinel. Use `input_stream.close()` only when shutting down the pipeline entirely.
195+
- **Event Types**: You'll receive `VoiceStreamEventToolCall` events for informational purposes when tools are used. The pipeline automatically executes registered tools and sends results back to the LLM - no action is needed from your application.
196+
- **No Separate STT/TTS Events**: You will receive `VoiceStreamEventAudio` directly from the LLM. There are no separate events indicating STT transcription completion or explicit text-to-speech stages within this pipeline's event stream.
197+
- **Configuration**: Real-time model specific settings (like assistant voice, system message, or turn detection mode) are passed via the `realtime_settings` dictionary within `VoicePipelineConfig`.
198+
- **Audio Format**: The OpenAI realtime models currently require **16-bit PCM at a 24 kHz sample rate, mono, little-endian** for both _input_ and _output_ when you use the default `pcm16` format. Make sure your microphone capture (`StreamedAudioInput`) and speaker playback are configured for **24 kHz** to avoid chip-munk / slow-motion artefacts.
199+
200+
```python
201+
INPUT_SAMPLE_RATE = 24_000 # 24 kHz for mic capture
202+
OUTPUT_SAMPLE_RATE = 24_000 # 24 kHz for TTS playback
203+
```
204+
205+
Failing to match this sample-rate is the most common cause of distorted or "slow" audio.
206+
207+
For complete working examples, see:
208+
209+
- [`realtime_assistant.py`](https://github.com/openai/openai-agents-python/blob/main/examples/voice/realtime_assistant.py) - Basic example with simulated audio
210+
- [`continuous_realtime_assistant.py`](https://github.com/openai/openai-agents-python/blob/main/examples/voice/continuous_realtime_assistant.py) - Example showing continuous streaming with push-to-talk simulation
211+
212+
Note that these examples require approved access to the OpenAI `gpt-4o-realtime-preview` model.
213+
214+
### New transcription events
215+
216+
When you enable `input_audio_transcription` in the session configuration (the realtime pipeline does this automatically), the server can stream _your_ microphone audio back as text. Two new event types are surfaced by the SDK so you can inspect what the model thinks it heard:
217+
218+
- `RealtimeEventInputAudioTranscriptionDelta` – incremental partial transcripts
219+
- `RealtimeEventInputAudioTranscriptionCompleted` – the final transcript for that user turn
220+
221+
```python
222+
elif isinstance(event, RealtimeEventInputAudioTranscriptionDelta):
223+
print("you (partial):", event.delta)
224+
elif isinstance(event, RealtimeEventInputAudioTranscriptionCompleted):
225+
print("you (final):", event.transcript)
226+
```
227+
228+
These are invaluable for debugging cases where echo or background noise is being mis-interpreted by the model.
229+
230+
### Echo & feedback mitigation
231+
232+
If you hear the assistant repeatedly greeting you ("Hello again!") it usually means your microphone is re-capturing the speaker audio. Combine these techniques:
233+
234+
1. Enable the built-in echo / noise suppression with
235+
236+
```python
237+
realtime_settings={"input_audio_noise_reduction": {}}
238+
```
239+
240+
2. In push-to-talk interfaces, _pause_ mic streaming for ~300 ms after the last assistant audio chunk. See `ASSISTANT_AUDIO_SILENCE_BUFFER_S` in `continuous_realtime_assistant.py`.
241+
242+
3. Use headphones for the cleanest experience.

0 commit comments

Comments
 (0)