A modular framework for building VoIP-Agent applications.
Dialog is an orchestration layer for VoIP-Agent applications. Two common VoIP-Agent models exist today: the Speech-to-Speech (S2S) model and the Speech-to-Text with Text-to-Speech (STT–TTS) model.
The S2S model converts spoken input into spoken output, while the STT–TTS model first converts speech into text, which is processed by an Agent; the Agent’s textual response is then converted back into speech. Both approaches involve tradeoffs.
Dialog adopts the STT–TTS model. It orchestrates communication between the VoIP, STT, TTS, and Agent components. The framework provides concrete implementations of VoIP, STT, and TTS classes, along with abstract Agent classes designed for subclassing.
- Extensible, modular framework
- Concrete implementations for VoIP, STT, and TTS, plus abstract Agent classes for extension
- Multithreaded deployments
- Event-driven architecture
- Isolated state — components exchange objects but never mutate objects held by other components
NB Dialog is a well architected implementation; however, it is still undergoing active refactoring. Prior to 1.0.0, public interfaces may change on turns of minor versions and commit messages will be minimal.
- Installation
- Usage
- Examples
- Architecture
- Implementations
- Custom Implementations
- Multithreading
- API
- Troubleshooting
- Alternatives
- Support
These instructions describe how to clone the Dialog repository and build the package.
git clone https://github.com/faranalytics/dialog.gitcd dialognpm install && npm updateYou can use the clean:build script in order to do a clean build.
npm run clean:buildAlternatively, you can use the watch script in order to watch and build the package. This will build the package each time you make a change to a file in ./src. If you use the watch script, you will need to open a new terminal in order to build and run your application.
npm run watchnpm install <path-to-the-dialog-repository> --saveYou should now be able to import Dialog artifacts into your package.
When a call is initiated, a Gateway (e.g., a Twilio Gateway) emits a voip event. The voip handler is called with a VoIP instance as its single argument. The VoIP instance handles the web socket connection that is set on it by the Gateway. In the voip handler, an instance of an Agent is constructed by passing a VoIP, STT, and TTS implementation into its constructor. The agent is started by calling its activate method. The activate method of the Agent instance connects the interfaces that comprise the application.
An important characteristic of the architecture is that a new instance (i.e., a VoIP, STT, TTS, and Agent) — is created for every call. This allows each instance to maintain state specific to its call.
Excerpted from src/main.ts.
...
const gateway = new TwilioGateway({
httpServer,
webSocketServer,
webhookURL: new URL(WEBHOOK_URL),
authToken: TWILIO_AUTH_TOKEN,
accountSid: TWILIO_ACCOUNT_SID
});
gateway.on("voip", (voip: TwilioVoIP) => {
const agent = new TwilioVoIPOpenAIAgent({
voip: voip,
stt: new DeepgramSTT({ apiKey: DEEPGRAM_API_KEY, liveSchema: DEEPGRAM_LIVE_SCHEMA }),
tts: new CartesiaTTS({ apiKey: CARTESIA_API_KEY, speechOptions: CARTESIA_SPEECH_OPTIONS }),
apiKey: OPENAI_API_KEY,
system: OPENAI_SYSTEM_MESSAGE,
greeting: OPENAI_GREETING_MESSAGE,
model: OPENAI_MODEL,
twilioAccountSid: TWILIO_ACCOUNT_SID,
twilioAuthToken: TWILIO_AUTH_TOKEN
});
agent.activate();
});
...Example implementations are provided in the examples subpackages.
In the Custom Twilio VoIP, Deepgram STT, Cartesia TTS, and OpenAI Agent example you will create a simple hypothetical Agent that prepends its messages with a timestamp and manages its conversation history.
In the Twilio VoIP, Deepgram STT, Cartesia TTS, and OpenAI Agent (Threading) example you will run each call session and Agent instance in a worker thread.
A Dialog orchestration typically consists of one or more of: an Agent component, a STT component, a TTS component, and a VoIP component.
An Agent component is essential to assembling the external LLM, the VoIP, STT, and TTS implementations into a working whole. Dialog, as the orchestration layer, does not provide a concrete Agent implementation. Instead you are provided with an interface and abstract class that you can implement or subclass with your custom special tool calling logic. For example, an Agent will decide when to transfer a call; if the LLM determines the User intent is to be transferred, the Agent can carry out this intent by calling the VoIP.transferTo method — or it could circumvent the provided call transfer facilities entirely and make a direct call to the VoIP provider (e.g., Twilio, Telnyx, etc.) API. The point here is that very little architectural constraints should be imposed on the Agent; this ensures the extensibility of the architecture.
The STT component transcribes the User speech into text. The STT emits utterance and VAD events that may be consumed by the Agent.
The TTS component synthesizes the text received from the Agent and/or LLM. The TTS emits message events that may be consumed by the Agent.
The VoIP component handles the incoming call, transcriptions, recordings, and streams audio into the STT.
Dialog favors simplicity and accessibility over feature richness. Its architecture should meet all the requirements of a typical VoIP-Agent application where many Users interact with a set of Agents. Although Dialog doesn't presently support concepts like "rooms", the simplicity and extensibility of its architecture should lend to even more advanced implementations.
Each component in a Dialog orchestration must not directly mutate the state of another component (e.g., VoIP, STT, TTS, or Agent). Components may emit messages and consume the messages of other components and they may hold references to each other; however the mutation of an object held by one component should never directly mutate the state of an object held by another component. This is an important characteristic of Dialog components — they exhibit isolated state — each component may exchange objects with other components but never mutate them. For example, a VoIP component may emit a Metadata object that contains information about a given incoming call that is consumed by other components; however, a subsequent mutation in the VoIP's Metadata must not mutate the Metadata in another component.
This strict separation of concerns ensures that component state remains predictable and easy for a human to reason about. Likewise, the architecture is expected to be easy for LLMs to consume, as the LLM's attention can be focused on the pattern that is exhibited by the relevant component.
+-------------+ audio (base64) +------------+ transcripts +----------+ text +-------------+
| Twilio | ------------------>| STT | -------------------> | Agent | -------> | TTS |
| VoIP | --metadata/events--| (Deepgram | --metadata/events--> | (OpenAI) | | (11Labs or |
| (WS in/out) | | or OpenAI) | | | | Cartesia) |
+-------------+ +------------+ +----------+ +-------------+
^ v
+----------------------------------------------------------------------------------------------+
audio (base64)
Dialog provides example implementations for each of the artifacts that comprise a VoIP-Agent application. You can use a packaged implementation as-is, subclass it, or implement your own. If you choose to implement a custom implementation, you can use one of the provided interfaces.
- Twilio request validation
- Recording status
- Transcript status
- Speech interruption
An implementation similar to Twilio is planned. A placeholder exists under src/implementations/voip/telnyx/.
- Voice activity detection (VAD) events
- Voice activity detection (VAD) events
- Semantic VAD
- Configurable voice
- Configurable voice
- An abstract Agent implementation is provided that uses the OpenAI API.
Dialog provides concrete VoIP, STT, and TTS implementations and an abstract Agent implementation. You can use a provided implementation as-is, subclass it, or choose an interface and implement your own. If you plan to implement your own VoIP, STT, Agent, or TTS, interfaces are provided for each component of the application.
A custom Agent implementation will allow you to facilitate tool calling, conversation history, and other nuances.
You can extend the provided OpenAIAgent class, as in the example below, or just implement the Agent interface. The straight-forward openai_agent.ts implementation can be used as a guide.
This hypothetical custom Agent implementation adds a timestamp to each user message and maintains conversation history.
import { once } from "node:events";
import { randomUUID } from "node:crypto";
import {
log,
Message,
OpenAIAgent,
OpenAIAgentOptions,
TwilioMetadata,
TwilioVoIP,
OpenAIConversationHistory,
} from "@farar/dialog";
export interface TwilioCustomAgentOptions extends OpenAIAgentOptions<TwilioVoIP> {
twilioAccountSid: string;
twilioAuthToken: string;
system?: string;
greeting?: string;
}
export class TwilioCustomAgent extends OpenAIAgent<TwilioVoIP> {
protected metadata?: TwilioMetadata;
protected twilioAccountSid: string;
protected twilioAuthToken: string;
protected history: OpenAIConversationHistory;
protected transcript: unknown[];
protected system: string;
protected greeting: string;
constructor(options: TwilioCustomAgentOptions) {
super(options);
this.twilioAccountSid = options.twilioAccountSid;
this.twilioAuthToken = options.twilioAuthToken;
this.transcript = [];
this.system = options.system ?? "";
this.greeting = options.greeting ?? "";
if (this.system) {
this.history = [
{
role: "system",
content: this.system,
},
];
} else {
this.history = [];
}
}
public inference = async (message: Message): Promise<void> => {
try {
const content = `${new Date().toISOString()}\n${message.data}`;
log.notice(`User message: ${content}`);
this.history.push({ role: "user", content });
const stream = await this.openAI.chat.completions.create({
model: this.model,
messages: this.history,
temperature: 1,
stream: true,
});
const assistantMessage = await this.dispatchStream(message.uuid, stream);
log.notice(`Assistant message: ${assistantMessage} `);
this.history.push({ role: "assistant", content: assistantMessage });
} catch (err) {
this.dispose(err);
}
};
public updateMetadata = (metadata: TwilioMetadata): void => {
if (!this.metadata) {
this.metadata = metadata;
} else {
this.metadata = { ...this.metadata, ...metadata };
}
};
public activate = (): void => {
super.activate();
this.voip.on("streaming_started", this.dispatchInitialMessage);
this.voip.on("streaming_started", this.startDisposal);
this.voip.on("metadata", this.updateMetadata);
};
public deactivate = (): void => {
super.deactivate();
this.voip.off("streaming_started", this.dispatchInitialMessage);
this.voip.off("streaming_started", this.startDisposal);
this.voip.off("metadata", this.updateMetadata);
};
public dispatchInitialMessage = (): void => {
const uuid = randomUUID();
this.activeMessages.add(uuid);
this.history.push({ role: "assistant", content: this.greeting });
this.dispatchMessage({ uuid: uuid, data: this.greeting, done: true }, false).catch(this.dispose);
};
protected startDisposal = (): void => {
void (async () => {
try {
await once(this.voip, "streaming_stopped");
this.dispose();
} catch (err) {
log.error(err);
}
})();
};
}Dialog provides a simple multithreading implementation you can use. An example is provided that demonstrates a multithreaded deployment.
A Worker is spun up for each call. VoIP events are propagated over a MessageChannel using the Port Agent RPC-like facility. This approach ensures that any peculiarity that takes place in handling one call will not interfere with other concurrent calls. Another notable aspect of this approach is that it permits hot changes to the Agent (and the STT and TTS) code without interrupting calls that are already underway — new calls will pick up changes each time a Worker is spun up.
In the excerpt below, a TwilioVoIPWorker is instantiated on each call.
Excerpted from ./src/main.ts.
const gateway = new TwilioGateway({
httpServer,
webSocketServer,
webhookURL: new URL(WEBHOOK_URL),
authToken: TWILIO_AUTH_TOKEN,
accountSid: TWILIO_ACCOUNT_SID,
requestSizeLimit: 1e6,
});
gateway.on("voip", (voip: TwilioVoIP) => {
new TwilioVoIPWorker({ voip, worker: new Worker("./dist/worker.js") });
});Over in worker.js the Agent is instantiated, as usual, except using a TwilioVoIPProxy instance that implements the VoIP interface.
Excerpted from ./src/worker.ts.
const voip = new TwilioVoIPProxy();
const agent = new Agent({
voip: voip,
stt: new DeepgramSTT({
apiKey: DEEPGRAM_API_KEY,
liveSchema: DEEPGRAM_LIVE_SCHEMA,
}),
tts: new CartesiaTTS({
apiKey: CARTESIA_API_KEY,
speechOptions: CARTESIA_SPEECH_OPTIONS,
}),
apiKey: OPENAI_API_KEY,
system: OPENAI_SYSTEM_MESSAGE,
greeting: OPENAI_GREETING_MESSAGE,
model: OPENAI_MODEL,
twilioAccountSid: TWILIO_ACCOUNT_SID,
twilioAuthToken: TWILIO_AUTH_TOKEN,
});
agent.activate();Dialog provides building blocks to create real‑time, voice‑driven agents that integrate telephony (VoIP), speech‑to‑text (STT), text‑to‑speech (TTS), and LLM agents. It includes interfaces, utility classes, and concrete implementations for Twilio VoIP, Deepgram STT, OpenAI Realtime STT, ElevenLabs TTS, Cartesia TTS, and an OpenAI‑based agent.
The API is organized by component. You can mix and match implementations by wiring them through the provided interfaces.
The logging utilities are thin wrappers around streams-logger for structured, backpressure‑aware logging.
- log
<Logger>An initialized Logger pipeline emitting to the console via the includedformatterandconsoleHandler. - formatter
<Formatter<unknown, string>>Formats log records into human‑readable strings. - consoleHandler
<ConsoleHandler<string>>A console sink with level set to DEBUG. - SyslogLevel
<enum>The syslog‑style levels exported fromstreams-logger.
Use these exports in order to emit structured logs across the library. See streams-logger for details on usage and configuration.
- options
<StreamBufferOptions>- bufferSizeLimit
<number>Optionally specify a maximum buffer size in bytes. Default:1e6
- bufferSizeLimit
- writableOptions
<stream.WritableOptions>Optional Node.js stream options; use to customize highWaterMark, etc.
Use a StreamBuffer in order to buffer incoming stream chunks into a single in‑memory Buffer with an upper bound. If the buffer exceeds the limit, an error is emitted.
public streamBuffer.buffer
<Buffer>
The accumulated buffer contents.
- options
<MutexOptions>- queueSizeLimit
<number>A hard limit imposed on all mark queues.mutex.callwill throw if this limit is exceeded.
- queueSizeLimit
Use a Mutex in order to serialize asynchronous calls by key.
public mutex.call(mark, fn, ...args)
- mark
<string>A key identifying the critical section. - fn
<(...args: unknown[]) => Promise<unknown>>An async function to execute exclusively per key. - ...args
<unknown[]>Arguments forwarded tofn.
Returns: <Promise<unknown>>
Acquire the mutex for mark, invoke fn, and release the mutex, even on error.
public mutex.acquire(mark)
- mark
<string>A key identifying the critical section.
Returns: <Promise<void>>
Wait until the mutex for mark is available and acquire it.
public mutex.release(mark)
- mark
<string>A key identifying the critical section.
Returns: <void>
Release a previously acquired mutex for mark. Throws if called without a corresponding acquire.
These interfaces define the contracts between VoIP, STT, TTS, and Agent components.
- uuid
<UUID>A unique identifier for correlation across components. - data
<DataT>The payload: audio (base64) or text, depending on the context. - done
<boolean>Whether the message is complete (end of stream/utterance).
- inference
(message: Message) => Promise<void>Implement the main inference loop for a message. - activate
() => voidBegin wiring events between components. - deactivate
() => voidRemove event wiring.
Extends: EventEmitter<STTEvents>
Events (STTEvents):
"message":[Message]Emitted when a finalized transcription is available."vad":[]Emitted on voice activity boundary events (start/stop cues)."error":[unknown]Emitted on errors.
Methods:
- post
(media: Message) => voidPost audio media into the recognizer (typically base64 payloads). - dispose
() => voidDispose resources and listeners.
Extends: EventEmitter<TTSEvents>
Events (TTSEvents):
"message":[Message]Emitted with encoded audio output chunks, and a terminal chunk withdone: true."error":[unknown]Emitted on errors.
Methods:
- post
(message: Message) => voidPost text to synthesize. Whendoneistrue, the provider should flush and emit the terminal chunk. - abort
(uuid: UUID) => voidCancel a previously posted message stream. - dispose
() => voidDispose resources and listeners.
Extends: EventEmitter<VoIPEvents<MetadataT, TranscriptT>>
Events (VoIPEvents):
"metadata":[MetadataT]Emitted for call/session metadata updates."message":[Message]Emitted for inbound audio media frames (base64 payloads)."message_dispatched":[UUID]Emitted when a downstream consumer has finished dispatching a message identified by the UUID."transcript":[TranscriptT]Emitted for transcription webhook updates, when supported."recording_url":[string]Emitted with a URL for completed recordings, when supported."streaming_started":[]Emitted when the media stream starts."streaming_stopped":[]Emitted when the media stream ends."error":[unknown]Emitted on errors.
Methods:
- post
(message: Message) => voidPost synthesized audio back to the call/session. - abort
(uuid: UUID) => voidCancel an in‑flight TTS dispatch and clear provider state if needed. - hangup
() => voidTerminate the call/session, when supported by the provider. - transferTo
(tel: string) => voidTransfer the call to the specified telephone number, when supported. - dispose
() => voidDispose resources and listeners.
Twilio implementations provide inbound call handling, WebSocket media streaming, call control, recording, and transcription via Twilio.
- options
<TwilioGatewayOptions>- httpServer
<http.Server>An HTTP/HTTPS server for Twilio webhooks. - webSocketServer
<ws.Server>A WebSocket server to receive Twilio Media Streams. - webhookURL
<URL>The public webhook URL path for the voice webhook (full origin and path). - accountSid
<string>Twilio Account SID. - authToken
<string>Twilio Auth Token. - recordingStatusURL
<URL>Optional recording status callback URL. If omitted, a unique URL on the same origin is generated. - transcriptStatusURL
<URL>Optional transcription status callback URL. If omitted, a unique URL on the same origin is generated. - requestSizeLimit
<number>Optional limit (bytes) for inbound webhook bodies. Default:1e6 - webSocketMessageSizeLimit
<number>Optional limit (bytes) for inbound WebSocket messages. Default:1e6
- httpServer
Use a TwilioGateway in order to accept Twilio voice webhooks, validate signatures, respond with a TwiML Connect <Stream> response, and manage the associated WebSocket connection and callbacks. On each new call, a TwilioVoIP instance is created and emitted.
Events:
"voip":[TwilioVoIP]Emitted when a new call is established and itsTwilioVoIPinstance is ready.
- options
<{ webSocket: ws.WebSocket, twilioGateway: TwilioGateway, callSidToTwilioVoIP: Map<string, TwilioVoIP> }>
Use a WebSocketListener in order to translate Twilio Media Stream messages into VoIP events for the associated TwilioVoIP instance. This class is managed by TwilioGateway and not typically constructed directly.
public webSocketListener.webSocket
<ws.WebSocket>The underlying WebSocket connection.
public webSocketListener.startMessage
<StartWebSocketMessage | undefined>The initial "start" message, when received.
- options
<TwilioVoIPOptions>- metadata
<TwilioMetadata>Initial call/stream metadata. - accountSid
<string>Twilio Account SID. - authToken
<string>Twilio Auth Token. - recordingStatusURL
<URL>Recording status callback URL. - transcriptStatusURL
<URL>Transcription status callback URL.
- metadata
Use a TwilioVoIP in order to send synthesized audio back to Twilio, emit inbound media frames, and control the call (transfer, hangup, recording, and transcription).
public twilioVoIP.post(message)
- message
<Message>Post base64‑encoded audio media back to Twilio over the Media Stream. Whendoneistrue, a marker is sent to allow downstream dispatch tracking.
Returns: <void>
public twilioVoIP.abort(uuid)
- uuid
<UUID>A message UUID to cancel. Sends a cancel marker and clears state; when no active messages remain, aclearcontrol message is sent.
Returns: <void>
public twilioVoIP.transferTo(tel)
- tel
<string>A destination telephone number in E.164 format.
Returns: <void>
Transfer the active call to tel using TwiML.
public twilioVoIP.hangup()
Returns: <void>
End the active call using TwiML.
public twilioVoIP.startTranscript()
Returns: <Promise<void>>
Start Twilio call transcription (Deepgram engine) with both_tracks.
public twilioVoIP.startRecording()
Returns: <Promise<void>>
Begin dual‑channel call recording with status callbacks.
public twilioVoIP.stopRecording()
Returns: <Promise<void>>
Stop the in‑progress recording when applicable.
public twilioVoIP.removeRecording()
Returns: <Promise<void>>
Remove the last recording via the Twilio API.
public twilioVoIP.dispose()
Returns: <void>
Close the media WebSocket and clean up listener maps.
Helper types and type guards for Twilio webhook and Media Stream payloads.
- Body
<Record<string, string | string[] | undefined>>A generic Twilio form‑encoded body map. - CallMetadata Extends
Bodywith required Twilio voice webhook fields. - isCallMetadata(message) Returns:
<message is CallMetadata> - RecordingStatus Extends
Bodywith Twilio recording status fields. - isRecordingStatus(message) Returns:
<message is RecordingStatus> - TranscriptStatus Extends
Bodywith Twilio transcription status fields. - isTranscriptStatus(message) Returns:
<message is TranscriptStatus> - WebSocketMessage
{ event: "start" | "media" | "stop" | "mark" } - StartWebSocketMessage, MediaWebSocketMessage, StopWebSocketMessage, MarkWebSocketMessage Specific Twilio Media Stream messages.
- isStartWebSocketMessage / isMediaWebSocketMessage / isStopWebSocketMessage / isMarkWebSocketMessage Type guards for the above.
- TwilioMetadata
Partial<StartWebSocketMessage> & Partial<CallMetadata>A merged, partial metadata shape for convenience.
- options
<OpenAIAgentOptions<VoIPT>>- voip
<VoIPT>The telephony transport. - stt
<STT>The speech‑to‑text provider. - tts
<TTS>The text‑to‑speech provider. - apiKey
<string>OpenAI API key. - model
<string>OpenAI Chat Completions model identifier. - queueSizeLimit
<number>A queueSizeLimit to be passed to the implementation'sMutexconstructor.
- voip
Use an OpenAIAgent as a base class in order to build streaming, interruptible LLM agents that connect STT input, TTS output, and a VoIP transport. Subclasses implement inference to call OpenAI APIs and stream back responses.
public (abstract) openAIAgent.inference(message)
- message
<Message>A transcribed user message to process.
Returns: <Promise<void>>
Implement this to call OpenAI and generate/stream the assistant’s reply.
public openAIAgent.post(message)
- message
<Message>Push a user message into the agent. Ignored ifmessage.datais empty. The message UUID is tracked for cancellation.
Returns: <void>
public openAIAgent.dispatchStream(uuid, stream, allowInterrupt?)
- uuid
<UUID>The message correlation identifier. - stream
<Stream<OpenAI.Chat.Completions.ChatCompletionChunk>>The OpenAI streaming iterator. - allowInterrupt
<boolean>Whether to allow VAD‑driven interruption. Default:true
Returns: <Promise<string>>
Stream assistant tokens to TTS. When allowInterrupt is false, waits for a downstream "message_dispatched" before returning.
public openAIAgent.dispatchMessage(message, allowInterrupt?)
- message
<Message>A pre‑composed assistant message to play via TTS. - allowInterrupt
<boolean>Whether to allow VAD‑driven interruption. Default:true
Returns: <Promise<string>>
Dispatch a complete assistant message to TTS with optional interruption handling.
public openAIAgent.abort()
Returns: <void>
Abort all active messages that are not currently being dispatched; cancels TTS and instructs the VoIP transport to clear state.
public openAIAgent.dispose(err?)
- err
<unknown>Optional error to log.
Returns: <void>
Abort any in‑flight OpenAI stream and dispose TTS, STT, and VoIP transports.
public openAIAgent.setTTS(tts)
- tts
<TTS>Replacement TTS implementation.
Returns: <void>
Swap the current TTS implementation, updating event wiring.
public openAIAgent.setSTT(stt)
- stt
<STT>Replacement STT implementation.
Returns: <void>
Swap the current STT implementation, updating event wiring.
public openAIAgent.activate()
Returns: <void>
Wire up voip → stt (media), stt → agent (messages, vad), and tts → voip (audio). Also subscribes to error and dispatch events.
public openAIAgent.deactivate()
Returns: <void>
Remove event wiring.
- options
<TwilioVoIPOpenAIAgentOptions>ExtendsOpenAIAgentOptions<TwilioVoIP>- twilioAccountSid
<string>Twilio Account SID used for authenticated media fetch. - twilioAuthToken
<string>Twilio Auth Token used for authenticated media fetch. - system
<string>Optional system prompt for conversation history. Default:"" - greeting
<string>Optional initial assistant greeting. Default:""
- twilioAccountSid
Use a TwilioVoIPOpenAIAgent in order to run an OpenAI‑driven assistant over a Twilio call. It records the call, starts transcription, streams a greeting on connect, collects conversation history, and disposes once recording and transcription are complete.
public twilioVoIPOpenAIAgent.updateMetadata(metadata)
- metadata
<TwilioMetadata>Merge updated Twilio metadata.
Returns: <void>
public twilioVoIPOpenAIAgent.activate()
Returns: <void>
Extends OpenAIAgent.activate() by wiring Twilio‑specific events (stream start/stop, recording, transcript) and dispatching the initial greeting.
public twilioVoIPOpenAIAgent.deactivate()
Returns: <void>
Remove Twilio‑specific wiring in addition to base wiring.
- options
<DeepgramSTTOptions>- apiKey
<string>Deepgram API key. - liveSchema
<LiveSchema>Deepgram live connection options. - queueSizeLimit
<number>A queueSizeLimit to be passed to the implementation'sMutexconstructor.
- apiKey
Use a DeepgramSTT in order to stream audio to Deepgram Live and emit final transcripts. Emits vad on speech boundary messages. Automatically reconnects when needed.
public deepgramSTT.post(message)
- message
<Message>Base64‑encoded (PCM/Telephony) audio chunk.
Returns: <void>
public deepgramSTT.dispose()
Returns: <void>
Close the underlying connection and remove listeners.
- options
<OpenAISTTOptions>- apiKey
<string>OpenAI API key. - session
<Session>Realtime transcription session configuration. - queueSizeLimit
<number>A queueSizeLimit to be passed to the implementation'sMutexconstructor.
- apiKey
Use an OpenAISTT in order to stream audio to OpenAI Realtime STT and emit message on completed transcriptions and vad on speech boundary events.
public openaiSTT.post(message)
- message
<Message>Base64‑encoded audio chunk.
Returns: <void>
public openaiSTT.dispose()
Returns: <void>
Close the WebSocket and remove listeners.
- options
<ElevenlabsTTSOptions>- voiceId
<string>Optional voice identifier. Default:"JBFqnCBsd6RMkjVDRZzb" - apiKey
<string>ElevenLabs API key. - headers
<Record<string, string>>Optional additional headers. - url
<string>Optional override URL for the WebSocket endpoint. - queryParameters
<Record<string, string>>Optional query parameters appended to the endpoint. - timeout
<number>Optional timeout in milliseconds to wait for finalization whendoneis set. If the timeout elapses, a terminal empty chunk is emitted. Default:undefined - queueSizeLimit
<number>A queueSizeLimit to be passed to the implementation'sMutexconstructor.
- voiceId
Use an ElevenlabsTTS in order to stream synthesized audio back as it’s generated. Supports message contexts (UUIDs), incremental text updates, flushing on done, and cancellation.
public elevenlabsTTS.post(message)
- message
<Message>Assistant text to synthesize. Whendoneistrue, the current context is closed and finalization is awaited (with optional timeout).
Returns: <void>
public elevenlabsTTS.abort(uuid)
- uuid
<UUID>The context to cancel; sends a flush and close if initialized.
Returns: <void>
public elevenlabsTTS.dispose()
Returns: <void>
Close the WebSocket.
- options
<CartesiaTTSOptions>- apiKey
<string>Cartesia API key. - speechOptions
<Record<string, unknown>>Provider options merged into each request. - url
<string>Optional override URL for the WebSocket endpoint. Default:"wss://api.cartesia.ai/tts/websocket" - headers
<Record<string, string>>Optional additional headers merged with required headers. - timeout
<number>Optional timeout in milliseconds to wait for finalization whendoneis set. If the timeout elapses, a terminal empty chunk is emitted. Default:undefined - queueSizeLimit
<number>A queueSizeLimit to be passed to the implementation'sMutexconstructor.
- apiKey
Use a CartesiaTTS in order to stream synthesized audio chunks for a given context UUID. Supports cancellation and optional finalization timeouts.
public cartesiaTTS.post(message)
- message
<Message>Assistant text to synthesize; whendoneistrue, the provider is instructed to flush and complete the context.
Returns: <void>
public cartesiaTTS.abort(uuid)
- uuid
<UUID>The context to cancel.
Returns: <void>
public cartesiaTTS.dispose()
Returns: <void>
Close the WebSocket and remove listeners.
The following classes enable running VoIP handling in a worker thread using the port_agent library.
- options
<TwilioVoIPWorkerOptions>- worker
<Worker>The target worker thread to communicate with. - voip
<TwilioVoIP>The localTwilioVoIPinstance whose events and methods will be bridged.
- worker
Use a TwilioVoIPWorker in order to expose TwilioVoIP events and actions to a worker thread. It forwards VoIP events to the worker and registers callables that invoke the corresponding TwilioVoIP methods.
Use a TwilioVoIPProxy in order to consume VoIP events and call VoIP methods from inside a worker thread. It mirrors the VoIP interface and delegates the work to a host TwilioVoIP via the port_agent channel.
public twilioVoIPProxy.post(message)
- message
<Message>Post synthesized audio.
Returns: <void>
public twilioVoIPProxy.abort(uuid)
- uuid
<UUID>The context to cancel.
Returns: <void>
public twilioVoIPProxy.hangup()
Returns: <void>
public twilioVoIPProxy.transferTo(tel)
- tel
<string>A destination telephone number in E.164 format.
Returns: <void>
public twilioVoIPProxy.startRecording()
Returns: <Promise<void>>
public twilioVoIPProxy.stopRecording()
Returns: <Promise<void>>
public twilioVoIPProxy.startTranscript()
Returns: <Promise<void>>
public twilioVoIPProxy.dispose()
Returns: <void>
Helper types for configuring OpenAI Realtime STT sessions and message discrimination.
public Session
<object>- input_audio_format
<"pcm16" | "g711_ulaw" | "g711_alaw"> - input_audio_noise_reduction
{ type: "near_field" | "far_field" }Optional noise reduction. - input_audio_transcription
{ model: "gpt-4o-transcribe" | "gpt-4o-mini-transcribe", prompt?: string, language?: string } - turn_detection
{ type: "semantic_vad" | "server_vad", threshold?: number, prefix_padding_ms?: number, silence_duration_ms?: number, eagerness?: "low" | "medium" | "high" | "auto" }
- input_audio_format
Discriminated unions for WebSocket messages are also provided with type guards:
WebSocketMessageandisCompletedWebSocketMessage,isSpeechStartedWebSocketMessage,isConversationItemCreatedWebSocketMessage.
public OpenAIConversationHistory
<{ role: "system" | "assistant" | "user" | "developer", content: string }[]>
A conversation history array suitable for OpenAI chat APIs.
There are a lot of great VoIP-Agent orchestration implementations out there. This is a selection of implementations that I have experience with.
If you have a feature request or run into any issues, feel free to submit an issue or start a discussion. You’re also welcome to reach out directly to one of the authors.