Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 33 additions & 54 deletions .agents/tools/voice/voice-ai-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,27 @@ tools:

<!-- AI-CONTEXT-END -->

## Decision Flow

```text
Need voice AI?
├── Generate speech (TTS)
│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud)
│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free)
│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU)
│ └── Default → EdgeTTS (free, good quality)
├── Transcribe speech (STT)
│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local)
│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local)
│ ├── Free? → Groq free tier (cloud) or any local model
│ └── Default → Whisper Large v3 Turbo (local)
└── Conversational (S2S)
├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md)
├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie)
├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline
└── Default → speech-to-speech.md cascaded pipeline
```

## TTS (Text-to-Speech)

### Cloud
Expand All @@ -33,20 +54,16 @@ tools:
| NVIDIA Magpie TTS | ~200ms | Great | Yes (zero-shot) | 17+ | NIM API (free tier) |
| Google Cloud TTS | ~200ms | Good | No (custom) | 50+ | $4-16/1M chars |

Pick: ElevenLabs → quality/cloning. Cartesia → lowest latency. NVIDIA Magpie → enterprise/self-hosted. Google → language breadth.

### Local

| Model | Params | License | Languages | Voice Clone | VRAM |
|-------|--------|---------|-----------|-------------|------|
| Qwen3-TTS 0.6B | 0.6B | Apache-2.0 | 10 | Yes (5s ref) | 2GB |
| Qwen3-TTS 1.7B | 1.7B | Apache-2.0 | 10 | Yes (5s ref) | 4GB |
| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale) |
| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale, expressive: laughter/music) |
| Coqui TTS | varies | MPL-2.0 | 20+ | Yes | 2-6GB |
| Piper | <100M | MIT | 30+ | No | CPU only |

Pick: Qwen3-TTS → quality + cloning. Piper → CPU-only/embedded. Bark → expressiveness (laughter, music).

Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS (1100+ languages). See `voice-models.md`.

## STT (Speech-to-Text)
Expand All @@ -61,8 +78,6 @@ Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS
| Deepgram | Nova-2 / Nova-3 | 9.5-9.6 | Yes | Per minute |
| Soniox | stt-async-v3 | 9.6 | Yes | Per minute |

Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet → enterprise/self-hosted. Deepgram → real-time streaming.

### Local

| Model | Size | Accuracy | Speed | VRAM |
Expand All @@ -76,8 +91,6 @@ Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet
| NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB |
| Apple Speech | Built-in | 9.0 | Fast | On-device |

Pick: Large v3 Turbo → best balance. Parakeet V3 → multilingual speed (25 langs). Parakeet V2 → English-only. Apple Speech → zero-setup macOS 26+.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing this line causes a loss of important information about language support for Parakeet V2 (English-only) and Parakeet V3 (multilingual), as well as the OS dependency for Apple Speech. This information is not present in the local STT table above or in the new Decision Flow. To adhere to the goal of 'Zero knowledge loss' and the project's practice of maintaining detailed explanations for key technical components, please consider adding this information to the local STT table before removing this summary.

References
  1. Restore detailed explanations for key concepts and technical details to ensure clarity and prevent knowledge loss.


Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, Apple Silicon optimized). See `transcription.md`.

## S2S (Speech-to-Speech)
Expand All @@ -94,51 +107,17 @@ Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, A

### NVIDIA Riva Composable Pipelines

| Component | Model | Role | Languages | NIM Available |
|-----------|-------|------|-----------|---------------|
| ASR | Parakeet TDT 0.6B v2 | Speech-to-text | English | HF (research) |
| ASR | Parakeet CTC 1.1B | Speech-to-text | English | Yes |
| ASR | Parakeet RNNT 1.1B | Speech-to-text | 25 languages | Yes |
| TTS | Magpie TTS Multilingual | Text-to-speech | 17+ languages | Yes |
| TTS | Magpie TTS Zero-Shot | Voice cloning TTS | English+ | API |
| Enhancement | StudioVoice | Noise removal | Any | Yes |
| Translation | Riva Translate | NMT | 36 languages | Yes |

Compose as: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`.

Pick: GPT-4o Realtime → production cloud (lowest latency, GA). MiniCPM-o 2.6 → self-hosted/private (Apache-2.0, multimodal). NVIDIA Riva → enterprise on-prem (composable, 25+ languages). Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`.

## Selection by Priority

| Priority | TTS | STT | S2S |
|----------|-----|-----|-----|
| Quality | ElevenLabs / Qwen3-TTS 1.7B | ElevenLabs Scribe / Large v3 | GPT-4o Realtime |
| Speed | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | GPT-4o Realtime / Cascaded |
| Cost | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 2.6 (local) |
| Privacy | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 2.6 |
| Enterprise | NVIDIA Magpie / ElevenLabs | NVIDIA Parakeet / Scribe | NVIDIA Riva pipeline |
| Voice clone | ElevenLabs / Qwen3-TTS | N/A | MiniCPM-o 2.6 |

### Decision Flow

```text
Need voice AI?
├── Generate speech (TTS)
│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud)
│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free)
│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU)
│ └── Default → EdgeTTS (free, good quality)
├── Transcribe speech (STT)
│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local)
│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local)
│ ├── Free? → Groq free tier (cloud) or any local model
│ └── Default → Whisper Large v3 Turbo (local)
└── Conversational (S2S)
├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md)
├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie)
├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline
└── Default → speech-to-speech.md cascaded pipeline
```
| Component | Model | Languages | NIM |
|-----------|-------|-----------|-----|
| ASR | Parakeet TDT 0.6B v2 | English | HF (research) |
| ASR | Parakeet CTC 1.1B | English | Yes |
| ASR | Parakeet RNNT 1.1B | 25 | Yes |
| TTS | Magpie Multilingual | 17+ | Yes |
| TTS | Magpie Zero-Shot | English+ | API |
| Enhancement | StudioVoice | Any | Yes |
| Translation | Riva Translate | 36 | Yes |

Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`.
Comment on lines +110 to +120
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Line 110–120: This conflicts with the reference-corpus strategy from the linked issue.

The section is compressed in-place, but the issue objective for reference corpora asks for extraction into chapter files plus a slim index rather than content compression. Please align this section (and likely the doc structure) to that strategy before merge.

Proposed structural direction
-## NVIDIA Riva Composable Pipelines
-| Component | Model | Languages | NIM |
-...
-Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`.
+## NVIDIA Riva Composable Pipelines
+High-level index only. Detailed matrix moved to `tools/voice/voice-ai-models-riva.md`.
+Pipeline overview: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`.
+See:
+- `tools/voice/voice-ai-models-riva.md` (full component matrix)
+- `tools/voice/cloud-voice-agents.md`
+- `tools/voice/speech-to-speech.md`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/tools/voice/voice-ai-models.md around lines 110 - 120, The Pipeline
table and the inline "Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie
TTS] -> Audio`" / cascaded S2S note should not be compressed in-place; instead
extract this section into separate chapter files for the reference-corpus
strategy and replace it here with a slim index entry linking to those new files.
Concretely, create new chapter docs (e.g., voice-models-parakeet.md,
voice-pipeline-s2s.md) containing the full table and pipeline details, update
this file's block (the table and the Pipeline/Cascaded S2S lines) to a short
index summary pointing to those chapters, and ensure filenames/classes
referenced in nav or TOC reflect the new chapter names so links resolve.


## GPU Planning

Expand Down
Loading