-
Notifications
You must be signed in to change notification settings - Fork 14
GH#14044: tighten voice-ai-models.md (147→126 lines) #14092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,27 @@ tools: | |
|
|
||
| <!-- AI-CONTEXT-END --> | ||
|
|
||
| ## Decision Flow | ||
|
|
||
| ```text | ||
| Need voice AI? | ||
| ├── Generate speech (TTS) | ||
| │ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) | ||
| │ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) | ||
| │ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) | ||
| │ └── Default → EdgeTTS (free, good quality) | ||
| ├── Transcribe speech (STT) | ||
| │ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) | ||
| │ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) | ||
| │ ├── Free? → Groq free tier (cloud) or any local model | ||
| │ └── Default → Whisper Large v3 Turbo (local) | ||
| └── Conversational (S2S) | ||
| ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) | ||
| ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) | ||
| ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline | ||
| └── Default → speech-to-speech.md cascaded pipeline | ||
| ``` | ||
|
|
||
| ## TTS (Text-to-Speech) | ||
|
|
||
| ### Cloud | ||
|
|
@@ -33,20 +54,16 @@ tools: | |
| | NVIDIA Magpie TTS | ~200ms | Great | Yes (zero-shot) | 17+ | NIM API (free tier) | | ||
| | Google Cloud TTS | ~200ms | Good | No (custom) | 50+ | $4-16/1M chars | | ||
|
|
||
| Pick: ElevenLabs → quality/cloning. Cartesia → lowest latency. NVIDIA Magpie → enterprise/self-hosted. Google → language breadth. | ||
|
|
||
| ### Local | ||
|
|
||
| | Model | Params | License | Languages | Voice Clone | VRAM | | ||
| |-------|--------|---------|-----------|-------------|------| | ||
| | Qwen3-TTS 0.6B | 0.6B | Apache-2.0 | 10 | Yes (5s ref) | 2GB | | ||
| | Qwen3-TTS 1.7B | 1.7B | Apache-2.0 | 10 | Yes (5s ref) | 4GB | | ||
| | Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale) | | ||
| | Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale, expressive: laughter/music) | | ||
| | Coqui TTS | varies | MPL-2.0 | 20+ | Yes | 2-6GB | | ||
| | Piper | <100M | MIT | 30+ | No | CPU only | | ||
|
|
||
| Pick: Qwen3-TTS → quality + cloning. Piper → CPU-only/embedded. Bark → expressiveness (laughter, music). | ||
|
|
||
| Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS (1100+ languages). See `voice-models.md`. | ||
|
|
||
| ## STT (Speech-to-Text) | ||
|
|
@@ -61,8 +78,6 @@ Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS | |
| | Deepgram | Nova-2 / Nova-3 | 9.5-9.6 | Yes | Per minute | | ||
| | Soniox | stt-async-v3 | 9.6 | Yes | Per minute | | ||
|
|
||
| Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet → enterprise/self-hosted. Deepgram → real-time streaming. | ||
|
|
||
| ### Local | ||
|
|
||
| | Model | Size | Accuracy | Speed | VRAM | | ||
|
|
@@ -76,8 +91,6 @@ Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet | |
| | NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB | | ||
| | Apple Speech | Built-in | 9.0 | Fast | On-device | | ||
|
|
||
| Pick: Large v3 Turbo → best balance. Parakeet V3 → multilingual speed (25 langs). Parakeet V2 → English-only. Apple Speech → zero-setup macOS 26+. | ||
|
|
||
| Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, Apple Silicon optimized). See `transcription.md`. | ||
|
|
||
| ## S2S (Speech-to-Speech) | ||
|
|
@@ -94,51 +107,17 @@ Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, A | |
|
|
||
| ### NVIDIA Riva Composable Pipelines | ||
|
|
||
| | Component | Model | Role | Languages | NIM Available | | ||
| |-----------|-------|------|-----------|---------------| | ||
| | ASR | Parakeet TDT 0.6B v2 | Speech-to-text | English | HF (research) | | ||
| | ASR | Parakeet CTC 1.1B | Speech-to-text | English | Yes | | ||
| | ASR | Parakeet RNNT 1.1B | Speech-to-text | 25 languages | Yes | | ||
| | TTS | Magpie TTS Multilingual | Text-to-speech | 17+ languages | Yes | | ||
| | TTS | Magpie TTS Zero-Shot | Voice cloning TTS | English+ | API | | ||
| | Enhancement | StudioVoice | Noise removal | Any | Yes | | ||
| | Translation | Riva Translate | NMT | 36 languages | Yes | | ||
|
|
||
| Compose as: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. | ||
|
|
||
| Pick: GPT-4o Realtime → production cloud (lowest latency, GA). MiniCPM-o 2.6 → self-hosted/private (Apache-2.0, multimodal). NVIDIA Riva → enterprise on-prem (composable, 25+ languages). Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. | ||
|
|
||
| ## Selection by Priority | ||
|
|
||
| | Priority | TTS | STT | S2S | | ||
| |----------|-----|-----|-----| | ||
| | Quality | ElevenLabs / Qwen3-TTS 1.7B | ElevenLabs Scribe / Large v3 | GPT-4o Realtime | | ||
| | Speed | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | GPT-4o Realtime / Cascaded | | ||
| | Cost | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 2.6 (local) | | ||
| | Privacy | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 2.6 | | ||
| | Enterprise | NVIDIA Magpie / ElevenLabs | NVIDIA Parakeet / Scribe | NVIDIA Riva pipeline | | ||
| | Voice clone | ElevenLabs / Qwen3-TTS | N/A | MiniCPM-o 2.6 | | ||
|
|
||
| ### Decision Flow | ||
|
|
||
| ```text | ||
| Need voice AI? | ||
| ├── Generate speech (TTS) | ||
| │ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) | ||
| │ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) | ||
| │ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) | ||
| │ └── Default → EdgeTTS (free, good quality) | ||
| ├── Transcribe speech (STT) | ||
| │ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) | ||
| │ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) | ||
| │ ├── Free? → Groq free tier (cloud) or any local model | ||
| │ └── Default → Whisper Large v3 Turbo (local) | ||
| └── Conversational (S2S) | ||
| ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) | ||
| ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) | ||
| ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline | ||
| └── Default → speech-to-speech.md cascaded pipeline | ||
| ``` | ||
| | Component | Model | Languages | NIM | | ||
| |-----------|-------|-----------|-----| | ||
| | ASR | Parakeet TDT 0.6B v2 | English | HF (research) | | ||
| | ASR | Parakeet CTC 1.1B | English | Yes | | ||
| | ASR | Parakeet RNNT 1.1B | 25 | Yes | | ||
| | TTS | Magpie Multilingual | 17+ | Yes | | ||
| | TTS | Magpie Zero-Shot | English+ | API | | ||
| | Enhancement | StudioVoice | Any | Yes | | ||
| | Translation | Riva Translate | 36 | Yes | | ||
|
|
||
| Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. | ||
|
Comment on lines
+110
to
+120
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Line 110–120: This conflicts with the reference-corpus strategy from the linked issue. The section is compressed in-place, but the issue objective for reference corpora asks for extraction into chapter files plus a slim index rather than content compression. Please align this section (and likely the doc structure) to that strategy before merge. Proposed structural direction-## NVIDIA Riva Composable Pipelines
-| Component | Model | Languages | NIM |
-...
-Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`.
+## NVIDIA Riva Composable Pipelines
+High-level index only. Detailed matrix moved to `tools/voice/voice-ai-models-riva.md`.
+Pipeline overview: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`.
+See:
+- `tools/voice/voice-ai-models-riva.md` (full component matrix)
+- `tools/voice/cloud-voice-agents.md`
+- `tools/voice/speech-to-speech.md`🤖 Prompt for AI Agents |
||
|
|
||
| ## GPU Planning | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this line causes a loss of important information about language support for
Parakeet V2(English-only) andParakeet V3(multilingual), as well as the OS dependency forApple Speech. This information is not present in the local STT table above or in the new Decision Flow. To adhere to the goal of 'Zero knowledge loss' and the project's practice of maintaining detailed explanations for key technical components, please consider adding this information to the local STT table before removing this summary.References