From b473f3188cfd66470515f135cbd368ca3028a11f Mon Sep 17 00:00:00 2001 From: Alexey <1556417+alex-solovyev@users.noreply.github.com> Date: Mon, 30 Mar 2026 12:36:15 +0200 Subject: [PATCH 1/2] =?UTF-8?q?docs:=20tighten=20voice-ai-models.md=20(147?= =?UTF-8?q?=E2=86=92126=20lines)=20=E2=80=94=20deduplicate=20selection=20g?= =?UTF-8?q?uidance?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove 4 Pick: lines and Selection by Priority table, both fully redundant with the Decision Flow tree. Move Decision Flow to top (primacy effect). Compress Riva table (remove redundant Role column, shorten headers). Preserve Bark expressiveness note in table. Zero knowledge loss: all 28 models, all specs, all cross-references, GPU planning, and decision paths retained. --- .agents/tools/voice/voice-ai-models.md | 87 ++++++++++---------------- 1 file changed, 33 insertions(+), 54 deletions(-) diff --git a/.agents/tools/voice/voice-ai-models.md b/.agents/tools/voice/voice-ai-models.md index 52977b33ab..7a367b781b 100644 --- a/.agents/tools/voice/voice-ai-models.md +++ b/.agents/tools/voice/voice-ai-models.md @@ -21,6 +21,27 @@ tools: +## Decision Flow + +```text +Need voice AI? +├── Generate speech (TTS) +│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) +│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) +│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) +│ └── Default → EdgeTTS (free, good quality) +├── Transcribe speech (STT) +│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) +│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) +│ ├── Free? → Groq free tier (cloud) or any local model +│ └── Default → Whisper Large v3 Turbo (local) +└── Conversational (S2S) + ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) + ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) + ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline + └── Default → speech-to-speech.md cascaded pipeline +``` + ## TTS (Text-to-Speech) ### Cloud @@ -33,20 +54,16 @@ tools: | NVIDIA Magpie TTS | ~200ms | Great | Yes (zero-shot) | 17+ | NIM API (free tier) | | Google Cloud TTS | ~200ms | Good | No (custom) | 50+ | $4-16/1M chars | -Pick: ElevenLabs → quality/cloning. Cartesia → lowest latency. NVIDIA Magpie → enterprise/self-hosted. Google → language breadth. - ### Local | Model | Params | License | Languages | Voice Clone | VRAM | |-------|--------|---------|-----------|-------------|------| | Qwen3-TTS 0.6B | 0.6B | Apache-2.0 | 10 | Yes (5s ref) | 2GB | | Qwen3-TTS 1.7B | 1.7B | Apache-2.0 | 10 | Yes (5s ref) | 4GB | -| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale) | +| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale, expressive: laughter/music) | | Coqui TTS | varies | MPL-2.0 | 20+ | Yes | 2-6GB | | Piper | <100M | MIT | 30+ | No | CPU only | -Pick: Qwen3-TTS → quality + cloning. Piper → CPU-only/embedded. Bark → expressiveness (laughter, music). - Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS (1100+ languages). See `voice-models.md`. ## STT (Speech-to-Text) @@ -61,8 +78,6 @@ Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS | Deepgram | Nova-2 / Nova-3 | 9.5-9.6 | Yes | Per minute | | Soniox | stt-async-v3 | 9.6 | Yes | Per minute | -Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet → enterprise/self-hosted. Deepgram → real-time streaming. - ### Local | Model | Size | Accuracy | Speed | VRAM | @@ -76,8 +91,6 @@ Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet | NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB | | Apple Speech | Built-in | 9.0 | Fast | On-device | -Pick: Large v3 Turbo → best balance. Parakeet V3 → multilingual speed (25 langs). Parakeet V2 → English-only. Apple Speech → zero-setup macOS 26+. - Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, Apple Silicon optimized). See `transcription.md`. ## S2S (Speech-to-Speech) @@ -94,51 +107,17 @@ Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, A ### NVIDIA Riva Composable Pipelines -| Component | Model | Role | Languages | NIM Available | -|-----------|-------|------|-----------|---------------| -| ASR | Parakeet TDT 0.6B v2 | Speech-to-text | English | HF (research) | -| ASR | Parakeet CTC 1.1B | Speech-to-text | English | Yes | -| ASR | Parakeet RNNT 1.1B | Speech-to-text | 25 languages | Yes | -| TTS | Magpie TTS Multilingual | Text-to-speech | 17+ languages | Yes | -| TTS | Magpie TTS Zero-Shot | Voice cloning TTS | English+ | API | -| Enhancement | StudioVoice | Noise removal | Any | Yes | -| Translation | Riva Translate | NMT | 36 languages | Yes | - -Compose as: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. - -Pick: GPT-4o Realtime → production cloud (lowest latency, GA). MiniCPM-o 2.6 → self-hosted/private (Apache-2.0, multimodal). NVIDIA Riva → enterprise on-prem (composable, 25+ languages). Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. - -## Selection by Priority - -| Priority | TTS | STT | S2S | -|----------|-----|-----|-----| -| Quality | ElevenLabs / Qwen3-TTS 1.7B | ElevenLabs Scribe / Large v3 | GPT-4o Realtime | -| Speed | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | GPT-4o Realtime / Cascaded | -| Cost | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 2.6 (local) | -| Privacy | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 2.6 | -| Enterprise | NVIDIA Magpie / ElevenLabs | NVIDIA Parakeet / Scribe | NVIDIA Riva pipeline | -| Voice clone | ElevenLabs / Qwen3-TTS | N/A | MiniCPM-o 2.6 | - -### Decision Flow - -```text -Need voice AI? -├── Generate speech (TTS) -│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) -│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) -│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) -│ └── Default → EdgeTTS (free, good quality) -├── Transcribe speech (STT) -│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) -│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) -│ ├── Free? → Groq free tier (cloud) or any local model -│ └── Default → Whisper Large v3 Turbo (local) -└── Conversational (S2S) - ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) - ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) - ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline - └── Default → speech-to-speech.md cascaded pipeline -``` +| Component | Model | Languages | NIM | +|-----------|-------|-----------|-----| +| ASR | Parakeet TDT 0.6B v2 | English | HF (research) | +| ASR | Parakeet CTC 1.1B | English | Yes | +| ASR | Parakeet RNNT 1.1B | 25 | Yes | +| TTS | Magpie Multilingual | 17+ | Yes | +| TTS | Magpie Zero-Shot | English+ | API | +| Enhancement | StudioVoice | Any | Yes | +| Translation | Riva Translate | 36 | Yes | + +Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. ## GPU Planning From 25eed294083f7e403d03b7993f31d46d8f6bb8c5 Mon Sep 17 00:00:00 2001 From: Alexey <1556417+alex-solovyev@users.noreply.github.com> Date: Mon, 30 Mar 2026 12:41:19 +0200 Subject: [PATCH 2/2] docs: restore Parakeet language counts and Apple Speech OS note in STT table Address Gemini review: the removed Pick line had details (Parakeet V3 25 langs, Parakeet V2 English-only, Apple Speech macOS 26+) not present in the table. Add these as parenthetical notes in the VRAM column. --- .agents/tools/voice/voice-ai-models.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.agents/tools/voice/voice-ai-models.md b/.agents/tools/voice/voice-ai-models.md index 7a367b781b..3646834349 100644 --- a/.agents/tools/voice/voice-ai-models.md +++ b/.agents/tools/voice/voice-ai-models.md @@ -87,9 +87,9 @@ Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS | Whisper Small | 461MB | 8.5 | Medium | 2GB | | Whisper Large v3 | 2.9GB | 9.8 | Slow | 10GB | | Whisper Large v3 Turbo | 1.5GB | 9.7 | Fast | 5GB | -| NVIDIA Parakeet V2 | 0.6B | 9.4 | Fastest | 2GB | -| NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB | -| Apple Speech | Built-in | 9.0 | Fast | On-device | +| NVIDIA Parakeet V2 | 0.6B | 9.4 | Fastest | 2GB (English-only) | +| NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB (25 langs) | +| Apple Speech | Built-in | 9.0 | Fast | On-device (macOS 26+) | Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, Apple Silicon optimized). See `transcription.md`.