diff --git a/.agents/tools/voice/voice-ai-models.md b/.agents/tools/voice/voice-ai-models.md index 52977b33ab..3646834349 100644 --- a/.agents/tools/voice/voice-ai-models.md +++ b/.agents/tools/voice/voice-ai-models.md @@ -21,6 +21,27 @@ tools: +## Decision Flow + +```text +Need voice AI? +├── Generate speech (TTS) +│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) +│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) +│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) +│ └── Default → EdgeTTS (free, good quality) +├── Transcribe speech (STT) +│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) +│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) +│ ├── Free? → Groq free tier (cloud) or any local model +│ └── Default → Whisper Large v3 Turbo (local) +└── Conversational (S2S) + ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) + ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) + ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline + └── Default → speech-to-speech.md cascaded pipeline +``` + ## TTS (Text-to-Speech) ### Cloud @@ -33,20 +54,16 @@ tools: | NVIDIA Magpie TTS | ~200ms | Great | Yes (zero-shot) | 17+ | NIM API (free tier) | | Google Cloud TTS | ~200ms | Good | No (custom) | 50+ | $4-16/1M chars | -Pick: ElevenLabs → quality/cloning. Cartesia → lowest latency. NVIDIA Magpie → enterprise/self-hosted. Google → language breadth. - ### Local | Model | Params | License | Languages | Voice Clone | VRAM | |-------|--------|---------|-----------|-------------|------| | Qwen3-TTS 0.6B | 0.6B | Apache-2.0 | 10 | Yes (5s ref) | 2GB | | Qwen3-TTS 1.7B | 1.7B | Apache-2.0 | 10 | Yes (5s ref) | 4GB | -| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale) | +| Bark (Suno) | 1.0B | MIT | 13+ | Yes (prompt) | 6GB (stale, expressive: laughter/music) | | Coqui TTS | varies | MPL-2.0 | 20+ | Yes | 2-6GB | | Piper | <100M | MIT | 30+ | No | CPU only | -Pick: Qwen3-TTS → quality + cloning. Piper → CPU-only/embedded. Bark → expressiveness (laughter, music). - Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS (1100+ languages). See `voice-models.md`. ## STT (Speech-to-Text) @@ -61,8 +78,6 @@ Also available: EdgeTTS (free, 300+ voices), macOS Say (zero deps), FacebookMMS | Deepgram | Nova-2 / Nova-3 | 9.5-9.6 | Yes | Per minute | | Soniox | stt-async-v3 | 9.6 | Yes | Per minute | -Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet → enterprise/self-hosted. Deepgram → real-time streaming. - ### Local | Model | Size | Accuracy | Speed | VRAM | @@ -72,11 +87,9 @@ Pick: Groq → free/fast batch. ElevenLabs Scribe → accuracy. NVIDIA Parakeet | Whisper Small | 461MB | 8.5 | Medium | 2GB | | Whisper Large v3 | 2.9GB | 9.8 | Slow | 10GB | | Whisper Large v3 Turbo | 1.5GB | 9.7 | Fast | 5GB | -| NVIDIA Parakeet V2 | 0.6B | 9.4 | Fastest | 2GB | -| NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB | -| Apple Speech | Built-in | 9.0 | Fast | On-device | - -Pick: Large v3 Turbo → best balance. Parakeet V3 → multilingual speed (25 langs). Parakeet V2 → English-only. Apple Speech → zero-setup macOS 26+. +| NVIDIA Parakeet V2 | 0.6B | 9.4 | Fastest | 2GB (English-only) | +| NVIDIA Parakeet V3 | 0.6B | 9.6 | Fastest | 2GB (25 langs) | +| Apple Speech | Built-in | 9.0 | Fast | On-device (macOS 26+) | Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, Apple Silicon optimized). See `transcription.md`. @@ -94,51 +107,17 @@ Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, A ### NVIDIA Riva Composable Pipelines -| Component | Model | Role | Languages | NIM Available | -|-----------|-------|------|-----------|---------------| -| ASR | Parakeet TDT 0.6B v2 | Speech-to-text | English | HF (research) | -| ASR | Parakeet CTC 1.1B | Speech-to-text | English | Yes | -| ASR | Parakeet RNNT 1.1B | Speech-to-text | 25 languages | Yes | -| TTS | Magpie TTS Multilingual | Text-to-speech | 17+ languages | Yes | -| TTS | Magpie TTS Zero-Shot | Voice cloning TTS | English+ | API | -| Enhancement | StudioVoice | Noise removal | Any | Yes | -| Translation | Riva Translate | NMT | 36 languages | Yes | - -Compose as: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. - -Pick: GPT-4o Realtime → production cloud (lowest latency, GA). MiniCPM-o 2.6 → self-hosted/private (Apache-2.0, multimodal). NVIDIA Riva → enterprise on-prem (composable, 25+ languages). Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. - -## Selection by Priority - -| Priority | TTS | STT | S2S | -|----------|-----|-----|-----| -| Quality | ElevenLabs / Qwen3-TTS 1.7B | ElevenLabs Scribe / Large v3 | GPT-4o Realtime | -| Speed | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | GPT-4o Realtime / Cascaded | -| Cost | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 2.6 (local) | -| Privacy | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 2.6 | -| Enterprise | NVIDIA Magpie / ElevenLabs | NVIDIA Parakeet / Scribe | NVIDIA Riva pipeline | -| Voice clone | ElevenLabs / Qwen3-TTS | N/A | MiniCPM-o 2.6 | - -### Decision Flow - -```text -Need voice AI? -├── Generate speech (TTS) -│ ├── Voice cloning? → Qwen3-TTS (local) or ElevenLabs (cloud) -│ ├── Lowest latency? → Cartesia Sonic 3 (cloud) or EdgeTTS (free) -│ ├── Offline? → Piper (CPU) or Qwen3-TTS (GPU) -│ └── Default → EdgeTTS (free, good quality) -├── Transcribe speech (STT) -│ ├── Real-time? → Deepgram Nova (cloud) or faster-whisper (local) -│ ├── Best accuracy? → ElevenLabs Scribe (cloud) or Large v3 (local) -│ ├── Free? → Groq free tier (cloud) or any local model -│ └── Default → Whisper Large v3 Turbo (local) -└── Conversational (S2S) - ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md) - ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie) - ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline - └── Default → speech-to-speech.md cascaded pipeline -``` +| Component | Model | Languages | NIM | +|-----------|-------|-----------|-----| +| ASR | Parakeet TDT 0.6B v2 | English | HF (research) | +| ASR | Parakeet CTC 1.1B | English | Yes | +| ASR | Parakeet RNNT 1.1B | 25 | Yes | +| TTS | Magpie Multilingual | 17+ | Yes | +| TTS | Magpie Zero-Shot | English+ | API | +| Enhancement | StudioVoice | Any | Yes | +| Translation | Riva Translate | 36 | Yes | + +Pipeline: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md`. Cascaded S2S (VAD+STT+LLM+TTS): see `speech-to-speech.md`. ## GPU Planning