Open-source text-to-speech for European languages
Powered by an AR + Diffusion architecture
|
|
|
|
Open-source text-to-speech models for European languages are significantly lagging behind. While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.
KugelAudio aims to change this. Building on the excellent foundation laid by the VibeVoice team at Microsoft, we've trained a model specifically focused on European language coverage, using approximately 200,000 hours of highly pre-processed and enhanced speech data from the YODAS2 dataset.
KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.
We conducted extensive A/B testing with 339 human evaluations to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.
The evaluation specifically focused on German language samples with diverse emotional expressions and speaking styles:
- Neutral Speech: Standard conversational tones
- Shouting: High-intensity, elevated volume speech
- Singing: Melodic and rhythmic speech patterns
- Drunken Voice: Slurred and irregular speech characteristics
These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.
| Rank | Model | Score | Record | Win Rate |
|---|---|---|---|---|
| 🥇 1 | KugelAudio | 26 | 71W / 20L / 23T | 78.0% |
| 🥈 2 | ElevenLabs Multi v2 | 25 | 56W / 34L / 22T | 62.2% |
| 🥉 3 | ElevenLabs v3 | 21 | 64W / 34L / 16T | 65.3% |
| 4 | Cartesia | 21 | 55W / 38L / 19T | 59.1% |
| 5 | VibeVoice | 10 | 30W / 74L / 8T | 28.8% |
| 6 | CosyVoice v3 | 9 | 15W / 91L / 8T | 14.2% |
Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)
Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:
| Sample | Description | Audio Player |
|---|---|---|
| Whispering | Soft whispering voice | |
| Female Narrator | Professional female reader voice | |
| Angry Voice | Irritated and frustrated speech | |
| Radio Announcer | Professional radio broadcast voice |
All samples are generated using pre-encoded voice embeddings.
- Base Model: Microsoft VibeVoice
- Training Data: ~200,000 hours from YODAS2
- Hardware: 8x NVIDIA H100 GPUs
- Training Duration: 5 days
KugelAudio supports 24 major European languages with varying levels of quality based on dataset representation:
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|---|---|---|---|---|---|---|---|---|
| English | en | 🇺🇸 | German | de | 🇩🇪 | French | fr | 🇫🇷 |
| Spanish | es | 🇪🇸 | Italian | it | 🇮🇹 | Portuguese | pt | 🇵🇹 |
| Dutch | nl | 🇳🇱 | Polish | pl | 🇵🇱 | Russian | ru | 🇷🇺 |
| Ukrainian | uk | 🇺🇦 | Czech | cs | 🇨🇿 | Romanian | ro | 🇷🇴 |
| Hungarian | hu | 🇭🇺 | Swedish | sv | 🇸🇪 | Danish | da | 🇩🇰 |
| Finnish | fi | 🇫🇮 | Norwegian | no | 🇳🇴 | Greek | el | 🇬🇷 |
| Bulgarian | bg | 🇧🇬 | Slovak | sk | 🇸🇰 | Croatian | hr | 🇭🇷 |
| Serbian | sr | 🇷🇸 | Turkish | tr | 🇹🇷 |
📊 Language Coverage Disclaimer: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.
Get started with KugelAudio quickly using our documentation:
| 📥 Installation | Set up KugelAudio on your machine |
| 🎯 Quick Start | Generate your first speech in minutes |
| 🎭 Voices | Use pre-encoded voices for different speakers |
| ☁️ Hosted API | Use our cloud API for zero-setup inference |
| 🔒 Watermarking | Verify AI-generated audio |
| 📦 Models | Available model variants and benchmarks |
- 🏆 State-of-the-Art Performance: Outperforms ElevenLabs and other leading TTS models in human evaluations
- 🌍 European Language Focus: Trained specifically for 24 major European languages
- High-Quality TTS: State-of-the-art speech synthesis using AR + Diffusion
- 🎭 Pre-encoded Voices: Select from a set of pre-encoded speaker voices
- Audio Watermarking: All generated audio is watermarked using Facebook's AudioSeal
- 🎭 Emotional Range: Supports various speaking styles including shouting, singing, and expressive speech
- Web Interface: Easy-to-use Gradio UI for non-technical users
- HuggingFace Integration: Seamless loading from HuggingFace Hub
- Python 3.10 or higher
- PyTorch 2.0 or higher
- CUDA (recommended for GPU acceleration)
- uv (recommended package manager)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or via pip
pip install uv# Clone the repository
git clone https://github.com/Kugelaudio/kugelaudio-open.git
cd kugelaudio-open
# Run directly with uv (recommended - handles all dependencies automatically)
uv run python start.pyThat's it! The uv run command will automatically create a virtual environment and install all dependencies.
# Quick start with uv (recommended)
uv run python start.py
# With a public share link
uv run python start.py ui --share
# Custom host and port
uv run python start.py ui --host 0.0.0.0 --port 8080Then open http://127.0.0.1:7860 in your browser.
# Generate speech from text
uv run python start.py generate "Hello, this is KugelAudio!" -o hello.wav
# With a specific pre-encoded voice
uv run python start.py generate "Hello in a warm voice!" --voice warm -o warm.wav
# Using the default model for higher quality
uv run python start.py generate "Premium quality speech" --model kugelaudio/kugelaudio-0-open -o premium.wav
# Check if audio contains watermark
uv run python start.py verify audio.wavfrom kugelaudio_open import (
KugelAudioForConditionalGenerationInference,
KugelAudioProcessor,
)
import torch
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
"kugelaudio/kugelaudio-0-open",
torch_dtype=torch.bfloat16,
).to(device)
model.eval()
# Strip encoder weights to save VRAM (only decoders needed for inference)
model.model.strip_encoders()
processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")
# See available voices
print(processor.get_available_voices()) # ["default", "warm", "clear"]
# Generate speech (watermark is automatically applied)
inputs = processor(text="Hello world!", voice="default", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")KugelAudio provides pre-encoded voices that can be selected by name. The voices are stored as .pt files in the model repository and are automatically downloaded from HuggingFace when needed.
# List available voices
voices = processor.get_available_voices()
print(voices) # ["default", "warm", "clear"]
# Generate with a specific voice
inputs = processor(text="Hello world!", voice="warm", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
processor.save_audio(outputs.speech_outputs[0], "warm_voice_output.wav")Note: Voice cloning from raw audio is not supported in this open-source release. Only the pre-encoded voices listed in
voices/voices.jsonare available.
Don't want to run your own infrastructure? Use our hosted API at kugelaudio.com:
- ⚡ Ultra-low latency: Inference as fast as 39ms, end-to-end latency <100ms including network
- 🌍 Global edge deployment: Low latency worldwide
- 🔧 Zero setup: No GPU required, just API calls
- 📈 Auto-scaling: Handle any traffic volume
uv pip install kugelaudiofrom kugelaudio import KugelAudio
# Initialize the client
client = KugelAudio(api_key="your_api_key")
# Generate speech
audio = client.tts.generate(
text="Hello from KugelAudio!",
model="kugel-1-turbo",
)
# Save to file
audio.save("output.wav")
print(f"Generated {audio.duration_seconds:.2f}s in {audio.generation_ms:.0f}ms")All audio generated by KugelAudio contains an imperceptible watermark using Facebook's AudioSeal technology. This helps identify AI-generated content and prevent misuse.
from kugelaudio_open.watermark import AudioWatermark
watermark = AudioWatermark()
# Check if audio is watermarked
result = watermark.detect(audio, sample_rate=24000)
print(f"Detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")- Imperceptible: No audible difference in audio quality
- Robust: Survives compression, resampling, and editing
- Fast Detection: Real-time capable detection
- Sample-Level: 1/16k second resolution
| Model | Parameters | Quality | RTF | Speed | VRAM |
|---|---|---|---|---|---|
| kugelaudio-0-open | 7B | Best | 1.00 | 1.0x realtime | ~19GB |
RTF = Real-Time Factor (generation time / audio duration). Lower is faster.
KugelAudio uses a hybrid AR + Diffusion architecture:
- Text Encoder: Qwen2-based language model encodes input text
- TTS Backbone: Upper transformer layers generate speech representations
- Diffusion Head: Predicts speech latents using denoising diffusion
- Acoustic Decoder: Converts latents to audio waveforms
# Use specific GPU
export CUDA_VISIBLE_DEVICES=0
# Enable TF32 for faster computation on Ampere GPUs
export TORCH_ALLOW_TF32=1outputs = model.generate(
**inputs,
cfg_scale=3.0, # Guidance scale (1.0-10.0)
max_new_tokens=4096, # Maximum generation length
)This technology is intended for legitimate purposes:
✅ Appropriate Uses:
- Accessibility (TTS for visually impaired)
- Content creation (podcasts, videos, audiobooks)
- Voice assistants and chatbots
- Language learning applications
- Creative projects with consent
❌ Prohibited Uses:
- Creating deepfakes or misleading content
- Impersonating individuals without consent
- Fraud or deception
- Any illegal activities
All generated audio is watermarked to enable detection of AI-generated content.
MIT License - see LICENSE for details.
This model would not have been possible without the contributions of many individuals and organizations:
- Microsoft VibeVoice Team: For the excellent foundation architecture that this model builds upon
- YODAS2 Dataset: For providing the large-scale multilingual speech data
- Qwen Team: For the powerful language model backbone
- Facebook AudioSeal: For the audio watermarking technology
- HuggingFace: For model hosting and the transformers library
- Carlos Menke: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
- AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing the GPU resources (8x H100) that made training this model possible
Kajo Kratzenstein
📧 kajo@kugelaudio.com
🌐 kugelaudio.com
Carlos Menke
@software{kugelaudio2026,
title = {KugelAudio: Open-Source Text-to-Speech for European Languages},
author = {Kratzenstein, Kajo and Menke, Carlos},
year = {2026},
institution = {Hasso-Plattner-Institut},
url = {https://github.com/kugelaudio/kugelaudio}
}Funding Notice
Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Forschung, Technologie und Raumfahrt unter dem Förderkennzeichen »KI-Servicezentrum Berlin-Brandenburg« 16IS22092 gefördert. Die Verantwortung für den Inhalt dieser Seite liegt bei der Autorin/beim Autor.
This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092. The responsibility for the content of this publication lies with the author.