Skip to content

Kugelaudio/kugelaudio-open

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ KugelAudio

Open-source text-to-speech for European languages
Powered by an AR + Diffusion architecture

HuggingFace Model GitHub Repository

License: MIT Python 3.10+ HuggingFace Hosted API

KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefördert durch BMFTR

Motivation

Open-source text-to-speech models for European languages are significantly lagging behind. While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.

KugelAudio aims to change this. Building on the excellent foundation laid by the VibeVoice team at Microsoft, we've trained a model specifically focused on European language coverage, using approximately 200,000 hours of highly pre-processed and enhanced speech data from the YODAS2 dataset.

🏆 Benchmark Results: Outperforming ElevenLabs

KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.

Human Preference Benchmark (A/B Testing)

We conducted extensive A/B testing with 339 human evaluations to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.

German Language Evaluation

The evaluation specifically focused on German language samples with diverse emotional expressions and speaking styles:

  • Neutral Speech: Standard conversational tones
  • Shouting: High-intensity, elevated volume speech
  • Singing: Melodic and rhythmic speech patterns
  • Drunken Voice: Slurred and irregular speech characteristics

These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.

OpenSkill Ranking Results

Rank Model Score Record Win Rate
🥇 1 KugelAudio 26 71W / 20L / 23T 78.0%
🥈 2 ElevenLabs Multi v2 25 56W / 34L / 22T 62.2%
🥉 3 ElevenLabs v3 21 64W / 34L / 16T 65.3%
4 Cartesia 21 55W / 38L / 19T 59.1%
5 VibeVoice 10 30W / 74L / 8T 28.8%
6 CosyVoice v3 9 15W / 91L / 8T 14.2%

Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)

Audio Samples

Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:

German Voice Samples

Sample Description Audio Player
Whispering Soft whispering voice
Female Narrator Professional female reader voice
Angry Voice Irritated and frustrated speech
Radio Announcer Professional radio broadcast voice

All samples are generated using pre-encoded voice embeddings.

Training Details

  • Base Model: Microsoft VibeVoice
  • Training Data: ~200,000 hours from YODAS2
  • Hardware: 8x NVIDIA H100 GPUs
  • Training Duration: 5 days

Supported Languages

KugelAudio supports 24 major European languages with varying levels of quality based on dataset representation:

Language Code Flag Language Code Flag Language Code Flag
English en 🇺🇸 German de 🇩🇪 French fr 🇫🇷
Spanish es 🇪🇸 Italian it 🇮🇹 Portuguese pt 🇵🇹
Dutch nl 🇳🇱 Polish pl 🇵🇱 Russian ru 🇷🇺
Ukrainian uk 🇺🇦 Czech cs 🇨🇿 Romanian ro 🇷🇴
Hungarian hu 🇭🇺 Swedish sv 🇸🇪 Danish da 🇩🇰
Finnish fi 🇫🇮 Norwegian no 🇳🇴 Greek el 🇬🇷
Bulgarian bg 🇧🇬 Slovak sk 🇸🇰 Croatian hr 🇭🇷
Serbian sr 🇷🇸 Turkish tr 🇹🇷

📊 Language Coverage Disclaimer: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.

📖 Start Here

Get started with KugelAudio quickly using our documentation:

📥 Installation Set up KugelAudio on your machine
🎯 Quick Start Generate your first speech in minutes
🎭 Voices Use pre-encoded voices for different speakers
☁️ Hosted API Use our cloud API for zero-setup inference
🔒 Watermarking Verify AI-generated audio
📦 Models Available model variants and benchmarks

Features

  • 🏆 State-of-the-Art Performance: Outperforms ElevenLabs and other leading TTS models in human evaluations
  • 🌍 European Language Focus: Trained specifically for 24 major European languages
  • High-Quality TTS: State-of-the-art speech synthesis using AR + Diffusion
  • 🎭 Pre-encoded Voices: Select from a set of pre-encoded speaker voices
  • Audio Watermarking: All generated audio is watermarked using Facebook's AudioSeal
  • 🎭 Emotional Range: Supports various speaking styles including shouting, singing, and expressive speech
  • Web Interface: Easy-to-use Gradio UI for non-technical users
  • HuggingFace Integration: Seamless loading from HuggingFace Hub

Quick Start

Installation

Prerequisites

  • Python 3.10 or higher
  • PyTorch 2.0 or higher
  • CUDA (recommended for GPU acceleration)
  • uv (recommended package manager)

Install uv

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip
pip install uv

Installation

# Clone the repository
git clone https://github.com/Kugelaudio/kugelaudio-open.git
cd kugelaudio-open

# Run directly with uv (recommended - handles all dependencies automatically)
uv run python start.py

That's it! The uv run command will automatically create a virtual environment and install all dependencies.

Launch Web Interface

# Quick start with uv (recommended)
uv run python start.py

# With a public share link
uv run python start.py ui --share

# Custom host and port
uv run python start.py ui --host 0.0.0.0 --port 8080

Then open http://127.0.0.1:7860 in your browser.

Command Line Usage

# Generate speech from text
uv run python start.py generate "Hello, this is KugelAudio!" -o hello.wav

# With a specific pre-encoded voice
uv run python start.py generate "Hello in a warm voice!" --voice warm -o warm.wav

# Using the default model for higher quality
uv run python start.py generate "Premium quality speech" --model kugelaudio/kugelaudio-0-open -o premium.wav

# Check if audio contains watermark
uv run python start.py verify audio.wav

Python API

from kugelaudio_open import (
    KugelAudioForConditionalGenerationInference,
    KugelAudioProcessor,
)
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
    "kugelaudio/kugelaudio-0-open",
    torch_dtype=torch.bfloat16,
).to(device)
model.eval()

# Strip encoder weights to save VRAM (only decoders needed for inference)
model.model.strip_encoders()

processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")

# See available voices
print(processor.get_available_voices())  # ["default", "warm", "clear"]

# Generate speech (watermark is automatically applied)
inputs = processor(text="Hello world!", voice="default", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voices

KugelAudio provides pre-encoded voices that can be selected by name. The voices are stored as .pt files in the model repository and are automatically downloaded from HuggingFace when needed.

# List available voices
voices = processor.get_available_voices()
print(voices)  # ["default", "warm", "clear"]

# Generate with a specific voice
inputs = processor(text="Hello world!", voice="warm", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

processor.save_audio(outputs.speech_outputs[0], "warm_voice_output.wav")

Note: Voice cloning from raw audio is not supported in this open-source release. Only the pre-encoded voices listed in voices/voices.json are available.

Hosted API

Don't want to run your own infrastructure? Use our hosted API at kugelaudio.com:

  • Ultra-low latency: Inference as fast as 39ms, end-to-end latency <100ms including network
  • 🌍 Global edge deployment: Low latency worldwide
  • 🔧 Zero setup: No GPU required, just API calls
  • 📈 Auto-scaling: Handle any traffic volume

Python SDK

uv pip install kugelaudio
from kugelaudio import KugelAudio

# Initialize the client
client = KugelAudio(api_key="your_api_key")

# Generate speech
audio = client.tts.generate(
    text="Hello from KugelAudio!",
    model="kugel-1-turbo",
)

# Save to file
audio.save("output.wav")
print(f"Generated {audio.duration_seconds:.2f}s in {audio.generation_ms:.0f}ms")

📚 Full SDK Documentation →

Audio Watermarking

All audio generated by KugelAudio contains an imperceptible watermark using Facebook's AudioSeal technology. This helps identify AI-generated content and prevent misuse.

Verify Watermark

from kugelaudio_open.watermark import AudioWatermark

watermark = AudioWatermark()

# Check if audio is watermarked
result = watermark.detect(audio, sample_rate=24000)

print(f"Detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")

Watermark Features

  • Imperceptible: No audible difference in audio quality
  • Robust: Survives compression, resampling, and editing
  • Fast Detection: Real-time capable detection
  • Sample-Level: 1/16k second resolution

Models

Model Parameters Quality RTF Speed VRAM
kugelaudio-0-open 7B Best 1.00 1.0x realtime ~19GB

RTF = Real-Time Factor (generation time / audio duration). Lower is faster.

Architecture

KugelAudio uses a hybrid AR + Diffusion architecture:

  1. Text Encoder: Qwen2-based language model encodes input text
  2. TTS Backbone: Upper transformer layers generate speech representations
  3. Diffusion Head: Predicts speech latents using denoising diffusion
  4. Acoustic Decoder: Converts latents to audio waveforms

Configuration

Environment Variables

# Use specific GPU
export CUDA_VISIBLE_DEVICES=0

# Enable TF32 for faster computation on Ampere GPUs
export TORCH_ALLOW_TF32=1

Advanced Generation Parameters

outputs = model.generate(
    **inputs,
    cfg_scale=3.0,                  # Guidance scale (1.0-10.0)
    max_new_tokens=4096,            # Maximum generation length
)

Responsible Use

This technology is intended for legitimate purposes:

Appropriate Uses:

  • Accessibility (TTS for visually impaired)
  • Content creation (podcasts, videos, audiobooks)
  • Voice assistants and chatbots
  • Language learning applications
  • Creative projects with consent

Prohibited Uses:

  • Creating deepfakes or misleading content
  • Impersonating individuals without consent
  • Fraud or deception
  • Any illegal activities

All generated audio is watermarked to enable detection of AI-generated content.

License

MIT License - see LICENSE for details.

Acknowledgments

This model would not have been possible without the contributions of many individuals and organizations:

Special Thanks

  • Carlos Menke: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
  • AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing the GPU resources (8x H100) that made training this model possible

Authors

Kajo Kratzenstein
📧 kajo@kugelaudio.com
🌐 kugelaudio.com

Carlos Menke

Citation

@software{kugelaudio2026,
  title = {KugelAudio: Open-Source Text-to-Speech for European Languages},
  author = {Kratzenstein, Kajo and Menke, Carlos},
  year = {2026},
  institution = {Hasso-Plattner-Institut},
  url = {https://github.com/kugelaudio/kugelaudio}
}

Funding Notice

Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Forschung, Technologie und Raumfahrt unter dem Förderkennzeichen »KI-Servicezentrum Berlin-Brandenburg« 16IS22092 gefördert. Die Verantwortung für den Inhalt dieser Seite liegt bei der Autorin/beim Autor.

This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092. The responsibility for the content of this publication lies with the author.


About

Open-source text-to-speech for European languages with voice cloning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages