WhisperLiveKit: Ultra-low-latency, self-hosted speech-to-text with speaker identification
- Simul-Whisper/Streaming (SOTA 2025) - Ultra-low latency transcription using AlignAtt policy
- NLLW (2025), based on distilled NLLB (2022, 2024) - Simulatenous translation from & to 200 languages.
- WhisperStreaming (SOTA 2023) - Low latency transcription using LocalAgreement policy
- Streaming Sortformer (SOTA 2025) - Advanced real-time speaker diarization
- Diart (SOTA 2021) - Real-time speaker diarization
- Silero VAD (2024) - Enterprise-grade Voice Activity Detection
Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.
The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.
pip install whisperlivekitYou can also clone the repo and
pip install -e .for the latest version.
-
Start the transcription server:
wlk --model base --language en
-
Open your browser and navigate to
http://localhost:8000. Start speaking and watch your words appear in real-time!
- See here for the list of all available languages.
- Check the troubleshooting guide for step-by-step fixes collected from recent GPU setup/env issues.
- The CLI entry point is exposed as both
wlkandwhisperlivekit-server; they are equivalent.- For HTTPS requirements, see the Parameters section for SSL configuration options.
Go to chrome-extension for instructions.
| Optional | pip install |
|---|---|
| Windows/Linux optimizations | faster-whisper |
| Apple Silicon optimizations | mlx-whisper |
| Translation | nllw |
| Speaker diarization | git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr] |
| OpenAI API | openai |
| [Not recommanded] Speaker diarization with Diart | diart |
See Parameters & Configuration below on how to use them.
Command-line Interface: Start the transcription server with various options:
# Large model and translate from french to danish
wlk --model large-v3 --language fr --target-language da
# Diarization and server listening on */80
wlk --host 0.0.0.0 --port 80 --model medium --diarization --language frPython API Integration: Check basic_server for a more complete example of how to use the functions and classes.
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from whisperlivekit import AudioProcessor, TranscriptionEngine, parse_args
transcription_engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
yield
app = FastAPI(lifespan=lifespan)
async def handle_websocket_results(websocket: WebSocket, results_generator):
async for response in results_generator:
await websocket.send_json(response)
await websocket.send_json({"type": "ready_to_stop"})
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
# Create a new AudioProcessor for each connection, passing the shared engine
audio_processor = AudioProcessor(transcription_engine=transcription_engine)
results_generator = await audio_processor.create_tasks()
results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
await websocket.accept()
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message) Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using from whisperlivekit import get_inline_ui_html & page = get_inline_ui_html()
| Parameter | Description | Default |
|---|---|---|
--model |
Whisper model size. List and recommandations here | small |
--model-path |
Local .pt file/directory or Hugging Face repo ID containing the Whisper model. Overrides --model. Recommandations here |
None |
--language |
List here. If you use auto, the model attempts to detect the language automatically, but it tends to bias towards English. |
auto |
--target-language |
If sets, translates using NLLW. 200 languages available. If you want to translate to english, you can also use --direct-english-translation. The STT model will try to directly output the translation. |
None |
--diarization |
Enable speaker identification | False |
--backend-policy |
Streaming strategy: 1/simulstreaming uses AlignAtt SimulStreaming, 2/localagreement uses the LocalAgreement policy |
simulstreaming |
--backend |
Whisper implementation selector. auto picks MLX on macOS (if installed), otherwise Faster-Whisper, otherwise vanilla Whisper. You can also force mlx-whisper, faster-whisper, whisper, or openai-api (LocalAgreement only) |
auto |
--no-vac |
Disable Voice Activity Controller. NOT ADVISED | False |
--no-vad |
Disable Voice Activity Detection. NOT ADVISED | False |
--warmup-file |
Audio file path for model warmup | jfk.wav |
--host |
Server host address | localhost |
--port |
Server port | 8000 |
--ssl-certfile |
Path to the SSL certificate file (for HTTPS support) | None |
--ssl-keyfile |
Path to the SSL private key file (for HTTPS support) | None |
--forwarded-allow-ips |
Ip or Ips allowed to reverse proxy the whisperlivekit-server. Supported types are IP Addresses (e.g. 127.0.0.1), IP Networks (e.g. 10.100.0.0/16), or Literals (e.g. /path/to/socket.sock) | None |
--pcm-input |
raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder | False |
| Translation options | Description | Default |
|---|---|---|
--nllb-backend |
transformers or ctranslate2 |
ctranslate2 |
--nllb-size |
600M or 1.3B |
600M |
| Diarization options | Description | Default |
|---|---|---|
--diarization-backend |
diart or sortformer |
sortformer |
--disable-punctuation-split |
[NOT FUNCTIONAL IN 0.2.15 / 0.2.16] Disable punctuation based splits. See #214 | False |
--segmentation-model |
Hugging Face model ID for Diart segmentation model. Available models | pyannote/segmentation-3.0 |
--embedding-model |
Hugging Face model ID for Diart embedding model. Available models | speechbrain/spkrec-ecapa-voxceleb |
| WhisperStreaming backend options | Description | Default |
|---|---|---|
--confidence-validation |
Use confidence scores for faster validation | False |
--buffer_trimming |
Buffer trimming strategy (sentence or segment) |
segment |
For diarization using Diart, you need to accept user conditions here for the
pyannote/segmentationmodel, here for thepyannote/segmentation-3.0model and here for thepyannote/embeddingmodel. Then, login to HuggingFace:huggingface-cli login
To deploy WhisperLiveKit in production:
-
Server Setup: Install production ASGI server & launch with multiple workers
pip install uvicorn gunicorn gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
-
Frontend: Host your customized version of the
htmlexample & ensure WebSocket connection points correctly -
Nginx Configuration (recommended for production):
server { listen 80; server_name your-domain.com; location / { proxy_pass http://localhost:8000; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; }}
-
HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
Deploy the application easily using Docker with GPU or CPU support.
- Docker installed on your system
- For GPU support: NVIDIA Docker runtime installed
With GPU acceleration (recommended):
docker build -t wlk .
docker run --gpus all -p 8000:8000 --name wlk wlkCPU only:
docker build -f Dockerfile.cpu -t wlk .
docker run -p 8000:8000 --name wlk wlkCustom configuration:
# Example with custom model and language
docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr- Large models: Ensure your Docker runtime has sufficient memory allocated
--build-argOptions:EXTRAS="whisper-timestamped"- Add extras to the image's installation (no spaces). Remember to set necessary container options!HF_PRECACHE_DIR="./.cache/"- Pre-load a model cache for faster first-time startHF_TKN_FILE="./token"- Add your Hugging Face Hub access token to download gated models
Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...



