Skip to content

QuentinFuxa/WhisperLiveKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

WLK

WhisperLiveKit: Ultra-low-latency, self-hosted speech-to-text with speaker identification

WhisperLiveKit Demo

PyPI Version PyPI Downloads Python Versions License

Powered by Leading Research:

Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

Architecture

Architecture

The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.

Installation & Quick Start

pip install whisperlivekit

You can also clone the repo and pip install -e . for the latest version.

Quick Start

  1. Start the transcription server:

    wlk --model base --language en
  2. Open your browser and navigate to http://localhost:8000. Start speaking and watch your words appear in real-time!

  • See here for the list of all available languages.
  • Check the troubleshooting guide for step-by-step fixes collected from recent GPU setup/env issues.
  • The CLI entry point is exposed as both wlk and whisperlivekit-server; they are equivalent.
  • For HTTPS requirements, see the Parameters section for SSL configuration options.

Use it to capture audio from web pages.

Go to chrome-extension for instructions.

WhisperLiveKit Demo

Optional Dependencies

Optional pip install
Windows/Linux optimizations faster-whisper
Apple Silicon optimizations mlx-whisper
Translation nllw
Speaker diarization git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
OpenAI API openai
[Not recommanded] Speaker diarization with Diart diart

See Parameters & Configuration below on how to use them.

Usage Examples

Command-line Interface: Start the transcription server with various options:

# Large model and translate from french to danish
wlk --model large-v3 --language fr --target-language da

# Diarization and server listening on */80 
wlk --host 0.0.0.0 --port 80 --model medium --diarization --language fr

Python API Integration: Check basic_server for a more complete example of how to use the functions and classes.

import asyncio
from contextlib import asynccontextmanager

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse

from whisperlivekit import AudioProcessor, TranscriptionEngine, parse_args

transcription_engine = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global transcription_engine
    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
    yield

app = FastAPI(lifespan=lifespan)

async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
    await websocket.send_json({"type": "ready_to_stop"})

@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
    global transcription_engine

    # Create a new AudioProcessor for each connection, passing the shared engine
    audio_processor = AudioProcessor(transcription_engine=transcription_engine)    
    results_generator = await audio_processor.create_tasks()
    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
    await websocket.accept()
    while True:
        message = await websocket.receive_bytes()
        await audio_processor.process_audio(message)        

Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using from whisperlivekit import get_inline_ui_html & page = get_inline_ui_html()

Parameters & Configuration

Parameter Description Default
--model Whisper model size. List and recommandations here small
--model-path Local .pt file/directory or Hugging Face repo ID containing the Whisper model. Overrides --model. Recommandations here None
--language List here. If you use auto, the model attempts to detect the language automatically, but it tends to bias towards English. auto
--target-language If sets, translates using NLLW. 200 languages available. If you want to translate to english, you can also use --direct-english-translation. The STT model will try to directly output the translation. None
--diarization Enable speaker identification False
--backend-policy Streaming strategy: 1/simulstreaming uses AlignAtt SimulStreaming, 2/localagreement uses the LocalAgreement policy simulstreaming
--backend Whisper implementation selector. auto picks MLX on macOS (if installed), otherwise Faster-Whisper, otherwise vanilla Whisper. You can also force mlx-whisper, faster-whisper, whisper, or openai-api (LocalAgreement only) auto
--no-vac Disable Voice Activity Controller. NOT ADVISED False
--no-vad Disable Voice Activity Detection. NOT ADVISED False
--warmup-file Audio file path for model warmup jfk.wav
--host Server host address localhost
--port Server port 8000
--ssl-certfile Path to the SSL certificate file (for HTTPS support) None
--ssl-keyfile Path to the SSL private key file (for HTTPS support) None
--forwarded-allow-ips Ip or Ips allowed to reverse proxy the whisperlivekit-server. Supported types are IP Addresses (e.g. 127.0.0.1), IP Networks (e.g. 10.100.0.0/16), or Literals (e.g. /path/to/socket.sock) None
--pcm-input raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder False
Translation options Description Default
--nllb-backend transformers or ctranslate2 ctranslate2
--nllb-size 600M or 1.3B 600M
Diarization options Description Default
--diarization-backend diart or sortformer sortformer
--disable-punctuation-split [NOT FUNCTIONAL IN 0.2.15 / 0.2.16] Disable punctuation based splits. See #214 False
--segmentation-model Hugging Face model ID for Diart segmentation model. Available models pyannote/segmentation-3.0
--embedding-model Hugging Face model ID for Diart embedding model. Available models speechbrain/spkrec-ecapa-voxceleb
SimulStreaming backend options Description Default
--disable-fast-encoder Disable Faster Whisper or MLX Whisper backends for the encoder (if installed). Inference can be slower but helpful when GPU memory is limited False
--custom-alignment-heads Use your own alignment heads, useful when --model-dir is used. Use scripts/determine_alignment_heads.py to extract them. WhisperLiveKit Demo
None
--frame-threshold AlignAtt frame threshold (lower = faster, higher = more accurate) 25
--beams Number of beams for beam search (1 = greedy decoding) 1
--decoder Force decoder type (beam or greedy) auto
--audio-max-len Maximum audio buffer length (seconds) 30.0
--audio-min-len Minimum audio length to process (seconds) 0.0
--cif-ckpt-path Path to CIF model for word boundary detection None
--never-fire Never truncate incomplete words False
--init-prompt Initial prompt for the model None
--static-init-prompt Static prompt that doesn't scroll None
--max-context-tokens Maximum context tokens Depends on model used, but usually 448.
WhisperStreaming backend options Description Default
--confidence-validation Use confidence scores for faster validation False
--buffer_trimming Buffer trimming strategy (sentence or segment) segment

For diarization using Diart, you need to accept user conditions here for the pyannote/segmentation model, here for the pyannote/segmentation-3.0 model and here for the pyannote/embedding model. Then, login to HuggingFace: huggingface-cli login

๐Ÿš€ Deployment Guide

To deploy WhisperLiveKit in production:

  1. Server Setup: Install production ASGI server & launch with multiple workers

    pip install uvicorn gunicorn
    gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
  2. Frontend: Host your customized version of the html example & ensure WebSocket connection points correctly

  3. Nginx Configuration (recommended for production):

    server {
       listen 80;
       server_name your-domain.com;
        location / {
            proxy_pass http://localhost:8000;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_set_header Host $host;
    }}
  4. HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

๐Ÿ‹ Docker

Deploy the application easily using Docker with GPU or CPU support.

Prerequisites

  • Docker installed on your system
  • For GPU support: NVIDIA Docker runtime installed

Quick Start

With GPU acceleration (recommended):

docker build -t wlk .
docker run --gpus all -p 8000:8000 --name wlk wlk

CPU only:

docker build -f Dockerfile.cpu -t wlk .
docker run -p 8000:8000 --name wlk wlk

Advanced Usage

Custom configuration:

# Example with custom model and language
docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr

Memory Requirements

  • Large models: Ensure your Docker runtime has sufficient memory allocated

Customization

  • --build-arg Options:
    • EXTRAS="whisper-timestamped" - Add extras to the image's installation (no spaces). Remember to set necessary container options!
    • HF_PRECACHE_DIR="./.cache/" - Pre-load a model cache for faster first-time start
    • HF_TKN_FILE="./token" - Add your Hugging Face Hub access token to download gated models

๐Ÿ”ฎ Use Cases

Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...

About

Simultaneous speech-to-text model

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published