Local audio/video transcription with speaker diarization, speaker labeling, and AI-powered meeting analysis. Built with WhisperX and Claude.
Works in two ways:
- Web app -- upload recordings, view synced transcripts, rename speakers, and generate AI analyses from your browser
- CLI -- transcribe files directly from the terminal
All commands below are run in the Terminal app.
- macOS: Open Finder → Applications → Utilities → Terminal (or press
Cmd + Space, type "Terminal", and hit Enter)
Homebrew is a package manager that makes it easy to install developer tools on macOS. Skip this if you already have it (run brew --version to check).
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"Follow the on-screen instructions. When it finishes, close and reopen Terminal so the brew command is available.
This project requires Python 3.12 specifically (not 3.11, not 3.13 -- some dependencies only work with 3.12).
# macOS
brew install python@3.12 ffmpeg# Ubuntu / Debian
sudo apt update
sudo apt install python3.12 python3.12-venv ffmpegVerify both are installed:
python3.12 --version # should print Python 3.12.x
ffmpeg -version # should print version infocd ~/Desktop
git clone <this-repo-url>
cd audio-transcription
make setupDon't have
git? Runbrew install git(macOS) orsudo apt install git(Linux) first, or download the project as a ZIP from GitHub and unzip it.
make setup creates an isolated Python environment and installs all dependencies. This may take a few minutes (PyTorch is a large download).
Speaker identification requires a free HuggingFace account and token.
- Create an account at https://huggingface.co
- Go to https://huggingface.co/settings/tokens and create a token (choose "Read" access)
- Accept the license agreement on each of these model pages (click "Agree and access repository" on each):
cp .env.example .envOpen .env in any text editor (TextEdit on macOS, or nano .env in Terminal) and paste your token:
HF_TOKEN=hf_your_token_here
The .env file is gitignored, so your keys stay private.
make runOpen http://localhost:8000 in your browser. You can now upload audio/video files and start transcribing.
Use
localhost, not0.0.0.0. Some browser features like desktop notifications require a secure context. Chrome treatslocalhostas secure, but not0.0.0.0.
The web app lets you upload recordings, track transcription progress, view transcripts synchronized with audio playback, and generate LLM-ready prompts for analysis.
- Upload audio/video files (mp3, mp4, m4a, wav, webm) up to 500 MB
- Real-time progress tracking as files are transcribed
- Audio player synced with the transcript -- click any line to jump to that moment
- Playback speed control (0.5x to 2x)
- Speaker renaming -- click a speaker label to assign a real name; recent names are remembered
- LLM-ready analysis prompts -- pick a template (interview, sales, client, general), and the app combines it with your transcript into a prompt you can paste into any LLM
- Desktop notifications -- get a browser notification when transcription finishes (or fails), even if the tab is in the background
- Retry failed transcriptions
When uploading, you can optionally specify:
| Option | Default | What it does |
|---|---|---|
| Title | Filename | Display name for the meeting |
| Meeting type | Other | Determines which analysis template is used |
| Language | Auto-detect | Set explicitly for faster transcription |
| Number of speakers | Auto-detect | Set explicitly for better speaker identification |
The CLI is useful for batch processing or scripting.
Basic transcription:
python transcriber.py meeting.mp3Supports mp3, mp4, wav, m4a, and other ffmpeg-compatible formats. Output is saved alongside the input file (e.g., meeting.mp3 -> meeting.txt).
With speaker names:
# First, identify who's who
python transcriber.py meeting.mp3 --identify-speakers
# Then run with names mapped
python transcriber.py meeting.mp3 --speakers "Julien,Alice,Bob"
# Or use interactive mode to name speakers as you go
python transcriber.py meeting.mp3 --interactiveOptions:
# Specify language (faster -- skips auto-detection)
python transcriber.py meeting.mp3 --language en
# Use smaller model (faster, less accurate)
python transcriber.py meeting.mp3 --model medium
# Output as SRT subtitles
python transcriber.py meeting.mp3 --format srt
# Output as JSON
python transcriber.py meeting.mp3 --format json -o transcript.json
# Help with diarization accuracy
python transcriber.py meeting.mp3 --num-speakers 3
# Lower batch size if you get out-of-memory errors
python transcriber.py meeting.mp3 --batch-size 4
# Skip diarization entirely
python transcriber.py meeting.mp3 --no-diarization
# Quiet mode (suppress progress messages)
python transcriber.py meeting.mp3 --quiet[00:00:15] Julien: So let's start with the Q4 roadmap discussion.
[00:00:42] Alice: I think we should prioritize the API work first.
[00:01:03] Julien: Makes sense. What's the timeline looking like?
[00:01:15] Bob: I'd estimate about three weeks for the core functionality.
All settings are configured via environment variables (in your .env file or exported in your shell).
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
HuggingFace token for speaker diarization | |
WHISPER_MODEL |
large-v3 |
Whisper model size (large-v3, medium, small, base) |
WHISPER_DEVICE |
auto |
Compute device (auto, cuda, cpu) |
WHISPER_BATCH_SIZE |
16 |
Batch size for transcription (lower if out of memory) |
DATA_DIR |
./data |
Where meeting data is stored |
MAX_UPLOAD_SIZE |
500MB |
Maximum upload file size |
- Language: Use
--language en(orfr,de, etc.) to skip auto-detection and speed up transcription. - Model choice:
large-v3is best quality but slower.mediumis a good balance.smallorbasefor quick drafts. - GPU memory: If you run out, reduce
--batch-size(CLI) or setWHISPER_BATCH_SIZEin.env(web app) to 8 or 4. - Known speaker count: Specifying the number of speakers improves diarization accuracy.
- No GPU? It still works on CPU, just slower. Set
WHISPER_DEVICE=cpuif auto-detection has issues.