This project combines advanced speech recognition and speaker diarization techniques to transcribe and identify speakers in audio recordings. We use OpenAI's Whisper model for transcription and NVIDIA's NeMo MSDD model for speaker diarization. The project can process various types of audio, including telephonic, meeting, and general conversations, with high accuracy and efficiency.
- Python 3.10
ffmpeg
installed on your machine- CUDA-enabled GPU (optional but recommended for faster processing)
- Clone the repository:
git clone https://github.com/thibaudbrg/whisper-diarization.git
cd whisper-diarization
- Bring the Uroman submodule:
git config --global add safe.directory /path/to/whisper-diarization
git submodule update --init --recursive
- Install dependencies using Poetry:
poetry install
- Configure your environment variables by creating a
.env
file (if necessary):
cp .env.example .env
- Build the
ctc_forced_aligner
C++ library:
cd whisper_diarization/ctc_forced_aligner/
python setup.py build_ext --inplace
Run the main script concurrent_diarize.py
with the required arguments:
python whisper_diarization/concurrent_diarize.py -a audios/<audio_file.wav> --whisper-model <model_name>
-
-a, --audio
: Name of the target audio file (required). -
--no-stem
: Disables source separation. This helps with long files that don't contain a lot of music. -
--suppress_numerals
: Suppresses numerical digits, improving diarization accuracy by converting all digits into written text. -
--whisper-model
: Name of the Whisper model to use (default:medium.en
). -
--batch-size
: Batch size for batched inference. Reduce if you run out of memory; set to 0 for non-batched inference (default: 8). -
--language
: Language spoken in the audio. Specify None to perform language detection. -
--device
: Device to run the model on. Usecuda
if you have a GPU, otherwisecpu
.
concurrent_diarize.py
This script orchestrates the entire process of audio transcription and speaker diarization. Below is a high-level overview of the steps involved:
-
Parsing Command Line Arguments : The script accepts various arguments to customize the transcription and diarization process.
-
Vocal Isolation : Uses Demucs to separate vocals from background music if the
--no-stem
flag is not set. -
Transcription : Utilizes the Whisper model for audio transcription.
-
Forced Alignment : Aligns the transcribed text with the audio using Wav2Vec2.
-
Mono Audio Conversion : Converts audio to mono for compatibility with NeMo MSDD.
-
Speaker Diarization : Performs speaker diarization using the NeMo MSDD model.
-
Restoring Punctuation : Restores punctuation in the transcribed text using a deep learning model.
-
Writing Output Files : Generates and saves the final speaker-aware transcript and SRT files.
The code runs Whisper and the NeMo MSDD model in concurrence for better efficiency.
Configuration files for different diarization scenarios (general, meeting, telephonic) are stored in the config
directory. You can customize these YAML files based on your specific needs.
Processed outputs, including transcribed text files and SRT subtitle files, are saved in the outputs
directory.
-
Ensure your audio files are in a supported format (e.g., WAV).
-
Verify that you have the correct versions of all dependencies installed.
-
For CUDA-related issues, make sure your GPU drivers and CUDA toolkit are correctly installed.
This project is under MIT License
. For more information, see the LICENSE
file.