Whisper + Diarization using NeMo

Project Overview

This project combines advanced speech recognition and speaker diarization techniques to transcribe and identify speakers in audio recordings. We use OpenAI's Whisper model for transcription and NVIDIA's NeMo MSDD model for speaker diarization. The project can process various types of audio, including telephonic, meeting, and general conversations, with high accuracy and efficiency.

Installation

Prerequisites

Python 3.10
ffmpeg installed on your machine
CUDA-enabled GPU (optional but recommended for faster processing)

Setup

Clone the repository:

git clone https://github.com/thibaudbrg/whisper-diarization.git
cd whisper-diarization

Bring the Uroman submodule:

git config --global add safe.directory /path/to/whisper-diarization
git submodule update --init --recursive

Install dependencies using Poetry:

poetry install

Configure your environment variables by creating a .env file (if necessary):

cp .env.example .env

Build the ctc_forced_aligner C++ library:

cd whisper_diarization/ctc_forced_aligner/
python setup.py build_ext --inplace

Usage

Command Line Interface

Run the main script concurrent_diarize.py with the required arguments:

python whisper_diarization/concurrent_diarize.py -a audios/<audio_file.wav> --whisper-model <model_name>

Command Line Arguments

-a, --audio: Name of the target audio file (required).
--no-stem: Disables source separation. This helps with long files that don't contain a lot of music.
--suppress_numerals: Suppresses numerical digits, improving diarization accuracy by converting all digits into written text.
--whisper-model: Name of the Whisper model to use (default: medium.en).
--batch-size: Batch size for batched inference. Reduce if you run out of memory; set to 0 for non-batched inference (default: 8).
--language: Language spoken in the audio. Specify None to perform language detection.
--device: Device to run the model on. Use cuda if you have a GPU, otherwise cpu.

Script Overview

concurrent_diarize.py This script orchestrates the entire process of audio transcription and speaker diarization. Below is a high-level overview of the steps involved:

Parsing Command Line Arguments : The script accepts various arguments to customize the transcription and diarization process.
Vocal Isolation : Uses Demucs to separate vocals from background music if the --no-stem flag is not set.
Transcription : Utilizes the Whisper model for audio transcription.
Forced Alignment : Aligns the transcribed text with the audio using Wav2Vec2.
Mono Audio Conversion : Converts audio to mono for compatibility with NeMo MSDD.
Speaker Diarization : Performs speaker diarization using the NeMo MSDD model.
Restoring Punctuation : Restores punctuation in the transcribed text using a deep learning model.
Writing Output Files : Generates and saves the final speaker-aware transcript and SRT files.

The code runs Whisper and the NeMo MSDD model in concurrence for better efficiency.

Configurations

Configuration files for different diarization scenarios (general, meeting, telephonic) are stored in the config directory. You can customize these YAML files based on your specific needs.

Output

Processed outputs, including transcribed text files and SRT subtitle files, are saved in the outputs directory.

Troubleshooting

Ensure your audio files are in a supported format (e.g., WAV).
Verify that you have the correct versions of all dependencies installed.
For CUDA-related issues, make sure your GPU drivers and CUDA toolkit are correctly installed.

License

This project is under MIT License. For more information, see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
audios		audios
config		config
outputs		outputs
temp_inputs		temp_inputs
whisper_diarization		whisper_diarization
.DS_Store		.DS_Store
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper + Diarization using NeMo

Project Overview

Installation

Prerequisites

Setup

Usage

Command Line Interface

Command Line Arguments

Script Overview

Configurations

Output

Troubleshooting

License

About

Releases

Packages

Languages

License

thibaudbrg/whisper-diarization

Folders and files

Latest commit

History

Repository files navigation

Whisper + Diarization using NeMo

Project Overview

Installation

Prerequisites

Setup

Usage

Command Line Interface

Command Line Arguments

Script Overview

Configurations

Output

Troubleshooting

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages