Add earnings config by nithinraok · Pull Request #130 · NVIDIA/NeMo-speech-data-processor

nithinraok · 2025-06-17T17:35:14Z

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

Earnings21 support (full dataset + eval10 subset)
Earnings22 support
Dual NLP file location handling for flexible dataset structures
Speaker metadata CSV integration for name mapping

Audio Processing:

Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
Accurate duration calculation from audio files
Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

Initial Audio Manifest → Full audio files with duration
Text Population → Add ground truth transcripts from NLP files
Text Cleaning → Remove artifacts, brackets, special characters
Forced Alignment → Generate word-level CTM files with timestamps
Sentence Segmentation → Create sentence-level segments from CTM data
Speaker Segmentation → Create speaker-level segments (optional)
Field Filtering → Keep only required manifest fields

Key Configuration Options:

dataset_type: "earnings21" | "earnings22"
subset: "full" | "eval10" (earnings21 only)
forced_alignment_model: Configurable NeMo ASR model
preserve_punctuation / preserve_capitalization: Text processing options
include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

lilithgrigoryan

Hi @nithinraok. Thanks!

Overall looks good to me. I left some comments about docs and docstrings, that need to be fixed and I will review code once more. Also, please, consider adding end2end tests.

sdp/processors/datasets/earnings21/__init__.py

sdp/processors/datasets/earnings21/apply_normalizations.py

sdp/processors/datasets/earnings21/create_initial_manifest.py

dataset_configs/english/earnings21/config.yaml

sdp/processors/datasets/earnings21/create_initial_manifest.py

lilithgrigoryan · 2025-06-18T08:34:09Z

sdp/processors/datasets/earnings21/create_initial_manifest.py

+
+# Step 2: Populate Full Text for Manifest
+class CreateFullAudioManifestEarnings21(BaseParallelProcessor):
+    """


Same here. Please, add proper docstrings and update api.rst

sdp/processors/datasets/earnings21/create_initial_manifest.py

docker/Dockerfile

nithinraok · 2025-06-26T20:53:33Z

Updated based on comments. @lilithgrigoryan pls have a look

lilithgrigoryan

@nithinraok Thanks for the code and great docs!
Minor question, can we rename the earnings21 folders to just earnings? From what I understand, the configs cover both earnings21 and earnings22, right?

Otherwise LGTM

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

…nto earnings_pc

nithinraok mentioned this pull request Jun 17, 2025

Add earnings config #123

Closed

nithinraok requested review from Jorjeous and lilithgrigoryan June 17, 2025 17:36

lilithgrigoryan requested changes Jun 18, 2025

View reviewed changes

lilithgrigoryan reviewed Jun 18, 2025

View reviewed changes

docker/Dockerfile Show resolved Hide resolved

nithinraok force-pushed the earnings_pc branch from 8ff5143 to a4dd69f Compare June 26, 2025 20:52

nithinraok requested a review from lilithgrigoryan June 26, 2025 20:59

lilithgrigoryan requested changes Jul 1, 2025

View reviewed changes

nithinraok added 3 commits July 17, 2025 13:21

Add earnings config

11d6fc1

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

address comments

cece201

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

move to earnings instead of earnings21

db9a70b

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>

nithinraok force-pushed the earnings_pc branch from 596d386 to db9a70b Compare July 17, 2025 20:21

Merge branch 'main' of github.com:NVIDIA/NeMo-speech-data-processor i…

84700a5

…nto earnings_pc

karpnv merged commit 64f9d13 into main Jul 19, 2025
9 of 10 checks passed

karpnv deleted the earnings_pc branch July 19, 2025 00:32

ssh-meister mentioned this pull request Jul 21, 2025

Add RemoveFiles and ExtractTar, reorganize audio converters #139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add earnings config#130

Add earnings config#130
karpnv merged 4 commits intomainfrom
earnings_pc

nithinraok commented Jun 17, 2025

Uh oh!

lilithgrigoryan left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lilithgrigoryan Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nithinraok commented Jun 26, 2025

Uh oh!

lilithgrigoryan left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nithinraok commented Jun 17, 2025

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

High-Level Changelog

New Features

Pipeline Configuration

Output Formats

Usage Examples

Uh oh!

lilithgrigoryan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lilithgrigoryan Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nithinraok commented Jun 26, 2025

Uh oh!

lilithgrigoryan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lilithgrigoryan left a comment •

edited

Loading

lilithgrigoryan Jun 18, 2025 •

edited

Loading

lilithgrigoryan left a comment •

edited

Loading