Skip to content

Add earnings config#130

Merged
karpnv merged 4 commits intomainfrom
earnings_pc
Jul 19, 2025
Merged

Add earnings config#130
karpnv merged 4 commits intomainfrom
earnings_pc

Conversation

@nithinraok
Copy link
Collaborator

Add Earnings21/22 Dataset Processing Pipeline with Forced Alignment

Overview

This PR introduces a complete 7-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

High-Level Changelog

New Features

Core Pipeline Processors:

  • CreateInitialAudioAndManifest: Initial audio manifest creation with automatic audio conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
  • CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
  • NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
  • CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
  • SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping

Dataset Support:

  • Earnings21 support (full dataset + eval10 subset)
  • Earnings22 support
  • Dual NLP file location handling for flexible dataset structures
  • Speaker metadata CSV integration for name mapping

Audio Processing:

  • Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
  • Accurate duration calculation from audio files
  • Batch processing with configurable test mode

Pipeline Configuration

7-Step Processing Workflow:

  1. Initial Audio Manifest → Full audio files with duration
  2. Text Population → Add ground truth transcripts from NLP files
  3. Text Cleaning → Remove artifacts, brackets, special characters
  4. Forced Alignment → Generate word-level CTM files with timestamps
  5. Sentence Segmentation → Create sentence-level segments from CTM data
  6. Speaker Segmentation → Create speaker-level segments (optional)
  7. Field Filtering → Keep only required manifest fields

Key Configuration Options:

  • dataset_type: "earnings21" | "earnings22"
  • subset: "full" | "eval10" (earnings21 only)
  • forced_alignment_model: Configurable NeMo ASR model
  • preserve_punctuation / preserve_capitalization: Text processing options
  • include_speaker_info / include_tags: Optional metadata inclusion

Output Formats

Sentence-Level Segments (Primary Output):

{
  "audio_filepath": "/path/to/audio.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation.",
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Speaker-Level Segments (Optional):

{
  "audio_filepath": "/path/to/audio.wav", 
  "duration": 0,
  "text": "Speaker segment text...",
  "speaker": "speaker_1",
  "segment_id": 0
}

Usage Examples

# Process Earnings21 full dataset
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

# Process Earnings22 with custom model
python main.py --config-path=dataset_configs/english/earnings21 --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nithinraok. Thanks!

Overall looks good to me. I left some comments about docs and docstrings, that need to be fixed and I will review code once more. Also, please, consider adding end2end tests.


# Step 2: Populate Full Text for Manifest
class CreateFullAudioManifestEarnings21(BaseParallelProcessor):
"""
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please, add proper docstrings and update api.rst

@nithinraok
Copy link
Collaborator Author

Updated based on comments. @lilithgrigoryan pls have a look

Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nithinraok Thanks for the code and great docs!
Minor question, can we rename the earnings21 folders to just earnings? From what I understand, the configs cover both earnings21 and earnings22, right?

Otherwise LGTM

Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
Signed-off-by: Nithin Rao Koluguri <nithinrao.koluguri@gmail.com>
@karpnv karpnv merged commit 64f9d13 into main Jul 19, 2025
9 of 10 checks passed
@karpnv karpnv deleted the earnings_pc branch July 19, 2025 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants