Skip to content

LS-EEND Speaker Pre-Enrollment Bugfixes#729

Open
SGD2718 wants to merge 6 commits into
mainfrom
fix/lseend-enrollment-timestamp-offset
Open

LS-EEND Speaker Pre-Enrollment Bugfixes#729
SGD2718 wants to merge 6 commits into
mainfrom
fix/lseend-enrollment-timestamp-offset

Conversation

@SGD2718

@SGD2718 SGD2718 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary:

Fixed two issues with issues with LS-EEND speaker pre-enrollment:

Issue 1: Pre-Enrollment Offsets DiarizerTimeline Timestamps

Cause:

Pre-enrollment offsets timestamps. The silence frames that pad the pre-enrollment audio are counted as valid frames in the DiarizerTimeline, sample count, pushing all actual audio frames about 1 second into the future.

Solution:

Reset LS-EEND's sample counter after pre-enrollment to make it exclude the warmup audio from the output. When the actual session starts, the silent frames will be skipped. This will not corrupt the recurrent state because it simply affects which frames are selected from the model's output. Additionally, when pre-enrolling back-to-back speakers, their audio is separated by a silence gap that is double the right context size (i.e., twice the number of frames it will skip), ensuring that no part of the enrollment output is cut off.

Issue 2: Pre-Enrollment Fails When No Segments Can Be Saved

Cause:

Disabling segment saving in the DiarizerTimeline prevents pre-enrollment from working because it counts the number of saved new segments for each speaker.

Solution:

Use the DiarizerUpdate object emitted by the flush instead, since it will contain the detected segments regardless of whether the timeline can save segments. Also removed the unit test enforcing this incorrect behavior.

Note: this was also an issue for Sortformer and was fixed in a similar way.

SGD2718 added 2 commits June 22, 2026 17:40
enrollSpeaker reset the timeline origin to frame 0 but left
framesFedToModel advanced, so the first live chunk skipped the
convDelay output strip. The encoder's right-context look-ahead
(holding the enrollment drain silence) then leaked into the
timeline, shifting every reported segment timestamp later by
convDelay (~9 frames, ~0.9s at 10Hz).

Reset framesFedToModel to 0 on successful enrollment to re-arm the
strip, and restore the captured value on the rollback paths so a
failed enrollment leaves the counter consistent with the
rolled-back session. warmupFrames only slices model output, so the
primed recurrent state stays continuous.

Add a regression test asserting an enrolled stream finalizes the
same frame count as a baseline stream of identical live audio.
When a DiarizerTimeline is configured with storeSegments=false, the
_speakers map is never populated, so enrollment can't read the detected
speaker back to map it to a slot and always returns nil for both
LSEENDDiarizer and SortformerDiarizer.

Add a lock-guarded DiarizerTimeline.setStoreSegments(_:) that returns the
previous value, and have both enrollSpeaker paths force storage on for the
duration of enrollment and restore the configured value via defer on every
exit path (success keeps the enrolled speaker; rollback drops it).

Add regression tests for both diarizers asserting enrollment succeeds with
storeSegments=false and the config falls back afterward.
@SGD2718 SGD2718 requested a review from Alex-Wengg June 23, 2026 02:20
@SGD2718 SGD2718 self-assigned this Jun 23, 2026
@SGD2718 SGD2718 added the bug Something isn't working label Jun 23, 2026
Copilot AI review requested due to automatic review settings June 23, 2026 02:20
@SGD2718 SGD2718 added the speaker-diarization Issues related to speaker diarization label Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two regressions in speaker pre-enrollment for the diarization pipeline (Sortformer + LS-EEND): (1) enrollment was causing LS-EEND streaming timestamps to be offset by the model’s right-context latency, and (2) enrollment could fail when DiarizerTimeline was configured not to persist segments.

Changes:

  • LS-EEND: snapshot/restore framesFedToModel on enrollment failure and reset it on success to prevent right-context timestamp offset.
  • Sortformer + LS-EEND: force DiarizerTimeline segment storage on during enrollment, then restore the caller’s configured storeSegments setting.
  • Add regression tests for the above behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
Tests/FluidAudioTests/Diarizer/SpeakerEnrollmentTests.swift Adds regression tests for enrollment with storeSegments=false and LS-EEND timestamp alignment after enrollment.
Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift Temporarily enables timeline segment storage during enrollment so the enrolled speaker can be read back reliably.
Sources/FluidAudio/Diarizer/LS-EEND/LSEENDDiarizer.swift Prevents right-context (convDelay) timestamp shift after enrollment and restores framesFedToModel on enrollment rollback.
Sources/FluidAudio/Diarizer/DiarizerTimeline.swift Makes timeline config mutable (private-set) and adds setStoreSegments(_:) to support temporary segment persistence overrides.

Comment thread Sources/FluidAudio/Diarizer/DiarizerTimeline.swift Outdated
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Supertonic3 Smoke Test ✅

Check Result
Build
Model download (incl. VectorEstimatorVariants/ int4 buckets)
Model load
Synthesis pipeline (--ve-variant int4)
Output WAV ✅ (364.7 KB)

Runtime: 0m24s

Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

PocketTTS Smoke Test ❌

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ❌ (NaN KB)

Runtime:

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 11.08x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 44.9s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.045s Average chunk processing time
Max Chunk Time 0.090s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m52s • 06/23/2026, 12:22 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 31.85x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.917 27.1 Fetching diarization models
Model Compile 3.821 11.6 CoreML compilation
Audio Load 0.045 0.1 Loading audio file
Segmentation 9.881 30.0 Detecting speech regions
Embedding 16.469 50.0 Extracting speaker voices
Clustering 6.587 20.0 Grouping same speakers
Total 32.942 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 32.9s diarization time • Test runtime: 1m 48s • 06/23/2026, 12:28 AM EST

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 746.0x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 761.1x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 30.3% <35%
Miss Rate 28.2% - -
False Alarm 0.9% - -
Speaker Error 1.2% - -
RTFx 14.3x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 38s • 2026-06-23T04:22:53.718Z

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 13.76x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 13.548 17.8 Fetching diarization models
Model Compile 5.806 7.6 CoreML compilation
Audio Load 0.031 0.0 Loading audio file
Segmentation 21.055 27.6 VAD + speech detection
Embedding 76.061 99.7 Speaker embedding extraction
Clustering (VBx) 0.096 0.1 Hungarian algorithm + VBx clustering
Total 76.281 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 97.2s processing • Test runtime: 1m 44s • 06/23/2026, 12:26 AM EST

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 4.79x
test-other 1.19% 0.00% 3.09x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.59x
test-other 1.00% 0.00% 2.93x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.51x Streaming real-time factor
Avg Chunk Time 1.752s Average time to process each chunk
Max Chunk Time 2.127s Maximum chunk processing time
First Token 2.073s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.57x Streaming real-time factor
Avg Chunk Time 1.587s Average time to process each chunk
Max Chunk Time 1.798s Maximum chunk processing time
First Token 1.627s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 7m35s • 06/23/2026, 12:29 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

SGD2718 and others added 4 commits June 22, 2026 20:23
… segments

The previous fix forced the timeline to store segments during enrollment so the
speaker could be read back from timeline.speakers. Instead, derive the enrolled
slot from the DiarizerTimelineUpdate (whose segments are emitted regardless of
storeSegments) and register the speaker identity directly via upsertSpeaker,
which is no longer gated by storeSegments.

- DiarizerTimeline.upsertSpeaker(named:atIndex:) registers identity regardless of
  storeSegments; removed the now-redundant setStoreSegments toggle and reverted
  config back to a let.
- LSEENDDiarizer: upsert the speaker at the slot found from the update.
- SortformerDiarizer: accumulate per-slot speech frames from the updates (plus the
  trailing tentative segment) to pick the dominant slot.

Enrollment now works with storeSegments=false without temporarily mutating the
timeline configuration. Tests assert the enrolled identity persists.
- Tentative segments in the diarizer update are now inspected
- Shortened long comments
Speaker enrollment must be able to register a named speaker even when the
timeline is configured not to store segments (storeSegments=false). The
DiarizerTimeline.upsertSpeaker(_:atIndex:) overload was still gated by
storeSegments, leaving it inconsistent with the named: overload and causing
the emit-only enrollment path to behave differently than intended.

- DiarizerTimeline: ungate the speaker-object upsertSpeaker overload so both
  overloads register identity independent of storeSegments. Auto-tracking from
  processing (commitSegment) stays gated.
- Update the emit-only timeline test to assert upsert is allowed (was asserting
  it is refused, which encoded the wrong contract).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working speaker-diarization Issues related to speaker diarization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants