LS-EEND Speaker Pre-Enrollment Bugfixes by SGD2718 · Pull Request #729 · FluidInference/FluidAudio

SGD2718 · 2026-06-23T02:20:05Z

Summary:

Fixed two issues with issues with LS-EEND speaker pre-enrollment:

Issue 1: Pre-Enrollment Offsets `DiarizerTimeline` Timestamps

Cause:

Pre-enrollment offsets timestamps. The silence frames that pad the pre-enrollment audio are counted as valid frames in the DiarizerTimeline, sample count, pushing all actual audio frames about 1 second into the future.

Solution:

Reset LS-EEND's sample counter after pre-enrollment to make it exclude the warmup audio from the output. When the actual session starts, the silent frames will be skipped. This will not corrupt the recurrent state because it simply affects which frames are selected from the model's output. Additionally, when pre-enrolling back-to-back speakers, their audio is separated by a silence gap that is double the right context size (i.e., twice the number of frames it will skip), ensuring that no part of the enrollment output is cut off.

Issue 2: Pre-Enrollment Fails When No Segments Can Be Saved

Cause:

Disabling segment saving in the DiarizerTimeline prevents pre-enrollment from working because it counts the number of saved new segments for each speaker.

Solution:

Use the DiarizerUpdate object emitted by the flush instead, since it will contain the detected segments regardless of whether the timeline can save segments. Also removed the unit test enforcing this incorrect behavior.

Note: this was also an issue for Sortformer and was fixed in a similar way.

enrollSpeaker reset the timeline origin to frame 0 but left framesFedToModel advanced, so the first live chunk skipped the convDelay output strip. The encoder's right-context look-ahead (holding the enrollment drain silence) then leaked into the timeline, shifting every reported segment timestamp later by convDelay (~9 frames, ~0.9s at 10Hz). Reset framesFedToModel to 0 on successful enrollment to re-arm the strip, and restore the captured value on the rollback paths so a failed enrollment leaves the counter consistent with the rolled-back session. warmupFrames only slices model output, so the primed recurrent state stays continuous. Add a regression test asserting an enrolled stream finalizes the same frame count as a baseline stream of identical live audio.

When a DiarizerTimeline is configured with storeSegments=false, the _speakers map is never populated, so enrollment can't read the detected speaker back to map it to a slot and always returns nil for both LSEENDDiarizer and SortformerDiarizer. Add a lock-guarded DiarizerTimeline.setStoreSegments(_:) that returns the previous value, and have both enrollSpeaker paths force storage on for the duration of enrollment and restore the configured value via defer on every exit path (success keeps the enrolled speaker; rollback drops it). Add regression tests for both diarizers asserting enrollment succeeds with storeSegments=false and the config falls back afterward.

Copilot

Pull request overview

Fixes two regressions in speaker pre-enrollment for the diarization pipeline (Sortformer + LS-EEND): (1) enrollment was causing LS-EEND streaming timestamps to be offset by the model’s right-context latency, and (2) enrollment could fail when DiarizerTimeline was configured not to persist segments.

Changes:

LS-EEND: snapshot/restore framesFedToModel on enrollment failure and reset it on success to prevent right-context timestamp offset.
Sortformer + LS-EEND: force DiarizerTimeline segment storage on during enrollment, then restore the caller’s configured storeSegments setting.
Add regression tests for the above behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
Tests/FluidAudioTests/Diarizer/SpeakerEnrollmentTests.swift	Adds regression tests for enrollment with `storeSegments=false` and LS-EEND timestamp alignment after enrollment.
Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift	Temporarily enables timeline segment storage during enrollment so the enrolled speaker can be read back reliably.
Sources/FluidAudio/Diarizer/LS-EEND/LSEENDDiarizer.swift	Prevents right-context (convDelay) timestamp shift after enrollment and restores `framesFedToModel` on enrollment rollback.
Sources/FluidAudio/Diarizer/DiarizerTimeline.swift	Makes timeline config mutable (private-set) and adds `setStoreSegments(_:)` to support temporary segment persistence overrides.

github-actions · 2026-06-23T02:24:21Z

Supertonic3 Smoke Test ✅

Check	Result
Build	✅
Model download (incl. `VectorEstimatorVariants/` int4 buckets)	✅
Model load	✅
Synthesis pipeline (`--ve-variant int4`)	✅
Output WAV	✅ (364.7 KB)

_{Runtime: 0m24s}

_{Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.}

github-actions · 2026-06-23T02:24:40Z

PocketTTS Smoke Test ❌

Check	Result
Build	✅
Model download	❌
Model load	❌
Synthesis pipeline	❌
Output WAV	❌ (NaN KB)

_Runtime:

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-06-23T02:25:56Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	11.08x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	44.9s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.045s	Average chunk processing time
Max Chunk Time	0.090s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 0m52s • 06/23/2026, 12:22 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-06-23T02:27:40Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	31.85x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	8.917	27.1	Fetching diarization models
Model Compile	3.821	11.6	CoreML compilation
Audio Load	0.045	0.1	Loading audio file
Segmentation	9.881	30.0	Detecting speech regions
Embedding	16.469	50.0	Extracting speaker voices
Clustering	6.587	20.0	Grouping same speakers
Total	32.942	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 32.9s diarization time • Test runtime: 1m 48s • 06/23/2026, 12:28 AM EST}

github-actions · 2026-06-23T02:30:23Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	746.0x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	761.1x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-06-23T02:31:39Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	30.3%	<35%	✅
Miss Rate	28.2%	-	-
False Alarm	0.9%	-	-
Speaker Error	1.2%	-	-
RTFx	14.3x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 2m 38s • 2026-06-23T04:22:53.718Z}

github-actions · 2026-06-23T02:33:24Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	13.76x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	13.548	17.8	Fetching diarization models
Model Compile	5.806	7.6	CoreML compilation
Audio Load	0.031	0.0	Loading audio file
Segmentation	21.055	27.6	VAD + speech detection
Embedding	76.061	99.7	Speaker embedding extraction
Clustering (VBx)	0.096	0.1	Hungarian algorithm + VBx clustering
Total	76.281	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 97.2s processing • Test runtime: 1m 44s • 06/23/2026, 12:26 AM EST}

github-actions · 2026-06-23T02:36:55Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	4.79x	✅
test-other	1.19%	0.00%	3.09x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	4.59x	✅
test-other	1.00%	0.00%	2.93x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.51x	Streaming real-time factor
Avg Chunk Time	1.752s	Average time to process each chunk
Max Chunk Time	2.127s	Maximum chunk processing time
First Token	2.073s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.57x	Streaming real-time factor
Avg Chunk Time	1.587s	Average time to process each chunk
Max Chunk Time	1.798s	Maximum chunk processing time
First Token	1.627s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 7m35s • 06/23/2026, 12:29 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

… segments The previous fix forced the timeline to store segments during enrollment so the speaker could be read back from timeline.speakers. Instead, derive the enrolled slot from the DiarizerTimelineUpdate (whose segments are emitted regardless of storeSegments) and register the speaker identity directly via upsertSpeaker, which is no longer gated by storeSegments. - DiarizerTimeline.upsertSpeaker(named:atIndex:) registers identity regardless of storeSegments; removed the now-redundant setStoreSegments toggle and reverted config back to a let. - LSEENDDiarizer: upsert the speaker at the slot found from the update. - SortformerDiarizer: accumulate per-slot speech frames from the updates (plus the trailing tentative segment) to pick the dominant slot. Enrollment now works with storeSegments=false without temporarily mutating the timeline configuration. Tests assert the enrolled identity persists.

- Tentative segments in the diarizer update are now inspected - Shortened long comments

Speaker enrollment must be able to register a named speaker even when the timeline is configured not to store segments (storeSegments=false). The DiarizerTimeline.upsertSpeaker(_:atIndex:) overload was still gated by storeSegments, leaving it inconsistent with the named: overload and causing the emit-only enrollment path to behave differently than intended. - DiarizerTimeline: ungate the speaker-object upsertSpeaker overload so both overloads register identity independent of storeSegments. Auto-tracking from processing (commitSegment) stays gated. - Update the emit-only timeline test to assert upsert is allowed (was asserting it is refused, which encoded the wrong contract).

SGD2718 added 2 commits June 22, 2026 17:40

SGD2718 requested a review from Alex-Wengg June 23, 2026 02:20

SGD2718 self-assigned this Jun 23, 2026

SGD2718 added the bug Something isn't working label Jun 23, 2026

Copilot AI review requested due to automatic review settings June 23, 2026 02:20

SGD2718 added the speaker-diarization Issues related to speaker diarization label Jun 23, 2026

Copilot started reviewing on behalf of SGD2718 June 23, 2026 02:20 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread Sources/FluidAudio/Diarizer/DiarizerTimeline.swift Outdated

SGD2718 and others added 4 commits June 22, 2026 20:23

Fix LS-EEND Pre-enrollment

9381e5c

- Tentative segments in the diarizer update are now inspected - Shortened long comments

Merge branch 'main' into fix/lseend-enrollment-timestamp-offset

cdd769c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LS-EEND Speaker Pre-Enrollment Bugfixes#729

LS-EEND Speaker Pre-Enrollment Bugfixes#729
SGD2718 wants to merge 6 commits into
mainfrom
fix/lseend-enrollment-timestamp-offset

SGD2718 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SGD2718 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Issue 1: Pre-Enrollment Offsets DiarizerTimeline Timestamps

Cause:

Solution:

Issue 2: Pre-Enrollment Fails When No Segments Can Be Saved

Cause:

Solution:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Supertonic3 Smoke Test ✅

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ❌

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SGD2718 commented Jun 23, 2026 •

edited

Loading

Issue 1: Pre-Enrollment Offsets `DiarizerTimeline` Timestamps

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading