LS-EEND Speaker Pre-Enrollment Bugfixes#729
Conversation
enrollSpeaker reset the timeline origin to frame 0 but left framesFedToModel advanced, so the first live chunk skipped the convDelay output strip. The encoder's right-context look-ahead (holding the enrollment drain silence) then leaked into the timeline, shifting every reported segment timestamp later by convDelay (~9 frames, ~0.9s at 10Hz). Reset framesFedToModel to 0 on successful enrollment to re-arm the strip, and restore the captured value on the rollback paths so a failed enrollment leaves the counter consistent with the rolled-back session. warmupFrames only slices model output, so the primed recurrent state stays continuous. Add a regression test asserting an enrolled stream finalizes the same frame count as a baseline stream of identical live audio.
When a DiarizerTimeline is configured with storeSegments=false, the _speakers map is never populated, so enrollment can't read the detected speaker back to map it to a slot and always returns nil for both LSEENDDiarizer and SortformerDiarizer. Add a lock-guarded DiarizerTimeline.setStoreSegments(_:) that returns the previous value, and have both enrollSpeaker paths force storage on for the duration of enrollment and restore the configured value via defer on every exit path (success keeps the enrolled speaker; rollback drops it). Add regression tests for both diarizers asserting enrollment succeeds with storeSegments=false and the config falls back afterward.
There was a problem hiding this comment.
Pull request overview
Fixes two regressions in speaker pre-enrollment for the diarization pipeline (Sortformer + LS-EEND): (1) enrollment was causing LS-EEND streaming timestamps to be offset by the model’s right-context latency, and (2) enrollment could fail when DiarizerTimeline was configured not to persist segments.
Changes:
- LS-EEND: snapshot/restore
framesFedToModelon enrollment failure and reset it on success to prevent right-context timestamp offset. - Sortformer + LS-EEND: force
DiarizerTimelinesegment storage on during enrollment, then restore the caller’s configuredstoreSegmentssetting. - Add regression tests for the above behaviors.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| Tests/FluidAudioTests/Diarizer/SpeakerEnrollmentTests.swift | Adds regression tests for enrollment with storeSegments=false and LS-EEND timestamp alignment after enrollment. |
| Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift | Temporarily enables timeline segment storage during enrollment so the enrolled speaker can be read back reliably. |
| Sources/FluidAudio/Diarizer/LS-EEND/LSEENDDiarizer.swift | Prevents right-context (convDelay) timestamp shift after enrollment and restores framesFedToModel on enrollment rollback. |
| Sources/FluidAudio/Diarizer/DiarizerTimeline.swift | Makes timeline config mutable (private-set) and adds setStoreSegments(_:) to support temporary segment persistence overrides. |
Supertonic3 Smoke Test ✅
Runtime: 0m24s Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf. |
PocketTTS Smoke Test ❌
Runtime: Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m52s • 06/23/2026, 12:22 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 32.9s diarization time • Test runtime: 1m 48s • 06/23/2026, 12:28 AM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 38s • 2026-06-23T04:22:53.718Z |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 97.2s processing • Test runtime: 1m 44s • 06/23/2026, 12:26 AM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m35s • 06/23/2026, 12:29 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
… segments The previous fix forced the timeline to store segments during enrollment so the speaker could be read back from timeline.speakers. Instead, derive the enrolled slot from the DiarizerTimelineUpdate (whose segments are emitted regardless of storeSegments) and register the speaker identity directly via upsertSpeaker, which is no longer gated by storeSegments. - DiarizerTimeline.upsertSpeaker(named:atIndex:) registers identity regardless of storeSegments; removed the now-redundant setStoreSegments toggle and reverted config back to a let. - LSEENDDiarizer: upsert the speaker at the slot found from the update. - SortformerDiarizer: accumulate per-slot speech frames from the updates (plus the trailing tentative segment) to pick the dominant slot. Enrollment now works with storeSegments=false without temporarily mutating the timeline configuration. Tests assert the enrolled identity persists.
- Tentative segments in the diarizer update are now inspected - Shortened long comments
Speaker enrollment must be able to register a named speaker even when the timeline is configured not to store segments (storeSegments=false). The DiarizerTimeline.upsertSpeaker(_:atIndex:) overload was still gated by storeSegments, leaving it inconsistent with the named: overload and causing the emit-only enrollment path to behave differently than intended. - DiarizerTimeline: ungate the speaker-object upsertSpeaker overload so both overloads register identity independent of storeSegments. Auto-tracking from processing (commitSegment) stays gated. - Update the emit-only timeline test to assert upsert is allowed (was asserting it is refused, which encoded the wrong contract).
Summary:
Fixed two issues with issues with LS-EEND speaker pre-enrollment:
Issue 1: Pre-Enrollment Offsets
DiarizerTimelineTimestampsCause:
Pre-enrollment offsets timestamps. The silence frames that pad the pre-enrollment audio are counted as valid frames in the
DiarizerTimeline, sample count, pushing all actual audio frames about 1 second into the future.Solution:
Reset LS-EEND's sample counter after pre-enrollment to make it exclude the warmup audio from the output. When the actual session starts, the silent frames will be skipped. This will not corrupt the recurrent state because it simply affects which frames are selected from the model's output. Additionally, when pre-enrolling back-to-back speakers, their audio is separated by a silence gap that is double the right context size (i.e., twice the number of frames it will skip), ensuring that no part of the enrollment output is cut off.
Issue 2: Pre-Enrollment Fails When No Segments Can Be Saved
Cause:
Disabling segment saving in the DiarizerTimeline prevents pre-enrollment from working because it counts the number of saved new segments for each speaker.
Solution:
Use the
DiarizerUpdateobject emitted by the flush instead, since it will contain the detected segments regardless of whether the timeline can save segments. Also removed the unit test enforcing this incorrect behavior.Note: this was also an issue for Sortformer and was fixed in a similar way.