## Background PR #2085 (merged) reworked dataflow restart into a two-phase, event-driven flow to fix the cross-daemon Zenoh race in #2082: - **Phase 1** (`initiate_restart`) sends `StopDataflow` to all daemons, persists `Stopping`, and parks a `PendingRestart` (descriptor + name + uv + the caller's `reply_sender`) in `pending_restarts`. - **Phase 2** completes inside the `DataflowFinishedOnDaemon` handler when the *last* daemon reports finish, calling `start_dataflow` and firing the parked reply. The core fix is correct. This issue tracks the follow-up gaps that PR #2085 left behind (one of them already flagged as an inline TODO). ## Problem 1 — caller hangs forever if a daemon dies mid-restart (primary) If a daemon crashes/disconnects after receiving `StopDataflow` but before sending `DataflowFinishedOnDaemon`, the `PendingRestart` entry is never removed and its `reply_sender` never fires. The restart caller (CLI) blocks indefinitely with no error. This is a behavioral regression relative to the old synchronous `restart_dataflow`, which surfaced stop/start failures immediately. The two-phase flow trades a visible error for a silent hang. The PR documents this as an inline TODO in `initiate_restart`: ```rust // TODO: If a daemon crashes after receiving StopDataflow but before // sending DataflowFinishedOnDaemon, this entry (and its reply_sender) // will never be cleaned up, causing the restart caller to hang // indefinitely. A timeout or a daemon-disconnect cleanup path should // be added to evict stale PendingRestart entries. ``` **Proposed fix:** evict stale `PendingRestart` entries and reply `Err` when their restart can no longer complete — either via a timeout, or (preferred) by draining matching `pending_restarts` entries in the daemon-disconnect cleanup path so the caller gets a clear error. ## Problem 2 — double-restart guard runs after `StopDataflow` (minor) In `initiate_restart`, the `pending_restarts.contains_key(...)` guard is checked *after* `stop_dataflow` has already broadcast `StopDataflow` to every daemon. A duplicate `Restart` for an already-restarting dataflow therefore incurs a redundant stop round-trip before being rejected. The guard should be moved ahead of the `stop_dataflow` call (right after extracting the descriptor) so duplicates are rejected before any daemon I/O. ## Problem 3 — no coordinator-side test for the new restart path (minor) The new two-phase restart shipped without a coordinator test. The race itself is hard to reproduce deterministically, but the double-restart guard and the deferred-completion → reply wiring are both testable in `binaries/coordinator/tests/` without real daemons. Adding coverage there would lock in the guard behavior and the cleanup path from Problem 1. ## References - PR #2085 (the two-phase restart fix) - #2082 (the original flaky cluster-e2e race)