coordinator: stale PendingRestart leaks reply_sender and hangs caller if a daemon dies mid-restart

## Background

PR #2085 (merged) reworked dataflow restart into a two-phase, event-driven flow to fix the cross-daemon Zenoh race in #2082:

- **Phase 1** (`initiate_restart`) sends `StopDataflow` to all daemons, persists `Stopping`, and parks a `PendingRestart` (descriptor + name + uv + the caller's `reply_sender`) in `pending_restarts`.
- **Phase 2** completes inside the `DataflowFinishedOnDaemon` handler when the *last* daemon reports finish, calling `start_dataflow` and firing the parked reply.

The core fix is correct. This issue tracks the follow-up gaps that PR #2085 left behind (one of them already flagged as an inline TODO).

## Problem 1 — caller hangs forever if a daemon dies mid-restart (primary)

If a daemon crashes/disconnects after receiving `StopDataflow` but before sending `DataflowFinishedOnDaemon`, the `PendingRestart` entry is never removed and its `reply_sender` never fires. The restart caller (CLI) blocks indefinitely with no error.

This is a behavioral regression relative to the old synchronous `restart_dataflow`, which surfaced stop/start failures immediately. The two-phase flow trades a visible error for a silent hang.

The PR documents this as an inline TODO in `initiate_restart`:

```rust
// TODO: If a daemon crashes after receiving StopDataflow but before
// sending DataflowFinishedOnDaemon, this entry (and its reply_sender)
// will never be cleaned up, causing the restart caller to hang
// indefinitely. A timeout or a daemon-disconnect cleanup path should
// be added to evict stale PendingRestart entries.
```

**Proposed fix:** evict stale `PendingRestart` entries and reply `Err` when their restart can no longer complete — either via a timeout, or (preferred) by draining matching `pending_restarts` entries in the daemon-disconnect cleanup path so the caller gets a clear error.

## Problem 2 — double-restart guard runs after `StopDataflow` (minor)

In `initiate_restart`, the `pending_restarts.contains_key(...)` guard is checked *after* `stop_dataflow` has already broadcast `StopDataflow` to every daemon. A duplicate `Restart` for an already-restarting dataflow therefore incurs a redundant stop round-trip before being rejected. The guard should be moved ahead of the `stop_dataflow` call (right after extracting the descriptor) so duplicates are rejected before any daemon I/O.

## Problem 3 — no coordinator-side test for the new restart path (minor)

The new two-phase restart shipped without a coordinator test. The race itself is hard to reproduce deterministically, but the double-restart guard and the deferred-completion → reply wiring are both testable in `binaries/coordinator/tests/` without real daemons. Adding coverage there would lock in the guard behavior and the cleanup path from Problem 1.

## References
- PR #2085 (the two-phase restart fix)
- #2082 (the original flaky cluster-e2e race)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly