Skip to content

Conversation

@yuandrew
Copy link
Contributor

@yuandrew yuandrew commented Dec 16, 2025

What was changed

Always send shutdown_worker RPC, decouple disabling eager workflow start and worker heartbeat unregistration for worker shutdown

Why?

shutdown_worker RPC doesn't indicate that the worker has fully shutdown, only that it has started. Server and others can tell that a worker has fully shutdown by checking if there has been a heartbeat within the "heartbeat interval" after receiving the ShuttingDown status.

Checklist

  1. Closes

  2. How was this tested:

  1. Any docs updates needed?

Note

Always send shutdown_worker RPC and refactor worker unregistration into two steps (disable eager start, then finalize heartbeat cleanup), updating APIs and tests.

  • Worker shutdown behavior
    • Always calls shutdown_worker RPC during shutdown; sets status to WorkerStatus::ShuttingDown.
    • Removes client-side mutation of heartbeat status in shutdown_worker; client only fills common heartbeat fields.
    • finalize_shutdown now calls workers().finalize_unregister(...) after shutdown completes.
  • Client worker registry (slot providers/heartbeat)
    • Replaces unregister_worker with two-step API:
      • unregister_slot_provider(worker_instance_key) to disable eager workflow start early.
      • finalize_unregister(worker_instance_key) to remove from all_workers and heartbeat manager; errors if still present in slot_providers.
    • Worker::initiate_shutdown and replace_client updated to use the new two-step flow.
  • Tests and mocks
    • Update unit/integration tests to use unregister_slot_provider then finalize_unregister; add worker_unregister_order test for enforcement.
    • Expect shutdown_worker to be invoked once (success or best-effort failure tolerated); heartbeat status expectations changed to ShuttingDown.
    • Minor formatting updates in poll_buffer.rs log asserts.

Written by Cursor Bugbot for commit 1849b16. This will update automatically on new commits. Configure here.

@yuandrew yuandrew marked this pull request as ready for review December 17, 2025 19:46
@yuandrew yuandrew requested a review from a team as a code owner December 17, 2025 19:46
.workers()
.unregister_worker(self.worker_instance_key);
.unregister_slot_provider(self.worker_instance_key);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Shutdown status not set on initiation

initiate_shutdown no longer updates self.status to WorkerStatus::ShuttingDown. Callers that use initiate_shutdown to begin shutdown (before awaiting shutdown/finalize_shutdown) will keep sending heartbeats with Running, delaying/obscuring shutdown signaling and breaking server-side “seen ShuttingDown then no heartbeat” detection.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want ShuttingDown state to be set when we send the worker_shutdown RPC call

// This is a best effort call and we can still shutdown the worker if it fails
match self.client.shutdown_worker(sticky_name, heartbeat).await {
Err(err)
if !matches!(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Empty sticky queue sent on shutdown

shutdown() now always calls shutdown_worker and uses unwrap_or_default() for sticky_name, which becomes an empty string when no sticky queue is used (e.g., max_cached_workflows == 0 or workflow polling disabled). If the server treats an empty sticky_task_queue as invalid when implemented, this can cause noisy warnings and failed shutdown signaling.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we want to start always sending shutdown_worker, not just on sticky queue

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me (aside from it looks like a few tests need updating), just one minor thing

slot_vec.retain(|info| info.worker_id != worker_instance_key);
if slot_vec.is_empty() {
self.slot_providers.remove(&slot_key);
if let Some(slot_vec) = self.slot_providers.get(&slot_key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just did this check above, no? Could this ever happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants