-
Notifications
You must be signed in to change notification settings - Fork 102
Always send shutdown_worker RPC, fix WorkerStatus state when shutting down worker #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Always send shutdown_worker RPC, fix WorkerStatus state when shutting down worker #1082
Conversation
| .workers() | ||
| .unregister_worker(self.worker_instance_key); | ||
| .unregister_slot_provider(self.worker_instance_key); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Shutdown status not set on initiation
initiate_shutdown no longer updates self.status to WorkerStatus::ShuttingDown. Callers that use initiate_shutdown to begin shutdown (before awaiting shutdown/finalize_shutdown) will keep sending heartbeats with Running, delaying/obscuring shutdown signaling and breaking server-side “seen ShuttingDown then no heartbeat” detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want ShuttingDown state to be set when we send the worker_shutdown RPC call
| // This is a best effort call and we can still shutdown the worker if it fails | ||
| match self.client.shutdown_worker(sticky_name, heartbeat).await { | ||
| Err(err) | ||
| if !matches!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Empty sticky queue sent on shutdown
shutdown() now always calls shutdown_worker and uses unwrap_or_default() for sticky_name, which becomes an empty string when no sticky queue is used (e.g., max_cached_workflows == 0 or workflow polling disabled). If the server treats an empty sticky_task_queue as invalid when implemented, this can cause noisy warnings and failed shutdown signaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional, we want to start always sending shutdown_worker, not just on sticky queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me (aside from it looks like a few tests need updating), just one minor thing
crates/client/src/worker/mod.rs
Outdated
| slot_vec.retain(|info| info.worker_id != worker_instance_key); | ||
| if slot_vec.is_empty() { | ||
| self.slot_providers.remove(&slot_key); | ||
| if let Some(slot_vec) = self.slot_providers.get(&slot_key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just did this check above, no? Could this ever happen?
What was changed
Always send shutdown_worker RPC, decouple disabling eager workflow start and worker heartbeat unregistration for worker shutdown
Why?
shutdown_workerRPC doesn't indicate that the worker has fully shutdown, only that it has started. Server and others can tell that a worker has fully shutdown by checking if there has been a heartbeat within the "heartbeat interval" after receiving theShuttingDownstatus.Checklist
Closes
How was this tested:
Note
Always send shutdown_worker RPC and refactor worker unregistration into two steps (disable eager start, then finalize heartbeat cleanup), updating APIs and tests.
shutdown_workerRPC during shutdown; sets status toWorkerStatus::ShuttingDown.statusinshutdown_worker; client only fills common heartbeat fields.finalize_shutdownnow callsworkers().finalize_unregister(...)after shutdown completes.unregister_workerwith two-step API:unregister_slot_provider(worker_instance_key)to disable eager workflow start early.finalize_unregister(worker_instance_key)to remove fromall_workersand heartbeat manager; errors if still present inslot_providers.Worker::initiate_shutdownandreplace_clientupdated to use the new two-step flow.unregister_slot_providerthenfinalize_unregister; addworker_unregister_ordertest for enforcement.shutdown_workerto be invoked once (success or best-effort failure tolerated); heartbeat status expectations changed toShuttingDown.poll_buffer.rslog asserts.Written by Cursor Bugbot for commit 1849b16. This will update automatically on new commits. Configure here.