feat(job): cooperative job cancellation via unified NOTIFY router#61
Closed
bodymindarts wants to merge 6 commits intomainfrom
Closed
feat(job): cooperative job cancellation via unified NOTIFY router#61bodymindarts wants to merge 6 commits intomainfrom
bodymindarts wants to merge 6 commits intomainfrom
Conversation
2a8188b to
2896a1d
Compare
2896a1d to
7dd6434
Compare
Allow job runners to attach a result value that callers receive through await_completion. The result flows from runner → OnceLock → entity event → JobCompletionResult without requiring any new migrations (backward-compatible serde(default) on the JobCompleted event variant). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The new await_completion tests registered pollers for the shared "test-job" type. Since nextest runs each test as a separate process sharing the same Postgres database, pollers from different processes competed for the same jobs. When a process exited before completing a stolen job (tokio runtime drops, cancelling the shutdown task), the job was left orphaned in 'running' state — causing test_cancel_already_completed_job_is_idempotent to time out waiting for its job to complete. Fix: give each await_completion test its own job type via a new AwaitTestJobInitializer so cross-process pollers never interfere. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ss poller interference Each test now gets its own unique job type via TestJobInitializer so that nextest parallel processes don't steal each other's jobs from the shared database. This extends the await_completion fix to all remaining tests that shared the "test-job" type, which caused flaky timeouts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace OnceLock with Mutex<Option> so callers can call set_result multiple times. The last value set before completion or error is persisted — enabling incremental progress tracking in batch jobs. Partial results are preserved on error so callers can see how far a job got before failing. - Remove ResultAlreadySet error variant (multiple calls now allowed) - Update dispatcher to use Mutex-based result holder - Add tests for incremental set_result and partial progress on error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add cancel_job() API that cooperatively cancels both pending and running jobs. Pending jobs are cancelled synchronously by deleting the execution row and recording cancel events. Running jobs are cancelled via a PG NOTIFY signal through the existing unified job_events channel, which triggers a tokio-util CancellationToken that the runner can observe. Key changes: - DB trigger fires job_cancel notification when cancelled_at transitions from NULL to non-NULL - CancellationTokens store (DashMap) maps running job IDs to tokens - NotificationRouter routes job_cancel events to cancel tokens - CurrentJob exposes cancellation_requested() / cancellation_notified() - JobCompletion::Cancelled variant for cooperative cancellation - Force-cancel monitor aborts JoinHandle after cancel_timeout (default 30s) - Keep-alive sweep as safety net for missed NOTIFY signals - CancelResult enum: Cancelled | AlreadyCompleted | NotFound Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7dd6434 to
f44b140
Compare
Member
Author
|
Closing: reverting the cancel_job approach. Command-center will handle cancellation at its own layer instead of relying on upstream job crate cancel API. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements cooperative job cancellation that works for both pending and running jobs, building on the unified PG NOTIFY router from PR #59.
job_cancelnotification through the existingjob_eventsPG NOTIFY channel →CancellationToken→ runner observes and returnsJobCompletion::CancelledJoinHandleaftercancel_timeout(default 30s) if the runner doesn't cooperatecancelled_atsetKey changes
cancelled_atcolumn firesjob_cancelnotification through unified routerCancellationTokensstore (DashMap<JobId, CancellationToken>) for running jobsCurrentJobexposescancellation_requested()/cancellation_notified()for runnersJobCompletion::Cancelledvariant andcancel_running_jobdispatcher methodCancelResultenum (Cancelled | AlreadyCompleted | NotFound) forcancel_job()APIcancel_timeoutconfig option (default 30s)Test plan
test_cancel_pending_job— cancels a pending job, assertsCancelResult::Cancelledtest_cancel_running_job_succeeds— cancels a running job cooperatively, assertsCancelResult::Cancelledtest_cancel_already_completed_job_is_idempotent— cancels a completed job, assertsCancelResult::AlreadyCompletednix flake checkpasses (fmt, clippy --deny warnings, audit, deny)🤖 Generated with Claude Code