Commit 05ad870
Support ML async job cancellation, fail jobs on redis errors (#1162)
* merge
* fix: PSv2 follow-up fixes from integration tests (#1135)
* fix: prevent NATS connection flooding and stale job task fetching
- Add connect_timeout=5, allow_reconnect=False to NATS connections to
prevent leaked reconnection loops from blocking Django's event loop
- Guard /tasks endpoint against terminal-status jobs (return empty tasks
instead of attempting NATS reserve)
- IncompleteJobFilter now excludes jobs by top-level status in addition
to progress JSON stages
- Add stale worker cleanup to integration test script
Found during PSv2 integration testing where stale ADC workers with
default DataLoader parallelism overwhelmed the single uvicorn worker
thread by flooding /tasks with concurrent NATS reserve requests.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: PSv2 integration test session notes and NATS flooding findings
Session notes from 2026-02-16 integration test including root cause
analysis of stale worker task competition and NATS connection issues.
Findings doc tracks applied fixes and remaining TODOs with priorities.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: update session notes with successful test run #3
PSv2 integration test passed end-to-end (job 1380, 20/20 images).
Identified ack_wait=300s as cause of ~5min idle time when GPU
processes race for NATS tasks.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: batch NATS task fetch to prevent HTTP timeouts
Replace N×1 reserve_task() calls with single reserve_tasks() batch
fetch. The previous implementation created a new pull subscription per
message (320 NATS round trips for batch=64), causing the /tasks endpoint
to exceed HTTP client timeouts. The new approach uses one psub.fetch()
call for the entire batch.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add next session prompt
* feat: add pipeline__slug__in filter for multi-pipeline job queries
Workers that handle multiple pipelines can now fetch jobs for all of them
in a single request: ?pipeline__slug__in=slug1,slug2
Co-Authored-By: Claude <noreply@anthropic.com>
* chore: remove local-only docs and scripts from branch
These files are session notes, planning docs, and test scripts that
should stay local rather than be part of the PR.
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: set job dispatch_mode at creation time based on project feature flags
ML jobs with a pipeline now get dispatch_mode set during setup() instead
of waiting until run() is called by the Celery worker. This lets the UI
show the correct mode immediately after job creation.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: add timeouts to all JetStream operations and restore reconnect policy
Add NATS_JETSTREAM_TIMEOUT (10s) to all JetStream metadata operations
via asyncio.wait_for() so a hung NATS connection fails fast instead of
blocking the caller's thread indefinitely. Also restore the intended
reconnect policy (2 attempts, 1s wait) that was lost in a prior force push.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: propagate NATS timeouts as 503 instead of swallowing them
asyncio.TimeoutError from _ensure_stream() and _ensure_consumer() was
caught by the broad `except Exception` in reserve_tasks(), silently
returning [] and making NATS outages indistinguishable from empty queues.
Workers would then poll immediately, recreating the flooding problem.
- Add explicit `except asyncio.TimeoutError: raise` in reserve_tasks()
- Catch TimeoutError and OSError in the /tasks view, return 503
- Restore allow_reconnect=False (fail-fast on connection issues)
- Add return type annotation to get_connection()
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: address review comments (log level, fetch timeout, docstring)
- Downgrade reserve_tasks log to DEBUG when zero tasks reserved (avoid
log spam from frequent polling)
- Pass timeout=0.5 from /tasks endpoint to avoid blocking the worker
for 5s on empty queues
- Fix docstring examples using string 'job123' for int-typed job_id
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: catch nats.errors.Error in /tasks endpoint for proper 503 responses
NoServersError, ConnectionClosedError, and other NATS exceptions inherit
from nats.errors.Error (not OSError), so they escaped the handler and
returned 500 instead of 503.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* PSv2: Improve task fetching & web worker concurrency configuration (#1142)
* feat: configurable NATS tuning and gunicorn worker management
Rebase onto main after #1135 merge. Keep only the additions unique to
this branch:
- Make TASK_TTR configurable via NATS_TASK_TTR Django setting (default 30s)
- Make max_ack_pending configurable via NATS_MAX_ACK_PENDING setting (default 100)
- Local dev: switch to gunicorn+UvicornWorker by default for production
parity, with USE_UVICORN=1 escape hatch for raw uvicorn
- Production: auto-detect WEB_CONCURRENCY from CPU cores (capped at 8)
when not explicitly set in the environment
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: address PR review comments
- Fix max_ack_pending falsy-zero guard (use `is not None` instead of `or`)
- Update TaskQueueManager docstring with Args section
- Simplify production WEB_CONCURRENCY fallback (just use nproc)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Michael Bunsen <notbot@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* fix: include pipeline_slug in MinimalJobSerializer (#1148)
* fix: include pipeline_slug in MinimalJobSerializer (ids_only response)
The ADC worker fetches jobs with ids_only=1 and expects pipeline_slug in
the response to know which pipeline to run. Without it, Pydantic
validation fails and the worker skips the job.
Co-Authored-By: Claude <noreply@anthropic.com>
* Update ami/jobs/serializers.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Avoid redis based locking by using atomic updates
* Test concurrency
* Increase max ack pending
* update comment
* CR feedback
* Cancel jobs if Redis state is missing
* Add chaos monkey
* CR feedback
* CR 2
* fix: OrderedEnum comparisons now override str MRO in subclasses
JobState(str, OrderedEnum) was using str's lexicographic __gt__
instead of OrderedEnum's definition-order __gt__, because str
comes first in the MRO. This caused max(FAILURE, SUCCESS) to
return SUCCESS, silently discarding failure state in concurrent
job progress updates.
Fix: __init_subclass__ injects comparison methods directly onto
each subclass so they take MRO priority over data-type mixins.
Also preserve FAILURE status through the progress ternary when
progress < 1.0, so early failure detection isn't overwritten.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: correct misleading error log about NATS redelivery
The NATS message is ACK'd at line 145, before update_state() and
_update_job_progress(). If either of those raises, the except
block was logging "NATS will redeliver" when it won't.
Co-Authored-By: Claude <noreply@anthropic.com>
* Use job.logger
* Use job.logger
* Integrate cancellation support
* merge, update tests
* Remove pause support in monkey
* fix: cancel async jobs by cleaning up NATS/Redis and stopping task delivery
For async_api jobs, the Celery task completes after queuing images to NATS,
so task.revoke() has no effect. The worker kept pulling tasks via the /tasks
endpoint because it only checked final_states(), not CANCELING.
- Add JobState.active_states() (STARTED, RETRY) for positive task-serving check
- /tasks endpoint returns empty unless job is in active_states()
- Job.cancel() for async_api jobs: clean up NATS/Redis, then set REVOKED
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(ui): hide Retry button while job is in CANCELING state
canRetry now excludes CANCELING so the Retry button stays hidden
during the drain period, matching the backend's transitional state.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: downgrade Redis-missing log to warning for canceled jobs
When a job is canceled, NATS/Redis cleanup runs before in-flight results
finish processing. The resulting "Redis state missing" message is expected,
not an error.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add async job monitoring reference
Covers all monitoring points for NATS async jobs: Django ORM, REST API,
tasks endpoint, NATS consumer state, Redis counters, Docker logs, and
AMI worker logs. Linked from CLAUDE.md and the test_ml_job_e2e command.
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: update tests for active_states() guard on /tasks endpoint
Tests need to set job status to STARTED since the /tasks endpoint
now only serves tasks for jobs in active_states() (STARTED, RETRY).
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: improve job cancel ordering, fail status sync, and log handler safety
- Reorder cancel(): revoke Celery task before cleaning up async resources
to prevent a theoretical race where a worker recreates state after cleanup
- Remove redundant self.save() after task.revoke() (no fields changed)
- Use update_status() in _fail_job() to keep progress.summary.status in
sync with job.status
- Wrap entire log handler emit() DB sequence (refresh_from_db + mutations +
save) in try/except so a DB failure during logging cannot crash callers
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: restore timeout on _stream_exists and use settings for NATS_URL
- Add asyncio.wait_for() wrapper to _stream_exists() stream_info call,
accidentally dropped during refactor from _ensure_stream
- Read NATS_URL from Django settings in chaos_monkey command instead of
hardcoding, consistent with TaskQueueManager
Co-Authored-By: Claude <noreply@anthropic.com>
* fix(ui): block retry button while job is in RETRY state
RETRY is an active processing state; allowing another retry while one
is already running could cause duplicate execution.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: clarify _stream_exists timeout propagation design
Add docstring explaining that TimeoutError is deliberately not caught —
an unreachable NATS server should be a hard failure, not a "stream
missing" false negative. Multiple reviewers questioned this behavior.
Co-Authored-By: Claude <noreply@anthropic.com>
* docs: add language tag to fenced code block in monitoring guide
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Carlos Garcia Jurado Suarez <carlos@irreverentlabs.com>
Co-authored-by: Michael Bunsen <notbot@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>1 parent 12a3c70 commit 05ad870
12 files changed
Lines changed: 430 additions & 54 deletions
File tree
- .agents
- ami
- jobs
- management/commands
- ml/orchestration
- tests
- docs/claude/reference
- ui/src/data-services/models
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
650 | 650 | | |
651 | 651 | | |
652 | 652 | | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
653 | 663 | | |
654 | 664 | | |
655 | 665 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
14 | 18 | | |
15 | 19 | | |
16 | 20 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
91 | 96 | | |
92 | 97 | | |
93 | 98 | | |
| |||
331 | 336 | | |
332 | 337 | | |
333 | 338 | | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
339 | 350 | | |
340 | | - | |
341 | | - | |
342 | | - | |
343 | | - | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
344 | 355 | | |
345 | | - | |
346 | | - | |
| 356 | + | |
| 357 | + | |
347 | 358 | | |
348 | | - | |
349 | | - | |
350 | 359 | | |
351 | 360 | | |
352 | 361 | | |
353 | | - | |
354 | 362 | | |
355 | 363 | | |
356 | 364 | | |
| |||
966 | 974 | | |
967 | 975 | | |
968 | 976 | | |
969 | | - | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
970 | 980 | | |
971 | 981 | | |
972 | 982 | | |
| 983 | + | |
973 | 984 | | |
974 | 985 | | |
975 | 986 | | |
976 | 987 | | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
977 | 992 | | |
978 | 993 | | |
979 | 994 | | |
980 | 995 | | |
981 | 996 | | |
| 997 | + | |
| 998 | + | |
982 | 999 | | |
983 | 1000 | | |
984 | 1001 | | |
| |||
1084 | 1101 | | |
1085 | 1102 | | |
1086 | 1103 | | |
1087 | | - | |
1088 | | - | |
1089 | | - | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
1090 | 1109 | | |
1091 | 1110 | | |
| 1111 | + | |
| 1112 | + | |
1092 | 1113 | | |
1093 | 1114 | | |
1094 | 1115 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
89 | | - | |
90 | 89 | | |
91 | 90 | | |
92 | | - | |
| 91 | + | |
93 | 92 | | |
94 | 93 | | |
95 | 94 | | |
| |||
153 | 152 | | |
154 | 153 | | |
155 | 154 | | |
156 | | - | |
157 | | - | |
| 155 | + | |
158 | 156 | | |
159 | 157 | | |
160 | 158 | | |
| |||
180 | 178 | | |
181 | 179 | | |
182 | 180 | | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
183 | 201 | | |
184 | 202 | | |
185 | 203 | | |
| |||
295 | 313 | | |
296 | 314 | | |
297 | 315 | | |
298 | | - | |
| 316 | + | |
299 | 317 | | |
300 | 318 | | |
301 | | - | |
| 319 | + | |
302 | 320 | | |
303 | 321 | | |
304 | 322 | | |
| |||
314 | 332 | | |
315 | 333 | | |
316 | 334 | | |
317 | | - | |
| 335 | + | |
318 | 336 | | |
319 | 337 | | |
320 | 338 | | |
| |||
353 | 371 | | |
354 | 372 | | |
355 | 373 | | |
356 | | - | |
| 374 | + | |
357 | 375 | | |
358 | 376 | | |
359 | 377 | | |
| |||
368 | 386 | | |
369 | 387 | | |
370 | 388 | | |
371 | | - | |
| 389 | + | |
372 | 390 | | |
373 | 391 | | |
374 | 392 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
445 | 445 | | |
446 | 446 | | |
447 | 447 | | |
448 | | - | |
| 448 | + | |
| 449 | + | |
449 | 450 | | |
450 | 451 | | |
451 | 452 | | |
| |||
487 | 488 | | |
488 | 489 | | |
489 | 490 | | |
| 491 | + | |
490 | 492 | | |
491 | 493 | | |
492 | 494 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
237 | 237 | | |
238 | 238 | | |
239 | 239 | | |
240 | | - | |
241 | | - | |
| 240 | + | |
| 241 | + | |
242 | 242 | | |
243 | 243 | | |
244 | 244 | | |
| |||
0 commit comments