Skip to content

feat: resilient background job retry & monitoring (#130)#389

Closed
sinatragian wants to merge 7 commits intorohitdash08:mainfrom
sinatragian:feat/resilient-job-retry-monitoring
Closed

feat: resilient background job retry & monitoring (#130)#389
sinatragian wants to merge 7 commits intorohitdash08:mainfrom
sinatragian:feat/resilient-job-retry-monitoring

Conversation

@sinatragian
Copy link

@sinatragian sinatragian commented Mar 13, 2026

Adds resilient background job retry & monitoring to fix #130.

The reminder runner now retries failed reminders up to a configurable max, records every execution in a JobRun audit table, and exposes monitoring endpoints. Race conditions in multi-worker deployments are prevented via .with_for_update(skip_locked=True).

Backend

  • POST /reminders/run — dispatches due reminders with retry semantics; returns {processed, failed, retried, status}
  • GET /reminders/job-runs — lists recent job execution records (JWT-protected)
  • JobRun model: id, job_name, status, started_at, finished_at, processed, failed, retried, error_message
  • Migration 006_job_runs.sql

Frontend

  • /jobs — Job Monitor page: live table of job execution history with colored status badges (success/partial/failed/no_work), auto-refreshes every 30 s
  • "Run Due Now" button: triggers POST /reminders/run and shows a toast with {processed, failed, retried} counts
  • Wired into the app navbar and React Router
  • Files: app/src/api/reminders.ts (JobRun type + listJobRuns), app/src/pages/JobMonitor.tsx, route in App.tsx, nav link in Navbar.tsx

Design note

Extended the existing reminder runner rather than building a parallel job system to minimize surface area and stay backward-compatible with existing schedulers. .with_for_update(skip_locked=True) at job_runner.py:73 prevents double-dispatch in multi-worker deployments.

Closes #130

- Add retry fields to Reminder model: retry_count, max_retries,
  next_retry_at, last_error, failed_permanently
- Add JobRun model for job execution audit log
- Implement exponential backoff retry (2^n minutes, capped at 60 min)
  in new services/job_runner.py
- POST /reminders/run now returns full stats dict and uses job_runner
- Add GET /reminders/job-runs monitoring endpoint
- Add SQL migration 002_resilient_job_retry.sql (IF NOT EXISTS safe)
- Add 16 unit/integration tests in tests/test_job_runner.py
- Add docs/resilient-job-retry.md

Closes rohitdash08#130
sinatragianpaolo-oc added 3 commits March 16, 2026 03:56
…var, derive_status logic, test email uniqueness
…User, limit type=int, no_work status, job-runs auth filter, cleanup annotations
@sinatragian
Copy link
Author

Design rationale — extending the existing reminder runner vs. a parallel job system

This PR extends the existing reminder runner (/reminders/run) rather than introducing a parallel job scheduler for two reasons:

  1. Minimal surface area — a separate job system would require its own queue, worker process, and failure model. Reusing the reminder runner keeps the retry and audit logic in one place and stays backward-compatible with any existing scheduler (cron, Celery beat, etc.) that already calls the endpoint.

  2. Race-condition safety in multi-worker deploymentsjob_runner.py line 73 uses .with_for_update(skip_locked=True) so that concurrent workers each claim a disjoint set of due reminders. Workers that cannot acquire the lock skip those rows rather than blocking, which prevents double-dispatch without a distributed lock service.

The JobRun audit table records every execution (status, duration, retry count) so ops teams can monitor failure rates without grepping logs.

Gianpaolo and others added 3 commits March 17, 2026 02:50
Also adds JobMonitor page with live refresh and run-now button.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilient background job retry & monitoring

1 participant