feat: resilient background job retry & monitoring (#130) by sinatragian · Pull Request #389 · rohitdash08/FinMind

sinatragian · 2026-03-13T22:10:43Z

Adds resilient background job retry & monitoring to fix #130.

The reminder runner now retries failed reminders up to a configurable max, records every execution in a JobRun audit table, and exposes monitoring endpoints. Race conditions in multi-worker deployments are prevented via .with_for_update(skip_locked=True).

Backend

POST /reminders/run — dispatches due reminders with retry semantics; returns {processed, failed, retried, status}
GET /reminders/job-runs — lists recent job execution records (JWT-protected)
JobRun model: id, job_name, status, started_at, finished_at, processed, failed, retried, error_message
Migration 006_job_runs.sql

Frontend

/jobs — Job Monitor page: live table of job execution history with colored status badges (success/partial/failed/no_work), auto-refreshes every 30 s
"Run Due Now" button: triggers POST /reminders/run and shows a toast with {processed, failed, retried} counts
Wired into the app navbar and React Router
Files: app/src/api/reminders.ts (JobRun type + listJobRuns), app/src/pages/JobMonitor.tsx, route in App.tsx, nav link in Navbar.tsx

Design note

Extended the existing reminder runner rather than building a parallel job system to minimize surface area and stay backward-compatible with existing schedulers. .with_for_update(skip_locked=True) at job_runner.py:73 prevents double-dispatch in multi-worker deployments.

Closes #130

- Add retry fields to Reminder model: retry_count, max_retries, next_retry_at, last_error, failed_permanently - Add JobRun model for job execution audit log - Implement exponential backoff retry (2^n minutes, capped at 60 min) in new services/job_runner.py - POST /reminders/run now returns full stats dict and uses job_runner - Add GET /reminders/job-runs monitoring endpoint - Add SQL migration 002_resilient_job_retry.sql (IF NOT EXISTS safe) - Add 16 unit/integration tests in tests/test_job_runner.py - Add docs/resilient-job-retry.md Closes rohitdash08#130

…var, derive_status logic, test email uniqueness

…return value check)

…User, limit type=int, no_work status, job-runs auth filter, cleanup annotations

sinatragian · 2026-03-17T00:30:41Z

Design rationale — extending the existing reminder runner vs. a parallel job system

This PR extends the existing reminder runner (/reminders/run) rather than introducing a parallel job scheduler for two reasons:

Minimal surface area — a separate job system would require its own queue, worker process, and failure model. Reusing the reminder runner keeps the retry and audit logic in one place and stays backward-compatible with any existing scheduler (cron, Celery beat, etc.) that already calls the endpoint.
Race-condition safety in multi-worker deployments — job_runner.py line 73 uses .with_for_update(skip_locked=True) so that concurrent workers each claim a disjoint set of due reminders. Workers that cannot acquire the lock skip those rows rather than blocking, which prevents double-dispatch without a distributed lock service.

The JobRun audit table records every execution (status, duration, retry count) so ops teams can monitor failure rates without grepping logs.

Also adds JobMonitor page with live refresh and run-now button. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sinatragian requested a review from rohitdash08 as a code owner March 13, 2026 22:10

sinatragianpaolo-oc added 3 commits March 16, 2026 03:56

fix: job runner return value, stats retried counter, is_retry unused …

d428ec2

…var, derive_status logic, test email uniqueness

fix: send_reminder raises on failure, job_runner uses try/except (no …

e727bd4

…return value check)

fix: race condition with_for_update, rollback on failure, email from …

323f177

…User, limit type=int, no_work status, job-runs auth filter, cleanup annotations

Gianpaolo and others added 3 commits March 17, 2026 02:50

feat(frontend): add JobRun type and listJobRuns API function

881a380

Also adds JobMonitor page with live refresh and run-now button. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(frontend): add /jobs route and nav link for Job Monitor

420545e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: default export JobMonitor

8a4a24f

sinatragian closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resilient background job retry & monitoring (#130)#389

feat: resilient background job retry & monitoring (#130)#389
sinatragian wants to merge 7 commits intorohitdash08:mainfrom
sinatragian:feat/resilient-job-retry-monitoring

sinatragian commented Mar 13, 2026 •

edited

Loading

Uh oh!

sinatragian commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sinatragian commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend

Frontend

Design note

Uh oh!

sinatragian commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sinatragian commented Mar 13, 2026 •

edited

Loading