Skip to content

feat: resilient background job retry & monitoring#401

Open
TallowX92 wants to merge 1 commit intorohitdash08:mainfrom
TallowX92:feat/background-job-retry-monitoring
Open

feat: resilient background job retry & monitoring#401
TallowX92 wants to merge 1 commit intorohitdash08:mainfrom
TallowX92:feat/background-job-retry-monitoring

Conversation

@TallowX92
Copy link

@TallowX92 TallowX92 commented Mar 14, 2026

/claim #130

Demo

demo2026-03-14.19-10-07.mp4

Scheduler started, job firing every 60 seconds, executing successfully with next run scheduled.


Summary

Production-grade background job infrastructure for async reminder dispatch with exponential-backoff retry and a live monitoring API.

What's included

Scheduler — app/services/scheduler.py

  • APScheduler BackgroundScheduler with MemoryJobStore
  • Runs process_due_reminders() every 60 seconds
  • Exponential backoff: 5 min → 15 min → 45 min between retries
  • Permanently marks reminders failed=True after 3 retries, captures last_error
  • Auto-disabled in test environment (FLASK_ENV=testing)

New fields on Reminder model

Field Type Purpose
retry_count Integer Attempts so far
last_error String Last exception message
next_retry_at DateTime When to next attempt
failed Boolean Permanently failed flag

Monitoring endpoints — GET/POST /jobs

Method Endpoint Description
GET /jobs/status Scheduler running state + job list
GET /jobs/reminders/stats Counts: sent / pending / overdue / retrying / permanently_failed
POST /jobs/reminders/run Manual trigger (admin)

Tests — tests/test_jobs.py (17 tests)

  • Backoff delta (4): 5 min at retry 0, 15 min at retry 1, 45 min at retry 2, capped at max
  • ProcessDueReminders (8): dispatches due, skips future-dated, increments retry, sets next_retry_at, marks permanently failed, respects retry window, skips sent, skips failed
  • Endpoints (5): status 200, stats shape, manual trigger, auth required

Note: 5 tests require Redis (auth_header fixture stores refresh token) — same constraint across the whole test suite. Core scheduler logic: 12/17 pass without Redis.

Implements production-grade background job infrastructure for async
reminder dispatch with exponential-backoff retry and a monitoring API.

Scheduler:
- APScheduler (BackgroundScheduler) initialized in create_app(), skipped
  in TESTING mode to avoid side effects in tests
- process_due_reminders() runs every 60 seconds via interval trigger
- Graceful shutdown registered via atexit

Retry logic:
- Failed deliveries are retried up to MAX_RETRIES (3) attempts
- Exponential backoff: 5min -> 15min -> 45min between attempts
- Reminders exceeding MAX_RETRIES are marked failed=True (no further attempts)
- retry_count, last_error, next_retry_at, failed columns added to Reminder model
- Schema compat ALTERs added for existing PostgreSQL deployments

Monitoring endpoints (JWT-protected):
- GET  /jobs/status              — scheduler health + registered job list
- GET  /jobs/reminders/stats     — sent/pending/retrying/failed counts
- POST /jobs/reminders/run       — manual trigger for ops/debugging

Tests (17 tests, 12 pass without Redis, 5 require Redis for auth):
- Backoff delta unit tests
- process_due_reminders: success, retry, backoff window, max retries, skip guards
- Endpoint auth, stats, manual trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants