Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162

danielaskdd · 2025-10-03T11:28:04Z

Fix APScheduler Deadlocks with Dual Database Job Coordinators

Current Problem

The proxy server experienced periodic deadlocks during startup, primarily due to concurrent database access attempts by multiple APScheduler jobs. This led to system unresponsiveness, specifically a stall within the get_next_fire_time() function, effectively blocking all event loops.

Root Cause

Multiple database jobs (update_spend, reset_budget, add_deployment, get_credentials, spend_log_cleanup, check_batch_cost) were scheduled independently, leading to:

Concurrent database access causing deadlocks
Event loop blocking during startup
System instability with multiple workers
APScheduler scheduler getting stuck during initialization

Solution

Implemented a dual-coordinator architecture that separates database jobs by execution frequency while maintaining sequential execution within each coordinator to prevent database conflicts.

Architecture Changes

1. DatabaseJobsCoordinator Class (Lines 3624-3665)

Tracks last execution time for each database job
Ensures jobs respect their configured intervals
Shared state tracker between both coordinators for centralized timing management

2. High-Frequency Coordinator (Lines 3668-3702)

Runs every 10 seconds and handles time-sensitive configuration updates:

add_deployment: Refresh model configuration from database
get_credentials: Refresh credentials from database

Rationale: These jobs need frequent execution to keep the proxy's model configuration in sync with database changes.

3. Low-Frequency Coordinator (Lines 3705-3844)

Runs every 60 seconds and handles maintenance tasks:

update_spend: Update spend logs (~60s with randomization to avoid worker collision)
reset_budget: Reset budget (3600-7200s with randomization)
spend_log_cleanup: Clean old logs (configurable interval)
check_batch_cost: Check batch costs (configurable interval)

Rationale: These tasks are less time-critical and can run at longer intervals without impacting system responsiveness.

4. Improved Scheduler Lifecycle Management (Lines 3947-4044)

Assign _scheduler_instance immediately after creation for proper shutdown handling
Removed duplicate assignment in _initialize_spend_tracking_background_jobs
Ensures scheduler can be properly shut down even if initialization fails

Key Benefits

✅ Eliminates Deadlocks: Sequential execution within each coordinator prevents concurrent database access
✅ Prevents Blocking: High-frequency tasks (10s) run independently from slow low-frequency tasks (60s+)
✅ Fault Tolerance: One failed job doesn't stop other jobs in the coordinator
✅ Improved Reliability: Enhanced error handling with descriptive logging and fallback to defaults
✅ Better Performance: Predictable execution patterns with separated high/low frequency paths
✅ Proper Cleanup: Improved scheduler shutdown lifecycle prevents resource leaks

Technical Details

Error Handling Improvements

Added fallback to default values (86400s = 1 day) for invalid spend log cleanup intervals
Each job failure is logged but doesn't prevent other jobs from running
Enhanced logging with ✓/✗ symbols for easy debugging

Execution Flow

Startup:
1. Create AsyncIOScheduler instance
2. Assign to global _scheduler_instance immediately
3. Load models/credentials if store_model_in_db=True
4. Schedule high-frequency coordinator (10s interval)
5. Schedule low-frequency coordinator (60s interval)
6. Schedule other jobs (alerting, reporting, etc.)
7. Start scheduler

Shutdown:
1. Stop scheduler (wait=False to avoid hanging)
2. Disconnect database
3. Close other connections

Backward Compatibility

Maintains all existing job functionality
Preserves randomization for budget reset and spend updates
All non-database jobs (alerting, reporting) continue to run independently
No configuration changes required
No database schema changes needed

Code Changes Summary

Modified Files

litellm/proxy/proxy_server.py

Lines Changed

Lines 3624-3665: Added DatabaseJobsCoordinator class
Lines 3668-3702: Implemented high_frequency_database_jobs_coordinator
Lines 3705-3844: Implemented low_frequency_database_jobs_coordinator
Lines 3947-4044: Refactored initialize_scheduled_background_jobs to use dual coordinators
Lines 3717-3732: Improved error handling for invalid configuration values
Lines 4049-4051: Improved scheduler lifecycle management

Why is this change important?

This change directly addresses a potential source of critical database errors (deadlocks), which can freeze proxy operations and require a manual restart. It hardens the proxy's stability, making it more reliable for production deployments.

• Track scheduler instance globally • Shutdown scheduler before database • Use wait=False for non-blocking shutdown

* Prevent concurrent DB access deadlocks * Split 10s vs 60s+ task frequencies * Add unified job state tracking * Improve error isolation per job

* Move global assignment earlier * Fix scheduler initialization order * Improve lifecycle management

vercel · 2025-10-03T11:28:11Z

@danielaskdd is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

* Add PLR0915 noqa comment

litellm/proxy/proxy_server.py

- Pre-calc random intervals once at init - Increased scheduler frequency of low frequency db job for more precise triggering of jobs - Add comprehensive test coverage

- Remove unused mock_patch_aembedding function - Replace decorator with inline AsyncMock - Use kwargs for parameter verification - Check mock was called before assertions

- Add proxy_logging_obj mocking to prevent interference between unit tests - Add premium_user mock set to True to bypass enterprise validation

…stances reached" warnings - Separate high-frequency (10s) and low-frequency (30min) tasks - Configure misfire_grace_time: 5s for high-freq, 20min for low-freq - Set coalesce=False to skip missed runs instead of queuing - Eliminate "maximum instances reached" warnings This reduces unnecessary scheduling overhead for long-running tasks while maintaining proper execution timing for all database operations.

danielaskdd · 2025-10-05T13:31:03Z

Add new commit:

Optimize database job scheduling to eliminate "maximum instances reached" warnings

Separate high-frequency (10s) and low-frequency (30min) tasks
Configure misfire_grace_time: 5s for high-freq, 20min for low-freq
Set coalesce=False to skip missed runs instead of queuing
Eliminate "maximum instances reached" warnings

This reduces unnecessary scheduling overhead for long-running tasks
while maintaining proper execution timing for all database operations.

danielaskdd · 2025-10-07T14:09:17Z

This PR has been running in personal production environment for four days without any issues. Most importantly, the service has undergone more than 20 restarts without a single occurrence of freezing, compared to the previous issue where approximately one-third of startups would result in a hang.

danielaskdd · 2025-10-14T09:19:26Z

@krrishdholakia This PR has been under test and running in my production environment for over a week with no issues observed. Do you recommend any additional testing or validation?

danielaskdd added 3 commits October 3, 2025 17:36

Fix APScheduler shutdown to prevent hanging during proxy shutdown

9af5298

• Track scheduler instance globally • Shutdown scheduler before database • Use wait=False for non-blocking shutdown

Refactor database jobs into coordinated high/low frequency schedulers

f952084

* Prevent concurrent DB access deadlocks * Split 10s vs 60s+ task frequencies * Add unified job state tracking * Improve error isolation per job

Move scheduler instance assignment to prevent lifecycle issues

c14a8ab

* Move global assignment earlier * Fix scheduler initialization order * Improve lifecycle management

Add noqa comment to suppress complexity warning

e06f3dc

* Add PLR0915 noqa comment

krrishdholakia reviewed Oct 3, 2025

View reviewed changes

litellm/proxy/proxy_server.py Show resolved Hide resolved

danielaskdd added 5 commits October 4, 2025 17:05

Optimize database job coordinator to pre-calculate intervals at init

3685c38

- Pre-calc random intervals once at init - Increased scheduler frequency of low frequency db job for more precise triggering of jobs - Add comprehensive test coverage

Fix async mock for embedding test and improve assertion handling

9ec164c

- Remove unused mock_patch_aembedding function - Replace decorator with inline AsyncMock - Use kwargs for parameter verification - Check mock was called before assertions

Fix linting

0d674aa

fix(tests): Stabilize proxy embedding test by mocking logging hooks

f2251eb

- Add proxy_logging_obj mocking to prevent interference between unit tests - Add premium_user mock set to True to bypass enterprise validation

Merge branch 'main' into fix-sigint-exit

e1d019a

danielaskdd added 2 commits October 11, 2025 17:35

Merge branch 'main' into fix-sigint-exit

d429c70

Merge branch 'main' into fix-sigint-exit

74c143e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162

Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162

danielaskdd commented Oct 3, 2025

Uh oh!

vercel bot commented Oct 3, 2025

Uh oh!

Uh oh!

danielaskdd commented Oct 5, 2025

Uh oh!

danielaskdd commented Oct 7, 2025

Uh oh!

danielaskdd commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162

Are you sure you want to change the base?

Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162

Conversation

danielaskdd commented Oct 3, 2025

Fix APScheduler Deadlocks with Dual Database Job Coordinators

Current Problem

Root Cause

Solution

Architecture Changes

1. DatabaseJobsCoordinator Class (Lines 3624-3665)

2. High-Frequency Coordinator (Lines 3668-3702)

3. Low-Frequency Coordinator (Lines 3705-3844)

4. Improved Scheduler Lifecycle Management (Lines 3947-4044)

Key Benefits

Technical Details

Error Handling Improvements

Execution Flow

Backward Compatibility

Code Changes Summary

Modified Files

Lines Changed

Why is this change important?

Uh oh!

vercel bot commented Oct 3, 2025

Uh oh!

Uh oh!

danielaskdd commented Oct 5, 2025

Optimize database job scheduling to eliminate "maximum instances reached" warnings

Uh oh!

danielaskdd commented Oct 7, 2025

Uh oh!

danielaskdd commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants