-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Fix APScheduler Deadlocks with Dual Database Job Coordinators #15162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
• Track scheduler instance globally • Shutdown scheduler before database • Use wait=False for non-blocking shutdown
* Prevent concurrent DB access deadlocks * Split 10s vs 60s+ task frequencies * Add unified job state tracking * Improve error isolation per job
* Move global assignment earlier * Fix scheduler initialization order * Improve lifecycle management
@danielaskdd is attempting to deploy a commit to the CLERKIEAI Team on Vercel. A member of the Team first needs to authorize it. |
* Add PLR0915 noqa comment
- Pre-calc random intervals once at init - Increased scheduler frequency of low frequency db job for more precise triggering of jobs - Add comprehensive test coverage
- Remove unused mock_patch_aembedding function - Replace decorator with inline AsyncMock - Use kwargs for parameter verification - Check mock was called before assertions
- Add proxy_logging_obj mocking to prevent interference between unit tests - Add premium_user mock set to True to bypass enterprise validation
…stances reached" warnings - Separate high-frequency (10s) and low-frequency (30min) tasks - Configure misfire_grace_time: 5s for high-freq, 20min for low-freq - Set coalesce=False to skip missed runs instead of queuing - Eliminate "maximum instances reached" warnings This reduces unnecessary scheduling overhead for long-running tasks while maintaining proper execution timing for all database operations.
Add new commit: Optimize database job scheduling to eliminate "maximum instances reached" warnings
This reduces unnecessary scheduling overhead for long-running tasks |
This PR has been running in personal production environment for four days without any issues. Most importantly, the service has undergone more than 20 restarts without a single occurrence of freezing, compared to the previous issue where approximately one-third of startups would result in a hang. |
Fix APScheduler Deadlocks with Dual Database Job Coordinators
Current Problem
The proxy server experienced periodic deadlocks during startup, primarily due to concurrent database access attempts by multiple APScheduler jobs. This led to system unresponsiveness, specifically a stall within the
get_next_fire_time()
function, effectively blocking all event loops.Root Cause
Multiple database jobs (update_spend, reset_budget, add_deployment, get_credentials, spend_log_cleanup, check_batch_cost) were scheduled independently, leading to:
Solution
Implemented a dual-coordinator architecture that separates database jobs by execution frequency while maintaining sequential execution within each coordinator to prevent database conflicts.
Architecture Changes
1. DatabaseJobsCoordinator Class (Lines 3624-3665)
2. High-Frequency Coordinator (Lines 3668-3702)
Runs every 10 seconds and handles time-sensitive configuration updates:
add_deployment
: Refresh model configuration from databaseget_credentials
: Refresh credentials from databaseRationale: These jobs need frequent execution to keep the proxy's model configuration in sync with database changes.
3. Low-Frequency Coordinator (Lines 3705-3844)
Runs every 60 seconds and handles maintenance tasks:
update_spend
: Update spend logs (~60s with randomization to avoid worker collision)reset_budget
: Reset budget (3600-7200s with randomization)spend_log_cleanup
: Clean old logs (configurable interval)check_batch_cost
: Check batch costs (configurable interval)Rationale: These tasks are less time-critical and can run at longer intervals without impacting system responsiveness.
4. Improved Scheduler Lifecycle Management (Lines 3947-4044)
_scheduler_instance
immediately after creation for proper shutdown handling_initialize_spend_tracking_background_jobs
Key Benefits
✅ Eliminates Deadlocks: Sequential execution within each coordinator prevents concurrent database access
✅ Prevents Blocking: High-frequency tasks (10s) run independently from slow low-frequency tasks (60s+)
✅ Fault Tolerance: One failed job doesn't stop other jobs in the coordinator
✅ Improved Reliability: Enhanced error handling with descriptive logging and fallback to defaults
✅ Better Performance: Predictable execution patterns with separated high/low frequency paths
✅ Proper Cleanup: Improved scheduler shutdown lifecycle prevents resource leaks
Technical Details
Error Handling Improvements
Execution Flow
Backward Compatibility
Code Changes Summary
Modified Files
litellm/proxy/proxy_server.py
Lines Changed
DatabaseJobsCoordinator
classhigh_frequency_database_jobs_coordinator
low_frequency_database_jobs_coordinator
initialize_scheduled_background_jobs
to use dual coordinatorsWhy is this change important?
This change directly addresses a potential source of critical database errors (deadlocks), which can freeze proxy operations and require a manual restart. It hardens the proxy's stability, making it more reliable for production deployments.