-
-
Notifications
You must be signed in to change notification settings - Fork 407
Description
Is your feature request related to a problem? Please describe.
We need to handle millions of user requests on our platform with different LLM providers such as mistral, claude, openAI, google, deep seek, Qwen stc. leading models. these models or platforms have strict API rate limits which can cause disruption to our services if not properly handled.
Minimal feature set to implement
Central throttling per provider/model/key with token-bucket or sliding-window limits, plus hard concurrency caps.
Retry with exponential backoff and jitter, respect Retry-After, and standardize “Rate Limited Upstream” errors.
Circuit breaker per provider/model to stop hammering when 429s/timeouts spike; probe to recover.
Fallback chain per task: primary model → smaller/cheaper same provider → equivalent alternate provider → cached/short response.
Deadline-aware requests: if queue wait exceeds budget, auto-route to fallback or async job.
Token budgeting: cap input/output tokens, trim context, and cache repeated results to reduce calls.
Transparent UX: inline notice when deferred, ETA shown, safe “continue in background,” and state preserved.
Priority queues with weighted fair scheduling: interactive first, workflows next, background and batch last.
Describe alternatives you've considered
No response
Additional context
No response
Describe the thing to improve
Configuration that matters
Per provider/model: RPM/TPM limits, max concurrency, and burst size aligned to published caps.
Per tenant/tier: distinct ceilings to ensure fairness and monetize higher capacity.
Timeouts: strict per-call timeouts; hedged requests only when safe and within budgets.
Operational guardrails
Real-time metrics: queue length, wait time, 429 rate, latency P50/P95, tokens, and cost per provider/model.
Adaptive limits: temporarily lower or raise ceilings based on load and error rates.
Clear client signals: return remaining-limit and reset-time headers to enable smarter clients