Skip to content

[FEAT] implement rate limit and backoff strategy for AI agents with different architecture patterns (superviosr pattern, hierarchical pattern etc. ) #655

@LishuaiJing3

Description

@LishuaiJing3

Is your feature request related to a problem? Please describe.

We need to handle millions of user requests on our platform with different LLM providers such as mistral, claude, openAI, google, deep seek, Qwen stc. leading models. these models or platforms have strict API rate limits which can cause disruption to our services if not properly handled.

Minimal feature set to implement
Central throttling per provider/model/key with token-bucket or sliding-window limits, plus hard concurrency caps.

Retry with exponential backoff and jitter, respect Retry-After, and standardize “Rate Limited Upstream” errors.

Circuit breaker per provider/model to stop hammering when 429s/timeouts spike; probe to recover.

Fallback chain per task: primary model → smaller/cheaper same provider → equivalent alternate provider → cached/short response.

Deadline-aware requests: if queue wait exceeds budget, auto-route to fallback or async job.

Token budgeting: cap input/output tokens, trim context, and cache repeated results to reduce calls.

Transparent UX: inline notice when deferred, ETA shown, safe “continue in background,” and state preserved.

Priority queues with weighted fair scheduling: interactive first, workflows next, background and batch last.

Describe alternatives you've considered

No response

Additional context

No response

Describe the thing to improve

Configuration that matters
Per provider/model: RPM/TPM limits, max concurrency, and burst size aligned to published caps.

Per tenant/tier: distinct ceilings to ensure fairness and monetize higher capacity.

Timeouts: strict per-call timeouts; hedged requests only when safe and within budgets.

Operational guardrails
Real-time metrics: queue length, wait time, 429 rate, latency P50/P95, tokens, and cost per provider/model.

Adaptive limits: temporarily lower or raise ceilings based on load and error rates.

Clear client signals: return remaining-limit and reset-time headers to enable smarter clients

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions