diff --git a/README.md b/README.md index 0122bbe47..5eab90745 100644 --- a/README.md +++ b/README.md @@ -56,6 +56,105 @@ Get 10% OFF GLM CODING PLAN:https://z.ai/subscribe?ic=8JVLJQFSKB - OpenAI-compatible upstream providers via config (e.g., OpenRouter) - Reusable Go SDK for embedding the proxy (see `docs/sdk-usage.md`) +## Operational Enhancements + +This fork includes additional "proxy ops" features beyond the mainline release to improve third-party provider integrations: + +### Core Features +- Environment-based secret loading via `os.environ/NAME` +- Strict YAML parsing via `strict-config` / `CLIPROXY_STRICT_CONFIG` +- Optional encryption-at-rest for `auth-dir` credentials + atomic/locked writes +- Prometheus metrics endpoint (configurable `/metrics`) + optional auth gate (`metrics.require-auth`) +- In-memory response cache (LRU+TTL) for non-streaming JSON endpoints +- Rate limiting (global / per-key parallelism + per-key RPM + per-key TPM) +- Request/response size limits (`limits.max-*-size-mb`) +- Request body guardrail (reject `api_base` / `base_url` by default) +- Virtual keys (managed client keys) + budgets + pricing-based spend tracking +- Fallback chains (`fallback-chains`) + exponential backoff retries (`retry-policy`) +- Pass-through endpoints (`pass-through.endpoints[]`) for forwarding extra routes upstream +- Health endpoints (`/health/liveness`, `/health/readiness`) + optional background probes +- Sensitive-data masking (request logs + redacted management config view) + +### Health-Based Routing & Smart Load Balancing + +CLIProxyAPIPlus now includes intelligent routing and health tracking based on production-grade proxy patterns: + +#### Features + +**Health Tracking System** +- Automatic monitoring of credential health based on failure rates and response latency +- Four health status levels: HEALTHY, DEGRADED, COOLDOWN, ERROR +- Rolling window metrics (configurable 60-second default) +- Per-credential and per-model statistics tracking +- P95/P99 latency percentile calculations +- Automatic cooldown integration + +**Advanced Routing Strategies** +- **`fill-first`**: Drain one credential to rate limit/cooldown before moving to the next to stagger rolling windows +- **`round-robin`**: Sequential credential rotation (default) +- **`random`**: Random credential selection +- **`least-busy`**: Select credential with fewest active requests (load balancing) +- **`lowest-latency`**: Select credential with best P95 latency (performance optimization) + +**Health-Aware Routing** +- Automatically filter out COOLDOWN and ERROR credentials +- Prefer HEALTHY credentials over DEGRADED when `prefer-healthy: true` +- Graceful fallback to all credentials when no healthy ones available + +#### Configuration Example + +```yaml +# Health tracking configuration +health-tracking: + enable: true + window-seconds: 60 # Rolling window for failure rate calculation + failure-threshold: 0.5 # 50% failure rate triggers ERROR status + degraded-threshold: 0.1 # 10% failure rate triggers DEGRADED status + min-requests: 5 # Minimum requests before tracking + cleanup-interval: 300 # Cleanup old data every 5 minutes + +# Enhanced routing configuration +routing: + strategy: "least-busy" # fill-first, round-robin, random, least-busy, lowest-latency + health-aware: true # Filter unhealthy credentials (COOLDOWN, ERROR) + prefer-healthy: true # Prioritize HEALTHY over DEGRADED credentials +``` + +#### Routing Strategy Comparison + +| Strategy | Best For | How It Works | +|----------|----------|--------------| +| `fill-first` | Staggering rolling caps | Uses the first available credential (by ID) until it hits rate limit/cooldown, then moves to the next | +| `round-robin` | Even distribution, predictable | Cycles through credentials sequentially | +| `random` | Simple load balancing | Randomly selects from available credentials | +| `least-busy` | Optimal load distribution | Selects credential with fewest active requests | +| `lowest-latency` | Performance-critical apps | Selects credential with best P95 latency | + +#### Health Status Levels + +- **HEALTHY**: Normal operation, low failure rates +- **DEGRADED**: Elevated failure rates (above degraded-threshold but below failure-threshold) +- **COOLDOWN**: Temporarily unavailable due to errors or rate limits +- **ERROR**: High failure rates (above failure-threshold) or persistent errors + +#### Benefits + +- **Improved reliability** by avoiding unhealthy credentials when `health-aware` routing is enabled +- **Better tail latency** when `lowest-latency` is enabled and health tracking has enough data +- **Smarter load balancing** with `least-busy` using in-flight request counts +- **Automatic recovery** from cooldown windows as health improves + +See: +- `docs/operations.md` + +### Future work + +These are high-value ideas that remain on the roadmap: +- OpenTelemetry tracing + external integrations (Langfuse/Sentry/webhooks) +- Redis-backed distributed cache/rate limits for multi-instance deployments +- DB-backed virtual key store + async spend log writer +- Broader endpoint coverage via native translators (beyond pass-through) + ## Getting Started CLIProxyAPI Guides: [https://help.router-for.me/](https://help.router-for.me/) diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 000000000..b943ff32e --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,370 @@ +# Operations (Security + Observability) + +This proxy borrows operational patterns from production-grade systems: environment-based secret loading, safe credential storage, guardrails (rate limits / cooldowns), response caching, and Prometheus metrics. + +## Environment-Sourced Secrets (`os.environ/`) + +Any string value in `config.yaml` can be set from an environment variable by using the prefix: + +```yaml +some-key: os.environ/MY_ENV_VAR +``` + +The config loader resolves these references after YAML unmarshal (works for nested structs, slices, and maps). + +If the env var is missing, startup fails (unless running in optional/cloud-deploy mode). + +- Keeps secrets out of `config.yaml` by referring to env vars instead of hard-coding secrets. +- Makes it easier to run the same config across machines/environments. + +### Safety note (no “secret persistence”) +When `os.environ/` references are resolved, config normalization steps that would normally write back to disk are skipped to avoid accidentally writing the resolved secret into `config.yaml`. + +## Strict Config Parsing (Reject Unknown YAML Fields) + +Strongly typed proxies typically surface unknown fields quickly. In Go/YAML it’s easy to silently ignore typos, so CLIProxyAPI supports strict parsing: + +```yaml +strict-config: true +``` + +You can also force strict parsing via env: +- `CLIPROXY_STRICT_CONFIG=true` + +## Encrypted Auth Storage (auth-dir) + +Auth JSON files under `auth-dir` can be encrypted-at-rest and are always written using: +- file locking +- atomic replace +- `0600` permissions + +Config: + +```yaml +auth-storage: + encrypt: true + encryption-key: os.environ/CLIPROXY_AUTH_ENCRYPTION_KEY + allow-plaintext-fallback: true +``` + +Also supported via env: `CLIPROXY_AUTH_ENCRYPTION_KEY` (or legacy `CLI_PROXY_API_AUTH_ENCRYPTION_KEY`). + +### What gets encrypted +- Files under `auth-dir` (typically `*.json`) created by login flows or uploaded via management endpoints. +- The stored format is an **envelope JSON** (AES-256-GCM). The plaintext JSON is only recovered in-memory. + +### Migration behavior +If encryption is enabled and `allow-plaintext-fallback: true`, legacy plaintext auth files are still readable and will be best-effort rewritten into the encrypted envelope format. + +### Remote stores (Postgres/Object store) +If you mirror auth files to Postgres/S3-backed stores, the raw bytes are stored as-is. When encryption is enabled, those remote payloads remain encrypted envelopes. + +## Prometheus Metrics + +Enable the metrics endpoint: + +```yaml +metrics: + enable: true + endpoint: "/metrics" + require-auth: false +``` + +Metrics include request counts/latency, token totals, cache hits/misses, rate-limit rejections, and cooldown counters. + +Key metric names: +- `cliproxy_requests_total` +- `cliproxy_request_duration_ms` +- `cliproxy_tokens_input_total` / `cliproxy_tokens_output_total` +- `cliproxy_cache_hits_total` / `cliproxy_cache_misses_total` +- `cliproxy_ratelimit_rejections_total` +- `cliproxy_cooldowns_triggered_total` + +## Response Cache + +Enable in-memory response caching: + +```yaml +cache: + enable: true + max-size: 1000 + ttl: 300 +``` + +### What is cached +- Only **non-streaming** requests. +- Only JSON responses with **2xx** status. +- Applies to: + - `POST /v1/chat/completions` + - `POST /v1/completions` + - `POST /v1/responses` (OpenAI Responses API) + - `POST /v1/messages` + +### Cache key +Cache keys include the authenticated `apiKey` + method + path + query + request body, so different users/inputs do not collide. + +### Response header +Cached requests return `X-CLIProxy-Cache: HIT` (and uncached attempts return `X-CLIProxy-Cache: MISS`). + +## Rate Limits + +Configure concurrency + RPM limits: + +```yaml +rate-limits: + enable: true + max-parallel-requests: 100 + max-per-key: 10 + max-rpm: 60 + max-tpm: 120000 +``` + +Rate-limited requests return HTTP `429` with `{"error":"rate_limited", ...}` and increment `cliproxy_ratelimit_rejections_total`. + +### Token-Per-Minute (TPM) + +TPM limits protect upstream quotas from a small number of very large requests. + +Notes: +- TPM is tracked per authenticated principal (`cfg:` for static `api-keys`, `vk:` for virtual keys). +- Tokens are recorded after request completion (usage plugin), so enforcement is best-effort and may allow brief bursts. + +## Request/Response Size Limits + +CLIProxyAPI supports request/response size caps: + +```yaml +limits: + max-request-size-mb: 10 + max-response-size-mb: 50 +``` + +Behavior: +- Request bodies above the cap return HTTP `413`. +- When `max-response-size-mb` is set, non-streaming upstream responses larger than the cap return HTTP `502`. + +## Cooldown Override + +Optionally apply a fixed cooldown window for specific HTTP status codes: + +```yaml +cooldown: + enable: true + duration: 60 + trigger-on: [429, 500, 502, 503, 504] +``` + +This is a simple “guardrail cooldown” that prevents immediate re-selection of a credential after repeated error codes. If the upstream returns `Retry-After`, that value is honored/extended. + +Note: quota backoff for 429 is still controlled separately via `disable-cooling`. + +## Fallback Chains (Cross-Provider Failover) + +Fallback chains provide model/provider failover on transient failures (network, 408, 429, 5xx): + +```yaml +fallback-chains: + enable: true + chains: + - primary-model: "gpt-4o" + fallbacks: + - model: "claude-3-5-sonnet-20241022" + provider: "claude" +``` + +When a fallback succeeds, responses include `X-CLIProxy-Fallback` headers for debugging. + +## Retry Policy (Exponential Backoff) + +`retry-policy` adds exponential backoff retries for transient failures (network, 408, 5xx): + +```yaml +retry-policy: + enable: true + max-retries: 3 + initial-delay-ms: 1000 + max-delay-ms: 30000 + multiplier: 2.0 + jitter: 0.1 +``` + +Notes: +- 429 is intentionally not retried via backoff; prefer cooldown/Retry-After. +- This is additive to the existing cooldown-based `request-retry` behavior. +- For OpenAI-compatible upstreams, you can pass `Idempotency-Key` to reduce duplicate charges when retries occur. + +## Routing Strategy + +When multiple credentials match, you can choose a selection strategy: + +```yaml +routing: + strategy: "fill-first" # fill-first, round-robin (default), random, least-busy, lowest-latency + health-aware: true # Filter unhealthy credentials (COOLDOWN, ERROR) + prefer-healthy: true # Prefer HEALTHY over DEGRADED when health-aware + fill-first-max-inflight-per-auth: 4 # 0 = unlimited + fill-first-spillover: "next-auth" # next-auth (default), least-busy +``` + +Notes: +- `least-busy` uses in-flight request counts; `lowest-latency` requires `health-tracking.enable: true`. +- `fill-first` drains one account to rate limit/cooldown, then moves to the next to stagger rolling windows; spillover prevents overload under bursty concurrency. +- `next-auth` preserves deterministic “drain first”; `least-busy` maximizes throughput. + +### Fill-first spillover (recommended for “many creds”) + +`fill-first` intentionally drains one account to its rate limit/cooldown, then moves to the next to keep throughput going by staggering rolling windows across accounts. With many concurrent terminals it can also overload a single credential, leading to avoidable `429` errors. Use `fill-first-max-inflight-per-auth` and `fill-first-spillover` to keep the intent while enabling safe spillover. + +- When the preferred credential is at capacity (`max-inflight`), selection spills over to another credential instead of overloading one. +- `next-auth` preserves deterministic “drain first”; `least-busy` maximizes throughput under bursty load. + +Health-aware filtering uses `health-aware` and `prefer-healthy` (requires `health-tracking.enable: true`). + +## Streaming (Keep-Alives + Safe Bootstrap Retries) + +Streaming failures are only safe to “retry/fail over” **before any bytes are written** to the client. After that, a retry would duplicate/diverge output. + +```yaml +streaming: + keepalive-seconds: 15 # SSE heartbeats (: keep-alive\n\n); <= 0 disables + bootstrap-retries: 2 # retries allowed before first byte; 0 disables +``` + +Notes: +- Keep-alives reduce idle timeouts (Cloudflare/Nginx/proxies) during long pauses between chunks. +- Bootstrap retries/fallbacks only run if the stream fails before producing any payload (safe failover). + +## “10 Terminals / Many Subscriptions” Recommended Defaults + +This configuration biases toward **predictable** routing (burn one account first) while reducing avoidable interruptions under bursty concurrency. Start with the routing block above and add: + +```yaml +health-tracking: + enable: true + +cooldown: + enable: true + duration: 60 + trigger-on: [429, 500, 502, 503, 504] + +retry-policy: + enable: true + max-retries: 3 + initial-delay-ms: 1000 + max-delay-ms: 30000 + multiplier: 2.0 + jitter: 0.1 + +streaming: + keepalive-seconds: 15 + bootstrap-retries: 2 +``` + +## Request Body Guardrails (Client-Side Upstream Targets) + +To prevent redirect attacks, CLIProxyAPI blocks `api_base` / `base_url` in request bodies by default: + +```yaml +security: + allow-client-side-credentials: false +``` + +When disabled (default), requests containing `api_base` or `base_url` are rejected with HTTP `400`. + +## Virtual Keys (Managed Client Keys) + +This pattern generates per-user/team keys without editing `config.yaml`. + +Enable: + +```yaml +virtual-keys: + enable: true +``` + +Management endpoints (require management key): +- `GET /v0/management/virtual-keys` +- `POST /v0/management/virtual-keys` (returns plaintext key once) +- `DELETE /v0/management/virtual-keys/:selector` +- `GET /v0/management/virtual-keys/:selector/budget` + +Policy enforcement (automatic for `vk:*` principals): +- Budget caps (tokens and/or USD) with fixed windows +- Model allowlists (wildcards) +- Per-key model aliases (`model_aliases`) applied by rewriting the request JSON `model` + +## Pricing (Spend Tracking) + +Virtual-key cost budgets require pricing rules: + +```yaml +pricing: + enable: true + models: + - match: "gpt-4o*" + input-per-1k: 5.0 + output-per-1k: 15.0 +``` + +When `pricing.enable: false`, virtual keys can still enforce token budgets, but cost budgets will return `cost_unknown`. + +## Pass-Through Endpoints + +Pass-through routes forward requests to an upstream base URL without writing a full translator. + +```yaml +pass-through: + enable: true + endpoints: + - path: "/v1/rerank" + method: "POST" + base-url: "https://api.openai.com" + timeout: 60 + headers: + Authorization: "Bearer os.environ/OPENAI_API_KEY" +``` + +Security behavior: +- Hop-by-hop headers are stripped. +- Proxy auth headers (`Authorization`, `X-Goog-Api-Key`, `X-Api-Key`) are stripped and must be provided via `headers`. +- If the proxy key was provided via query (`?key=` / `?auth_token=`), that parameter is removed from the forwarded query string. + +## Health Endpoints + Background Probes + +Endpoints: +- `GET /health/liveness` (fast, no upstream calls) +- `GET /health/readiness` (feature status + optional probe summary) +- `GET /health` (alias for readiness) + +Optional background probes: + +```yaml +health: + background-checks: + enable: true + interval: 300 +``` + +Probes are lightweight TCP connectivity checks to configured provider base URLs (no auth, no quota usage). + +## Management API Hardening + +- Auth file downloads are blocked for non-local clients by default. +- To allow it, set: + ```yaml + remote-management: + allow-auth-file-download: true + ``` + +### Auth file download behavior +- By default, downloads return the stored bytes (encrypted envelope if encryption is enabled). +- `GET /v0/management/auth-files/download?name=...&decrypt=1` is **localhost-only** and returns plaintext JSON (requires encryption key when files are encrypted). + +New endpoints: +- `GET /v0/management/auth-files/errors` +- `GET /v0/management/auth-providers` +- `GET /v0/management/virtual-keys` (+ create/revoke/budget) + +### Config Redaction + +`GET /v0/management/config` returns a redacted config view (API keys/tokens masked). Use `GET /v0/management/config.yaml` to fetch the raw file (preserves comments). diff --git a/docs/sdk-advanced.md b/docs/sdk-advanced.md index 3a9d3e500..9020eaf90 100644 --- a/docs/sdk-advanced.md +++ b/docs/sdk-advanced.md @@ -60,6 +60,7 @@ func (Executor) Refresh(ctx context.Context, a *coreauth.Auth) (*coreauth.Auth, Register the executor with the core manager before starting the service: ```go +// nil selector uses the default "round-robin" selection strategy. core := coreauth.NewManager(coreauth.NewFileStore(cfg.AuthDir), nil, nil) core.RegisterExecutor(myprov.Executor{}) svc, _ := cliproxy.NewBuilder().WithConfig(cfg).WithConfigPath(cfgPath).WithCoreAuthManager(core).Build() @@ -135,4 +136,3 @@ The embedded server calls this automatically for built‑in providers; for custo - Enable request logging: Management API GET/PUT `/v0/management/request-log` - Toggle debug logs: Management API GET/PUT `/v0/management/debug` - Hot reload changes in `config.yaml` and `auths/` are picked up automatically by the watcher - diff --git a/docs/sdk-advanced_CN.md b/docs/sdk-advanced_CN.md index 25e6e83c9..22cd7b87d 100644 --- a/docs/sdk-advanced_CN.md +++ b/docs/sdk-advanced_CN.md @@ -55,6 +55,7 @@ func (Executor) Refresh(ctx context.Context, a *coreauth.Auth) (*coreauth.Auth, 在启动服务前将执行器注册到核心管理器: ```go +// selector 传 nil 时默认使用 "round-robin" 选择策略。 core := coreauth.NewManager(coreauth.NewFileStore(cfg.AuthDir), nil, nil) core.RegisterExecutor(myprov.Executor{}) svc, _ := cliproxy.NewBuilder().WithConfig(cfg).WithConfigPath(cfgPath).WithCoreAuthManager(core).Build() @@ -128,4 +129,3 @@ cliproxy.GlobalModelRegistry().RegisterClient(authID, "myprov", models) - 启用请求日志:管理 API GET/PUT `/v0/management/request-log` - 切换调试日志:管理 API GET/PUT `/v0/management/debug` - 热更新:`config.yaml` 与 `auths/` 变化会自动被侦测并应用 - diff --git a/docs/sdk-usage.md b/docs/sdk-usage.md index 55e7d5f9a..ddb061da1 100644 --- a/docs/sdk-usage.md +++ b/docs/sdk-usage.md @@ -81,6 +81,7 @@ These options mirror the internals used by the CLI server. The service uses a core `auth.Manager` for selection, execution, and auto‑refresh. When embedding, you can provide your own manager to customize transports or hooks: ```go +// nil selector uses the default "round-robin" selection strategy. core := coreauth.NewManager(coreauth.NewFileStore(cfg.AuthDir), nil, nil) core.SetRoundTripperProvider(myRTProvider) // per‑auth *http.Transport diff --git a/docs/sdk-usage_CN.md b/docs/sdk-usage_CN.md index b87f9aa1f..ba808a8ef 100644 --- a/docs/sdk-usage_CN.md +++ b/docs/sdk-usage_CN.md @@ -81,6 +81,7 @@ svc, _ := cliproxy.NewBuilder(). 服务内部使用核心 `auth.Manager` 负责选择、执行、自动刷新。内嵌时可自定义其传输或钩子: ```go +// selector 传 nil 时默认使用 "round-robin" 选择策略。 core := coreauth.NewManager(coreauth.NewFileStore(cfg.AuthDir), nil, nil) core.SetRoundTripperProvider(myRTProvider) // 按账户返回 *http.Transport @@ -161,4 +162,3 @@ _ = svc.Shutdown(ctx) - 热更新:`config.yaml` 与 `auths/` 变化会被自动侦测并应用。 - 请求日志可通过管理 API 在运行时开关。 - `gemini-web.*` 相关配置在内嵌服务器中会被遵循。 -