diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml index 6dfb20f1..10552a47 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.yml +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -5,9 +5,9 @@ body: - type: markdown attributes: value: | - Before proposing a feature, please read the [Manifesto](https://github.com/zentinelproxy/zentinel/blob/main/MANIFESTO.md). + Before proposing a feature, please read the [Manifesto](https://github.com/zentinelproxy/zentinel/blob/main/MANIFESTO.md) and the [design rationale documents](https://github.com/zentinelproxy/zentinel/tree/main/doc/design). - Zentinel values **predictability over flexibility** and **calm operation over feature breadth**. + Zentinel values **predictability over flexibility** and **calm operation over feature breadth**. The design documents explain why key architectural decisions were made and when they might be revisited. - type: textarea id: problem @@ -72,3 +72,5 @@ body: required: true - label: I have read the [Manifesto](https://github.com/zentinelproxy/zentinel/blob/main/MANIFESTO.md) required: true + - label: I have read the [design rationale documents](https://github.com/zentinelproxy/zentinel/tree/main/doc/design) and my proposal does not conflict with existing architectural decisions + required: true diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index ce5bd8cb..3bcaaf60 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -23,6 +23,7 @@ ## Checklist - [ ] I have read [CONTRIBUTING.md](CONTRIBUTING.md) +- [ ] I have read the [design rationale documents](doc/design/) and my changes align with existing architectural decisions - [ ] My code follows the project's coding standards - [ ] I have added tests that prove my fix/feature works - [ ] All new and existing tests pass locally diff --git a/doc/design/why-bounded-resources.md b/doc/design/why-bounded-resources.md new file mode 100644 index 00000000..04106c4c --- /dev/null +++ b/doc/design/why-bounded-resources.md @@ -0,0 +1,61 @@ +# Why Bounded Resources + +## The Decision + +Every resource in Zentinel has an explicit upper bound: connections, request body size, header count, header size, agent concurrency, cache size, decompression ratios, connection pool depth. Nothing grows without limit. Nothing is "unlimited by default." + +## Alternatives Considered + +**Unbounded by default, limit when needed.** Most proxies start with no limits and let operators add them when problems arise. This is reactive: you discover the limit you needed after the outage. A single client opening 100,000 connections, a request with a 2 GB body, or a zip bomb expanding to fill all available memory—these are not edge cases, they are Tuesday. + +**Dynamic auto-scaling.** Automatically grow buffers, pools, and queues based on demand. This works until it doesn't: auto-scaling under a DDoS attack means the proxy consumes all available memory trying to accommodate malicious traffic. The system that was supposed to protect your backend becomes the mechanism of its destruction. + +**OS-level limits only.** Rely on `ulimit`, cgroups, and OOM killer for resource boundaries. These are blunt instruments: the OOM killer does not distinguish between a proxy handling legitimate traffic and one being abused. When the OS enforces the limit, recovery is a process restart, not a graceful rejection. + +## Why Bounded + +**Predictable memory usage.** An operator can look at the configuration and calculate the worst-case memory footprint: + +| Resource | Default Limit | Purpose | +|----------|--------------|---------| +| Max body size | 1 MB | Prevents memory exhaustion from large uploads | +| Max header size | 8,192 bytes | Prevents header-based DoS | +| Max header count | 100 | Prevents header inflation attacks | +| Max connections per client | 100 | Prevents single-client monopolization | +| Agent concurrency | 100 per agent | Prevents agent overload | +| Cache size | 100 MB | Bounded memory for cached responses | +| Upstream connection pool | 100 per upstream | Prevents upstream connection exhaustion | +| Decompression ratio | 100x | Zip bomb protection | +| Decompression output | 10 MB | Absolute decompression ceiling | + +These are not hidden safety nets. They are explicit configuration values, logged at startup, observable in metrics. + +**Graceful degradation.** When a bound is reached, Zentinel rejects the specific request that would exceed it—with an appropriate HTTP status code and a log entry—rather than degrading the entire system. The 101st connection from a single client gets rejected; the other 100 continue normally. The request with a 2 MB body gets a 413; all other requests are unaffected. + +**Noisy neighbor prevention.** Per-agent concurrency semaphores ensure that a slow agent cannot starve other agents. If the WAF agent is processing slowly, it uses its own semaphore budget. The authentication agent continues at full speed with its own independent semaphore. One misbehaving component cannot cascade into system-wide degradation. + +**Zip bomb defense.** Decompression is double-bounded: by ratio (output/input must stay below the configured maximum, default 100x) and by absolute size (output must stay below the configured ceiling, default 10 MB). A 1 KB payload that decompresses to 1 GB is caught by the ratio check. A legitimate but large compressed payload is caught by the absolute limit. Both are configurable per deployment. + +**Circuit breakers.** Each agent has a three-state circuit breaker (closed → open → half-open) with configurable thresholds. When an agent fails repeatedly, the circuit opens and requests are handled according to the configured failure mode (block or pass-through) without waiting for the agent to time out on every request. Recovery is automatic: after the timeout period, a probe request tests the agent, and on success, the circuit closes. + +## Trade-offs + +**Operators must size limits.** There is no "unlimited" escape hatch. An operator deploying Zentinel must decide: how large can a request body be? How many connections per client? How much memory for the cache? This requires understanding the workload. We provide documented defaults that work for common cases, but operators should review them. + +**Legitimate traffic can be rejected.** A bound that is too tight will reject valid requests. A 1 MB body limit will reject a 2 MB file upload. This is by design: the operator must explicitly raise the limit for endpoints that need it, rather than having no limit and hoping for the best. + +**Configuration surface.** Every bound is a configuration knob. More knobs means more to understand, more to review, more to get wrong. We mitigate this with sensible defaults and validation that warns about unusual values, but the complexity is real. + +## When to Revisit + +- If adaptive limiting (learning from traffic patterns to suggest bounds) proves reliable enough to supplement—not replace—explicit limits +- If a deployment pattern emerges where the defaults are consistently wrong, we should change the defaults rather than expecting every operator to override them +- If per-route or per-endpoint limits become necessary (currently most limits are global or per-agent), the configuration model may need to evolve + +## Manifesto Alignment + +> *"Infrastructure should be calm. [...] It should have clear limits, predictable timeouts, and failure modes you can explain to another human."* — Manifesto, principle 1 + +> *"A feature that cannot be bounded, observed, tested, and rolled back does not belong in the core."* — Manifesto, principle 6 + +Bounded resources are how Zentinel ensures that the proxy behaves predictably under any load condition. The operator sets the bounds. The proxy enforces them. The metrics show when they are reached. There are no surprises. diff --git a/doc/design/why-explicit-config.md b/doc/design/why-explicit-config.md new file mode 100644 index 00000000..5d060e9f --- /dev/null +++ b/doc/design/why-explicit-config.md @@ -0,0 +1,64 @@ +# Why Explicit Configuration + +## The Decision + +Zentinel requires all operational parameters—limits, timeouts, failure modes, TLS settings—to be explicitly stated in configuration. There are no hidden defaults that silently shape behavior. Every default value is documented, logged on startup, and observable in metrics. + +The proxy's failure mode defaults to `closed`: if something is ambiguous or misconfigured, Zentinel rejects rather than guesses. + +## Alternatives Considered + +**Convention over configuration.** Many frameworks minimize configuration by assuming sensible defaults. Ruby on Rails popularized this: if you follow the convention, things "just work." For a web framework, this reduces boilerplate. For a reverse proxy handling production traffic, invisible conventions become invisible failure modes. An operator debugging a 3 AM outage should not have to know that the default timeout was 30 seconds because the documentation said so three versions ago. + +**Auto-detection / smart defaults.** Automatically detect the number of CPU cores, available memory, and network interfaces, then configure accordingly. This sounds helpful but creates non-reproducible behavior: the same configuration file produces different behavior on different machines. When you move from a 4-core dev box to a 64-core production server, the proxy silently changes its concurrency model. + +**Fail-open by default.** Many proxies default to permissive behavior: if a WAF agent is unreachable, pass the request through. This prioritizes availability over security. It means that the moment your security infrastructure fails, you have no security—precisely when you need it most. + +## Why Explicit + +**Debuggability.** When every parameter is stated in configuration, an operator can look at the config file and know exactly what the proxy will do. No need to check documentation for default values, no need to wonder whether a parameter was auto-detected or explicitly set. The configuration file is the source of truth. + +**Reproducibility.** The same configuration file produces the same behavior on any machine. If `worker-threads=4` is in the config, there are 4 worker threads—on a laptop and on a 128-core server. The only exception is `worker-threads=0`, which explicitly means "auto-detect," and this choice is logged on startup. + +**Fail-closed security.** Zentinel defaults to rejecting ambiguous or broken states: + +| Scenario | Default Behavior | +|----------|-----------------| +| Agent unreachable | Block request (fail closed) | +| TLS cert missing | Refuse to start | +| Unknown config key | Validation error | +| Cross-reference to nonexistent upstream | Validation error | + +An operator can override any of these to fail-open, but they must do so explicitly. The configuration records that decision for auditing. + +**Startup validation.** Zentinel validates configuration at startup with four phases: + +1. **Parse-time**: Syntax correctness (valid KDL) +2. **Schema**: Required fields present, types correct +3. **Semantic**: Cross-references valid (routes reference existing upstreams, filters reference existing agents) +4. **Runtime**: External resources exist (TLS cert files, agent socket paths) + +A misconfigured proxy fails loudly at startup, not silently at 3 AM when a particular code path is first exercised. + +**Audit trail.** Explicit configuration means you can diff two config versions and see exactly what changed. No implicit state to track, no auto-detected values that shifted between deployments. Code review of config changes is meaningful because the config contains the full picture. + +## Trade-offs + +**More configuration to write.** Operators must specify values that other proxies would assume. This is intentional friction: it forces the operator to make conscious decisions about timeouts, limits, and failure modes. But it does increase the initial setup effort. + +**Steeper onboarding.** A new user cannot start with an empty config file and have everything work. They must understand what the proxy needs: at minimum, a listener, a route, and an upstream. We mitigate this with example configurations and clear validation error messages that tell you what's missing. + +**Verbose for simple cases.** A proxy that serves a single backend on port 80 requires more configuration in Zentinel than in proxies that assume defaults. This is an acceptable cost: simple cases should still be explicit, because simple deployments eventually become complex deployments, and the configuration should grow predictably rather than revealing hidden assumptions. + +## When to Revisit + +- If the configuration burden becomes a significant barrier to adoption, we could offer a `zentinel init` command that generates an explicit config with documented defaults—but never hide those defaults from the running config +- If a particular default proves universally correct (never needs changing across deployments), it could be promoted to a documented, logged implicit default—but this bar should be very high + +## Manifesto Alignment + +> *"Security must be explicit. [...] There is no 'magic'. There is no implied policy. If Zentinel is protecting something, you should be able to point to where and why."* — Manifesto, principle 2 + +> *"Infrastructure should be calm. [...] It should have clear limits, predictable timeouts, and failure modes you can explain to another human."* — Manifesto, principle 1 + +Explicit configuration is how Zentinel delivers on both promises: every limit is visible, every failure mode is a conscious choice, and the configuration file tells the full story. diff --git a/doc/design/why-external-agents.md b/doc/design/why-external-agents.md new file mode 100644 index 00000000..9209523c --- /dev/null +++ b/doc/design/why-external-agents.md @@ -0,0 +1,61 @@ +# Why External Agents + +## The Decision + +Zentinel processes complex request logic—WAF inspection, authentication, custom business rules—in external agent processes that communicate with the proxy over Unix domain sockets or gRPC. Agents are separate OS processes, not embedded plugins or in-process modules. + +## Alternatives Considered + +**Embedded plugins (shared libraries / dynamic loading).** Load `.so`/`.dylib` files at runtime. Fast (no IPC), but a bug in any plugin can corrupt proxy memory or crash the entire process. No language flexibility—plugins must be written in Rust or C. Upgrading a plugin requires restarting the proxy. + +**WASM filters.** Sandboxed execution within the proxy process. Better isolation than shared libraries, but WASM has limited access to system resources (networking, filesystem), restricted language support (not all languages compile well to WASM), and the sandbox adds overhead for every call. Debugging WASM in production is painful. + +**Lua scripting (NGINX/OpenResty model).** Flexible and fast for simple transformations. But Lua's type system is weak, error handling is ad hoc, and complex logic (WAF rule evaluation, ML model inference) does not belong in an embedded scripting language. Lua scripts share the proxy's address space—a runaway script blocks the event loop. + +**HTTP callouts (ext_proc / ext_authz).** External services over HTTP. Good isolation, but HTTP adds serialization overhead, connection management complexity, and latency. Every request becomes at least one additional HTTP round-trip. The protocol is generic rather than purpose-built for proxy integration. + +## Why External Processes + +**Crash isolation.** If a WAF agent segfaults or panics, the proxy keeps serving traffic. The circuit breaker trips, the agent restarts, and recovery is automatic. A bug in request inspection must never take down the proxy. + +**Language flexibility.** Agents can be written in any language: Rust, Go, Python, Java. The protocol is documented and SDK libraries are provided. Teams can extend Zentinel without learning Rust or understanding proxy internals. + +**Independent deployment.** Agents have their own release cycle. You can upgrade a WAF agent without restarting the proxy. You can roll back an agent without touching the proxy binary. This matters in production where the proxy handles all traffic. + +**Resource isolation.** Each agent has its own memory space, CPU allocation, and concurrency limits. A slow authentication agent cannot starve a fast header-transformation agent. Per-agent semaphores enforce concurrency bounds. Circuit breakers prevent cascading failures. + +**Noisy neighbor prevention.** Per-agent concurrency semaphores ensure that one slow agent cannot consume all available processing capacity. If Agent A is slow, Agent B continues processing at full speed with its own independent semaphore. + +## The Protocol + +Agents communicate over a binary protocol with length-prefixed JSON messages: + +- **Transport**: Unix domain sockets (primary), gRPC (remote agents), reverse connections (NAT traversal) +- **Message frame**: 4-byte big-endian length + 1-byte type prefix + JSON payload +- **Lifecycle events**: `RequestHeaders`, `RequestBody`, `ResponseHeaders`, `ResponseBody`, `RequestComplete`, `WebSocketFrame`, `GuardrailInspect` +- **Decisions**: `ALLOW` (continue), `BLOCK` (reject with status), `MODIFY` (transform headers/body) +- **Connection pooling**: Persistent connections with 4 load-balancing strategies (round-robin, least-connections, health-based, random) + +The protocol is purpose-built for proxy integration. It exposes exactly the request lifecycle phases that matter, with no unnecessary abstraction. + +## Trade-offs + +**IPC overhead.** Every agent call crosses a process boundary. For the hot path (every request), this adds latency—typically sub-millisecond over UDS, but nonzero. We mitigate this with connection pooling, persistent connections, and batched communication where possible. + +**Operational complexity.** External agents are additional processes to deploy, monitor, and manage. Each agent needs health checking, log collection, and lifecycle management. This is more complex than a single-binary approach. + +**Protocol versioning.** The agent protocol is a contract. Breaking changes require coordinated updates across proxy and agents. We version the protocol and maintain backward compatibility where feasible. + +## When to Revisit + +- If WASM matures to support full system access, rich debugging, and broad language support, some lightweight agents could move in-process +- If the IPC overhead becomes measurable in latency-critical paths (sub-100μs budgets), a hybrid model with in-process fast-path and external slow-path could be considered +- If the operational burden of managing agent processes proves too high for small deployments, an embedded mode could be offered as an option + +## Manifesto Alignment + +> *"Complexity must be isolated. [...] The agent architecture is not a workaround or a plugin system bolted on as an afterthought. It is a fundamental design choice."* — Manifesto, principle 4 + +> *"A broken extension must never take the whole system down with it. Agents can crash, restart, be upgraded, or be disabled—independently of the proxy."* — Manifesto, principle 4 + +The external agent model is how Zentinel keeps the core small and the blast radius of complexity contained. diff --git a/doc/design/why-kdl.md b/doc/design/why-kdl.md new file mode 100644 index 00000000..56439388 --- /dev/null +++ b/doc/design/why-kdl.md @@ -0,0 +1,61 @@ +# Why KDL + +## The Decision + +Zentinel uses [KDL](https://kdl.dev/) (KDL Document Language) as its configuration format. All proxy configuration—listeners, routes, upstreams, agents, filters, TLS, limits—is expressed in KDL files. + +## Alternatives Considered + +**YAML.** The most common configuration format for infrastructure tools. But YAML has well-documented pitfalls: implicit type coercion (`yes` becomes `true`, `3.10` becomes `3.1`), significant whitespace that breaks on copy-paste, the Norway problem (`NO` becomes `false`), and multiple ways to express the same thing (block vs flow style). These ambiguities cause real production incidents. + +**TOML.** Explicit and well-typed. Good for flat or shallow configuration. But TOML becomes unwieldy for deeply nested structures—Zentinel's config has routes containing filters containing parameters, agents with circuit breaker settings, and upstreams with health check configurations. Deeply nested TOML requires verbose `[section.subsection.sub-subsection]` headers that obscure the structure. + +**JSON.** Unambiguous parsing, universal support. But JSON has no comments, no trailing commas, and no multiline strings. A configuration format that does not support comments is hostile to operators who need to document why a setting exists or temporarily disable a block. + +**HCL (HashiCorp Configuration Language).** Purpose-built for infrastructure. Good block syntax. But HCL is tightly associated with the HashiCorp ecosystem, has complex interpolation semantics, and its specification has changed between versions (HCL1 vs HCL2) in breaking ways. + +**Custom DSL.** Maximum expressiveness for our domain. But a custom language means custom tooling (syntax highlighting, linting, formatting), a learning curve for every new user, and maintenance burden for the parser. The configuration language should be a solved problem, not a project unto itself. + +## Why KDL Fits + +**Node-based structure.** KDL's fundamental unit is a node with optional arguments, properties, and children. This maps naturally to proxy configuration: + +```kdl +route "api" { + match path="/api/*" methods=["GET" "POST"] + upstream "api-backend" + filter "rate-limit" requests-per-second=100 +} +``` + +The hierarchy is visually clear. Nesting is explicit via braces, not indentation. + +**No type coercion surprises.** Strings are strings, numbers are numbers, booleans are `true` or `false`. `"yes"` is always the string `"yes"`, never silently converted to a boolean. `3.10` stays `3.10`. + +**Comments are first-class.** Line comments (`//`) and block comments (`/* */`) are part of the language. Operators can document why a rate limit is set to a specific value, or comment out an agent block for debugging. + +**Diff-friendly.** Each node is typically one line. Adding a route, changing a limit, or adding a filter produces clean, reviewable diffs. No ambiguity about whether a change affected surrounding blocks. + +**Consistent syntax.** There is one way to express a configuration block. No choice between block style and flow style, no alternative quoting mechanisms, no optional colons. This consistency means configuration looks the same regardless of who wrote it. + +## Trade-offs + +**Smaller ecosystem.** KDL is newer than YAML, TOML, or JSON. Fewer editors have syntax highlighting out of the box. Fewer developers have seen it before. There is a learning curve, though the syntax is simple enough that most people read it correctly on first encounter. + +**Fewer libraries.** KDL parsing libraries exist for major languages (Rust, JavaScript, Go, Python), but the ecosystem is smaller than YAML or JSON. If we need KDL support in an unusual language for an agent SDK, we may need to contribute to or write a parser. + +**Unfamiliarity.** Operators evaluating Zentinel may see KDL as a barrier. "Why not just use YAML like everything else?" is a reasonable question. The answer is that YAML's ambiguities cause real incidents, and the cost of learning KDL is lower than the cost of debugging YAML type coercion in production. + +## When to Revisit + +- If KDL development stalls and the specification does not reach stability +- If a future configuration format emerges that solves the same problems with broader adoption +- If the KDL Rust parser becomes unmaintained (currently well-maintained via the `kdl` crate) + +## Manifesto Alignment + +> *"Security must be explicit. [...] Every limit, timeout, and decision in Zentinel is meant to be: visible in configuration, observable in metrics and logs, and explainable after the fact."* — Manifesto, principle 2 + +> *"There is no 'magic'. There is no implied policy."* — Manifesto, principle 2 + +KDL supports this principle by being an unambiguous format. What you read in the configuration file is what the proxy does. No implicit type conversions, no hidden inheritance, no surprising defaults. diff --git a/doc/design/why-pingora.md b/doc/design/why-pingora.md new file mode 100644 index 00000000..dabcdcd4 --- /dev/null +++ b/doc/design/why-pingora.md @@ -0,0 +1,49 @@ +# Why Pingora + +## The Decision + +Zentinel is built on [Pingora](https://github.com/cloudflare/pingora), Cloudflare's open-source HTTP proxy framework written in Rust. + +We use Pingora as the core dataplane: connection handling, HTTP parsing, TLS termination, load balancing, and upstream connection pooling. Zentinel adds configuration, the agent architecture, observability, and operational semantics on top. + +## Alternatives Considered + +**Hyper (raw)**. Rust's de facto HTTP library. Gives you maximum control but requires building connection management, load balancing, graceful shutdown, hot restart, and TLS from scratch. Writing a production proxy on raw hyper means reimplementing what Pingora already provides—and getting it wrong in subtle ways under load. + +**Envoy**. Battle-tested C++ proxy with a large ecosystem. But extending Envoy means writing C++ or using WASM filters, both of which add friction. Envoy's configuration surface is enormous (xDS, Lua, WASM, ext_proc), and the operational model assumes a control plane. Zentinel wants to be a single binary you can reason about. + +**NGINX**. Proven, fast, widely deployed. But NGINX's module system is C-based, its configuration language is its own DSL with implicit inheritance rules, and its architecture (worker processes, shared memory zones) makes certain patterns—like per-request external callouts—awkward. + +**Building from scratch**. Full control, no dependency risk. But HTTP proxy correctness is deceptively hard: connection reuse, keepalive management, upgrade handling, graceful shutdown with drain, hot restart without dropping connections. These are solved problems. Solving them again is a poor use of time. + +## Why Pingora Fits + +**Proven at scale.** Pingora handles trillions of requests at Cloudflare. The connection lifecycle, memory management, and failure handling have been tested under conditions we cannot reproduce in a lab. + +**Rust-native.** Same language as Zentinel. No FFI boundary, no serialization overhead for the hot path. The `ProxyHttp` trait gives us typed hooks into the request lifecycle—request filter, upstream peer selection, response filter—without fighting a C API. + +**Right abstraction level.** Pingora gives us the plumbing (connection pools, health checks, load balancing algorithms, TLS, HTTP/1 and HTTP/2) while letting us own the policy layer. We implement `ProxyHttp` and control what happens at each phase. It does not impose a configuration format, a control plane, or an extension model. + +**Operational primitives.** Graceful shutdown, hot restart (upgrading the binary without dropping connections), and worker thread management come built in. These are hard to get right and critical for zero-downtime operation. + +## Trade-offs + +**External dependency.** We depend on a project maintained by Cloudflare. If Pingora's direction diverges from ours, we carry the cost. We mitigate this by maintaining a fork with security patches rebased, and by keeping our integration surface narrow (primarily the `ProxyHttp` trait). + +**Abstraction leakage.** Pingora's APIs occasionally expose internal assumptions (session lifecycle, error types). We work around these where needed rather than fighting the framework. + +**Upgrade friction.** Tracking upstream Pingora means periodic rebasing. Breaking changes in Pingora's trait signatures require updates across our proxy implementation. + +## When to Revisit + +- If Pingora is abandoned or development stalls significantly +- If our requirements diverge from HTTP proxying (e.g., raw TCP/UDP as a primary use case) +- If Pingora's abstraction becomes a bottleneck for features we need (unlikely given the trait-based design) + +## Manifesto Alignment + +> *"We build on proven foundations."* — Manifesto, introduction + +> *"Production correctness beats feature breadth."* — Manifesto, principle 6 + +Building on Pingora means we inherit correctness for the hard parts (HTTP parsing, connection management, TLS) and spend our time on what makes Zentinel different: the agent architecture, KDL configuration, and explicit operational semantics.