- Status: Accepted
- Date: 2026-06-03
- Related: code-style-typescript.md, architectural-principles.md, logging-and-observability.md, security-review.md
How Relavium represents, classifies, and surfaces failures. The rules apply to every
package; the LLM layer (@relavium/llm) and the engine (packages/core) are the
highest-stakes consumers because a misclassified error there breaks the fallback chains
and cost accounting that are product features.
- Errors are typed and discriminated, never bare
Errorand never thrown strings (see code-style-typescript.md). Each error type carries a stable discriminant (akind/code) callers narrow on — noterror.message, which is for humans and may change. - Errors carry structured context (the node id, run id, provider id, attempt number) as fields, not interpolated into the message string, so logging can emit them as fields and tests can assert on them.
- Wrap-and-rethrow preserves the cause (
{ cause }); we never swallow a root cause to re-throw a vaguer one. - The engine surfaces this with typed,
code-discriminated classes —EngineStateError(theWorkflowEngineAPI-boundary faults) andRunLoopInvariantError(the run-loop substrate's internal-invariant breaches: a draft with both/neither correlation key, a concurrent stream consumer) — each defined and documented at its source underpackages/core/src/engine/and exported from@relavium/coreso callers and tests narrow on.code, never the message. (The cross-cuttingLlmErrorseam contract is the one error type detailed in full below, because the fallback chains depend on its classification.)
Every failure inside an @relavium/llm adapter is normalized to a single LlmError type
before it crosses the seam — no vendor SDK error shape ever escapes the adapter (see the
seam boundary rule).
LlmError is classified so the FallbackChain runner can make a policy decision without
knowing which provider produced it:
- Retryable — transient, worth moving to the next provider in the chain (or retrying
with backoff): rate limits (HTTP 429), server/overloaded errors (5xx), timeouts, and
transport/connection resets. The fallback runner advances to the next provider on a
retryable
LlmErrorand records the failed attempt's usage so cost stays accurate across failover. - Fatal — not worth retrying anywhere; surface it and stop: authentication/permission
failures (401/403, a bad or missing key), malformed requests (400, an unsupported model
id, a tool schema a provider rejected), content-policy refusals, and request
cancellation (
AbortSignal). A fatal error does not silently fall through the chain to mask a real bug.
The runner — not the adapter — owns the retry/fallback policy (ADR-0011); adapters stay dumb and only classify. The classification mapping (per-provider status/code → retryable vs fatal) is covered by the per-provider conformance suite.
- A
catchblock must do one of: handle the error meaningfully, enrich and rethrow it, or classify it. An empty catch, orcatch { /* ignore */ }, is forbidden — it is how failures become invisible. - Never catch broadly and continue as if nothing happened. If a failure is genuinely
ignorable, that is a deliberate decision recorded in a comment and, where it matters,
logged at
warn. - No floating promises — an unhandled rejection is a silent catch with extra steps (lint-enforced, see code-style-typescript.md).
We distinguish the two and never leak one as the other:
- User-facing errors are actionable and safe: "No API key found for provider
anthropic— add one in Settings", "Workflowx.relavium.yamlfailed validation at nodesummarize: missingmodel". They are phrased for the person, name the next step, and never contain secrets, raw provider payloads, stack traces, or internal file paths (see security-review.md). - Internal errors carry the full detail (stack, cause chain, provider
raw) for diagnosis and go to structured logs, never to the frontend verbatim. The mapping from internal → user-facing happens at the surface boundary, once. - Errors surfaced through the run-event stream use the canonical
node:failedandrun:failedevents (see the SSE event schema); they carry a user-safe message plus an internal correlation id, not a raw exception. - Content-policy rejections are fatal, with their own cause. A provider content-filter block (a
text turn or a media generation) carries
content_filter(1.AG, ADR-0045) — fatal, distinct fromvalidation(an authoring/shape error) so a surface shows the right cause/remediation; re-issuing the same blocked content just re-blocks, so it is never retried. - Resource/limit codes are fatal-without-user-action, never silent.
budget_exceeded,run_timeout, andturn_limit(a hard agent/session turn/round cap — distinct from the[chat].max_messageshistory-trim threshold of config-spec.md, which continues the session and emits no error) end the work with a typed event carrying that code — the engine never loops past a cap or quietly stops under one. They are not retryable by policy: continuing past a limit is an explicit user decision (raise the cap, resume the session), not something a runner retries into. Notebudget_exceededis the fail-path code only: ADR-0028'son_exceed: warn/pause_for_approvalbranches emitbudget:warning/budget:pausedevents and do not use this code. - Tool-dispatch codes split on policy vs execution. A tool policy / grant denial
(tool-registry.md,
ADR-0029) carries
tool_deniedand is fatal — a denied call is deterministic, never retried (re-issuing it just re-denies). A tool execution failure (the host capability threw a transient/runtime error) carriestool_failedand is retryable within the node retry budget. An absent host capability isinternal(a host/config gap, not the model's fault). A tool aborted by the run'sAbortSignalsurfaces on the cancellation path (cancelled), nevertool_failed, so it composes with the ADR-0036 cancel-precedence rule. Messages stay scrubbed to the code + a user-safe string (the tool id / field, never an argument value or a host stack). sandbox_errorsplits on determinism. An expression-sandbox failure (ADR-0027, expression-sandbox-spec.md) carries the closedsandbox_errorcode. The deterministic causes — a syntax error, a runtimeReference/TypeError, a memory/stack overflow, or a non-conforming result — are fatal (a retry repeats them); only a wall-clock-timeout trip is retryable (a non-idempotent safety net that may pass on re-execution, bounded by the node retry budget). The message is scrubbed to the code + a generic string — never the expression source, a variable name, or a scope value.- The catch-all is
internal, never a silent stop. An uncaught throw from a node handler with no more specific classification maps tointernalwithretryable: false(a tool throw istool_failed, a sandbox throw issandbox_errorper its determinism split, above). The run loop (ADR-0036) catches it, emits a singlerun:failed{internal}rather than hanging, and stamps a secret-freecorrelationIdjoining the user-safe message to the internal log — so an unexpected engine fault is always loud and attributable, never a zombie run.
Untrusted input (parsed YAML, IPC payloads, provider responses, env/config) is validated with a Zod schema at the boundary and fails with a typed, user-facing validation error that names the offending field. We validate once, at the edge, then trust the typed value inside the core.
A node's output_schema is enforced NODE-SIDE. The seam's LlmRequest.responseFormat
(ADR-0030)
is a request-side hint only — no adapter validates the response against it, and DeepSeek degrades
to bare json_object with no schema. So the AgentRunner (1.O) lowers output_schema to
responseFormat and validates the returned content node-side. Phase-1 scope is parse-as-JSON
only: an output that does not parse as valid JSON maps to code: 'validation' (retryable: false — a re-ask is a node-retry/authoring concern, not a node-level loop); a schema-violating but
valid-JSON output is not yet rejected. Deep JSON-Schema conformance is a deferred follow-up (it
needs a JSON-Schema validator dependency behind an ADR — Zod cannot consume an arbitrary JSON-Schema).
See agent-runner.md.