Agent Runtime Architecture Audit

Date: 2026-02-08 Auditor: Automated deep-analysis (4 parallel research agents) Scope: Full cuervo-cli codebase — 701 tests, 9 crates, ~30k LOC, 5.1MB binary Status: Phase 10 UX complete, pre-Phase 11 (Agent Runtime Hardening)

Executive Summary
Runtime Execution Model
Observability & Tracing
Safety & Security
User Experience Layer
Consolidated Issue Registry
Recommendations

1. Executive Summary

Cuervo CLI's agent runtime is architecturally sound with strong foundations: round-based execution with budget enforcement, multi-provider resilience, TBAC authorization, hash-chained audit trail, and NO_COLOR-aware UX. However, critical gaps exist in four areas:

Area	Critical	Major	Minor	Total
Runtime	4	7	5	16
Observability	6	14	5	25
Safety	4	6	6	16
UX	3	5	6	14
Total	17	32	22	71

Top 5 Critical Findings:

Provider invocation has no timeout — process hangs on network failure
Tool results not sanitized — prompt injection via tool output
13 of 21 domain events defined but never emitted — 38% audit coverage
Tool execution metrics not persisted — cannot identify slow/failing tools
Default budget is unlimited (0) — unbounded API spend without opt-in limits

2. Runtime Execution Model

2.1 Agent Loop Architecture

File: crates/cuervo-cli/src/repl/agent.rs (1,886 lines)

The agent loop is a sequential state machine with explicit rounds:

for round in 0..limits.max_rounds {
    Build ModelRequest (messages, tools, system prompt)
    → Check pre-invocation guardrails
    → Attempt response cache lookup
    → If miss: invoke provider with resilience routing
    → Stream response via tokio::select! { chunk, ctrl_c }
    → Accumulate text + tool uses
    → Track usage (tokens, cost, latency)
    → Check post-invocation guardrails
    → Check token/duration budgets
    → If ToolUse: execute tools, append results, continue
    → If EndTurn: append message, break
}

Stop conditions (checked in order):

EndTurn — provider signals completion (line 872)
TokenBudget — cumulative tokens >= limit (line 807)
DurationBudget — elapsed wall-clock >= limit (line 840)
GuardrailBlock — pre/post guardrail blocks (lines 475, 764)
MaxRounds — loop exhausts max_rounds (line 1134, default 25)
ProviderError — invoke_with_fallback returns Err (line 611)
Interrupted — Ctrl+C via tokio::signal (line 585)

Determinism: Mostly deterministic with two non-determinism sources:

Parallel tool execution uses futures::join_all() — results sorted by tool_use_id for deterministic output order, but execution timing varies
Speculative routing uses futures::select_ok() — first provider to respond wins

2.2 Concurrency Model

Primitive	Location	Purpose
`tokio::select!`	agent.rs:548	Stream polling + Ctrl+C cancellation
`tokio::time::timeout`	agent.rs:359	15s compaction timeout
`futures::join_all`	executor.rs:194	Parallel tool execution
`futures::select_ok`	speculative.rs	Speculative provider racing
`tokio::sync::Semaphore`	backpressure.rs	Per-provider permit limiting
`tokio::broadcast`	cuervo-core	Domain event bus (256 buffer)
`spawn_blocking`	executor.rs:345	stdin/stderr permission I/O

Blocking I/O points: Permission prompts (spawn_blocking), all DB operations (AsyncDatabase wrapper).

2.3 Error Handling

Provider errors: invoke_with_fallback Err → stop spinner → log → record failed metric → user_error() → return ProviderError. No retry on promoted-fallback failure.

Tool errors: Wrapped in ToolResult with is_error: true → appended to messages → triggers reflexion (if enabled) → triggers adaptive replanning (first failure only, once per round).

DB errors: Fire-and-forget with tracing::warn!. Failures never propagate or affect loop execution.

Panics in production: None found. All unwrap()/expect() confined to test code or safe contexts.

2.4 Memory & Token Management

Message history growth: Unbounded unless compaction or budget enforcement active. Per-round: 1-3 messages appended (assistant text, tool uses, tool results). No hard limit on message count.

Context compaction: Optional, 15s timeout, invokes same provider for summarization. If timeout or disabled, messages accumulate indefinitely.

Token budget: Checked AFTER response received (not before) — can overshoot by one full round. Default: 0 (unlimited).

Cost tracking: Estimated only (provider.estimate_cost()), not validated against actual API billing. Tracked but not enforced as a budget.

2.5 Runtime Issues

ID	Issue	Severity	Location	Description
RT-1	Provider invocation has no timeout	CRITICAL	agent.rs:530	invoke_with_fallback() can hang indefinitely on network failure. Only Ctrl+C or OS kill terminates.
RT-2	Unbounded message history growth	CRITICAL	agent.rs:513,913,1117	No hard limit on message count. Compaction is best-effort (15s timeout, can be skipped). OOM on long sessions.
RT-3	Token budget checked post-response	CRITICAL	agent.rs:807	Budget enforced AFTER response, not before. Can overshoot by 10k+ tokens in one round.
RT-4	Tool result accumulation unbounded	CRITICAL	agent.rs:963	All tool results held in memory before append. OOM on parallel batch with large outputs (e.g., file_read 100MB).
RT-5	Speculative routing non-determinism	MAJOR	speculative.rs	Different provider used on each run. Cost/latency varies unpredictably. Not logged which provider won.
RT-6	No fallback retry chain	MAJOR	agent.rs:138	If promoted fallback also fails, no further fallback attempted. Single point of failure.
RT-7	Spinner may race with first chunk	MAJOR	agent.rs:526	Spinner started as fire-and-forget. May attempt stop() before start() on fast responses.
RT-8	Resilience pre-filters but no "all unhealthy" fallback	MAJOR	agent.rs:71	If ALL providers unhealthy, returns error instead of trying anyway.
RT-9	Tool permissions lack granularity	MAJOR	executor.rs:284	TBAC checks tool name + params but no context-aware argument validation.
RT-10	No per-tool invocation limit across rounds	MAJOR	agent.rs	Same tool can be called up to max_rounds times (25 default).
RT-11	Event bus has no backpressure	MAJOR	cuervo-core	broadcast::channel(256) drops events if subscriber can't keep up.
RT-12	Message history fully cloned per round	MINOR	agent.rs:418	request.messages.clone() copies entire history. Use Arc or message IDs instead.
RT-13	Reflexion always runs on non-Success	MINOR	agent.rs:987	May pollute reflection storage with transient error noise.
RT-14	Trace step index incremented on error	MINOR	agent.rs:229	Trace steps may have gaps if DB writes fail.
RT-15	Cache miss-then-hit in same round impossible	MINOR	agent.rs:481	Cache checked before invocation. Next message has different messages, so cache ineffective for streaks.
RT-16	Cost estimates only, not enforced	MINOR	agent.rs:676	estimated_cost_usd tracked but never used to stop loop (unlike tokens).

3. Observability & Tracing

3.1 Structured Logging

Framework: tracing + tracing-subscriber (stderr, --verbose flag)

Agent loop spans:

run_agent_loop() [#[instrument(skip_all)]]
  └─ gen_ai.agent.round (info_span!)
     └─ [no child spans — flat hierarchy]

Gap: No spans for tool execution, provider invocation, permission checks, or resilience decisions. Only top-level round span exists.

Ad-hoc logging: 11 eprintln! calls in agent.rs bypass structured logging (lines 357, 474, 483, 544, 593, 646, 763, 815, 848, 888, 1136). These critical operational messages are not queryable, timestamped, or captured in log files.

3.2 Trace Recording

Schema: trace_steps table with session_id, step_index, step_type, data_json, duration_ms, timestamp.

Step types: ModelRequest, ModelResponse, Error, ToolCall, ToolResult (5 types).

Coverage: Complete for model invocation + tool execution. Missing: plan generation, compaction, reflection, resilience decisions.

Replay: Designed for replay (append-only, ordered by step_index) but no replay implementation exists.

3.3 Metrics

Persisted: InvocationMetric per model round — provider, model, latency_ms, tokens, cost, success, stop_reason, session_id.

Missing metrics:

Metric	Impact
Tool execution time	Cannot identify slow tools or set tool-level SLOs
Tool invocation count	Unknown tool usage patterns
Tool success/failure rate	Cannot identify problematic tools
Context compaction overhead	Cannot tune compaction thresholds
Reflection cost (tokens, latency)	Conflated with round cost
Guard rejection rates	Cannot assess guard effectiveness
Cache hit/miss ratio by model	Cannot recommend cache tuning per model
Provider fallback frequency	Opaque fallback patterns

3.4 Domain Event Bus

21 EventPayload variants defined. Only 8 actively emitted (38%).

Event	Status	Emitted From
ModelInvoked	Emitted	agent.rs:665
ToolExecuted	Emitted	executor.rs:208,415
PermissionRequested	Emitted	executor.rs:334
PermissionGranted	Emitted	executor.rs:382
PermissionDenied	Emitted	executor.rs:363
PlanGenerated	Emitted	agent.rs:292
PlanStepCompleted	Emitted	agent.rs:1057
GuardrailTriggered	Emitted	agent.rs:467,756
ReflectionGenerated	Emitted	agent.rs:999
AgentStarted	NOT EMITTED	—
AgentCompleted	NOT EMITTED	—
PiiDetected	NOT EMITTED	—
SessionStarted	NOT EMITTED	—
SessionEnded	NOT EMITTED	—
ConfigChanged	NOT EMITTED	—
CircuitBreakerTripped	NOT EMITTED	—
HealthChanged	NOT EMITTED	—
BackpressureSaturated	NOT EMITTED	—
ProviderFallback	NOT EMITTED	—
PolicyDecision	NOT EMITTED	—
EpisodeCreated	NOT EMITTED	—
MemoryRetrieved	NOT EMITTED	—

Resilience events bypass the event bus entirely — written directly to resilience_events table, never broadcast as DomainEvent.

3.5 Audit Trail

Schema: audit_log table with hash chain (SHA-256, previous_hash → hash).

Coverage: 8 of 21 event types actually persisted.

Gap: session_id is inside JSON payload, not a column — cannot efficiently query "all events for session X" without JSON parsing.

3.6 OpenTelemetry Readiness

Status: Not ready.

Only 1 span uses GenAI conventions (gen_ai.agent.round)
Missing: gen_ai.response.input_tokens, gen_ai.response.output_tokens, gen_ai.response.finish_reason
No OTLP exporter configured
No span hierarchy for tool/provider/permission/resilience operations

3.7 Observability Issues

ID	Issue	Severity	Location	Description
OB-1	Tool execution duration not persisted	CRITICAL	executor.rs	duration_ms captured locally but never inserted to metrics DB
OB-2	Resilience events bypass event bus	CRITICAL	resilience.rs	Circuit breaker/health/fallback events not broadcast
OB-3	13 domain events never emitted	CRITICAL	event.rs	Security audit blind spots (PiiDetected, PolicyDecision not tracked)
OB-4	No cost validation against actual API spend	CRITICAL	metrics.rs	Estimates may diverge from actual billing
OB-5	No tool-level metrics table	CRITICAL	—	Cannot optimize tool performance or set SLOs
OB-6	Session ID not indexed in audit_log	CRITICAL	audit.rs	Cannot efficiently reconstruct session history
OB-7	11 eprintln! bypass structured logging	MAJOR	agent.rs	Critical messages not queryable or timestamped
OB-8	Stop reason format inconsistent	MAJOR	agent.rs:777	Debug format "EndTurn" vs serde "end_turn" in different tables
OB-9	No span context for provider invocation	MAJOR	agent.rs	Provider latency invisible to distributed tracing
OB-10	Reflection cost not tracked separately	MAJOR	agent.rs:990	Conflated with round cost
OB-11	Cache not linked to model	MAJOR	response_cache.rs	Cannot analyze cache effectiveness per model
OB-12	Permission events only for Destructive	MAJOR	executor.rs:333	ReadOnly execution invisible to audit
OB-13	Guardrail violations not queryable	MAJOR	agent.rs:750	No aggregate table
OB-14	Plan generation not persisted to trace	MAJOR	agent.rs:299	Plan cost not in replay trace
OB-15	Compaction overhead not measured	MAJOR	agent.rs:359	Cannot identify compaction as bottleneck
OB-16	Trace recording failures not monitored	MAJOR	agent.rs:228	Silent trace data loss
OB-17	No OTLP export configured	MAJOR	—	Not compatible with cloud observability
OB-18	Cost estimation accuracy not validated	MAJOR	metrics.rs	No reconciliation against API billing
OB-19	Round outcomes not persisted	MAJOR	executor.rs	Cannot identify round success/failure patterns
OB-20	Health score formula duplicated	MAJOR	resilience.rs, doctor.rs	Two independent implementations may diverge
OB-21	EventPayload match not exhaustive-enforced	MINOR	audit.rs	New variants silently missed
OB-22	Trace step JSON schema not documented	MINOR	trace.rs	Consumers must reverse-engineer format
OB-23	Memory retrieval not tracked as event	MINOR	memory.rs	Cannot audit knowledge base queries
OB-24	Episode creation not tracked as event	MINOR	episodes.rs	Cannot audit episodic memory lifecycle
OB-25	Cost display precision truncates micro-txns	MINOR	agent.rs:687	format!("${:.4}") rounds < $0.0001

4. Safety & Security

4.1 Permission Model

Two-layer authorization:

Layer 1: TBAC (Task-Based Authorization Control)

TaskContext with allowed_tools (HashSet), parameter_constraints (PathRestriction, CommandAllowlist, ValueAllowlist), max_invocations, expires_at
Child contexts enforce narrowing (intersection of parent + child tools)
Disabled by default (tbac_enabled: false in config)
Returns AuthzDecision: Allowed, ToolNotAllowed, ParamViolation, ContextInvalid, NoContext

Layer 2: Legacy Permission Check

ReadOnly/ReadWrite: auto-allowed (no prompt)
Destructive: prompt if confirm_destructive=true AND tool not in always_allowed
"Always allow" is session-scoped, in-memory only

Wiring: TBAC check runs BEFORE legacy check (executor.rs:284). If TBAC returns NoContext, falls through to legacy.

4.2 Sandboxing

Unix rlimit-based only:

RLIMIT_CPU: max CPU seconds (default 60s)
RLIMIT_FSIZE: max file size bytes (default 50MB)
Output truncation: 60% head + 30% tail (max_output_bytes)

Not enforced:

RLIMIT_AS (memory) — omitted for macOS/container compatibility
RLIMIT_NPROC (subprocess limits) — not implemented
Network isolation — no rlimit, namespace, or eBPF firewall
Filesystem isolation — no chroot or mount namespace
System call filtering — no seccomp

Platform: Unix only (#[cfg(unix)]). No-op on Windows.

4.3 Input Validation

PII Detection: 14-pattern RegexSet (SIMD) detector exists in cuervo-security but is NOT wired into the agent loop. Manual configuration required.

Tool Input: Minimal validation. Bash commands passed directly to Command::new("bash").arg("-c").arg(command). No escaping, no length limits, no sanitization.

Blocked Patterns: Config defines patterns (.env, *.key, credentials.json) but they are NOT enforced by any tool. Config-only with no implementation.

4.4 Guardrails

Built-in (always-on when enabled):

PromptInjectionGuardrail — 4 patterns, warn-only
CodeInjectionGuardrail — 5 destructive patterns (rm -rf, fork bomb, mkfs, dd, curl|bash), warn-only

Checkpoints:

Pre-invocation: scans last user message before sending to model
Post-invocation: scans model output after response

Default action: Warn only. Built-in guardrails do not block or redact. Only custom config-based guardrails with explicit "block" action will halt execution.

4.5 Tool Output Injection

Attack flow: Tool results are appended directly to message history without sanitization. Malicious tool output (e.g., "ignore previous instructions") reaches the model in the next round as a legitimate ToolResult block. Post-invocation guardrails only warn after the model has already been influenced.

4.6 Safety Issues

ID	Issue	Severity	Location	Description
SF-1	Default budget unlimited (0)	CRITICAL	config.rs:530	Unbounded API spend without opt-in limits. Config warns but doesn't enforce.
SF-2	No tool input validation/escaping	CRITICAL	bash.rs:52	Command injection via model-generated bash commands. `$(...)` expansion not prevented.
SF-3	Tool results not sanitized	CRITICAL	agent.rs:972	Prompt injection via tool output influences next round. Guardrails only warn post-hoc.
SF-4	confirm_destructive=false bypasses all prompts	CRITICAL	permissions.rs:126	User misconfiguration allows bash/file_write without any interaction.
SF-5	TBAC disabled by default	MAJOR	config.rs:375	Parameter constraints (command allowlist, path restrictions) not enforced unless explicitly enabled.
SF-6	Blocked patterns not enforced	MAJOR	config.rs:311	.env, *.key patterns defined but NOT validated by file tools.
SF-7	Post-invocation guardrails warn-only	MAJOR	guardrails.rs:233	Model output with dangerous patterns continues to next round.
SF-8	No network isolation for bash	MAJOR	bash.rs	Commands can curl/wget, exfiltrate data, connect to external servers.
SF-9	No memory limit on subprocess	MAJOR	sandbox.rs:97	RLIMIT_AS omitted. Bash can consume all available memory.
SF-10	Unbounded parallel tool execution	MAJOR	executor.rs:194	No max concurrent limit on join_all. Could spawn hundreds of processes.
SF-11	PII detector not wired	MINOR	pii.rs	14-pattern detector exists but never called in agent loop.
SF-12	Prompt injection patterns basic	MINOR	guardrails.rs:207	Only 4 hardcoded patterns. Sophisticated injection bypasses detection.
SF-13	No per-tool rate limiting	MINOR	tools/*	Same tool invocable unlimited times per session without TBAC.
SF-14	Backpressure semaphore no timeout	MINOR	backpressure.rs	acquire() blocks indefinitely if saturated.
SF-15	Compaction timeout hardcoded 15s	MINOR	agent.rs:360	Not configurable. Slow models may always timeout.
SF-16	No parallel batch concurrency limit	MINOR	executor.rs:194	join_all spawns all at once. No batching (e.g., max 10).

5. User Experience Layer

5.1 Streaming UX

StreamRenderer (render/stream.rs) uses a two-state machine:

Prose: Text deltas printed immediately per chunk
CodeBlock: Accumulated in buffer until closing ``` fence, then syntax-highlighted and printed all at once

Gap: Complete code blocks wait for closing fence — no incremental rendering feedback during long code blocks.

Spinner: "Thinking..." with elapsed time. 200ms delay before display (avoids flicker). Stops on first TextDelta, ToolUseStart, or Error chunk.

5.2 Tool Execution Display

Start: ╭─ name(args) with summarized args (path, command first 50 chars)
Result: ╰─ [OK/ERROR 42ms] with formatted duration
Output: Truncated to 50 lines max with "... (N more lines)"
Permission: [y]es [n]o [a]lways prompt for destructive tools
Denied: ╰─ [DENIED] name

All box-drawing chars respect NO_COLOR (color.rs fallback to ASCII).

5.3 Error & Warning Display

Standardized via render/feedback.rs:

Error: {message}
  Hint: {hint}

Warning: {message}
  Hint: {hint}

Coverage: Provider errors, config validation, MCP failures, session save failures. All written to stderr.

Gap: Silent failures via let _ = ... in 6+ places (event sends, trace recording, auto-save). User unaware of data loss.

5.4 Progress & State Visibility

Signal	Location	Display
Per-message	mod.rs:463	`[N tokens \| Xs \| $cost \| M tool rounds]`
Per-session (exit)	mod.rs:300	`Session: N rounds \| Xs \| $cost \| M tools \| session_id`
Cache hit	agent.rs:483	`[cached]`
Compaction	agent.rs:357	`[compacting context...]`
Guardrail block	agent.rs:474	`[blocked by guardrail]`
Token budget	agent.rs:815	`[token budget exceeded: X / Y tokens]`
Duration budget	agent.rs:848	`[duration budget exceeded: Xs / Ys]`
Max rounds	agent.rs:1136	`[max rounds reached: X]`
Round separator	agent.rs:888	`--- round N ---` (only if rounds > 1)

5.5 Accessibility

NO_COLOR env var + TERM=dumb detection (OnceLock at startup)
Unicode fallback: ╭→+, ╰→+, ─→-, │→|
Syntax highlighting disabled when NO_COLOR set
No structured/JSON output mode for script consumption

5.6 UX Issues

ID	Issue	Severity	Location	Description
UX-1	First round has no separator	CRITICAL	agent.rs:887	User can't tell when tool execution starts. Separator only shown for rounds > 1.
UX-2	Auto-save failures silent	CRITICAL	mod.rs:513	Fire-and-forget with tracing::warn! only. User unaware of session data loss.
UX-3	Code blocks not rendered incrementally	CRITICAL	stream.rs:135	Entire code block waits for closing fence. No feedback during long code blocks.
UX-4	No per-round cost display	MAJOR	agent.rs:684	Cost only at debug log level. Users can't track per-round spend.
UX-5	Round separator timing confusing	MAJOR	agent.rs:888	Shown AFTER round completes, not BEFORE. Confuses round boundaries.
UX-6	TBAC denial no user-friendly message	MAJOR	executor.rs:303	Generic error, doesn't call render_tool_denied() for consistency.
UX-7	No progress during long tool execution	MAJOR	executor.rs:70	Tools can take seconds with no feedback. CLI appears hung.
UX-8	Config validation hints incomplete	MAJOR	chat.rs:152	Suggestions missing for many validation errors.
UX-9	Stream error doesn't set done flag	MINOR	stream.rs:69	Error chunk doesn't terminate rendering. Must wait for Done.
UX-10	stdout/stderr mixed	MINOR	tool.rs:23	Tool status → stderr, responses → stdout. Inconsistent when combined.
UX-11	Cache miss silent	MINOR	agent.rs:480	Only hit shows [cached]. No indication of miss.
UX-12	Spinner brackets inconsistent	MINOR	spinner.rs	Uses () for elapsed time. Other status uses [].
UX-13	Language label detection fragile	MINOR	stream.rs:99	Language label can be split across TextDelta chunks.
UX-14	No JSON output mode	MINOR	—	All output human-formatted. Not script-parseable.

6. Consolidated Issue Registry

Critical Issues (17 total)

ID	Category	Issue	Impact	Complexity
RT-1	Runtime	Provider invoke no timeout	Process hangs on network failure	Low
RT-2	Runtime	Unbounded message history	OOM on long sessions	Medium
RT-3	Runtime	Token budget post-check	Budget overshoot by 10k+ tokens	Low
RT-4	Runtime	Tool results unbounded	OOM on large parallel batch	Medium
OB-1	Observability	Tool metrics not persisted	Cannot optimize tools	Low
OB-2	Observability	Resilience events bypass bus	Unauditable resilience decisions	Medium
OB-3	Observability	13 events never emitted	38% audit coverage	Medium
OB-4	Observability	No cost validation	Estimates may diverge from billing	High
OB-5	Observability	No tool metrics table	Cannot set tool SLOs	Low
OB-6	Observability	Session ID not in audit_log	Cannot reconstruct sessions	Low
SF-1	Safety	Default budget unlimited	Unbounded API spend	Low
SF-2	Safety	No tool input validation	Command injection via bash	High
SF-3	Safety	Tool output not sanitized	Prompt injection via results	Medium
SF-4	Safety	confirm_destructive bypass	Silent destructive execution	Low
UX-1	UX	First round invisible	Tool start unclear	Low
UX-2	UX	Auto-save failures silent	Data loss undetected	Low
UX-3	UX	Code blocks not incremental	No feedback on long blocks	Medium

Major Issues (32 total)

Runtime: RT-5 through RT-11 (7 issues) Observability: OB-7 through OB-20 (14 issues) Safety: SF-5 through SF-10 (6 issues) UX: UX-4 through UX-8 (5 issues)

Minor Issues (22 total)

Runtime: RT-12 through RT-16 (5 issues) Observability: OB-21 through OB-25 (5 issues) Safety: SF-11 through SF-16 (6 issues) UX: UX-9 through UX-14 (6 issues)

7. Recommendations

Phase 1: Critical Safety (Blocks production use)

Priority: Immediate

Add provider invoke timeout (RT-1)
- Wrap invoke_with_fallback() with tokio::time::timeout(30s)
- Configurable per-provider via HttpConfig
Add max_message_count limit (RT-2)
- Default 100 messages. Mandatory compaction when exceeded.
- New config field: agent.limits.max_messages
Check token budget before invoke (RT-3)
- Move budget check to BEFORE provider invocation, not after
Enforce non-zero budget defaults or mandatory warning (SF-1)
- Change default to max_total_tokens=200000 or make startup block on zero budget
Add tool result sanitization (SF-3)
- Run guardrail patterns on tool results before appending to messages
- Redact detected patterns (not just warn)
Change code_injection guardrail to Block (SF-7)
- Built-in guardrails should block by default, not just warn

Phase 2: Observability Foundation

Priority: Next sprint

Create tool_execution_metrics table (OB-1, OB-5)
- Schema: tool_name, duration_ms, success, session_id, created_at
- Insert per tool execution in executor.rs
Emit missing domain events (OB-3)
- AgentStarted/Completed at loop boundaries
- SessionStarted/Ended in Repl
- PiiDetected in security module (when wired)
Route resilience events through event bus (OB-2)
- Emit DomainEvent BEFORE persisting to resilience_events table
Add session_id column to audit_log (OB-6)
- New migration: ALTER TABLE audit_log ADD COLUMN session_id TEXT
- Index for session queries
Replace eprintln! with structured logging (OB-7)
- All 11 instances → tracing::info!/warn! with structured fields
- Separate render layer for user-facing messages

Phase 3: Runtime Hardening

Priority: Following sprint

Add provider invocation spans (OB-9)
- #[instrument] on invoke_with_fallback(), resilience.pre_invoke()
- Add GenAI span attributes (tokens, cost, stop_reason)
Implement fallback retry chain (RT-6)
- Try primary, then each fallback sequentially (not just first)
Cap parallel tool concurrency (SF-10)
- futures::stream::buffer_unordered(max_concurrent) instead of join_all
- Default max_concurrent=10
Wire PII detector (SF-11)
- Call PiiDetector::redact() on tool results and model output when pii_detection=true
Enforce blocked_patterns in file tools (SF-6)
- Validate file paths against blocked_patterns before tool execution

Phase 4: UX Hardening

Priority: Subsequent sprint

Fix round separator timing (UX-1, UX-5)
- Print separator BEFORE model invocation, not after
- Show for all rounds including first
Surface auto-save failures (UX-2)
- Show user_warning() on session save failure, not just tracing::warn!
Add per-round cost display (UX-4)
- Print cost to stderr after each round (not just debug log)
Add tool execution spinner (UX-7)
- Show "Executing tool..." spinner during long tool execution

Phase 5: Advanced Observability

Priority: When cloud deployment needed

OTLP export (OB-17)
- Add tracing-opentelemetry + OTLP HTTP exporter
- Span hierarchy: session → round → provider.invoke → tool.execute
Cost reconciliation (OB-4)
- Periodic validation against Anthropic API usage endpoint
Tool-level cost attribution (OB-10)
- Track reflection cost separately from round cost
Trace replay implementation
- Execute agent loop from recorded trace steps for deterministic testing

Attack Surface Summary

Vector	Exploitability	Current Mitigation	Risk
confirm_destructive=false	High (single config)	Config warning on startup	CRITICAL
Bash command injection	High (model-generated)	Permission prompt (if enabled)	CRITICAL
Tool output prompt injection	High (any tool output)	Post-invocation guardrails (warn only)	CRITICAL
Path traversal via file tools	Medium (TBAC disabled)	blocked_patterns (NOT enforced)	MAJOR
Network exfiltration via bash	Medium (requires bash tool)	No network isolation	MAJOR
Memory exhaustion via bash	Medium (no RLIMIT_AS)	CPU+FSIZE limits only	MAJOR
Unbounded parallel fork	Low (model must request)	No concurrency cap	MAJOR
Unbounded API spend	Low (requires long session)	Budget warning (not enforced)	MAJOR

Files Analyzed

Core Runtime

crates/cuervo-cli/src/repl/agent.rs — 1,886 lines (agent loop)
crates/cuervo-cli/src/repl/mod.rs — 1,247 lines (REPL lifecycle)
crates/cuervo-cli/src/repl/executor.rs — 680 lines (tool execution)
crates/cuervo-cli/src/repl/accumulator.rs — 295 lines (stream accumulation)
crates/cuervo-cli/src/repl/permissions.rs — 450+ lines (TBAC + legacy)
crates/cuervo-cli/src/repl/circuit_breaker.rs — circuit breaker FSM
crates/cuervo-cli/src/repl/backpressure.rs — semaphore-based limiting
crates/cuervo-cli/src/repl/resilience.rs — resilience manager facade

Observability

crates/cuervo-core/src/types/event.rs — 21 domain event variants
crates/cuervo-storage/src/db/traces.rs — trace recording
crates/cuervo-storage/src/db/metrics_repo.rs — invocation metrics
crates/cuervo-storage/src/db/audit.rs — hash-chained audit log
crates/cuervo-storage/src/db/resilience_repo.rs — resilience events

Safety & Security

crates/cuervo-security/src/pii.rs — 14-pattern PII detector
crates/cuervo-security/src/guardrails.rs — prompt/code injection guards
crates/cuervo-tools/src/sandbox.rs — rlimit sandboxing
crates/cuervo-tools/src/bash.rs — bash tool execution
crates/cuervo-core/src/types/auth.rs — TBAC TaskContext
crates/cuervo-core/src/types/config.rs — security configuration

UX Layer

crates/cuervo-cli/src/render/stream.rs — streaming state machine
crates/cuervo-cli/src/render/tool.rs — tool execution display
crates/cuervo-cli/src/render/spinner.rs — inference spinner
crates/cuervo-cli/src/render/feedback.rs — error/warning formatting
crates/cuervo-cli/src/render/color.rs — NO_COLOR + accessibility
crates/cuervo-cli/src/render/syntax.rs — syntax highlighting
crates/cuervo-cli/src/commands/chat.rs — chat command entry point
crates/cuervo-cli/src/commands/doctor.rs — diagnostic command

Total lines analyzed: ~8,000+ across 20+ files

FilesExpand file tree

Agent_Runtime_Audit.md

Latest commit

History