Sandboxed delegates for safely processing untrusted input #7159
tlongwell-block
started this conversation in
Ideas
Replies: 2 comments
-
|
Draft of |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Maybe one suggestion, should we seperate out the summarisation from the delegate agent? So for this we'd have two sub-agents:
This can help protect the main agent "slightly" more in the compromised summary case as we can ensure the review-subagent is pretty paranoid. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When goose encounters untrusted text — a skill from a URL, a response from an MCP server, a file in a cloned repo — it reads that content directly into the agent's context with full tool access. If the content contains a prompt injection, the attacker inherits the agent's tools: shell, file writes, network.
The existing
PromptInjectionScannerfocuses on tool requests and is opt-in via config. It doesn't change the agent's privileges, so if something slips through, the blast radius is unchanged.Proposed solution
Process untrusted content in a delegate that can read but not write, execute, or network.
Goose already supports spawning delegates with filtered extensions (PR #6964). A delegate with
extensions: ["reader"]gets a single read-only tool — it can explore files and directories within the working directory, but has no shell, no file writes, and no network access. This is enforced in Rust by the extension filtering chain, not by prompt engineering.If the delegate gets prompt-injected, the worst case is: it reads files in the working directory and produces misleading text output. No code execution, no exfiltration beyond its response text, no writes.
Reasoning is separated from capability, with a deterministic enforcement layer at the boundary. The schema validation uses the existing
FinalOutputTool+jsonschemacrate — already shipped for recipes, just needs plumbing todelegate(). This follows capability-mediation patterns described in CaMeL and related agent-safety research.The
readerextensionA new platform extension (
default_enabled: false) that provides a singlereadtool:../traversalThe reader is never available to the main goose agent (which has the full
developerextension). It exists solely for sandboxed delegates — the coordinator explicitly passesextensions: ["reader"]when spawning a delegate to process untrusted content.Example: canary delegate for skill evaluation
This pattern is already used in production by skill security evaluator agent internally at Block. Before the evaluator reads untrusted skill content itself, it sends a sandboxed delegate to read it first — a canary. If the skill contains prompt injection, the canary gets hit instead, but it has no tools so it can't do anything dangerous.
The key: the delegate's instructions tell it what file to read. The untrusted content is never loaded into the delegate's system prompt via
source:— it stays at arm's length, accessed only through the read-onlyreadertool.The evaluator then inspects the canary output — which is itself treated as untrusted data:
If the canary looks compromised, the evaluator stops immediately and reports CRITICAL without ever reading the skill content itself.
With
FinalOutputToolplumbed to delegates (see below), the delegate would be forced to return structured JSON validated against a schema — making compromise detection deterministic rather than heuristic.Known limitations
The sandboxed delegate can still produce misleading summaries that influence the coordinator's subsequent decisions. Structured JSON output with deterministic schema validation mitigates this (the coordinator validates fields against source data rather than trusting prose), but it's not a complete defense against semantic manipulation. This pattern reduces blast radius from "arbitrary code execution" to "potentially misleading text" — a significant improvement, not a silver bullet. Each delegate call is an extra LLM round-trip, so it's best suited for one-time gates (skill install, PR review) rather than high-frequency streams.
What already exists
extensions: []orextensions: ["reader"]FinalOutputTool+jsonschemacrates/goose/src/agents/final_output_tool.rs— deterministic JSON schema validation with retry. Already works for recipes viaRecipe.response.json_schema.ToolInspectorchaincrates/goose/src/tool_inspection.rsPromptInjectionScannercrates/goose/src/security/scanner.rsdefault_enabled: false(#6193)What's needed
Plumb
FinalOutputToolto delegates. TheDelegateParamsstruct insummon_extension.rsdoesn't expose aresponse/json_schemafield yet, butsubagent_handler.rsalready checksrecipe.response.is_some()and wires upFinalOutputToolwhen present. The change is small: addjson_schema: Option<Value>toDelegateParams, setrecipe.responseinbuild_adhoc_recipe()/build_source_recipe(), done. The delegate gets arecipe__final_outputtool, a system prompt telling it to use that tool, and deterministic schema validation on its output.Land the reader extension. Read-only platform extension with
default_enabled: false.MVP integration: Pick one ingestion point — skill install is the natural first target. Coordinator spawns a sandboxed delegate with
extensions: ["reader"]and ajson_schema, delegate reads the file and callsfinal_output, coordinator gets validated JSON.Future work: Content demarcation delimiters, response inspection for high-frequency streams (MCP responses), trust-level metadata for graduated scanning.
How this complements containers
SECURITY.md already recommends containers as a mitigation. Containers isolate processes (filesystem, network). The sandboxed delegate isolates capability (what tools the LLM can invoke). A containerized agent with full tools is still vulnerable to prompt injection via the LLM API. These are complementary layers — containers for process boundaries, sandboxed delegates for capability boundaries.
Non-goals
PromptInjectionScanner— this complements it, doesn't replace itReferences
Beta Was this translation helpful? Give feedback.
All reactions