Skip to content

Feature Request: Agent-recordable, verifiable procedural skills ("executable memory") #28999

@Haruhiyuki

Description

@Haruhiyuki

What variant of Codex are you using?

CLI

What feature would you like to see?

Summary

Codex can already read procedural knowledge (skills) and recall context (memories), but it has no first-class way for the agent to turn one of its own successful, repeatable runs into a reusable, executable, verifiable procedure — and to repair that procedure cheaply when the environment drifts, instead of re-deriving the whole thing from scratch.

I'd like to propose an agent-recordable procedural skill capability (call it "executable memory"): a small, surface-consistent extension that sits between today's static skills and background-extracted memories.

Additional information

The gap (grounded in the current tree)

Three adjacent primitives already exist, but none closes the loop:

  • Skills are static and human-authored. SKILL.md is parsed at load time (codex-rs/core-skills/src/loader.rs), and SkillMetadata is a read-only snapshot (codex-rs/core-skills/src/model.rs). There is no agent-facing path to write or update a skill from a successful run.
  • Memories are unstructured context, not executable procedure. The agent can append a free-form markdown note via AddAdHocNote (codex-rs/ext/memories/src/tools/ad_hoc_note.rs), and memories are auto-injected into context by a ContextContributor. But a note is prose, not a replayable step sequence with success criteria.
  • Traces are diagnostic, not re-executable. rollout-trace records everything and exposes replay_bundle (codex-rs/rollout-trace/src/lib.rs), but explicitly for offline reconstruction / debugging — there is no production pathway to replay a recorded sequence against new inputs.

So the agent re-discovers the same multi-step procedure (a chain of shell/tool calls) on every repeat of a recurring task, paying full exploration cost in tokens and latency each time.

Proposed capability (minimal slice)

Reuse what already exists; add only the missing loop. Concretely:

  1. Record — an agent-callable tool (e.g. skill.record) that, after a successful repeatable task, persists the procedure as a SKILL.md under ./.codex/skills/<name> (existing format), where the body captures steps = the tool-call sequence plus postconditions = how to verify success. This is the agent-write counterpart to today's human-authored skills, and a structured counterpart to AddAdHocNote.
  2. Recall + replay — surface matching recorded skills the way memories already auto-surface (ContextContributor), and let the agent replay the stored steps instead of re-exploring.
  3. Verify — on replay, check the recorded postconditions rather than assuming success.
  4. Minimal repair — when a postcondition fails, patch the failing step (and persist the patch) instead of discarding and re-deriving the whole procedure. Keeping runtime patches separate from the trusted baseline keeps a recorded skill auditable.
  5. Safety gate — route any step the agent flags as high-risk through the existing approval flow before replay, so a recorded procedure can't silently execute a destructive step.

Capture could optionally be assisted by existing PostToolUse / Stop hook events (codex-rs/hooks/src/lib.rs) and the rollout-trace reducer as the source of the step sequence — no new recording infrastructure required.

Why this fits Codex (and not a separate product)

  • Surface-consistent. It's pure text/tool-call procedure persisted as files. It works headless and applies equally to CLI, IDE, and cloud surfaces — no new platform-specific dependencies, no display requirement.
  • Directly on the existing roadmap surface. It's an incremental extension of core-skills + memories, not a new subsystem. The novel pieces are narrow: an agent write path for skills, postcondition verification, and minimal repair.
  • Concrete payoff. Recurring tasks (repeated build/test/debug loops, multi-step setup/migration chores, fixed tool pipelines) stop paying full re-exploration cost on every run — fewer tokens, lower latency, more determinism.

Open design questions (happy to discuss before any code)

  • Should a recorded skill be a SKILL.md superset (add an optional executable/verify block) or a sibling artifact type, to avoid confusing human-authored vs. agent-recorded skills?
  • What's the right trigger for record — an explicit tool the model calls, a user command, or a Stop-hook prompt at end of a successful task?
  • How should postconditions be expressed so they're cheap to evaluate and safe (no side effects)?
  • What's the minimal-repair contract — patch-and-persist vs. propose-patch-for-confirmation?
  • Where does this sit relative to the memories Phase-1 background pipeline — complementary, or should recording feed that pipeline?

I'm glad to iterate on the design in this thread before anything is implemented; identifying the right shape here seems like the hard part.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLIIssues related to the Codex CLIenhancementNew feature or requestmemoryskillsIssues related to skills

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions