fix: tighten /fix and refine spec/prd/create-skill/setup-rules skills

maxritter · maxritter · commit 03f945a65564 · 2026-04-30T10:25:24.000+02:00
- /fix: rationalizations table, bisect/perf/stack-trace investigation hooks, fast-signal lock-in, single-site defense-in-depth, pre-report checklist, optional architectural post-mortem flag.
- /spec-plan: code-first rule before user questions, conditional File Structure section for 4+ task plans, no-placeholders self-check before reviewer launch.
- /spec-implement: anti-pattern callout against horizontal slicing.
- /prd: HARD-GATE against pre-approval implementation, code-first rule.
- /create-skill: behavioural-vs-cosmetic edit taxonomy, rationalization-resistant design pattern (scoped to discipline skills).
- /setup-rules: explainer-first prompt guideline.
- approval flows: collapse two No options into one No, I have feedback (annotations auto-save, no ready handshake).
diff --git a/pilot/skills/create-skill/orchestrator.md b/pilot/skills/create-skill/orchestrator.md
@@ -8,3 +8,14 @@ model: opus
 # /create-skill — Skill Creator
 
 **Create a reusable skill.** Provide a topic or workflow description, and this command explores the codebase, gathers relevant patterns, and builds a well-structured skill interactively with you. If no topic is given, it evaluates the current session for extractable knowledge.
+
+## Editing existing skills
+
+For NEW skills, Step 6 runs the with-skill vs baseline subagent comparison — no skill ships without it.
+
+For EDITS, classify the change first:
+
+- **Behavioural** — adds/removes a step, changes a rule, reorders critical sections, edits the description, changes Iron Laws / red flags / rationalization tables, modifies trigger keywords. **Re-run Step 6 (or write 2–3 prompts if none exist).** These changes shift trigger accuracy and step compliance.
+- **Cosmetic** — typo, prose polish, link fix, formatting, example clarification with no semantic shift. Skip the test re-run.
+
+When uncertain, treat as behavioural. Skill changes that go unverified are how skill quality drifts.
diff --git a/pilot/skills/create-skill/steps/07-anti-patterns.md b/pilot/skills/create-skill/steps/07-anti-patterns.md
@@ -20,3 +20,13 @@
 | **Overtriggering** | Skill loads for irrelevant queries, users disabling it | Add negative triggers ("Do NOT use for..."), be more specific |
 | **Instructions not followed** | Claude loads skill but ignores steps | Instructions too verbose (condense), buried (move critical to top), or ambiguous (use exact commands) |
 | **Large context issues** | Skill seems slow or responses degraded | Move detailed docs to `references/`, keep SKILL.md under 5,000 words |
+
+### Rationalization-Resistant Design (for discipline skills only)
+
+For skills that enforce discipline (TDD, verification, root-cause-before-fix), use baseline testing (Step 6.2) to capture verbatim agent rationalizations and counter them with three patterns:
+
+- **Excuse → Reality table** — each rationalization the agent produced in baseline, paired with the technical counter.
+- **Iron Laws + Red Flags list** — short numbered laws and a symptom-checklist that points back to "STOP, return to step N".
+- **Closed loopholes** — don't just state the rule, forbid specific workarounds (e.g. "Don't keep code as 'reference'. Delete means delete.").
+
+Skip these for technique/reference skills — they don't have rationalization pressure to defend against.
diff --git a/pilot/skills/fix/orchestrator.md b/pilot/skills/fix/orchestrator.md
@@ -30,6 +30,17 @@ Bugfix workflow with TDD. Investigate the bug, write the failing test, fix at th
 5. STOP WHEN OVER YOUR HEAD — multi-component / architectural bugs need /spec.
 ```
 
+### Common rationalizations
+
+| Excuse | Reality |
+|--------|---------|
+| "Emergency, no time for process" | Systematic is faster than guess-and-check thrashing. |
+| "Issue is simple, don't need to trace" | Simple bugs have root causes too — tracing one file is fast. |
+| "I'll write the test after confirming the fix" | Tests written after pass immediately and prove nothing. RED first or it isn't TDD. |
+| "One more fix attempt" (after 2 failures) | Two failed quick-lane attempts means the lane is wrong. Bail to /spec. |
+| "Looks fixed, tests pass" | E2E evidence required. Unit-pass is not user-pass. |
+| "Quick patch now, investigate later" | This IS the quick lane. If you can't trace it now, bail to /spec — don't patch blind. |
+
 ---
 
 ## Critical Constraints
diff --git a/pilot/skills/fix/steps/01-investigate.md b/pilot/skills/fix/steps/01-investigate.md
@@ -16,6 +16,8 @@ git log --oneline -10 -- <suspected_file_or_dir> 2>/dev/null
 
 Look for the commit that introduced the bug. If recent, read that diff. If nothing obvious, skip.
 
+**Bisect when `git log` doesn't reveal it.** If the bug appeared between two known-good and known-bad states and the suspect commit isn't obvious, run `git bisect start <bad> <good>` then `git bisect run <test-cmd>` against the reproducing test you'll write in Step 2 — this pinpoints the introducing commit automatically. Skip when the surface area is small enough that a single read finds it.
+
 ### 1.3 Trace to root cause
 
 **Start with `codegraph_context(task="<bug description>")`** — single call, returns entry points and related symbols. Read it carefully.
@@ -37,6 +39,17 @@ For bugs that don't surface clearly through stack traces or static reading — U
 
 **Mark every temporary log with `SPEC-DEBUG:`** (e.g. `console.log("SPEC-DEBUG: filters=", filters)`). Step 3.5 greps for this marker — any unremoved match fails the diff sanity check.
 
+**Unknown caller?** If the bug is in shared code reached from many sites and you don't know which caller triggers it, capture the call chain inline:
+
+```js
+// SPEC-DEBUG: who is calling this with bad input?
+console.error("SPEC-DEBUG:", { args, stack: new Error().stack });
+```
+
+Run, read the stack, identify the offending caller, then trace upward to the original trigger.
+
+**Performance regressions are different.** Value-logging is the wrong tool — replace it with baseline measurement: timing harness, `performance.now()`, profiler, or query plan / `EXPLAIN`. Establish current vs expected timing first, then bisect against that signal (Step 1.2). Measure first, fix second.
+
 Skip 1.4 when the stack trace already names the failing line, or when a static read of the file is enough to see the bug. Skip is the default.
 
 ### 1.5 State the root cause
@@ -47,6 +60,18 @@ Out loud, in one sentence to the user, before writing any test:
 
 If confidence is Low: bail out. Don't guess in the quick lane.
 
+### 1.6 Lock in a fast signal before Step 2
+
+Your reproducing signal — what runs in <30s and definitively shows fail/pass — must be **fast and deterministic** before you write the fix. For most bugs that's the unit test you're about to write in Step 2. For UI / integration / async bugs the unit-test seam may be wrong, and your real signal is a `curl`, CLI invocation, or headless-browser command (the same one you'll run in Step 4 E2E).
+
+Whichever it is, sharpen it now:
+
+- **Slow loop (>30s)?** Narrow the test scope, skip unrelated setup, cache fixtures. A flaky 30s loop is the slowest path to a fix.
+- **Flaky?** Pin time, seed RNG, isolate filesystem, freeze network. For non-deterministic bugs, raise the reproduction rate (loop the trigger 50–100×, parallelise) until it's debuggable — a 1% flake is not.
+- **Wrong symptom?** The signal must fail with the **user's** reported symptom, not a different failure that happens to be nearby. Wrong bug = wrong fix.
+
+If you cannot get a fast deterministic signal after one pass of sharpening, that's a bail-out trigger — tell the user to use `/spec`. Don't hypothesise into a flaky loop.
+
 ### Red flags — STOP and re-investigate
 
 - "Quick fix for now, investigate later" → STOP, this IS the quick lane.
diff --git a/pilot/skills/fix/steps/03-fix.md b/pilot/skills/fix/steps/03-fix.md
@@ -14,7 +14,7 @@ If the buggy data flows from upstream, fix upstream. Red flags in the diff:
 - Early return that hides wrong state from the caller.
 - Renamed/suppressed log lines that previously surfaced the bug.
 
-A defensive layer is occasionally legitimate (defense-in-depth). When it is, document it explicitly in the Step 6 summary. Otherwise revert it.
+**Quick-lane fixes are single-site.** If you find yourself adding validation at 2+ layers (entry + business + storage, etc.) to make the bug "structurally impossible," you're in `/spec` territory — bail out. Defense-in-depth is a feature of `/spec`, not `/fix`. The one exception: a single boundary check that prevents the same root cause from re-entering through one other obvious path — document it in the Step 6 summary.
 
 ### 3.3 Run the reproducing test — it MUST pass
 
diff --git a/pilot/skills/fix/steps/06-finalise.md b/pilot/skills/fix/steps/06-finalise.md
@@ -44,7 +44,20 @@ Handle:
 
 Best-effort — don't block on failure.
 
-### 6.4 Report
+### 6.4 Pre-report verification checklist
+
+⛔ Walk every box before writing the report. Missing any one = not done — return to the relevant step.
+
+- [ ] Reproducing test passes (Step 3.3 fresh run, this message).
+- [ ] Full anti-regression suite green (Step 5.2 fresh run).
+- [ ] E2E executed against the actual program with concrete evidence captured (Step 4).
+- [ ] `git diff | grep -E "SPEC-DEBUG|^\\+.*\\b(console\\.log|print\\()"` returns nothing (no leftover instrumentation).
+- [ ] Diff is small and every changed line traces to the bug (lineage rule).
+- [ ] Worktree mode: single bundled `fix:` commit. Non-worktree: changes ready, no commit yet.
+
+If any box is unchecked, do not write the report and do not ask for approval — fix the gap first.
+
+### 6.5 Report
 
 ```
 Bugfix complete — <bug>.
@@ -57,4 +70,14 @@ Run /clear before starting new work — this resets context while keeping projec
 
 The `E2E:` line is **mandatory** — it documents that the actual program was exercised, not just the unit tests.
 
+### 6.6 Post-mortem flag (optional, one line)
+
+Ask once, now that you have more information than when you started: **what would have prevented this bug?** If the answer is architectural — no clean test seam, hidden coupling between modules, validation absent at the boundary the bad data crossed, repeated near-miss in the same area — name it as a `/spec` follow-up candidate in one line:
+
+```
+Follow-up (architectural): <one-line description> — candidate for /spec.
+```
+
+Skip when the answer is "nothing structural, it was a one-line typo / off-by-one / wrong default." Don't manufacture follow-ups.
+
 ARGUMENTS: $ARGUMENTS
diff --git a/pilot/skills/prd/orchestrator.md b/pilot/skills/prd/orchestrator.md
@@ -8,6 +8,12 @@ model: opus
 
 # /prd - Generate Product Requirements Documents
 
+<HARD-GATE>
+Do NOT invoke `/spec`, `/spec-plan`, `/spec-implement`, write any code, scaffold any project, or take any implementation action until you have written a PRD and the user has approved it. This applies to EVERY idea regardless of perceived simplicity.
+
+`/prd`'s output is a written PRD at the path determined in Step 6 (write-prd). The terminal state is offering hand-off to `/spec` and waiting for the user. The skill does not invoke implementation skills directly.
+</HARD-GATE>
+
 **Strategic thought partner and brainstorming surface** — turns vague ideas into concrete Product Requirements Documents (PRDs) through one-on-one conversation, with optional research. Produces a PRD that can be handed off directly to `/spec` for implementation.
 
 **Use `/prd` when:**
diff --git a/pilot/skills/prd/steps/03-clarify.md b/pilot/skills/prd/steps/03-clarify.md
@@ -8,6 +8,8 @@
 
 **⛔ Skip obvious questions.** Do not ask anything already answered by the one-line idea, the codebase exploration in Step 1, or earlier answers in this conversation. The goal is to surface what the user hasn't thought about yet, not to collect a standard intake form.
 
+**⛔ Code-first rule.** Before each question, ask "can the codebase answer this?" If yes — read the code. Use `codegraph_context`, `codegraph_search`, `codegraph_explore`, or Probe to resolve "how does X currently work / where does Y live / what's the existing pattern for Z". Only ask the user about things code can't tell you: purpose, priorities, audience, constraints, scope boundaries, behavioural expectations not yet encoded. Asking the user about facts the code already encodes wastes their time and signals you didn't explore.
+
 Coverage areas (ask only where genuinely unclear):
 - **Purpose** — what's the core outcome the user wants?
 - **Users** — who benefits from this? What's their workflow today?
diff --git a/pilot/skills/setup-rules/steps/01-reference.md b/pilot/skills/setup-rules/steps/01-reference.md
@@ -5,6 +5,7 @@
 ### Guidelines
 
 - **Always use AskUserQuestion** when asking the user anything
+- **Explainer-first prompts** — when prompting the user to choose between options, lead with one short paragraph that names what the term means, why these skills need it, and what changes if they pick differently. Then show the choices and the recommended default. Assume the user does not know the term — never present `paths` frontmatter, MCP scoping, or rule-vs-skill distinctions as a multiple-choice without context. Walk them through one section at a time, not all sections at once.
 - **Read before writing** — check existing rules before creating
 - **Write concise rules** — every word costs tokens in context
 - **Idempotent** — running multiple times produces consistent results
diff --git a/pilot/skills/spec-bugfix-plan/steps/07-approval.md b/pilot/skills/spec-bugfix-plan/steps/07-approval.md
@@ -11,6 +11,11 @@
    ~/.pilot/bin/pilot notify plan_approval "Bugfix Plan Ready" "<plan-slug> — annotate in Console or approve here" --plan-path "<plan_path>" 2>/dev/null || true
    ```
 1. Summarize: symptom → root cause → fix approach → task structure
-2. AskUserQuestion: "Yes, proceed" | "No, let me edit" | "No, I'll annotate in the Console"
+2. AskUserQuestion:
+   - "Yes, proceed" — Approve as-is and start spec-implement
+   - "No, I have feedback" — I've annotated in the Console or edited the plan file; process my feedback
+
+   The user can pause at this prompt, annotate in the Console's Specifications tab (auto-saves), or edit the plan file directly, then pick option 2. No "ready" handshake required.
 3. **Yes:** Set `Approved: Yes`, invoke `Skill(skill='spec-implement', args='<plan-path>')`
-   **No (edit or annotate):** Tell user to edit the plan or annotate in the Console Specifications tab — annotations auto-save. Say "ready" when done. Re-run Step 6 (check for annotation feedback), re-read plan, ask again. **Other:** Incorporate, re-ask.
+   **No, I have feedback:** Re-run Step 6 (process Console annotations), re-read the plan file (in case the user edited it), then return to Step 7 and ask again.
+   **Other free-text feedback:** Incorporate the changes into the plan, then re-ask with a fresh AskUserQuestion.
diff --git a/pilot/skills/spec-implement/steps/04-tdd-loop.md b/pilot/skills/spec-implement/steps/04-tdd-loop.md
@@ -2,6 +2,16 @@
 
 ## Step 4: TDD Loop
 
+### Anti-pattern: horizontal slicing
+
+⛔ **Do NOT write all RED tests for the plan first, then all GREEN implementations.** That is "horizontal slicing" — it produces tests that test imagined behaviour and shapes, not actual behaviour. Tests written in bulk anchor on the function signatures you imagined, not the ones you'd actually want after writing the implementation.
+
+The TDD loop runs **per task**, vertical slice: RED for this task → GREEN for this task → refactor → next task. One test → one implementation → repeat. Each cycle informs the next.
+
+If you find yourself queuing 3+ RED tests before any GREEN, stop and complete the first cycle.
+
+---
+
 **For EVERY task (generic flow — features use this as-is; bugfixes use it via the Bugfix Lane overrides below):**
 
 1. **Read plan's implementation steps** — list files to create/modify/delete
diff --git a/pilot/skills/spec-plan/steps/05-task-understanding.md b/pilot/skills/spec-plan/steps/05-task-understanding.md
@@ -13,7 +13,9 @@
    | CLI/scripts | Output format, flags, exit codes                   |
    | Data/config | Schema, migration, validation, defaults            |
 
-3. **Ask Batch 1 questions** → notify, then use `AskUserQuestion` with each question as a separate entry with predefined options:
+4. **⛔ Code-first rule: before each question, ask "can I answer this from the codebase?"** If yes, do that instead. Use `codegraph_context`, `codegraph_search`, `codegraph_explore`, or Probe to resolve "how does X currently work / where does Y live / what's the current pattern for Z". Only ask the user about decisions the code can't make — purpose, priority trade-offs, scope boundaries, behavioral expectations not yet encoded. Asking the user about facts already in the codebase is the single biggest source of unnecessary friction in planning.
+
+5. **Ask Batch 1 questions** → notify, then use `AskUserQuestion` with each question as a separate entry with predefined options:
 
    ```bash
    ~/.pilot/bin/pilot notify plan_approval "Input Needed" "<plan_name> — clarification questions" --plan-path "<plan_path>" 2>/dev/null || true
diff --git a/pilot/skills/spec-plan/steps/08-implementation-planning.md b/pilot/skills/spec-plan/steps/08-implementation-planning.md
@@ -1,5 +1,21 @@
 ## Step 8: Implementation Planning
 
+### 8.0: File Structure (when 4+ tasks expected, otherwise inline per task)
+
+When the plan will have 4+ tasks, write a `## File Structure` section before tasks listing every file with one-line responsibility — decomposition decisions get locked in here. For 1–3 task plans skip this; the per-task `Files:` block already gives the same view.
+
+```markdown
+## File Structure
+
+- `src/foo/bar.ts` (create) — pure function: `parseFoo(input) → Foo`. No I/O.
+- `src/foo/loader.ts` (create) — fetches and caches Foo from API. Wraps `parseFoo`.
+- `tests/foo/bar.test.ts` (create) — unit tests for `parseFoo`.
+```
+
+One responsibility per file. Files that change together live together. In existing codebases, follow established patterns — don't restructure unrelated code.
+
+### 8.1: Task Granularity
+
 **Task Granularity:** Each task: independently testable, focused (2-4 files max), verifiable. Split if multiple unrelated DoD criteria; merge if one can't be tested without the other. Don't create tasks for setup/boilerplate with no standalone value — fold into the first task that uses them.
 
 **Task Structure:**
@@ -40,7 +56,7 @@
 
 **Assumptions:** After creating tasks, write the `## Assumptions` section — one bullet per assumption: what you assume, which finding supports it, which task numbers depend on it. When implementation hits a surprise, this list tells the implementer which tasks are affected.
 
-#### Step 8.1: Goal Verification Criteria
+#### Step 8.2: Goal Verification Criteria
 
 After creating tasks, derive for the `## Goal Verification` section:
 
diff --git a/pilot/skills/spec-plan/steps/11-plan-verification.md b/pilot/skills/spec-plan/steps/11-plan-verification.md
@@ -1,6 +1,30 @@
 ## Step 11: Plan Verification
 
-**⛔ If `PILOT_SPEC_REVIEW_ENABLED` is `"false"` (from Step 0),** skip this step entirely and proceed to Step 13.
+### 11.0: No-Placeholders Self-Check (always — before launching reviewers)
+
+⛔ Walk the plan file once, fresh-eyed, and grep for the patterns below. **Every match is a plan failure** — fix inline before sending the plan to a reviewer or asking for approval.
+
+**Forbidden placeholder patterns:**
+
+- `TBD`, `TODO`, `FIXME`, "implement later", "fill in details", "details below"
+- "add appropriate error handling", "add validation", "handle edge cases" — without specifying which cases
+- "write tests for the above" — tasks must specify the actual test cases, not a meta-instruction
+- "similar to Task N" — implementers may read tasks out of order; repeat the relevant content
+- Steps that describe *what* to do without showing *how* (code blocks required for code steps)
+- References to types, functions, methods, files, or env vars not defined in any task
+- Bracketed angle-brackets like `<your-code-here>`, `<insert-X>` outside of header literal placeholders
+- Goal Verification truths that are not falsifiable ("works correctly", "is fast enough")
+
+```bash
+# Quick grep (run in worktree or repo root):
+grep -nEi "TBD|TODO|FIXME|implement later|fill in details|appropriate error handling|similar to Task" "<plan_path>"
+```
+
+If anything matches, fix it inline (no new round-trip needed). Then proceed to spec-review launch below.
+
+---
+
+**⛔ If `PILOT_SPEC_REVIEW_ENABLED` is `"false"` (from Step 0),** skip the rest of this step and proceed to Step 13.
 
 **When enabled:** Run spec-review for every feature spec. Small plans benefit from a second pair of eyes just as much as large ones — missing edge cases and unclear DoD criteria are size-independent.
 
diff --git a/pilot/skills/spec-plan/steps/13-approval.md b/pilot/skills/spec-plan/steps/13-approval.md