fix: Quality Improvements to Agents, Rules and Spec-Commands

maxritter · maxritter · commit fa41d6e48695 · 2026-03-09T09:58:39.000+01:00
diff --git a/pilot/agents/plan-reviewer.md b/pilot/agents/plan-reviewer.md
@@ -37,6 +37,7 @@ Compare plan against user request and clarifications:
 5. **DoD Quality** — Are Definition of Done criteria measurable and verifiable? ("tests pass" is not verifiable; "API returns 404 for nonexistent resources" is)
 6. **Risk Quality** — Are risk mitigations concrete implementable behaviors? ("handle edge cases" is not acceptable; "reset to null when selected project not in list" is)
 7. **Runtime Environment** — If project has a running service/API/UI, does the plan document how to start, test, and verify it?
+8. **Problem Statement Quality** — Does the plan state invariants (what must always be true) and edge cases, not just what to build? A plan that only says "add X feature" without stating behavioral expectations and invariants leaves implementers guessing at correctness boundaries.
 
 ### Step 3: Adversarial Challenge
 
@@ -48,6 +49,7 @@ Verify assumptions against actual code using Grep/Glob/Read. Challenge every ass
 4. **Question optimism** — Where is the plan overly optimistic about complexity or feasibility?
 5. **Identify architectural weaknesses** — What design decisions create risk? What alternatives were ignored?
 6. **Test scope boundaries** — What happens at the edges? What's excluded that should be included?
+7. **Ghost constraint check** *(suggestion level)* — Are any constraints in the plan inherited from assumptions or prior patterns rather than the current stated requirements? Look for: constraints nobody can attribute to a specific requirement, scope restrictions that appear copied from similar prior work without re-validation, or scope limitations that may reflect a historical context that no longer applies. Flag as `suggestion` — these are speculative on initial plans.
 
 ### Step 4: Compose Output
 
@@ -90,6 +92,7 @@ For EVERY plan, ask:
 - [ ] What happens at the boundaries of "in scope" vs "out of scope"?
 - [ ] What failure modes from similar features in the codebase could apply here?
 - [ ] What concurrent access or race condition scenarios exist?
+- [ ] Are any constraints ghost constraints — inherited from prior context, not from current requirements?
 
 ## Alignment Checklist
 
@@ -104,6 +107,7 @@ For EVERY plan, verify:
 - [ ] Each DoD criterion is verifiable against code or runtime behavior
 - [ ] Runtime Environment section exists if the project has a running service
 - [ ] Architecture aligns with any stated user preferences
+- [ ] Plan states invariants (what must always be true) and edge cases, not just what to build
 
 ## Output Persistence
 
diff --git a/pilot/agents/spec-reviewer.md b/pilot/agents/spec-reviewer.md
@@ -120,6 +120,21 @@ Apply all rules read in Step 1. Focus on real issues, not style preferences.
 - Silently swallowed errors with no fallback or notification → **should_fix**
 - External calls without timeout handling → **should_fix**
 
+#### Complexity Anti-Patterns (should_fix)
+
+Scan changed files for patterns that add complexity instead of solving the problem:
+
+| Pattern | Looks Like | Severity |
+|---------|-----------|---------|
+| Wrapper cascade | Function wraps a broken function with extra handling instead of fixing it | should_fix |
+| Config toggle | Flag switching between old and new behavior (`if USE_NEW_PATH:`) | should_fix |
+| Defensive copy-paste | Function duplicated with minor modification "to be safe" | should_fix |
+| Exception swallowing | `except: pass` or `except Exception: pass` with no logging | should_fix |
+| Type escape | `as any`, `# type: ignore`, `@ts-ignore`, `@SuppressWarnings` | should_fix |
+| Adapter layer | New class/module created purely to translate between two things you control | should_fix |
+
+These patterns compound over time. Flag them so the implementer can simplify before the complexity accumulates.
+
 ### Step 5: Phase C — Goal Achievement
 
 **Is the overall goal actually achieved?**
@@ -199,6 +214,16 @@ For each truth: **verified** = artifacts exist, substantive, wired, no critical
 
 **Overall goal_score**: `achieved` = all truths verified; `partial` = some verified; `not_achieved` = majority failed.
 
+#### C7: Prediction Accuracy
+
+Compare the plan's predictions against actuals from `git diff --stat`:
+
+```bash
+git diff --stat HEAD~1  # or appropriate base commit
+```
+
+Count: tasks predicted (from plan Progress Tracking), tasks actual (from `[x]` checkboxes), files predicted (from plan task Files sections), files changed actual (from git diff). If plan contains no numeric predictions, set `prediction_accuracy` to `null`. Record in output JSON for future calibration.
+
 ### Step 6: Compose and Persist Output
 
 Merge all findings from Phases A, B, and C. Deduplicate overlapping issues. Write to output_path.
@@ -224,6 +249,14 @@ Output ONLY valid JSON (no markdown wrapper, no explanation outside JSON):
   "goal_score": "achieved | partial | not_achieved",
   "truths_verified": 5,
   "truths_total": 7,
+  "prediction_accuracy": {
+    "tasks_predicted": 6,
+    "tasks_actual": 7,
+    "files_predicted": 4,
+    "files_changed_actual": 5,
+    "notes": "One extra task added during implementation for edge case handling"
+  },
+  // When plan contains no numeric predictions: "prediction_accuracy": null
   "issues": [
     {
       "severity": "must_fix | should_fix | suggestion",
diff --git a/pilot/commands/spec-implement.md b/pilot/commands/spec-implement.md
@@ -84,21 +84,23 @@ All subsequent work happens inside the worktree directory.
 **For EVERY task:**
 
 1. **Read plan's implementation steps** — list files to create/modify/delete
-2. **Call chain analysis:** Trace callers (upwards), callees (downwards), side effects
-3. **Mark in_progress:** `TaskUpdate(taskId, status="in_progress")`
-4. **TDD Flow:**
+2. **Pre-Mortem check:** Scan plan's `## Pre-Mortem` section — if any trigger condition is observably true for this task, note it in the plan and adapt your approach autonomously (e.g., adjust the implementation strategy, add a defensive check, or reorder steps). Handle per Deviation rules — only escalate to user if it's an architectural-level change.
+3. **Call chain analysis:** Trace callers (upwards), callees (downwards), side effects
+4. **Mark in_progress:** `TaskUpdate(taskId, status="in_progress")`
+5. **TDD Flow:**
    - **RED:** Write failing test → verify it fails (feature missing, not syntax error)
    - **GREEN:** Implement minimal code to pass
    - **REFACTOR:** Improve while keeping tests green
    - Skip TDD for: docs, config, IaC, formatting-only changes
-5. **Verify tests pass** — run test suite
-6. **Run actual program** — use plan's Runtime Environment. Check port: `lsof -i :<port>`. If using playwright-cli: `-s="${PILOT_SESSION_ID:-default}"`
-7. **Check diagnostics** — zero errors
-8. **Validate Definition of Done** — all criteria from plan
-9. **Self-review:** Completeness? Names clear? YAGNI? Tests verify behavior not implementation?
-10. **Per-task commit (worktree only):** `git add <files> && git commit -m "{type}(spec): {task-name}"`
-11. **Mark completed:** `TaskUpdate(taskId, status="completed")`
-12. **Update plan file immediately** (Step 2.4)
+   - **Surprise discovery:** If something contradicts how you expected it to work, check plan's `## Assumptions` section — identify which task numbers are affected and note the invalidated assumption in the plan before continuing.
+6. **Verify tests pass** — run test suite
+7. **Run actual program** — use plan's Runtime Environment. Check port: `lsof -i :<port>`. If using playwright-cli: `-s="${PILOT_SESSION_ID:-default}"`
+8. **Check diagnostics** — zero errors
+9. **Validate Definition of Done** — all criteria from plan
+10. **Self-review:** Completeness? Names clear? YAGNI? Tests verify behavior not implementation?
+11. **Per-task commit (worktree only):** `git add <files> && git commit -m "{type}(spec): {task-name}"`
+12. **Mark completed:** `TaskUpdate(taskId, status="completed")`
+13. **Update plan file immediately** (Step 2.4)
 
 ---
 
diff --git a/pilot/commands/spec-plan.md b/pilot/commands/spec-plan.md
@@ -158,6 +158,8 @@ For each: document hypotheses, note full file paths, track unanswered questions.
 ~/.pilot/bin/pilot notify plan_approval "Design Decisions" "<plan_name> — architecture choices" --plan-path "<plan_path>" 2>/dev/null || true
 ```
 
+Frame each decision as **"X at the cost of Y"** — never recommend without stating what it costs.
+
 Incorporate user choices into plan design, proceed to Step 1.5.
 
 ### Step 1.5: Implementation Planning
@@ -193,6 +195,8 @@ Incorporate user choices into plan design, proceed to Step 1.5.
 
 **Zero-context assumption:** Assume implementer knows nothing. Provide exact file paths, explain domain concepts, reference similar patterns.
 
+**Assumptions:** After creating tasks, write the `## Assumptions` section — one bullet per assumption: what you assume, which finding supports it, which task numbers depend on it. When implementation hits a surprise, this list tells the implementer which tasks are affected.
+
 #### Step 1.5.1: Goal Verification Criteria
 
 After creating tasks, derive for the `## Goal Verification` section:
@@ -201,6 +205,18 @@ After creating tasks, derive for the `## Goal Verification` section:
 3. For each truth, identify supporting artifacts (files with real implementation)
 4. Identify 2-5 key links (critical component connections)
 
+#### Step 1.5.2: Pre-Mortem & Falsification Signals
+
+**Assume this plan failed after full execution. Why?** Write 2-3 failure scenarios with observable trigger conditions checked during implementation.
+
+**This is distinct from Risks** (external dependencies outside your control) and from **Goal Verification truths** (what success looks like). Pre-Mortem covers *internal approach validity* — where your own design choices or assumptions could be wrong.
+
+Example: Risk = "Redis is unavailable" | Pre-Mortem = "We assumed sessions are stateless but they're not — trigger: session data can't round-trip through the new format in first integration test"
+
+**During implementation**, these triggers are handled autonomously — the implementer adapts the approach, not stops the workflow.
+
+Write these to the `## Pre-Mortem` section of the plan.
+
 ### Step 1.6: Write Full Plan
 
 **Save to:** `docs/plans/YYYY-MM-DD-<feature-name>.md`
@@ -246,6 +262,10 @@ Type: Feature
 ## Implementation Tasks
 [Tasks from Step 1.5]
 
+## Assumptions
+- [What you assume] — supported by [finding/file:line] — Tasks N, M depend on this
+- [What you assume] — supported by [finding/file:line] — Task N depends on this
+
 ## Testing Strategy
 - Unit / Integration / Manual verification
 
@@ -254,6 +274,11 @@ Type: Feature
 ⚠️ Mitigations are commitments — verification checks they're implemented.
 ✅ "Reset to null when project not in list" ❌ "Handle edge cases"
 
+## Pre-Mortem
+*Assume this plan failed. Most likely internal reasons (approach validity, not external deps):*
+1. **[Failure scenario]** (Task N) → Trigger: [observable condition during implementation]
+2. **[Failure scenario]** (Task N) → Trigger: [observable condition during implementation]
+
 ## Goal Verification
 ### Truths
 ### Artifacts
diff --git a/pilot/commands/spec-verify.md b/pilot/commands/spec-verify.md
@@ -214,6 +214,16 @@ For EACH task, verify its Definition of Done criteria against the running progra
 
 If any criterion unmet: fix inline if possible, or add task and loop back.
 
+### Step 3.8b: Not Verified Acknowledgment
+
+List what was **NOT** verified and why. Include in the verification report (Step 3.13). Every gap must have a reason:
+
+| Not Verified | Reason |
+|-------------|--------|
+| [criterion or scenario] | No test environment / Out of scope / Untestable statically / Deferred |
+
+"None — all criteria have automated verification" is a valid answer if true. Do not omit this section: absence of acknowledged gaps ≠ absence of real gaps.
+
 ### Step 3.9: E2E Verification (Full profile only)
 
 **If runtime profile is not Full:** Skip.
@@ -311,7 +321,14 @@ If any fails: fix on base branch, re-run, commit fix separately (e.g., `fix: res
 
 1. Set `Status: VERIFIED` in plan
 2. Register: `~/.pilot/bin/pilot register-plan "<plan_path>" "VERIFIED" 2>/dev/null || true`
-3. Report completion with summary
+3. Report completion with summary:
+   ```
+   ## Verification Complete
+   **Issues Found:** X
+   ### Goal Achievement: N/M truths verified
+   ### Must Fix (N) | Should Fix (N) | Suggestions (N)
+   ### Not Verified: [list items from Step 3.8b, or "None"]
+   ```
 
 **When verification FAILS (missing features, serious bugs):**
 
diff --git a/pilot/rules/development-practices.md b/pilot/rules/development-practices.md
@@ -37,6 +37,12 @@ probe search "database connection setup" ./
 
 **Red Flags → STOP:** "Quick fix for now", multiple changes at once, proposing fixes before tracing data flow, 2+ failed fixes.
 
+**Revert-First:** When something breaks during implementation, default response = simplify, not add more code.
+1. **Revert** — undo the change that broke it. Clean state.
+2. **Delete** — can the broken thing be removed entirely?
+3. **One-liner** — minimal targeted fix only.
+4. **None of the above** → stop, reconsider the approach. 3+ failed fixes = the approach is wrong, not the fix.
+
 **Meta-Debugging:** Treat your own code as foreign. Your mental model is a guess — the code's behavior is truth.
 
 #### Defense-in-Depth & Root-Cause Tracing
@@ -76,6 +82,16 @@ result = await wait_for(lambda: get_result() is not None, timeout=5.0)
 
 **Rules:** Poll every 10ms (not 1ms — wastes CPU). Always include timeout with clear error message. Call getter inside loop for fresh data (no stale cache).
 
+### Constraint Classification
+
+When exploring a problem or codebase, classify constraints you encounter:
+
+- **Hard** — non-negotiable (physical limits, external contracts, security requirements, deadlines)
+- **Soft** — preferences or conventions — negotiable if trade-off is stated explicitly
+- **Ghost** — past constraints baked into the current approach that **no longer apply**
+
+Ghost constraints are the most valuable to find: they lock out options nobody thinks are available. Ask "why can't we do X?" — if nobody can point to a current requirement, it may be a ghost.
+
 ### Git Operations
 
 **Read git state freely. NEVER execute write commands without EXPLICIT user permission.**
diff --git a/pilot/rules/standards-python.md b/pilot/rules/standards-python.md
@@ -28,7 +28,7 @@ uv run pytest -q --cov=src --cov-fail-under=80     # Coverage
 
 ruff format .                                       # Format
 ruff check . --fix                                  # Lint
-basedpyright src                                    # Type check
+basedpyright src                                    # Type check (adapt to your source dirs)
 ```
 
 ### Code Style
@@ -55,7 +55,7 @@ basedpyright src                                    # Type check
 - [ ] `uv run pytest` — tests pass
 - [ ] `ruff format .` — formatted
 - [ ] `ruff check .` — clean
-- [ ] `basedpyright src` — clean
+- [ ] `basedpyright src` — clean (adapt to your source dirs)
 - [ ] Coverage ≥ 80%
 - [ ] No unused imports
 - [ ] Production files ideally under 800 lines (1000+ = consider splitting)
diff --git a/pilot/rules/verification.md b/pilot/rules/verification.md
@@ -25,6 +25,8 @@ Unit tests with mocks prove nothing about real-world behavior. After tests pass:
 
 ### Evidence Before Claims
 
+**Before proceeding:** Ask "Do these tests verify what matters, or only what was easy to test?" If important edge cases go untested, acknowledge the gap explicitly — don't claim full coverage when you only have partial coverage.
+
 1. **Identify** — What command proves this claim?
 2. **Execute** — Run the full command (not cached)
 3. **Read output** — Check exit code, count failures
diff --git a/pilot/scripts/mcp-server.cjs b/pilot/scripts/mcp-server.cjs
diff --git a/pilot/scripts/worker-service.cjs b/pilot/scripts/worker-service.cjs