You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pilot/agents/plan-reviewer.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,6 +37,7 @@ Compare plan against user request and clarifications:
37
37
5.**DoD Quality** — Are Definition of Done criteria measurable and verifiable? ("tests pass" is not verifiable; "API returns 404 for nonexistent resources" is)
38
38
6.**Risk Quality** — Are risk mitigations concrete implementable behaviors? ("handle edge cases" is not acceptable; "reset to null when selected project not in list" is)
39
39
7.**Runtime Environment** — If project has a running service/API/UI, does the plan document how to start, test, and verify it?
40
+
8.**Problem Statement Quality** — Does the plan state invariants (what must always be true) and edge cases, not just what to build? A plan that only says "add X feature" without stating behavioral expectations and invariants leaves implementers guessing at correctness boundaries.
40
41
41
42
### Step 3: Adversarial Challenge
42
43
@@ -48,6 +49,7 @@ Verify assumptions against actual code using Grep/Glob/Read. Challenge every ass
48
49
4.**Question optimism** — Where is the plan overly optimistic about complexity or feasibility?
49
50
5.**Identify architectural weaknesses** — What design decisions create risk? What alternatives were ignored?
50
51
6.**Test scope boundaries** — What happens at the edges? What's excluded that should be included?
52
+
7.**Ghost constraint check***(suggestion level)* — Are any constraints in the plan inherited from assumptions or prior patterns rather than the current stated requirements? Look for: constraints nobody can attribute to a specific requirement, scope restrictions that appear copied from similar prior work without re-validation, or scope limitations that may reflect a historical context that no longer applies. Flag as `suggestion` — these are speculative on initial plans.
51
53
52
54
### Step 4: Compose Output
53
55
@@ -90,6 +92,7 @@ For EVERY plan, ask:
90
92
-[ ] What happens at the boundaries of "in scope" vs "out of scope"?
91
93
-[ ] What failure modes from similar features in the codebase could apply here?
92
94
-[ ] What concurrent access or race condition scenarios exist?
95
+
-[ ] Are any constraints ghost constraints — inherited from prior context, not from current requirements?
93
96
94
97
## Alignment Checklist
95
98
@@ -104,6 +107,7 @@ For EVERY plan, verify:
104
107
-[ ] Each DoD criterion is verifiable against code or runtime behavior
105
108
-[ ] Runtime Environment section exists if the project has a running service
106
109
-[ ] Architecture aligns with any stated user preferences
110
+
-[ ] Plan states invariants (what must always be true) and edge cases, not just what to build
| Adapter layer | New class/module created purely to translate between two things you control | should_fix |
135
+
136
+
These patterns compound over time. Flag them so the implementer can simplify before the complexity accumulates.
137
+
123
138
### Step 5: Phase C — Goal Achievement
124
139
125
140
**Is the overall goal actually achieved?**
@@ -199,6 +214,16 @@ For each truth: **verified** = artifacts exist, substantive, wired, no critical
199
214
200
215
**Overall goal_score**: `achieved` = all truths verified; `partial` = some verified; `not_achieved` = majority failed.
201
216
217
+
#### C7: Prediction Accuracy
218
+
219
+
Compare the plan's predictions against actuals from `git diff --stat`:
220
+
221
+
```bash
222
+
git diff --stat HEAD~1 # or appropriate base commit
223
+
```
224
+
225
+
Count: tasks predicted (from plan Progress Tracking), tasks actual (from `[x]` checkboxes), files predicted (from plan task Files sections), files changed actual (from git diff). If plan contains no numeric predictions, set `prediction_accuracy` to `null`. Record in output JSON for future calibration.
226
+
202
227
### Step 6: Compose and Persist Output
203
228
204
229
Merge all findings from Phases A, B, and C. Deduplicate overlapping issues. Write to output_path.
@@ -224,6 +249,14 @@ Output ONLY valid JSON (no markdown wrapper, no explanation outside JSON):
2.**Pre-Mortem check:** Scan plan's `## Pre-Mortem` section — if any trigger condition is observably true for this task, note it in the plan and adapt your approach autonomously (e.g., adjust the implementation strategy, add a defensive check, or reorder steps). Handle per Deviation rules — only escalate to user if it's an architectural-level change.
88
+
3.**Call chain analysis:** Trace callers (upwards), callees (downwards), side effects
-**Surprise discovery:** If something contradicts how you expected it to work, check plan's `## Assumptions` section — identify which task numbers are affected and note the invalidated assumption in the plan before continuing.
96
+
6.**Verify tests pass** — run test suite
97
+
7.**Run actual program** — use plan's Runtime Environment. Check port: `lsof -i :<port>`. If using playwright-cli: `-s="${PILOT_SESSION_ID:-default}"`
98
+
8.**Check diagnostics** — zero errors
99
+
9.**Validate Definition of Done** — all criteria from plan
100
+
10.**Self-review:** Completeness? Names clear? YAGNI? Tests verify behavior not implementation?
Frame each decision as **"X at the cost of Y"** — never recommend without stating what it costs.
162
+
161
163
Incorporate user choices into plan design, proceed to Step 1.5.
162
164
163
165
### Step 1.5: Implementation Planning
@@ -193,6 +195,8 @@ Incorporate user choices into plan design, proceed to Step 1.5.
193
195
194
196
**Zero-context assumption:** Assume implementer knows nothing. Provide exact file paths, explain domain concepts, reference similar patterns.
195
197
198
+
**Assumptions:** After creating tasks, write the `## Assumptions` section — one bullet per assumption: what you assume, which finding supports it, which task numbers depend on it. When implementation hits a surprise, this list tells the implementer which tasks are affected.
199
+
196
200
#### Step 1.5.1: Goal Verification Criteria
197
201
198
202
After creating tasks, derive for the `## Goal Verification` section:
@@ -201,6 +205,18 @@ After creating tasks, derive for the `## Goal Verification` section:
201
205
3. For each truth, identify supporting artifacts (files with real implementation)
**Assume this plan failed after full execution. Why?** Write 2-3 failure scenarios with observable trigger conditions checked during implementation.
211
+
212
+
**This is distinct from Risks** (external dependencies outside your control) and from **Goal Verification truths** (what success looks like). Pre-Mortem covers *internal approach validity* — where your own design choices or assumptions could be wrong.
213
+
214
+
Example: Risk = "Redis is unavailable" | Pre-Mortem = "We assumed sessions are stateless but they're not — trigger: session data can't round-trip through the new format in first integration test"
215
+
216
+
**During implementation**, these triggers are handled autonomously — the implementer adapts the approach, not stops the workflow.
217
+
218
+
Write these to the `## Pre-Mortem` section of the plan.
Copy file name to clipboardExpand all lines: pilot/commands/spec-verify.md
+18-1Lines changed: 18 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -214,6 +214,16 @@ For EACH task, verify its Definition of Done criteria against the running progra
214
214
215
215
If any criterion unmet: fix inline if possible, or add task and loop back.
216
216
217
+
### Step 3.8b: Not Verified Acknowledgment
218
+
219
+
List what was **NOT** verified and why. Include in the verification report (Step 3.13). Every gap must have a reason:
220
+
221
+
| Not Verified | Reason |
222
+
|-------------|--------|
223
+
|[criterion or scenario]| No test environment / Out of scope / Untestable statically / Deferred |
224
+
225
+
"None — all criteria have automated verification" is a valid answer if true. Do not omit this section: absence of acknowledged gaps ≠ absence of real gaps.
**Red Flags → STOP:** "Quick fix for now", multiple changes at once, proposing fixes before tracing data flow, 2+ failed fixes.
39
39
40
+
**Revert-First:** When something breaks during implementation, default response = simplify, not add more code.
41
+
1.**Revert** — undo the change that broke it. Clean state.
42
+
2.**Delete** — can the broken thing be removed entirely?
43
+
3.**One-liner** — minimal targeted fix only.
44
+
4.**None of the above** → stop, reconsider the approach. 3+ failed fixes = the approach is wrong, not the fix.
45
+
40
46
**Meta-Debugging:** Treat your own code as foreign. Your mental model is a guess — the code's behavior is truth.
41
47
42
48
#### Defense-in-Depth & Root-Cause Tracing
@@ -76,6 +82,16 @@ result = await wait_for(lambda: get_result() is not None, timeout=5.0)
76
82
77
83
**Rules:** Poll every 10ms (not 1ms — wastes CPU). Always include timeout with clear error message. Call getter inside loop for fresh data (no stale cache).
78
84
85
+
### Constraint Classification
86
+
87
+
When exploring a problem or codebase, classify constraints you encounter:
-**Soft** — preferences or conventions — negotiable if trade-off is stated explicitly
91
+
-**Ghost** — past constraints baked into the current approach that **no longer apply**
92
+
93
+
Ghost constraints are the most valuable to find: they lock out options nobody thinks are available. Ask "why can't we do X?" — if nobody can point to a current requirement, it may be a ghost.
94
+
79
95
### Git Operations
80
96
81
97
**Read git state freely. NEVER execute write commands without EXPLICIT user permission.**
Copy file name to clipboardExpand all lines: pilot/rules/verification.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,8 @@ Unit tests with mocks prove nothing about real-world behavior. After tests pass:
25
25
26
26
### Evidence Before Claims
27
27
28
+
**Before proceeding:** Ask "Do these tests verify what matters, or only what was easy to test?" If important edge cases go untested, acknowledge the gap explicitly — don't claim full coverage when you only have partial coverage.
0 commit comments