Skip to content

Commit fa41d6e

Browse files
committed
fix: Quality Improvements to Agents, Rules and Spec-Commands
1 parent 3510821 commit fa41d6e

10 files changed

Lines changed: 116 additions & 17 deletions

File tree

pilot/agents/plan-reviewer.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Compare plan against user request and clarifications:
3737
5. **DoD Quality** — Are Definition of Done criteria measurable and verifiable? ("tests pass" is not verifiable; "API returns 404 for nonexistent resources" is)
3838
6. **Risk Quality** — Are risk mitigations concrete implementable behaviors? ("handle edge cases" is not acceptable; "reset to null when selected project not in list" is)
3939
7. **Runtime Environment** — If project has a running service/API/UI, does the plan document how to start, test, and verify it?
40+
8. **Problem Statement Quality** — Does the plan state invariants (what must always be true) and edge cases, not just what to build? A plan that only says "add X feature" without stating behavioral expectations and invariants leaves implementers guessing at correctness boundaries.
4041

4142
### Step 3: Adversarial Challenge
4243

@@ -48,6 +49,7 @@ Verify assumptions against actual code using Grep/Glob/Read. Challenge every ass
4849
4. **Question optimism** — Where is the plan overly optimistic about complexity or feasibility?
4950
5. **Identify architectural weaknesses** — What design decisions create risk? What alternatives were ignored?
5051
6. **Test scope boundaries** — What happens at the edges? What's excluded that should be included?
52+
7. **Ghost constraint check** *(suggestion level)* — Are any constraints in the plan inherited from assumptions or prior patterns rather than the current stated requirements? Look for: constraints nobody can attribute to a specific requirement, scope restrictions that appear copied from similar prior work without re-validation, or scope limitations that may reflect a historical context that no longer applies. Flag as `suggestion` — these are speculative on initial plans.
5153

5254
### Step 4: Compose Output
5355

@@ -90,6 +92,7 @@ For EVERY plan, ask:
9092
- [ ] What happens at the boundaries of "in scope" vs "out of scope"?
9193
- [ ] What failure modes from similar features in the codebase could apply here?
9294
- [ ] What concurrent access or race condition scenarios exist?
95+
- [ ] Are any constraints ghost constraints — inherited from prior context, not from current requirements?
9396

9497
## Alignment Checklist
9598

@@ -104,6 +107,7 @@ For EVERY plan, verify:
104107
- [ ] Each DoD criterion is verifiable against code or runtime behavior
105108
- [ ] Runtime Environment section exists if the project has a running service
106109
- [ ] Architecture aligns with any stated user preferences
110+
- [ ] Plan states invariants (what must always be true) and edge cases, not just what to build
107111

108112
## Output Persistence
109113

pilot/agents/spec-reviewer.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,21 @@ Apply all rules read in Step 1. Focus on real issues, not style preferences.
120120
- Silently swallowed errors with no fallback or notification → **should_fix**
121121
- External calls without timeout handling → **should_fix**
122122

123+
#### Complexity Anti-Patterns (should_fix)
124+
125+
Scan changed files for patterns that add complexity instead of solving the problem:
126+
127+
| Pattern | Looks Like | Severity |
128+
|---------|-----------|---------|
129+
| Wrapper cascade | Function wraps a broken function with extra handling instead of fixing it | should_fix |
130+
| Config toggle | Flag switching between old and new behavior (`if USE_NEW_PATH:`) | should_fix |
131+
| Defensive copy-paste | Function duplicated with minor modification "to be safe" | should_fix |
132+
| Exception swallowing | `except: pass` or `except Exception: pass` with no logging | should_fix |
133+
| Type escape | `as any`, `# type: ignore`, `@ts-ignore`, `@SuppressWarnings` | should_fix |
134+
| Adapter layer | New class/module created purely to translate between two things you control | should_fix |
135+
136+
These patterns compound over time. Flag them so the implementer can simplify before the complexity accumulates.
137+
123138
### Step 5: Phase C — Goal Achievement
124139

125140
**Is the overall goal actually achieved?**
@@ -199,6 +214,16 @@ For each truth: **verified** = artifacts exist, substantive, wired, no critical
199214

200215
**Overall goal_score**: `achieved` = all truths verified; `partial` = some verified; `not_achieved` = majority failed.
201216

217+
#### C7: Prediction Accuracy
218+
219+
Compare the plan's predictions against actuals from `git diff --stat`:
220+
221+
```bash
222+
git diff --stat HEAD~1 # or appropriate base commit
223+
```
224+
225+
Count: tasks predicted (from plan Progress Tracking), tasks actual (from `[x]` checkboxes), files predicted (from plan task Files sections), files changed actual (from git diff). If plan contains no numeric predictions, set `prediction_accuracy` to `null`. Record in output JSON for future calibration.
226+
202227
### Step 6: Compose and Persist Output
203228

204229
Merge all findings from Phases A, B, and C. Deduplicate overlapping issues. Write to output_path.
@@ -224,6 +249,14 @@ Output ONLY valid JSON (no markdown wrapper, no explanation outside JSON):
224249
"goal_score": "achieved | partial | not_achieved",
225250
"truths_verified": 5,
226251
"truths_total": 7,
252+
"prediction_accuracy": {
253+
"tasks_predicted": 6,
254+
"tasks_actual": 7,
255+
"files_predicted": 4,
256+
"files_changed_actual": 5,
257+
"notes": "One extra task added during implementation for edge case handling"
258+
},
259+
// When plan contains no numeric predictions: "prediction_accuracy": null
227260
"issues": [
228261
{
229262
"severity": "must_fix | should_fix | suggestion",

pilot/commands/spec-implement.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -84,21 +84,23 @@ All subsequent work happens inside the worktree directory.
8484
**For EVERY task:**
8585

8686
1. **Read plan's implementation steps** — list files to create/modify/delete
87-
2. **Call chain analysis:** Trace callers (upwards), callees (downwards), side effects
88-
3. **Mark in_progress:** `TaskUpdate(taskId, status="in_progress")`
89-
4. **TDD Flow:**
87+
2. **Pre-Mortem check:** Scan plan's `## Pre-Mortem` section — if any trigger condition is observably true for this task, note it in the plan and adapt your approach autonomously (e.g., adjust the implementation strategy, add a defensive check, or reorder steps). Handle per Deviation rules — only escalate to user if it's an architectural-level change.
88+
3. **Call chain analysis:** Trace callers (upwards), callees (downwards), side effects
89+
4. **Mark in_progress:** `TaskUpdate(taskId, status="in_progress")`
90+
5. **TDD Flow:**
9091
- **RED:** Write failing test → verify it fails (feature missing, not syntax error)
9192
- **GREEN:** Implement minimal code to pass
9293
- **REFACTOR:** Improve while keeping tests green
9394
- Skip TDD for: docs, config, IaC, formatting-only changes
94-
5. **Verify tests pass** — run test suite
95-
6. **Run actual program** — use plan's Runtime Environment. Check port: `lsof -i :<port>`. If using playwright-cli: `-s="${PILOT_SESSION_ID:-default}"`
96-
7. **Check diagnostics** — zero errors
97-
8. **Validate Definition of Done** — all criteria from plan
98-
9. **Self-review:** Completeness? Names clear? YAGNI? Tests verify behavior not implementation?
99-
10. **Per-task commit (worktree only):** `git add <files> && git commit -m "{type}(spec): {task-name}"`
100-
11. **Mark completed:** `TaskUpdate(taskId, status="completed")`
101-
12. **Update plan file immediately** (Step 2.4)
95+
- **Surprise discovery:** If something contradicts how you expected it to work, check plan's `## Assumptions` section — identify which task numbers are affected and note the invalidated assumption in the plan before continuing.
96+
6. **Verify tests pass** — run test suite
97+
7. **Run actual program** — use plan's Runtime Environment. Check port: `lsof -i :<port>`. If using playwright-cli: `-s="${PILOT_SESSION_ID:-default}"`
98+
8. **Check diagnostics** — zero errors
99+
9. **Validate Definition of Done** — all criteria from plan
100+
10. **Self-review:** Completeness? Names clear? YAGNI? Tests verify behavior not implementation?
101+
11. **Per-task commit (worktree only):** `git add <files> && git commit -m "{type}(spec): {task-name}"`
102+
12. **Mark completed:** `TaskUpdate(taskId, status="completed")`
103+
13. **Update plan file immediately** (Step 2.4)
102104

103105
---
104106

pilot/commands/spec-plan.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,8 @@ For each: document hypotheses, note full file paths, track unanswered questions.
158158
~/.pilot/bin/pilot notify plan_approval "Design Decisions" "<plan_name> — architecture choices" --plan-path "<plan_path>" 2>/dev/null || true
159159
```
160160

161+
Frame each decision as **"X at the cost of Y"** — never recommend without stating what it costs.
162+
161163
Incorporate user choices into plan design, proceed to Step 1.5.
162164

163165
### Step 1.5: Implementation Planning
@@ -193,6 +195,8 @@ Incorporate user choices into plan design, proceed to Step 1.5.
193195

194196
**Zero-context assumption:** Assume implementer knows nothing. Provide exact file paths, explain domain concepts, reference similar patterns.
195197

198+
**Assumptions:** After creating tasks, write the `## Assumptions` section — one bullet per assumption: what you assume, which finding supports it, which task numbers depend on it. When implementation hits a surprise, this list tells the implementer which tasks are affected.
199+
196200
#### Step 1.5.1: Goal Verification Criteria
197201

198202
After creating tasks, derive for the `## Goal Verification` section:
@@ -201,6 +205,18 @@ After creating tasks, derive for the `## Goal Verification` section:
201205
3. For each truth, identify supporting artifacts (files with real implementation)
202206
4. Identify 2-5 key links (critical component connections)
203207

208+
#### Step 1.5.2: Pre-Mortem & Falsification Signals
209+
210+
**Assume this plan failed after full execution. Why?** Write 2-3 failure scenarios with observable trigger conditions checked during implementation.
211+
212+
**This is distinct from Risks** (external dependencies outside your control) and from **Goal Verification truths** (what success looks like). Pre-Mortem covers *internal approach validity* — where your own design choices or assumptions could be wrong.
213+
214+
Example: Risk = "Redis is unavailable" | Pre-Mortem = "We assumed sessions are stateless but they're not — trigger: session data can't round-trip through the new format in first integration test"
215+
216+
**During implementation**, these triggers are handled autonomously — the implementer adapts the approach, not stops the workflow.
217+
218+
Write these to the `## Pre-Mortem` section of the plan.
219+
204220
### Step 1.6: Write Full Plan
205221

206222
**Save to:** `docs/plans/YYYY-MM-DD-<feature-name>.md`
@@ -246,6 +262,10 @@ Type: Feature
246262
## Implementation Tasks
247263
[Tasks from Step 1.5]
248264

265+
## Assumptions
266+
- [What you assume] — supported by [finding/file:line] — Tasks N, M depend on this
267+
- [What you assume] — supported by [finding/file:line] — Task N depends on this
268+
249269
## Testing Strategy
250270
- Unit / Integration / Manual verification
251271

@@ -254,6 +274,11 @@ Type: Feature
254274
⚠️ Mitigations are commitments — verification checks they're implemented.
255275
✅ "Reset to null when project not in list" ❌ "Handle edge cases"
256276

277+
## Pre-Mortem
278+
*Assume this plan failed. Most likely internal reasons (approach validity, not external deps):*
279+
1. **[Failure scenario]** (Task N) → Trigger: [observable condition during implementation]
280+
2. **[Failure scenario]** (Task N) → Trigger: [observable condition during implementation]
281+
257282
## Goal Verification
258283
### Truths
259284
### Artifacts

pilot/commands/spec-verify.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,16 @@ For EACH task, verify its Definition of Done criteria against the running progra
214214

215215
If any criterion unmet: fix inline if possible, or add task and loop back.
216216

217+
### Step 3.8b: Not Verified Acknowledgment
218+
219+
List what was **NOT** verified and why. Include in the verification report (Step 3.13). Every gap must have a reason:
220+
221+
| Not Verified | Reason |
222+
|-------------|--------|
223+
| [criterion or scenario] | No test environment / Out of scope / Untestable statically / Deferred |
224+
225+
"None — all criteria have automated verification" is a valid answer if true. Do not omit this section: absence of acknowledged gaps ≠ absence of real gaps.
226+
217227
### Step 3.9: E2E Verification (Full profile only)
218228

219229
**If runtime profile is not Full:** Skip.
@@ -311,7 +321,14 @@ If any fails: fix on base branch, re-run, commit fix separately (e.g., `fix: res
311321

312322
1. Set `Status: VERIFIED` in plan
313323
2. Register: `~/.pilot/bin/pilot register-plan "<plan_path>" "VERIFIED" 2>/dev/null || true`
314-
3. Report completion with summary
324+
3. Report completion with summary:
325+
```
326+
## Verification Complete
327+
**Issues Found:** X
328+
### Goal Achievement: N/M truths verified
329+
### Must Fix (N) | Should Fix (N) | Suggestions (N)
330+
### Not Verified: [list items from Step 3.8b, or "None"]
331+
```
315332

316333
**When verification FAILS (missing features, serious bugs):**
317334

pilot/rules/development-practices.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,12 @@ probe search "database connection setup" ./
3737

3838
**Red Flags → STOP:** "Quick fix for now", multiple changes at once, proposing fixes before tracing data flow, 2+ failed fixes.
3939

40+
**Revert-First:** When something breaks during implementation, default response = simplify, not add more code.
41+
1. **Revert** — undo the change that broke it. Clean state.
42+
2. **Delete** — can the broken thing be removed entirely?
43+
3. **One-liner** — minimal targeted fix only.
44+
4. **None of the above** → stop, reconsider the approach. 3+ failed fixes = the approach is wrong, not the fix.
45+
4046
**Meta-Debugging:** Treat your own code as foreign. Your mental model is a guess — the code's behavior is truth.
4147

4248
#### Defense-in-Depth & Root-Cause Tracing
@@ -76,6 +82,16 @@ result = await wait_for(lambda: get_result() is not None, timeout=5.0)
7682

7783
**Rules:** Poll every 10ms (not 1ms — wastes CPU). Always include timeout with clear error message. Call getter inside loop for fresh data (no stale cache).
7884

85+
### Constraint Classification
86+
87+
When exploring a problem or codebase, classify constraints you encounter:
88+
89+
- **Hard** — non-negotiable (physical limits, external contracts, security requirements, deadlines)
90+
- **Soft** — preferences or conventions — negotiable if trade-off is stated explicitly
91+
- **Ghost** — past constraints baked into the current approach that **no longer apply**
92+
93+
Ghost constraints are the most valuable to find: they lock out options nobody thinks are available. Ask "why can't we do X?" — if nobody can point to a current requirement, it may be a ghost.
94+
7995
### Git Operations
8096

8197
**Read git state freely. NEVER execute write commands without EXPLICIT user permission.**

pilot/rules/standards-python.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ uv run pytest -q --cov=src --cov-fail-under=80 # Coverage
2828

2929
ruff format . # Format
3030
ruff check . --fix # Lint
31-
basedpyright src # Type check
31+
basedpyright src # Type check (adapt to your source dirs)
3232
```
3333

3434
### Code Style
@@ -55,7 +55,7 @@ basedpyright src # Type check
5555
- [ ] `uv run pytest` — tests pass
5656
- [ ] `ruff format .` — formatted
5757
- [ ] `ruff check .` — clean
58-
- [ ] `basedpyright src` — clean
58+
- [ ] `basedpyright src` — clean (adapt to your source dirs)
5959
- [ ] Coverage ≥ 80%
6060
- [ ] No unused imports
6161
- [ ] Production files ideally under 800 lines (1000+ = consider splitting)

pilot/rules/verification.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ Unit tests with mocks prove nothing about real-world behavior. After tests pass:
2525

2626
### Evidence Before Claims
2727

28+
**Before proceeding:** Ask "Do these tests verify what matters, or only what was easy to test?" If important edge cases go untested, acknowledge the gap explicitly — don't claim full coverage when you only have partial coverage.
29+
2830
1. **Identify** — What command proves this claim?
2931
2. **Execute** — Run the full command (not cached)
3032
3. **Read output** — Check exit code, count failures

pilot/scripts/mcp-server.cjs

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

pilot/scripts/worker-service.cjs

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)