From 4193e3c2520421fa22e9cc01fb2413c2821ed4c4 Mon Sep 17 00:00:00 2001 From: alex-solovyev <1556417+alex-solovyev@users.noreply.github.com> Date: Sun, 29 Mar 2026 21:25:06 -0400 Subject: [PATCH 1/2] =?UTF-8?q?docs:=20tighten=20production-audio.md=20(15?= =?UTF-8?q?1=E2=86=92139=20lines,=20remove=20redundant=20pipeline=20block/?= =?UTF-8?q?checklist)=20(#13490)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From 168d0c47fa18bdd4c55bb3272a103d8e793d26ab Mon Sep 17 00:00:00 2001 From: AI DevOps Date: Sun, 29 Mar 2026 21:24:46 -0600 Subject: [PATCH 2/2] =?UTF-8?q?docs:=20tighten=20graduated-learnings.md=20?= =?UTF-8?q?(138=E2=86=9261=20lines,=20preserve=20all=20refs)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .agents/aidevops/graduated-learnings.md | 147 ++++++------------------ 1 file changed, 35 insertions(+), 112 deletions(-) diff --git a/.agents/aidevops/graduated-learnings.md b/.agents/aidevops/graduated-learnings.md index 9376456e13..350c2eba3e 100644 --- a/.agents/aidevops/graduated-learnings.md +++ b/.agents/aidevops/graduated-learnings.md @@ -13,126 +13,49 @@ tools: # Graduated Learnings -Validated learnings promoted from local memory databases into shared documentation. -These patterns have been confirmed through repeated use across sessions. +Validated patterns promoted from local memory. Qualify at high confidence or 3+ accesses. +Managed by `memory-graduate-helper.sh graduate` / `memory-helper.sh graduate [candidates|graduate|status]`. -**How memories graduate**: Memories qualify when they reach high confidence or are -accessed frequently (3+ times). The `memory-graduate-helper.sh` script identifies -candidates and appends them here. Each graduation batch is timestamped. - -**Categories**: - -- **Solutions & Fixes**: Working solutions to real problems -- **Anti-Patterns**: Approaches that failed (avoid repeating) -- **Patterns & Best Practices**: Proven approaches -- **Architecture Decisions**: Key design choices and rationale -- **Configuration & Preferences**: Tool and workflow settings -- **Context & Background**: Important background information - -**Usage**: `memory-helper.sh graduate [candidates|graduate|status]` - -## Graduated: 2026-02-08 - -### Anti-Patterns (What NOT to Do) - -- **[FAILED_APPROACH]** Tried using PostgreSQL for memory but it adds deployment complexity - SQLite FTS5 is simpler - *(confidence: high, validated: 9x)* - -- **[FAILURE_PATTERN]** [task:refactor] Haiku missed edge cases when refactoring complex shell scripts with many conditionals [model:haiku] - *(confidence: high, validated: 3x)* - -### Architecture Decisions - -- **[ARCHITECTURAL_DECISION]** YAML handoffs are more token-efficient than markdown (~400 vs ~2000 tokens) - *(confidence: high, validated: 0x)* - -- **[DECISION]** Mailbox uses SQLite (`mailbox.db`) not TOON files. Prune shows storage report by default, `--force` to delete. Migration from TOON runs automatically on `aidevops update` via `setup.sh`. - *(confidence: medium, validated: 8x)* - -- **[DECISION]** Agent lifecycle uses three tiers: `draft/` (R&D, orchestration-created), `custom/` (private, permanent), shared (`.agents/` via PR). Both `draft/` and `custom/` survive `setup.sh` deployments. Orchestration agents (Build+, Ralph loop, runners) know they can create drafts for reusable parallel processing context and propose them for inclusion in aidevops. - *(confidence: medium, validated: 3x)* - -### Configuration & Preferences - -- **[USER_PREFERENCE]** Prefer conventional commits with scope: feat(memory): description - *(confidence: medium, validated: 4x)* - -### Patterns & Best Practices - -- **[SUCCESS_PATTERN]** [task:feature] Breaking task into 4 phases with separate commits worked well for Claude-Flow feature adoption [model:sonnet] - *(confidence: high, validated: 3x)* - -- **[SUCCESS_PATTERN]** [task:bugfix] Opus identified root cause of race condition by reasoning through concurrent execution paths [model:opus] - *(confidence: high, validated: 2x)* - -- **[CODEBASE_PATTERN]** Memory daemon should auto-extract learnings from thinking blocks when sessions end - *(confidence: medium, validated: 5x)* - -## Graduated: 2026-02-11 - -### Anti-Patterns (What NOT to Do) - -- **[FAILURE_PATTERN]** Session anti-pattern: mentioning issues in summary text without logging them as TODOs or fixing them. This creates an illusion of thoroughness while actually losing the improvements. The fix is mechanical: every time you type a sentence describing a bug or limitation, STOP and either (1) fix it now, or (2) add a TODO entry. Then continue writing the summary. Do not batch issue logging to the end. - *(confidence: high, validated: 1x)* - -### Architecture Decisions - -- **[DECISION]** When discovering bugs or issues during a task, log them as TODOs IMMEDIATELY — do not defer until the end of the session or until the user asks. This session discovered 8 issues but only logged 2 until prompted. The development lifecycle rule is clear: issues discovered during work must be fixed on the fly or logged as TODOs. Deferring loses context and risks forgetting entirely. - *(confidence: high, validated: 1x)* - -- **[DECISION]** For content generation tasks (images, video, UGC, ads), ALWAYS read domain subagents BEFORE generating. `content/production-image.md` has Nanobanana Pro JSON prompt templates that produce dramatically better results than freehand prompts. `tools/video/video-prompt-design.md` has the 7-component video prompt format. `content/story.md` has hook frameworks. Using structured templates from subagents vs freehand: the difference was visible in output quality during the Trinity Windows UGC test. - *(confidence: high, validated: 1x)* - -- **[DECISION]** UGC content generation needs a complete assembled sequence, not just individual assets. When storyboarding multi-shot content: (1) generate video for ALL shots not just the hero, (2) assemble into a single sequence with transitions using `ffmpeg`, (3) output the final assembled video as the primary deliverable. Individual clips are intermediates, not the final product. - *(confidence: high, validated: 1x)* - -- **[DECISION]** CRITICAL self-improvement: The supervisor needs a post-evaluation orphaned PR scanner. Pattern observed across 47+ tasks: workers create PRs but the supervisor records `task_only` or `no_pr` because (1) worker emits `TASK_COMPLETE` instead of `FULL_LOOP_COMPLETE`, (2) PR creation happens after the signal, or (3) `evaluate_worker` fails to parse the PR URL from logs. Fix: add a Phase 3c to the pulse cycle that runs `gh pr list --state open --head feature/tXXX` for all tasks in complete/deployed/failed states with `task_only`/`no_pr`/NULL `pr_url`, and links any found PRs. This would have caught t199.2 (PR #849), t199.3 (PR #846), t199.5 (PR #872) automatically instead of requiring manual intervention. - *(confidence: high, validated: 0x)* - -### Configuration & Preferences - -- **[USER_PREFERENCE]** User-facing generated assets (images, videos, documents) should be output to `~/Downloads/` so the user can immediately review them in Finder. Do NOT bury outputs in `~/.aidevops/.agent-workspace/` for interactive sessions — that path is invisible to the user. Reserve `.agent-workspace` for headless/pipeline runs only. - *(confidence: high, validated: 0x)* - -- **[USER_PREFERENCE]** Runtime identity hazard: misidentifying as Claude Code when running in OpenCode wastes cycles investigating wrong config paths, wrong CLI commands, wrong prompt loading. The AGENTS.md rule says `use the app name from the version check output — do not guess`. Enforce this strictly — wrong identity leads to wrong assumptions about how `build.txt` loads, where configs live, and which CLI to use for dispatch. - *(confidence: high, validated: 0x)* - -### Patterns & Best Practices - -- **[CODEBASE_PATTERN]** OpenCode system prompt override: the agent `prompt` field in `opencode.json` replaces `anthropic_default` (not appends). Code path: `input.agent.prompt ? [input.agent.prompt] : SystemPrompt.provider(input.model)`. The `{file:path}` syntax is resolved by template matching. All active agents must have `build.txt` set or they fall back to upstream `anthropic.txt`, losing all aidevops overrides. Verified: all 12 active agents have it; 4 disabled agents (build, plan, Plan+, AI-DevOps) don't need it. - *(confidence: high, validated: 1x)* - -- **[CODEBASE_PATTERN]** Task ID collision: t264 was assigned by two sessions simultaneously. Another session used t264 for memory monitoring (PR #1040), while this session used t264 for version-manager unbound variable fix. The pre-dispatch check caught it (`t264 is marked [x] in TODO.md`). Prevention: always `git pull` and re-read TODO.md before assigning IDs. The collision prevention rule in AGENTS.md exists but needs enforcement during monitoring sessions that create tasks. - *(confidence: high, validated: 1x)* - -- **[CODEBASE_PATTERN]** Stale TODO.md pattern: tasks completed in previous sessions (t231 via PR #955, t247 via subtask PRs, t259 via PR #1020) remain open in TODO.md because the supervisor's `update_todo_on_complete()` only runs during the post-PR lifecycle. When a monitoring session manually merges PRs or when tasks are completed across session boundaries, TODO.md falls out of sync. Fix: run `supervisor-helper.sh reconcile-todo` periodically, and before dispatching tasks, check if the work is already done (workers now do this and report `task_obsolete`). - *(confidence: high, validated: 0x)* +--- -- **[SUCCESS_PATTERN]** [task:feature] Supervisor task t136.5 completed successfully | PR: https://github.com/marcusquinn/aidevops/pull/792 | Task: Scaffold aidevops-pro and aidevops-anon repos - create initial plugin repos with proper structure [model:opus] [duration:1206s] - *(confidence: medium, validated: 51x)* +## Anti-Patterns -### Solutions & Fixes +- **PostgreSQL for memory** adds deployment complexity — SQLite FTS5 is simpler. *(9x)* +- **Haiku on complex shell refactors** misses edge cases with many conditionals. Use sonnet+. *(3x)* +- **Mentioning issues in summary text** without logging them as TODOs loses improvements. Fix: every sentence describing a bug → either fix it now or add a TODO entry. Do not batch to end of session. *(1x)* -- **[ERROR_FIX]** Deploying auto-recovery infinite loop: when a task is stuck in `deploying` and its PR is already merged, `cmd_pr_lifecycle` Step 4b runs `cleanup_after_merge` (worktree already gone) and `update_todo_on_complete` (already marked [x]). The `retry_count` variable was LOCAL and reset every pulse cycle, allowing infinite recovery attempts across pulses. Additionally, if `cmd_transition` to `deployed` fails AND the fallback `cmd_transition` to `failed` also fails, the task stays in `deploying` forever. Fixed by t263 (PR #1036): persistent `deploying_recovery_attempts` DB column, max 10 attempts across all pulses, fallback direct SQL UPDATE. - *(confidence: high, validated: 0x)* +## Architecture Decisions -- **[ERROR_FIX]** Pulse silent failure pattern: with `set -euo pipefail`, Phase 3 (`process_post_pr_lifecycle`) can fail silently because it's called with `2>/dev/null || true`. If the function crashes internally (e.g., infinite loop in deploying auto-recovery), the pulse exits with code 1 but produces no output after the header line. The `|| true` prevents the error from propagating, but the exit code still leaks through. Symptom: pulse prints `=== Supervisor Pulse ===` and nothing else. Diagnosis: check `post-pr.log` for repeated entries, check exit code of manual pulse run. - *(confidence: high, validated: 0x)* +- **YAML handoffs** are more token-efficient than markdown (~400 vs ~2000 tokens). +- **Mailbox** uses SQLite (`mailbox.db`) not TOON files. Prune shows storage report by default; `--force` to delete. Migration from TOON runs automatically on `aidevops update`. *(8x)* +- **Agent lifecycle tiers**: `draft/` (R&D, orchestration-created), `custom/` (private, permanent), shared (`.agents/` via PR). Both `draft/` and `custom/` survive `setup.sh`. *(3x)* +- **Bugs found during a task** must be logged as TODOs IMMEDIATELY — not deferred to end of session. Deferring loses context. *(1x)* +- **Content generation tasks** (images, video, UGC, ads): ALWAYS read domain subagents BEFORE generating. `content/production-image.md` has JSON prompt templates; `tools/video/video-prompt-design.md` has the 7-component format; `content/story.md` has hook frameworks. *(1x)* +- **UGC multi-shot content**: generate ALL shots (not just hero), assemble with `ffmpeg`, output assembled video as primary deliverable. Individual clips are intermediates. *(1x)* +- **Orphaned PR scanner needed**: workers create PRs but supervisor records `task_only`/`no_pr` when (1) worker emits `TASK_COMPLETE` instead of `FULL_LOOP_COMPLETE`, (2) PR created after signal, or (3) `evaluate_worker` fails to parse PR URL. Fix: Phase 3c in pulse — `gh pr list --state open --head feature/tXXX` for tasks in complete/deployed/failed with no `pr_url`. Caught t199.2 (PR #849), t199.3 (PR #846), t199.5 (PR #872) manually. +- **OpenCode system prompt override**: `prompt` field in `opencode.json` replaces `anthropic_default` (not appends). All active agents must have `build.txt` set or fall back to upstream `anthropic.txt`. *(1x)* -- **[WORKING_SOLUTION]** Bash associative arrays (`declare -A`) + `set -u` = unbound variable on empty arrays and subscript access. Use newline-delimited string + grep instead for portable `set -u`-safe lookups. Fixed in `issue-sync-helper.sh` PR #1086. - *(confidence: high, validated: 0x)* +## Configuration & Preferences -- **[WORKING_SOLUTION]** Worker PRs dispatched in parallel for tasks with dependency chains (blocked-by) will create merge conflicts. t008.1-4 and t012.3-5 all conflicted because workers ran simultaneously on overlapping files. Solution: dispatch sequentially respecting blocked-by dependencies, or use a single worker for the entire plan. - *(confidence: high, validated: 0x)* +- **Conventional commits with scope**: `feat(memory): description` *(4x)* +- **User-facing generated assets** → `~/Downloads/` for interactive sessions. Do NOT use `~/.aidevops/.agent-workspace/` — invisible to user in Finder. Reserve `.agent-workspace` for headless/pipeline runs. +- **Runtime identity**: always use the app name from version-check output. Misidentifying (e.g., Claude Code vs OpenCode) leads to wrong config paths, CLI commands, and prompt loading assumptions. -- **[WORKING_SOLUTION]** Decomposition workers marking parent #plan tasks [x] is a known bug (t278). Parents t008 and t012 were falsely completed while subtasks were still [ ]. Always verify subtask completion before marking parent done. - *(confidence: high, validated: 0x)* +## Patterns & Best Practices -- **[WORKING_SOLUTION]** issue-sync `find_closing_pr()` bug pattern: when TODO.md uses a different format (`pr:#NNN`) than what the code searches for (`PR #NNN`), close comments silently omit the PR reference. Always check that regex patterns match the actual data format in TODO.md. Fixed in t291/PR#1129. - *(confidence: high, validated: 0x)* +- **Phase-based task breakdown** (4 phases, separate commits) worked well for complex feature adoption. *(3x)* +- **Opus for race condition root cause**: reasoning through concurrent execution paths identified the issue. *(2x)* +- **Memory daemon** should auto-extract learnings from thinking blocks when sessions end. *(5x)* +- **Task ID collision** (t264 assigned twice): always `git pull` and re-read TODO.md before assigning IDs. The pre-dispatch check catches it. *(1x)* +- **Stale TODO.md**: `update_todo_on_complete()` only runs during post-PR lifecycle. Cross-session merges leave TODO.md out of sync. Fix: `supervisor-helper.sh reconcile-todo` periodically; workers check `task_obsolete` before starting. -- **[WORKING_SOLUTION]** CRITICAL FIX: Cron supervisor pulse requires three things to work on macOS: (1) `/usr/sbin` in PATH for `sysctl`, (2) `GH_TOKEN` cached to file since macOS keyring is inaccessible from cron - `supervisor-helper.sh` now auto-caches token from interactive sessions to `~/.aidevops/.agent-workspace/supervisor/.gh-token-cache`, (3) `get_aidevops_identity` must validate `gh api` output is not JSON error. Fixed in PR #780. - *(confidence: medium, validated: 52x)* +## Solutions & Fixes -- **[WORKING_SOLUTION]** SYSTEMIC: After merging PRs that modify `supervisor-helper.sh` or other scripts in `.agents/scripts/`, the deployed copy at `~/.aidevops/agents/scripts/` is NOT automatically updated. The cron pulse runs the deployed copy. Must run `rsync -a --exclude=loop-state/ --exclude=custom/ --exclude=draft/ ~/Git/aidevops/.agents/ ~/.aidevops/agents/` or `aidevops update` after merging script changes. `setup.sh` `deploy_aidevops_agents()` handles this but may not run to completion in all modes. Consider adding auto-deploy to the supervisor's post-merge hook. - *(confidence: medium, validated: 37x)* +- **Deploying auto-recovery infinite loop** (t263/PR #1036): `retry_count` was LOCAL and reset every pulse cycle. Fixed with persistent `deploying_recovery_attempts` DB column, max 10 attempts, fallback direct SQL UPDATE. +- **Pulse silent failure** with `set -euo pipefail`: Phase 3 called with `2>/dev/null || true` — crashes silently, pulse exits after printing only the header. Diagnosis: check `post-pr.log` for repeated entries. +- **Bash `declare -A` + `set -u`** = unbound variable on empty arrays. Use newline-delimited string + grep for portable `set -u`-safe lookups. Fixed in `issue-sync-helper.sh` PR #1086. +- **Parallel worker PRs on dependency chains** create merge conflicts. Dispatch sequentially respecting `blocked-by`, or use a single worker for the plan. (t008.1-4, t012.3-5) +- **Decomposition workers marking parent `#plan` tasks `[x]`** is a known bug (t278). Always verify subtask completion before marking parent done. +- **`issue-sync find_closing_pr()` format mismatch**: `pr:#NNN` in TODO.md vs `PR #NNN` in regex → close comments silently omit PR reference. Fixed in t291/PR#1129. +- **Cron supervisor pulse on macOS** (PR #780): requires (1) `/usr/sbin` in PATH for `sysctl`, (2) `GH_TOKEN` cached to file (`~/.aidevops/.agent-workspace/supervisor/.gh-token-cache`) since macOS keyring is inaccessible from cron, (3) `get_aidevops_identity` must validate `gh api` output is not a JSON error. *(52x)* +- **Script deploy lag** (37x): after merging PRs that modify `.agents/scripts/`, the deployed copy at `~/.aidevops/agents/scripts/` is NOT auto-updated. Run `aidevops update` or `rsync -a --exclude=loop-state/ --exclude=custom/ --exclude=draft/ ~/Git/aidevops/.agents/ ~/.aidevops/agents/` after merging script changes.