feat: agent eval mode for loop exit conditions#126
Conversation
- Stale state guard now allows PAUSED loops through, consistent with the early guard (graceful pause should record iteration results) - Add cleanup of in-memory tracking maps (taskToLoop, pipelineTasks) when stale state guard returns early, preventing memory leaks
- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance
- Extract TaskCompletionStatus type alias, replace nested ternary with switch - Deduplicate buildEvalPrompt template structure - Remove redundant comments restating adjacent code - Remove empty else block in loop-manager validation - Fix stale dependency count comment in handler-setup
…cope stale guard to agent mode - Extract `cleanupIterationTracking(taskId, loopId, iteration)` from three duplicated cleanup sequences in handleTaskTerminal, eliminating repetition - Guard the stale-state re-fetch block with `if (loop.evalMode === 'agent')` so shell eval (milliseconds) bypasses the unnecessary DB round-trips - Update stale-guard test to use evalMode: 'agent', matching the scope of the guard it exercises Co-Authored-By: Claude <noreply@anthropic.com>
…itCondition createScheduledLoop() unconditionally rejected empty exitCondition, blocking agent eval mode which uses LLM review instead of a shell command. Guard the check with evalMode === 'shell' (default), matching loop-manager validation. Co-Authored-By: Claude <noreply@anthropic.com>
AgentExitConditionEvaluator now subscribes to LoopCancelled inside waitForTaskCompletion and immediately emits TaskCancellationRequested for the in-flight eval task. Previously, the eval task was not tracked in LoopHandler.taskToLoop by design, so handleLoopCancelled could not reach it — leaving it running until evalTimeout as an orphan consuming a worker slot. The existing stale-state guard in LoopHandler already discarded the result correctly; this fix adds the missing cancellation signal to free the worker slot immediately. Two tests added: verifies TaskCancellationRequested is emitted on loop cancellation, and verifies cancellation of a different loop is ignored. Co-Authored-By: Claude <noreply@anthropic.com>
- FEATURES.md: add agent eval to Loop Strategies, evalMode/evalPrompt to Configuration, agent eval CLI examples, and Migration 15 to Database Schema - README.md: add two agent eval examples in the Eval Loops section - CHANGELOG.md: add [Unreleased] entry for agent eval mode feature - CLAUDE.md: add agent-exit-condition-evaluator.ts and composite-exit-condition-evaluator.ts to File Locations table - domain.ts: clarify exitCondition comment (empty string for agent mode) Fix: convert EvalMode from string literals to enum throughout codebase so CompositeExitConditionEvaluator switch exhaustiveness check compiles. Affected: domain.ts, loop.ts, mcp-adapter.ts, loop-repository.ts, schedule-repository.ts, loop-handler.ts, loop-manager.ts, composite-exit-condition-evaluator.ts, format.ts, orchestrator-prompt.ts Co-Authored-By: Claude <noreply@anthropic.com>
…itch - schedule-manager.ts: replace 'shell' string literal with EvalMode.SHELL - mcp-adapter.ts: remove toEvalMode helper import (unused after nativeEnum Zod schemas) and remove unnecessary EvalMode casts on parsed data - utils/format.ts: remove toEvalMode() helper — now superseded by z.nativeEnum(EvalMode) at MCP/CLI boundaries EvalMode enum is now consistently used at all comparison sites across domain, handlers, services, adapters, and CLI. The toEvalMode boundary parser is no longer needed since Zod nativeEnum validates and types the input directly. Co-Authored-By: Claude <noreply@anthropic.com>
Linter preference: keep explicit .default(EvalMode.SHELL) on nativeEnum schemas for CreateLoop and ScheduleLoop — makes the default explicit at the protocol boundary rather than relying solely on the domain factory. Co-Authored-By: Claude <noreply@anthropic.com>
validateCreateRequest now explicitly rejects evalMode values that are neither EvalMode.SHELL nor EvalMode.AGENT, returning an INVALID_INPUT error with the received value. While TypeScript enforces the EvalMode enum at compile time for typed callers, the guard protects against deserialized or cast inputs that bypass type checking at runtime. Co-Authored-By: Claude <noreply@anthropic.com>
…ilerplate - Issue 22: Define named IterationResultFields type in loop-handler.ts, replacing the inline object type on recordAndContinue's evalResult param - Issue 23: Extract evaluateWithCompletion helper in agent-exit-condition-evaluator tests, reducing 12 repeated spy-capture-simulate blocks to 3-line call sites - Issue 24: Move createAndEmitLoop call above mockImplementation in the stale state guard test to eliminate the forward reference to loop.id in the closure - Issue 25: Add test verifying result processing is skipped when iteration status changes to a terminal state during agent eval (stale iteration guard) - Issue 26: Add test verifying shell mode rejects evalTimeout: 300001 Co-Authored-By: Claude <noreply@anthropic.com>
Greptile SummaryThis PR introduces Agent Eval Mode for loop exit conditions, allowing an AI agent (instead of a shell command) to evaluate each iteration's output and return pass/fail (retry) or a numeric score (optimize). It adds a Key changes:
Confidence Score: 4/5Not safe to merge as-is: agent optimize loops are completely non-functional from the CLI due to contradictory validation logic between the parser and service layer. One P1 defect: the CLI unconditionally rejects --minimize/--maximize in agent mode while the service layer unconditionally requires evalDirection for optimize strategy, meaning every attempt to create an agent+optimize loop from the CLI will error. The README example documents this exact broken path. The MCP path is unaffected. All other changes (migration, repository, shell guard, stale-state guard, composite dispatcher, event-driven completion) look correct and well-tested. src/cli/commands/loop.ts (parseAgentModeArgs — missing evalDirection propagation) and README.md (documents the broken --maximize example) Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI_MCP as CLI / MCP Adapter
participant LM as LoopManagerService
participant LH as LoopHandler
participant CEE as CompositeEvaluator
participant AECE as AgentEvaluator
participant EB as EventBus
participant Repo as LoopRepository
CLI_MCP->>LM: createLoop({ evalMode: agent, strategy, evalPrompt })
LM->>LM: validate(evalMode, evalDirection if optimize)
LM->>Repo: save(loop)
LM-->>CLI_MCP: Loop created
Note over LH: On task completion event
LH->>CEE: evaluate(loop, taskId)
CEE->>AECE: evaluate(loop, taskId) [evalMode=agent]
AECE->>Repo: findIterationByTaskId(taskId)
AECE->>AECE: buildEvalPrompt(loop, taskId)
AECE->>EB: subscribe TaskCompleted/Failed/Cancelled/Timeout/LoopCancelled
AECE->>EB: emit TaskDelegated(evalTask)
EB-->>AECE: TaskCompleted(evalTaskId)
AECE->>AECE: outputRepo.get(evalTaskId)
AECE->>AECE: parseEvalOutput(lines, strategy)
AECE-->>CEE: EvalResult { passed, score?, feedback? }
CEE-->>LH: EvalResult
LH->>LH: stale-state guard (re-fetch loop + iteration)
LH->>LH: handleIterationResult(freshLoop, freshIteration, evalResult)
LH->>Repo: updateIteration(evalFeedback, score, ...)
|
| const defaultInstructions = isRetry | ||
| ? `Review the code changes. ${gitDiffInstruction} Use \`beat logs ${taskId}\` to read the worker's output. Output PASS if the changes are acceptable, FAIL if not. The LAST LINE of your response must be exactly PASS or FAIL.` | ||
| : `Score the code change quality 0-100. ${gitDiffInstruction} Use \`beat logs ${taskId}\` to read the worker's output. Provide your analysis, then on the LAST LINE output a single numeric score between 0 and 100.`; | ||
| const instructions = loop.evalPrompt ?? defaultInstructions; | ||
|
|
||
| return `${header} | ||
|
|
||
| IMPORTANT: Do NOT modify any files. You are an evaluator — read and assess only. | ||
|
|
||
| Working directory: ${loop.workingDirectory} | ||
| Iteration: ${loop.currentIteration} | ||
| Task ID: ${taskId} | ||
|
|
||
| ${instructions}`; | ||
| } | ||
|
|
||
| /** | ||
| * Wait for eval task to reach a terminal state. | ||
| * Subscribes to LoopCancelled so that if the parent loop is cancelled while this | ||
| * eval is in-flight, the eval task gets a TaskCancellationRequested immediately |
There was a problem hiding this comment.
Custom
evalPrompt silently drops format requirements and git diff instructions
When loop.evalPrompt is set, it completely replaces defaultInstructions, including the critical format directives the eval agent must follow:
- Retry: the LAST LINE must be exactly
PASSorFAIL. - Optimize: the LAST LINE must be a numeric score.
- Both modes: the eval agent should run
git diffandbeat logs <taskId>to inspect the work.
None of these requirements appear in the built prompt when a custom evalPrompt is used. The eval agent doesn't know it needs to access the git diff, read task logs, or produce a PASS/FAIL token, so it will return free-form text and the evaluation will fail with "Eval agent output did not end with PASS or FAIL (got: ...)".
Consider injecting the gitDiffInstruction, the beat logs command, and the output format requirement unconditionally, and letting evalPrompt only override the evaluation criteria section.
| ``` | ||
|
|
||
| **Pipeline loops** -repeat a multi-step workflow: |
There was a problem hiding this comment.
README example documents a command that the code rejects
The documented example:
beat loop "Optimize the algorithm" --eval-mode agent --strategy optimize --maximize \
--eval-prompt "Score the solution on correctness and efficiency (0-100)"--maximize is explicitly rejected in parseAgentModeArgs with the error "--minimize/--maximize are not valid with --eval-mode agent", so this command errors before ever reaching the service. This should be corrected once the CLI bug above is fixed.
- Allow --minimize/--maximize with --eval-mode agent + --strategy optimize (previously rejected unconditionally, making agent+optimize loops broken from CLI) - Restructure buildEvalPrompt so format directive (PASS/FAIL or numeric score), git diff instruction, and beat logs command are always injected — custom evalPrompt now only overrides the evaluation criteria section - Add 7 new tests covering both fixes
## Summary - Bump version 1.0.0 → 1.1.0 - Add release notes, changelog, features, roadmap entries for v1.1.0 - Update CLAUDE.md MCP tools list (add 8 missing tools: RetryTask, ResumeTask, CreateOrchestrator, OrchestratorStatus, ListOrchestrators, CancelOrchestrator, ListAgents, ConfigureAgent) - Update stale RELEASE_NOTES.md index (was pointing to v0.4.0) ## What's in v1.1.0 | PR | Type | Description | |----|------|-------------| | #126 | feat | Agent eval mode for loop exit conditions | | #127 | feat | Agent orchestration skill and skill installer | | #128 | fix | Populate nextRunAt on CRON schedule creation | ## Test plan - [x] `npm run build` — clean - [x] `npm run test:core` — 359 passed - [x] `npm run test:services` — 175 passed - [x] `npm run test:cli` — 295 passed - [x] `npm run test:adapters` — 100 passed - [x] `npx biome check src/ tests/` — clean - [ ] CI passes on PR - [ ] Squash merge → trigger Release workflow --------- Co-authored-by: Dean Sharon <deanshrn@gmain.com>
Summary
evalMode: 'agent'(MCP) or--eval-mode agent(CLI). The agent reads iteration output and returns pass/fail (retry) or a numeric score (optimize).evalPrompt/--eval-promptallows customizing the agent evaluator's instructions.evalModesetting.eval_mode,eval_promptcolumns onloopstable andeval_feedbackonloop_iterations.exitConditionin shell evaluator, JSDoc onExitConditionEvaluatorinterface, stale FEATURES.md version label removed.Key Files
src/core/domain.ts,src/core/interfaces.tssrc/services/agent-exit-condition-evaluator.tssrc/services/composite-exit-condition-evaluator.tssrc/services/exit-condition-evaluator.tssrc/services/handlers/loop-handler.tssrc/adapters/mcp-adapter.tssrc/cli/commands/loop.tssrc/implementations/database.tsStats
Test plan
npm run test:core— 359 passednpm run test:handlers— 168 passednpm run test:services— 175 passed (includes agent evaluator, composite evaluator, shell guard tests)npm run test:repositories— 202 passednpm run test:adapters— 99 passednpm run test:cli— 260 passednpm run test:implementations— 328 passednpm run test:integration— 70 passednpm run build— cleannpx biome check src/— 86 files, 0 issues