feat: agent eval mode for loop exit conditions by dean0x · Pull Request #126 · dean0x/autobeat

dean0x · 2026-03-30T15:46:35Z

Summary

Agent Eval Mode: Loops can now delegate exit condition evaluation to an AI agent instead of a shell command. Pass evalMode: 'agent' (MCP) or --eval-mode agent (CLI). The agent reads iteration output and returns pass/fail (retry) or a numeric score (optimize).
Custom Eval Prompt: evalPrompt / --eval-prompt allows customizing the agent evaluator's instructions.
Composite Evaluator: Routes evaluation to shell or agent implementation based on loop's evalMode setting.
Database: Migration 15 adds eval_mode, eval_prompt columns on loops table and eval_feedback on loop_iterations.
Pre-existing issue fixes: Defensive guard for empty exitCondition in shell evaluator, JSDoc on ExitConditionEvaluator interface, stale FEATURES.md version label removed.

Key Files

Area	Files
Domain	`src/core/domain.ts`, `src/core/interfaces.ts`
Agent evaluator	`src/services/agent-exit-condition-evaluator.ts`
Composite router	`src/services/composite-exit-condition-evaluator.ts`
Shell guard	`src/services/exit-condition-evaluator.ts`
Loop handler	`src/services/handlers/loop-handler.ts`
MCP adapter	`src/adapters/mcp-adapter.ts`
CLI	`src/cli/commands/loop.ts`
DB migration	`src/implementations/database.ts`

Stats

28 files changed, +1,988 / -114 lines
16 commits
1,661 tests passing across all suites

Test plan

- Stale state guard now allows PAUSED loops through, consistent with the early guard (graceful pause should record iteration results) - Add cleanup of in-memory tracking maps (taskToLoop, pipelineTasks) when stale state guard returns early, preventing memory leaks

- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance

- Extract TaskCompletionStatus type alias, replace nested ternary with switch - Deduplicate buildEvalPrompt template structure - Remove redundant comments restating adjacent code - Remove empty else block in loop-manager validation - Fix stale dependency count comment in handler-setup

…cope stale guard to agent mode - Extract `cleanupIterationTracking(taskId, loopId, iteration)` from three duplicated cleanup sequences in handleTaskTerminal, eliminating repetition - Guard the stale-state re-fetch block with `if (loop.evalMode === 'agent')` so shell eval (milliseconds) bypasses the unnecessary DB round-trips - Update stale-guard test to use evalMode: 'agent', matching the scope of the guard it exercises Co-Authored-By: Claude <noreply@anthropic.com>

…itCondition createScheduledLoop() unconditionally rejected empty exitCondition, blocking agent eval mode which uses LLM review instead of a shell command. Guard the check with evalMode === 'shell' (default), matching loop-manager validation. Co-Authored-By: Claude <noreply@anthropic.com>

AgentExitConditionEvaluator now subscribes to LoopCancelled inside waitForTaskCompletion and immediately emits TaskCancellationRequested for the in-flight eval task. Previously, the eval task was not tracked in LoopHandler.taskToLoop by design, so handleLoopCancelled could not reach it — leaving it running until evalTimeout as an orphan consuming a worker slot. The existing stale-state guard in LoopHandler already discarded the result correctly; this fix adds the missing cancellation signal to free the worker slot immediately. Two tests added: verifies TaskCancellationRequested is emitted on loop cancellation, and verifies cancellation of a different loop is ignored. Co-Authored-By: Claude <noreply@anthropic.com>

- FEATURES.md: add agent eval to Loop Strategies, evalMode/evalPrompt to Configuration, agent eval CLI examples, and Migration 15 to Database Schema - README.md: add two agent eval examples in the Eval Loops section - CHANGELOG.md: add [Unreleased] entry for agent eval mode feature - CLAUDE.md: add agent-exit-condition-evaluator.ts and composite-exit-condition-evaluator.ts to File Locations table - domain.ts: clarify exitCondition comment (empty string for agent mode) Fix: convert EvalMode from string literals to enum throughout codebase so CompositeExitConditionEvaluator switch exhaustiveness check compiles. Affected: domain.ts, loop.ts, mcp-adapter.ts, loop-repository.ts, schedule-repository.ts, loop-handler.ts, loop-manager.ts, composite-exit-condition-evaluator.ts, format.ts, orchestrator-prompt.ts Co-Authored-By: Claude <noreply@anthropic.com>

…itch - schedule-manager.ts: replace 'shell' string literal with EvalMode.SHELL - mcp-adapter.ts: remove toEvalMode helper import (unused after nativeEnum Zod schemas) and remove unnecessary EvalMode casts on parsed data - utils/format.ts: remove toEvalMode() helper — now superseded by z.nativeEnum(EvalMode) at MCP/CLI boundaries EvalMode enum is now consistently used at all comparison sites across domain, handlers, services, adapters, and CLI. The toEvalMode boundary parser is no longer needed since Zod nativeEnum validates and types the input directly. Co-Authored-By: Claude <noreply@anthropic.com>

Linter preference: keep explicit .default(EvalMode.SHELL) on nativeEnum schemas for CreateLoop and ScheduleLoop — makes the default explicit at the protocol boundary rather than relying solely on the domain factory. Co-Authored-By: Claude <noreply@anthropic.com>

validateCreateRequest now explicitly rejects evalMode values that are neither EvalMode.SHELL nor EvalMode.AGENT, returning an INVALID_INPUT error with the received value. While TypeScript enforces the EvalMode enum at compile time for typed callers, the guard protects against deserialized or cast inputs that bypass type checking at runtime. Co-Authored-By: Claude <noreply@anthropic.com>

…ilerplate - Issue 22: Define named IterationResultFields type in loop-handler.ts, replacing the inline object type on recordAndContinue's evalResult param - Issue 23: Extract evaluateWithCompletion helper in agent-exit-condition-evaluator tests, reducing 12 repeated spy-capture-simulate blocks to 3-line call sites - Issue 24: Move createAndEmitLoop call above mockImplementation in the stale state guard test to eliminate the forward reference to loop.id in the closure - Issue 25: Add test verifying result processing is skipped when iteration status changes to a terminal state during agent eval (stale iteration guard) - Issue 26: Add test verifying shell mode rejects evalTimeout: 300001 Co-Authored-By: Claude <noreply@anthropic.com>

- Add defensive guard for empty exitCondition in ShellExitConditionEvaluator - Remove stale version tag from FEATURES.md "Last Updated" line - Add JSDoc on ExitConditionEvaluator.evaluate() interface method

greptile-apps · 2026-03-30T15:52:45Z

Greptile Summary

This PR introduces Agent Eval Mode for loop exit conditions, allowing an AI agent (instead of a shell command) to evaluate each iteration's output and return pass/fail (retry) or a numeric score (optimize). It adds a CompositeExitConditionEvaluator dispatcher, a new AgentExitConditionEvaluator service, database migration 15 (eval_mode, eval_prompt, eval_feedback), and threads the new fields through the CLI, MCP adapter, domain, repositories, and loop handler.

Key changes:

EvalMode enum (shell | agent) added to domain; Loop and LoopCreateRequest updated accordingly
AgentExitConditionEvaluator spawns a TaskDelegated eval task, waits for completion via event subscriptions with a fallback timer, and parses the last output line as PASS/FAIL or a numeric score
Stale-state guard re-fetches loop + iteration after the (potentially slow) agent eval before persisting results
Loop cancellation during in-flight agent eval is handled by subscribing to LoopCancelled and emitting TaskCancellationRequested
Shell evaluator gains a defensive empty-exitCondition guard
P1 issue found: The CLI's parseAgentModeArgs rejects --minimize/--maximize flags and omits evalDirection from its output, but LoopManagerService.createLoop unconditionally requires evalDirection for optimize strategy. This makes --eval-mode agent --strategy optimize always fail at the service layer, and the README documents this exact broken combination as a valid example.
P2 note: When a custom evalPrompt is provided, it fully replaces defaultInstructions, silently dropping the critical format directives (PASS/FAIL or numeric last line) and the git diff / beat logs reading instructions the eval agent needs.

Confidence Score: 4/5

Not safe to merge as-is: agent optimize loops are completely non-functional from the CLI due to contradictory validation logic between the parser and service layer.

One P1 defect: the CLI unconditionally rejects --minimize/--maximize in agent mode while the service layer unconditionally requires evalDirection for optimize strategy, meaning every attempt to create an agent+optimize loop from the CLI will error. The README example documents this exact broken path. The MCP path is unaffected. All other changes (migration, repository, shell guard, stale-state guard, composite dispatcher, event-driven completion) look correct and well-tested.

src/cli/commands/loop.ts (parseAgentModeArgs — missing evalDirection propagation) and README.md (documents the broken --maximize example)

Important Files Changed

Filename	Overview
src/cli/commands/loop.ts	Refactored CLI parser into shell/agent mode sub-functions; rejects --minimize/--maximize for agent mode, breaking agent optimize loops
src/services/agent-exit-condition-evaluator.ts	New agent eval service: spawns eval task, waits with event-driven completion + fallback timer, parses PASS/FAIL or numeric score; custom evalPrompt loses format requirements
src/services/composite-exit-condition-evaluator.ts	Clean routing dispatcher; exhaustive switch with never guard is correct
src/services/handlers/loop-handler.ts	Added stale-state guard for slow agent evals; evalFeedback threaded through all result paths correctly
src/services/loop-manager.ts	Validation updated for agent mode; evalDirection still required for optimize (pre-existing); evalMode/evalPrompt cross-validation correct
src/implementations/database.ts	Migration 15 adds eval_mode (NOT NULL DEFAULT 'shell'), eval_prompt, and eval_feedback columns correctly
src/adapters/mcp-adapter.ts	evalMode/evalPrompt added to both CreateLoop and ScheduleLoop schemas; exitCondition made optional; evalDirection still accepted (MCP path correct)
src/core/domain.ts	EvalMode enum added; Loop and LoopCreateRequest updated with evalMode/evalPrompt; exitCondition made optional on request, defaults to '' in createLoop

Sequence Diagram

sequenceDiagram
    participant CLI_MCP as CLI / MCP Adapter
    participant LM as LoopManagerService
    participant LH as LoopHandler
    participant CEE as CompositeEvaluator
    participant AECE as AgentEvaluator
    participant EB as EventBus
    participant Repo as LoopRepository

    CLI_MCP->>LM: createLoop({ evalMode: agent, strategy, evalPrompt })
    LM->>LM: validate(evalMode, evalDirection if optimize)
    LM->>Repo: save(loop)
    LM-->>CLI_MCP: Loop created

    Note over LH: On task completion event
    LH->>CEE: evaluate(loop, taskId)
    CEE->>AECE: evaluate(loop, taskId) [evalMode=agent]
    AECE->>Repo: findIterationByTaskId(taskId)
    AECE->>AECE: buildEvalPrompt(loop, taskId)
    AECE->>EB: subscribe TaskCompleted/Failed/Cancelled/Timeout/LoopCancelled
    AECE->>EB: emit TaskDelegated(evalTask)
    EB-->>AECE: TaskCompleted(evalTaskId)
    AECE->>AECE: outputRepo.get(evalTaskId)
    AECE->>AECE: parseEvalOutput(lines, strategy)
    AECE-->>CEE: EvalResult { passed, score?, feedback? }
    CEE-->>LH: EvalResult
    LH->>LH: stale-state guard (re-fetch loop + iteration)
    LH->>LH: handleIterationResult(freshLoop, freshIteration, evalResult)
    LH->>Repo: updateIteration(evalFeedback, score, ...)

Comments Outside Diff (1)

src/cli/commands/loop.ts, line 332-367 (link)

Agent optimize mode is always broken from the CLI

parseAgentModeArgs explicitly rejects --minimize and --maximize (line 332-334), and the shared object never populates evalDirection. However, LoopManagerService.createLoop (in loop-manager.ts) unconditionally requires evalDirection for any optimize-strategy loop:
```
if (request.strategy === LoopStrategy.OPTIMIZE && !request.evalDirection) {
  return err(new AutobeatError(ErrorCode.INVALID_INPUT, 'evalDirection is required for optimize strategy', ...));
}
```
This means beat loop "..." --eval-mode agent --strategy optimize will always fail at the service layer with "evalDirection is required for optimize strategy", and there is no way for the user to fix it because the CLI rejects the --maximize/--minimize flags needed to provide the direction.

The README contradicts the implementation by documenting this command as valid:
```
beat loop "Optimize the algorithm" --eval-mode agent --strategy optimize --maximize \
  --eval-prompt "Score the solution on correctness and efficiency (0-100)"
```
The fix is to allow --minimize/--maximize when --eval-mode agent --strategy optimize is used and propagate evalDirection through parseAgentModeArgs's shared object.

_{Reviews (1): Last reviewed commit: "fix: resolve pre-existing review issues ..." | Re-trigger Greptile}

greptile-apps · 2026-03-30T15:52:52Z

+    const defaultInstructions = isRetry
+      ? `Review the code changes. ${gitDiffInstruction} Use \`beat logs ${taskId}\` to read the worker's output. Output PASS if the changes are acceptable, FAIL if not. The LAST LINE of your response must be exactly PASS or FAIL.`
+      : `Score the code change quality 0-100. ${gitDiffInstruction} Use \`beat logs ${taskId}\` to read the worker's output. Provide your analysis, then on the LAST LINE output a single numeric score between 0 and 100.`;
+    const instructions = loop.evalPrompt ?? defaultInstructions;
+
+    return `${header}
+
+IMPORTANT: Do NOT modify any files. You are an evaluator — read and assess only.
+
+Working directory: ${loop.workingDirectory}
+Iteration: ${loop.currentIteration}
+Task ID: ${taskId}
+
+${instructions}`;
+  }
+
+  /**
+   * Wait for eval task to reach a terminal state.
+   * Subscribes to LoopCancelled so that if the parent loop is cancelled while this
+   * eval is in-flight, the eval task gets a TaskCancellationRequested immediately


Custom evalPrompt silently drops format requirements and git diff instructions

When loop.evalPrompt is set, it completely replaces defaultInstructions, including the critical format directives the eval agent must follow:

Retry: the LAST LINE must be exactly PASS or FAIL.

Optimize: the LAST LINE must be a numeric score.

Both modes: the eval agent should run git diff and beat logs <taskId> to inspect the work.

None of these requirements appear in the built prompt when a custom evalPrompt is used. The eval agent doesn't know it needs to access the git diff, read task logs, or produce a PASS/FAIL token, so it will return free-form text and the evaluation will fail with "Eval agent output did not end with PASS or FAIL (got: ...)".

Consider injecting the gitDiffInstruction, the beat logs command, and the output format requirement unconditionally, and letting evalPrompt only override the evaluation criteria section.

greptile-apps · 2026-03-30T15:52:53Z

+```
+
 **Pipeline loops** -repeat a multi-step workflow:


README example documents a command that the code rejects

The documented example:

beat loop "Optimize the algorithm" --eval-mode agent --strategy optimize --maximize \ --eval-prompt "Score the solution on correctness and efficiency (0-100)"

--maximize is explicitly rejected in parseAgentModeArgs with the error "--minimize/--maximize are not valid with --eval-mode agent", so this command errors before ever reaching the service. This should be corrected once the CLI bug above is fixed.

- Allow --minimize/--maximize with --eval-mode agent + --strategy optimize (previously rejected unconditionally, making agent+optimize loops broken from CLI) - Restructure buildEvalPrompt so format directive (PASS/FAIL or numeric score), git diff instruction, and beat logs command are always injected — custom evalPrompt now only overrides the evaluation criteria section - Add 7 new tests covering both fixes

## Summary - Bump version 1.0.0 → 1.1.0 - Add release notes, changelog, features, roadmap entries for v1.1.0 - Update CLAUDE.md MCP tools list (add 8 missing tools: RetryTask, ResumeTask, CreateOrchestrator, OrchestratorStatus, ListOrchestrators, CancelOrchestrator, ListAgents, ConfigureAgent) - Update stale RELEASE_NOTES.md index (was pointing to v0.4.0) ## What's in v1.1.0 | PR | Type | Description | |----|------|-------------| | #126 | feat | Agent eval mode for loop exit conditions | | #127 | feat | Agent orchestration skill and skill installer | | #128 | fix | Populate nextRunAt on CRON schedule creation | ## Test plan - [x] `npm run build` — clean - [x] `npm run test:core` — 359 passed - [x] `npm run test:services` — 175 passed - [x] `npm run test:cli` — 295 passed - [x] `npm run test:adapters` — 100 passed - [x] `npx biome check src/ tests/` — clean - [ ] CI passes on PR - [ ] Squash merge → trigger Release workflow --------- Co-authored-by: Dean Sharon <deanshrn@gmain.com>

Dean Sharon and others added 16 commits March 30, 2026 11:33

Loop loop-06fd8b6e-0085-4e90-9085-8f42c3f28568 iteration 1 — pass

6539f69

style: fix biome formatting and import organization

114d5cb

fix: address shepherd review gaps

8013db9

- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance

style: fix biome formatting after parallel resolver changes

1f259cb

refactor: simplify cleanup and remove redundant schedule update

c1269d5

fix: resolve pre-existing review issues #3, #4, #5

4ed1e8e

- Add defensive guard for empty exitCondition in ShellExitConditionEvaluator - Remove stale version tag from FEATURES.md "Last Updated" line - Add JSDoc on ExitConditionEvaluator.evaluate() interface method

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

dean0x merged commit f0d2c50 into main Mar 30, 2026
1 check failed

dean0x deleted the feat-agent-eval-mode branch March 30, 2026 17:19

dean0x mentioned this pull request Mar 31, 2026

chore: release v1.1.0 — Agent Eval Mode & Skill System #129

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: agent eval mode for loop exit conditions#126

feat: agent eval mode for loop exit conditions#126
dean0x merged 17 commits intomainfrom
feat-agent-eval-mode

dean0x commented Mar 30, 2026

Uh oh!

greptile-apps bot commented Mar 30, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 30, 2026

Uh oh!

greptile-apps bot Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dean0x commented Mar 30, 2026

Summary

Key Files

Stats

Test plan

Uh oh!

greptile-apps bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 30, 2026 •

edited

Loading