feat: agent-based loop evaluation mode by dean0x · Pull Request #125 · dean0x/autobeat

dean0x · 2026-03-30T09:13:51Z

Summary

Adds agent-based loop evaluation (evalMode: 'agent') — enables spawning a Claude Code instance to evaluate loop iteration quality instead of shell commands. Comprehensive implementation including domain types, database migration, repository layer, validation, agent evaluator component, composite evaluator pattern, handler wiring, CLI flags, MCP adapter integration, orchestrator prompt enhancement, and full test coverage.

Changes

Domain Types: Added EvalMode union type ('shell' | 'agent'), AgentEvaluatorConfig for agent setup
Migration v11: Added eval_mode and agent_evaluator_config columns to loops table
Repository: Extended LoopRepository with evalMode and agentEvaluatorConfig property handling
Validation: Added schema validation for eval mode and agent config via Zod
AgentEvaluator: New component to spawn Claude Code instance and delegate evaluation
CompositeEvaluator: Patterns for composing shell and agent evaluators
Handler Wiring: LoopHandler extended to support both evaluation modes
CLI Flags: Added --eval-mode and agent config flags to beat loop create
MCP Adapter: Updated CreateLoop tool with eval mode parameters
Orchestrator Prompt: Enhanced to guide agent-based loop evaluation
Tests: Comprehensive test coverage across domain, handler, evaluator, and integration tests

Breaking Changes

None. The evalMode defaults to 'shell' for backward compatibility.

Testing

Unit tests: Domain types, validators, agent evaluator
Integration tests: Full loop workflow with agent evaluation
Handler tests: Event handling with both eval modes
All test groups passing: `npm run test:core`, `test:handlers`, etc.

Related Issues

Closes #[issue-number] (if applicable)

Reviewed via Claude Code

- Stale state guard now allows PAUSED loops through, consistent with the early guard (graceful pause should record iteration results) - Add cleanup of in-memory tracking maps (taskToLoop, pipelineTasks) when stale state guard returns early, preventing memory leaks

- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance

- Extract TaskCompletionStatus type alias, replace nested ternary with switch - Deduplicate buildEvalPrompt template structure - Remove redundant comments restating adjacent code - Remove empty else block in loop-manager validation - Fix stale dependency count comment in handler-setup

dean0x · 2026-03-30T09:19:18Z

src/services/agent-exit-condition-evaluator.ts

+    }
+
+    const lastLine = lines[lines.length - 1].trim();
+    // Everything before the last line (if any) as feedback


SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

The evalFeedback field is constructed from the entire eval agent output and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.

Fix: Truncate feedback before storage:

const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit const feedback = feedbackLines.length > 0 ? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH) : undefined;

This is consistent with the 8000-char cap on evalPrompt in the Zod schema.

dean0x · 2026-03-30T09:19:22Z

src/services/handlers/loop-handler.ts

      } else {
        // Task COMPLETED — run exit condition evaluation
+        // Note: agent eval mode can take a long time; re-fetch state afterwards to guard stale data
        const evalResult = await this.exitConditionEvaluator.evaluate(loop, taskId);


ARCHITECTURE: Eval task not cancelled when loop is cancelled (85% confidence)

When a loop is cancelled, the handleLoopCancelled method updates the loop status and cleans up tracking, but the eval task spawned by AgentExitConditionEvaluator is NOT tracked in taskToLoop. This means the eval agent task continues running as an orphan, wasting compute resources.

Fix: Track the in-flight eval task ID so handleLoopCancelled can emit a TaskCancelled event for any running eval task. Consider passing an AbortSignal that LoopHandler can trigger on cancellation:

async evaluate(loop: Loop, taskId: TaskId, signal?: AbortSignal): Promise<EvalResult> { // ...spawn eval task... if (signal?.aborted) { await this.eventBus.emit('TaskCancelled', { taskId: evalTaskId, reason: 'loop-cancelled' }); return { passed: false, error: 'Eval cancelled — loop was cancelled' }; } }

dean0x · 2026-03-30T09:19:23Z

src/services/handlers/loop-handler.ts

+        const freshLoop = freshLoopResult.value;
+
+        const freshIterationResult = await this.loopRepo.findIterationByTaskId(taskId);
+        if (!freshIterationResult.ok || !freshIterationResult.value || freshIterationResult.value.status !== 'running') {


MAINTENANCE: Duplicated cleanup triplet (82% confidence)

The three-line cleanup sequence (cleanupPipelineTaskTracking, taskToLoop.delete, cleanupPipelineTasks) appears identically at lines 294-296, 313-315, and 324-326. This violates DRY and creates a maintenance hazard.

Fix: Extract into a helper method:

private cleanupIterationTracking(iteration: LoopIteration, taskId: TaskId, loopId: string): void { this.cleanupPipelineTaskTracking(iteration); this.taskToLoop.delete(taskId); this.cleanupPipelineTasks(loopId, iteration.iterationNumber); }

Then use finally block or restructure early returns to call this once.

dean0x · 2026-03-30T09:19:26Z

src/core/domain.ts

+  readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode)
  readonly evalDirection?: OptimizeDirection; // Optimize strategy only
  readonly evalTimeout: number; // Milliseconds for exit condition evaluation
+  readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review


TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

The evalMode field is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).

Fix: Define an EvalMode enum:

export enum EvalMode { SHELL = 'shell', AGENT = 'agent', }

Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:

case EvalMode.AGENT: return this.agentEvaluator.evaluate(loop, taskId); case EvalMode.SHELL: return this.shellEvaluator.evaluate(loop, taskId); default: return loop.evalMode satisfies never;

dean0x · 2026-03-30T09:19:28Z

src/services/agent-exit-condition-evaluator.ts

+    }
+
+    const gitDiffInstruction = preIterationCommitSha
+      ? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.`


PERFORMANCE: Unbounded output concatenation (85% confidence)

The evaluator concatenates full stdout and stderr arrays into a single string before parsing, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant.

Fix: Only parse from the filtered lines array without the join-then-split round-trip:

const allLines = [...output.stdout, ...output.stderr]; const nonEmptyLines = allLines.filter((line) => line.trim().length > 0); return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);

dean0x · 2026-03-30T09:19:30Z

src/cli/commands/loop.ts

  }

-  // Strategy inference from flags
+  // Validate --eval-prompt requires --eval-mode agent


COMPLEXITY: parseLoopCreateArgs exceeds maintainability threshold (88% confidence)

This function is 238 lines with ~25 decision points. The agent eval mode additions (lines 153-206) pushed it past the threshold. The agent and shell branches each construct nearly identical shared objects.

Fix: Extract the two code paths into helper functions:

if (evalMode === 'agent') { return parseAgentModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... }); } return parseShellModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... });

This makes each path easier to understand and modify independently.

greptile-apps · 2026-03-30T09:20:02Z

Greptile Summary

This PR adds an evalMode: 'agent' evaluation path for loops, enabling a dedicated Claude Code instance to judge iteration quality through code comprehension rather than shell exit codes. The implementation is comprehensive — domain types, DB migration (v15), repository layer, validation, AgentExitConditionEvaluator, CompositeExitConditionEvaluator dispatcher, handler wiring, CLI flags, MCP schema changes, orchestrator prompt guidance, and test coverage are all included. The feature is fully backward-compatible (evalMode defaults to 'shell').

Notable strengths:

Stale-state guard in LoopHandler correctly re-fetches loop and iteration after the potentially long agent eval, preventing races when the loop is cancelled mid-evaluation
completionPromise is set up before emit is awaited — correctly avoiding the window where a fast-completing task could fire before subscriptions are registered
The CompositeExitConditionEvaluator pattern is clean and keeps callers oblivious to which evaluator is in use

Issues found:

Subscription leak on emit failure (agent-exit-condition-evaluator.ts ~line 583): When emit('TaskDelegated') fails the function returns early without awaiting or cancelling completionPromise. The four event-bus subscriptions remain active until the fallback timer fires (evalTimeout + 5000 ms later).
stdout/stderr ordering corrupts last-line parsing (agent-exit-condition-evaluator.ts ~line 632): [...output.stdout, ...output.stderr].join('\ ') puts all stderr after all stdout. Any stderr output from the eval agent's runtime (tool traces, warnings) will become the last line, causing PASS/FAIL/score parsing to fail.
Custom evalPrompt silently drops the git diff instruction and format requirement (agent-exit-condition-evaluator.ts ~line 659): When evalPrompt is set it fully replaces defaultInstructions, discarding the git diff <sha>..HEAD command and the strict "last line must be exactly PASS or FAIL" constraint. The orchestrator prompt example compounds this by showing \"FAIL with an explanation\" phrasing which would produce a non-matching last line.
Agent mode silently accepts exitCondition (loop-manager.ts): The service layer only rejects exitCondition for shell mode; MCP callers can pass a non-empty exitCondition with evalMode: 'agent', which is stored but never executed.

Confidence Score: 4/5

Safe to merge after addressing the stdout/stderr ordering issue and the custom evalPrompt format-requirement gap, which would silently cause agent evals to fail in practice

All four findings are P2, but two of them (stdout/stderr ordering and custom evalPrompt missing format constraints) describe conditions where agent evaluation silently produces incorrect results on the primary happy path — enough to warrant fixing before wide use. The subscription leak and missing agent-mode exitCondition rejection are lower-urgency but still worth a quick fix.

src/services/agent-exit-condition-evaluator.ts requires the most attention (subscription leak, output parsing, and evalPrompt construction); src/services/loop-manager.ts needs a minor guard for agent mode + exitCondition

Important Files Changed

Filename	Overview
src/services/agent-exit-condition-evaluator.ts	New agent evaluator — has subscription leak on emit failure, stdout/stderr ordering issue, and custom evalPrompt loses critical format/git-diff instructions
src/services/loop-manager.ts	Validation logic updated — shell mode correctly requires exitCondition, but agent mode does not reject a non-empty exitCondition passed via MCP
src/services/handlers/loop-handler.ts	Stale-state guard added after slow agent eval — correctly re-fetches loop and iteration, propagates evalFeedback to all iteration outcome paths
src/implementations/database.ts	Migration v15 adds eval_mode (NOT NULL DEFAULT 'shell'), eval_prompt, and eval_feedback columns — backward compatible and correctly structured
src/adapters/mcp-adapter.ts	CreateLoop and ScheduleLoop schemas updated to accept optional evalMode/evalPrompt; evalFeedback surfaced in iteration output; no cross-field validation at schema level (deferred to service)
src/services/orchestrator-prompt.ts	Agent eval mode documented in orchestrator prompt, but the --eval-prompt example uses 'FAIL with an explanation' phrasing which would not produce a valid bare-FAIL last line

Sequence Diagram

sequenceDiagram
    participant LH as LoopHandler
    participant CE as CompositeEvaluator
    participant SE as ShellEvaluator
    participant AE as AgentEvaluator
    participant EB as EventBus
    participant OR as OutputRepository
    participant LR as LoopRepository

    LH->>CE: evaluate(loop, taskId)
    alt evalMode == 'shell'
        CE->>SE: evaluate(loop, taskId)
        SE-->>CE: EvalResult (exitCode)
    else evalMode == 'agent'
        CE->>AE: evaluate(loop, taskId)
        AE->>LR: findIterationByTaskId(taskId)
        LR-->>AE: preIterationCommitSha
        AE->>AE: buildEvalPrompt(loop, taskId)
        AE->>EB: subscribe(TaskCompleted / Failed / Cancelled / Timeout)
        AE->>EB: emit(TaskDelegated, evalTask)
        Note over EB: Claude Code eval agent runs
        EB-->>AE: TaskCompleted(evalTaskId)
        AE->>OR: get(evalTaskId)
        OR-->>AE: stdout + stderr lines
        AE->>AE: parseEvalOutput(fullText, strategy)
        AE-->>CE: EvalResult (passed/score/feedback)
    end
    CE-->>LH: EvalResult

    Note over LH: Stale-state guard: re-fetch loop + iteration
    LH->>LR: findById(loopId)
    LH->>LR: findIterationByTaskId(taskId)
    LH->>LH: handleIterationResult(freshLoop, freshIteration, evalResult)

Comments Outside Diff (4)

src/services/agent-exit-condition-evaluator.ts, line 583-595 (link)

Subscription leak when TaskDelegated emit fails

completionPromise (and its four event-bus subscriptions) is created before the emit and is never awaited or cancelled when the emit fails. The subscriptions will remain alive until the fallback timer fires evalTimeout + 5000 ms later. In a long-running server that repeatedly fails to delegate eval tasks this leaks subscriptions on every failure.

A cancel() helper returned from waitForTaskCompletion (or exposing the cleanup function) would let the early-return path clean up immediately:
```
const completionPromise = this.waitForTaskCompletion(evalTaskId, loop.evalTimeout);

const emitResult = await this.eventBus.emit('TaskDelegated', { task: evalTask });
if (!emitResult.ok) {
  completionPromise.cancel(); // clean up subscriptions immediately
  this.logger.error('Failed to emit TaskDelegated for eval task', emitResult.error, { ... });
  return { passed: false, error: `Failed to spawn eval agent: ${emitResult.error.message}` };
}
```
Alternatively, use an AbortController or expose the resolveOnce as a cancel signal so the early-return path can force the promise to settle.
src/services/agent-exit-condition-evaluator.ts, line 632-635 (link)

stdout/stderr interleaving may corrupt last-line parsing

The eval output is assembled as [...output.stdout, ...output.stderr].join('\n'). This places all stderr lines after all stdout lines regardless of real write order. If the eval agent emits even a single stderr line (tool trace, warning, debug message from the Claude Code runtime), that line becomes the new "last line" — causing every PASS/FAIL or numeric score to appear on a non-last line and fail validation.

Consider reading only stdout for the decision line, or at minimum place stderr first so the agent's deliberate final stdout line remains last:
```
// Prefer stdout for the decision; stderr is informational
const fullText = [...output.stderr, ...output.stdout].join('\n');
```
Or, better yet, parse only stdout for the exit decision and use stderr solely for diagnostic logging.
src/services/agent-exit-condition-evaluator.ts, line 659-669 (link)

Custom evalPrompt silently drops the git diff instruction and format requirements

When loop.evalPrompt is set, instructions is set to the raw custom string and the gitDiffInstruction (the exact git diff <sha>..HEAD command needed to view this iteration's changes) and the strict last-line format requirement are both omitted from the final prompt.

The example in orchestrator-prompt.ts reinforces this problem:
```
--eval-prompt "Review the changes and output PASS if all tests pass and code quality is high, otherwise FAIL with an explanation."
```
An LLM following this wording would likely output "FAIL: tests are still failing" as its final line, which does not match the expected exact string "FAIL" and will always be treated as a parse error.

The cleanest fix is to append the git diff instruction and format requirement even for custom prompts:
```
const formatReminder = isRetry
  ? `\n\nIMPORTANT: ${gitDiffInstruction} The LAST LINE of your response must be exactly PASS or FAIL.`
  : `\n\nIMPORTANT: ${gitDiffInstruction} The LAST LINE must be a single numeric score (e.g. 85).`;
const instructions = (loop.evalPrompt ?? defaultInstructions) + (loop.evalPrompt ? formatReminder : '');
```
src/services/loop-manager.ts, line 1033-1051 (link)

Agent mode silently accepts and persists a non-empty exitCondition

The service only validates exitCondition when evalMode === 'shell'. When evalMode === 'agent', a caller (e.g. via MCP) can pass any exitCondition value and it will be stored in the database without warning. The stored value is never used by the agent evaluator, creating a confusing state where the DB record has a non-empty exitCondition that is never executed.

Add a symmetric guard for agent mode:
```
if (evalMode === 'agent') {
  if (request.exitCondition && request.exitCondition.trim().length > 0) {
    return err(
      new AutobeatError(ErrorCode.INVALID_INPUT, 'exitCondition is not valid with evalMode: agent', {
        field: 'exitCondition',
      }),
    );
  }
}
```

_{Reviews (1): Last reviewed commit: "refactor: simplify agent evaluator and c..." | Re-trigger Greptile}

dean0x · 2026-03-30T09:20:26Z

src/adapters/mcp-adapter.ts

-  evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval script timeout in ms'),
+  evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval timeout in ms (max: shell=300s, agent=600s)'),
  workingDirectory: z.string().optional().describe('Working directory for task and eval'),
  maxIterations: z.number().min(0).optional().default(10).describe('Max iterations (0 = unlimited)'),


SECURITY: Missing .max() upper-bound on evalTimeout Zod schema (82% confidence)

The evalTimeout field (within the CreateLoopSchema starting at line 237) has min(1000) but no .max(). While LoopManagerService enforces 300s/600s limits downstream, the MCP schema should enforce the boundary (validates-at-boundaries principle).

Fix: Add .max(600000) to the evalTimeout definition:

evalTimeout: z .number() .min(1000) .max(600000) .optional() .default(60000) .describe('Eval timeout in ms (max: shell=300s, agent=600s)'),

Apply the same fix to ScheduleLoopSchema.

dean0x · 2026-03-30T09:21:01Z

Code Review Summary: Lower-Confidence Issues & Documentation Gaps

Reviewer: Claude Code
Timestamp: 2026-03-30 12:14 UTC
Branch: feat/agent-eval-mode → main

Documentation Issues (Not Inline - Require File Updates)

HIGH PRIORITY (95-90% confidence):

FEATURES.md not updated for agent eval mode (95% confidence)
- The PR adds significant new feature but docs/FEATURES.md does not mention agent eval mode, new eval timeout differences (300s vs 600s), or Migration v15
- Fix: Add agent eval mode to Loop Strategies section, Configuration section (evalMode, evalPrompt), and CLI Commands section with examples
README.md omits agent eval mode examples (92% confidence)
- Users discovering the feature won't know it exists; no mention of --eval-mode agent --strategy retry pattern
- Fix: Add "Agent eval" example after shell eval examples
CHANGELOG.md [Unreleased] is empty (90% confidence)
- No entries for agent eval mode, custom eval prompts, stale state guard, or Migration v15
- Fix: Add [Unreleased] section with Features and Database subsections
MCP JSON Schema out of sync with Zod schema (95% confidence)
- JSON Schema still requires exitCondition (line 1043 for CreateLoop, line 1180 for ScheduleLoop), but Zod schema makes it optional
- Fix: Remove from required array; MCP clients will fail JSON Schema validation before Zod even runs
- JSON Schema also missing evalMode and evalPrompt properties (lines 966-1044 for CreateLoop, 1146-1181 for ScheduleLoop)
- Fix: Add properties and update exitCondition description

Design Issues to Address (60-80% confidence, summary consolidated)

Duplicated cleanup triplet (lines 294-296, 313-315, 324-326 in loop-handler.ts):

The three-line sequence appears 4 times; violates DRY
Suggested fix: Extract `cleanupIterationTracking()" helper

Cascading validation gaps:

schedule-manager.ts:485 unconditionally requires exitCondition (blocks agent-mode scheduled loops)
schedule.ts ParsedLoopConfig interface missing evalMode/evalPrompt fields
Suggested fix: Guard validation with evalMode check

Repository field naming inconsistency:

AgentExitConditionEvaluator uses outputRepository (full name) while handlers use loopRepo (abbreviated)
Suggested fix: Rename to outputRepo" for consistency with loopRepo`

Type safety improvements:

loop-repository.ts:609: Change LoopRowSchema from z.string() to z.enum(['shell', 'agent']) to remove unsafe as cast
loop-manager.ts:57: Add explicit validation rejecting invalid evalMode values (defense-in-depth)
loop-handler.ts:1121: Inline type for evalResult should be named type (IterationEvalFields) for reuse

Tests: High-Confidence Issues

Test boilerplate pattern (90% confidence):

The vi.spyOn + capturedEvalTaskId + setImmediate + simulateEvalTaskComplete pattern repeats 12+ times in agent-exit-condition-evaluator.test.ts
Suggested fix: Extract `evaluateWithCompletion()" shared helper to reduce boilerplate

Missing test coverage:

Stale iteration guard path (one of two branches uncovered)
Shell mode evalTimeout boundary validation (evalTimeout > 300000ms should reject)

Pre-Existing Issues (Not Your Responsibility, but Context)

PF-005: getResetTargetSha O(n) iteration scan will exercise more with agent eval
PF-006: 4 sequential git spawns per iteration (agent eval adds eval spawn overhead)
Pre-existing shell evaluator has no guard for empty exitCondition (should fail vs default to true)

Files Requiring Updates (Outside of Inline Comments)

docs/FEATURES.md - Add agent eval mode documentation
README.md - Add agent eval mode examples
CHANGELOG.md - Add [Unreleased] section
CLAUDE.md - Add new service files to File Locations table
src/adapters/mcp-adapter.ts - Fix MCP JSON Schemas (lines 1043, 1180, 966-1044, 1146-1181)
src/services/schedule-manager.ts:485 - Guard exitCondition validation with evalMode check
src/cli/commands/schedule.ts:32-44 - Add evalMode/evalPrompt to ParsedLoopConfig

Rate Limit: 4973/5000 remaining
Recommendation: 7 blocking issues (inline), 11 high-confidence findings (above), 6 documentation updates required

dean0x · 2026-03-30T09:21:18Z

src/services/agent-exit-condition-evaluator.ts

+
+    const lastLine = lines[lines.length - 1].trim();
+    // Everything before the last line (if any) as feedback
+    const feedbackLines = lines.slice(0, -1);


SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

The feedback field at line 235 is constructed from the entire eval agent output (minus the last line) and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.

Fix: Truncate feedback before storage, consistent with the 8000-char cap on evalPrompt in the Zod schema:

const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit const feedback = feedbackLines.length > 0 ? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH) : undefined;

dean0x · 2026-03-30T09:21:21Z

src/services/agent-exit-condition-evaluator.ts

+    const gitDiffInstruction = preIterationCommitSha
+      ? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.`
+      : 'Use `git diff HEAD~1..HEAD` to see what changed in this iteration.';
+


PERFORMANCE: Unbounded output concatenation (85% confidence)

The evaluator at line 115-116 concatenates full stdout and stderr arrays into a single string, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant memory duplication.

Fix: Parse directly from the filtered lines array without the join-then-split round-trip:

const allLines = [...output.stdout, ...output.stderr]; const nonEmptyLines = allLines.filter((line) => line.trim().length > 0); return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);

dean0x · 2026-03-30T09:21:35Z

src/core/domain.ts

+  readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode)
  readonly evalDirection?: OptimizeDirection; // Optimize strategy only
  readonly evalTimeout: number; // Milliseconds for exit condition evaluation
+  readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review


TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

The evalMode field in the Loop domain is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).

Fix: Define an EvalMode enum in core/domain.ts:

export enum EvalMode { SHELL = 'shell', AGENT = 'agent', }

Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:

switch (loop.evalMode) { case EvalMode.AGENT: return this.agentEvaluator.evaluate(loop, taskId); case EvalMode.SHELL: return this.shellEvaluator.evaluate(loop, taskId); default: return loop.evalMode satisfies never; }

dean0x · 2026-03-30T09:21:38Z

src/implementations/loop-repository.ts

      exitCondition: data.exit_condition,
      evalDirection: data.eval_direction ? this.toOptimizeDirection(data.eval_direction) : undefined,
      evalTimeout: data.eval_timeout,
+      evalMode: data.eval_mode as 'shell' | 'agent',


TYPE SAFETY: Unsafe as cast on eval_mode bypasses Zod validation (85% confidence)

The LoopRowSchema at line 609 validates eval_mode as bare z.string() (not z.enum()), then the repository casts with as 'shell' | 'agent' bypassing type safety. If a third eval mode is added or data corruption occurs, this cast silently lies about the type.

Fix: Change the Zod schema to use z.enum():
typescript\n// In LoopRowSchema:\neval_mode: z.enum(['shell', 'agent']).default('shell'),\n\n// In toDomain():\nevalMode: data.eval_mode, // z.enum already narrows the type\n\n\nThis removes the need for the unsafe cast and validates at the repository boundary.

dean0x · 2026-03-31T19:56:08Z

Closing: agent eval mode already on main. AgentExitConditionEvaluator and evalMode: agent fully implemented.

Dean Sharon added 5 commits March 30, 2026 11:33

Loop loop-06fd8b6e-0085-4e90-9085-8f42c3f28568 iteration 1 — pass

6539f69

style: fix biome formatting and import organization

114d5cb

fix: address shepherd review gaps

8013db9

- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance

dean0x commented Mar 30, 2026

View reviewed changes

dean0x closed this Mar 31, 2026

dean0x deleted the feat/agent-eval-mode branch March 31, 2026 19:56

Conversation

dean0x commented Mar 30, 2026

Summary

Changes

Breaking Changes

Testing

Related Issues

Uh oh!

Choose a reason for hiding this comment

SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

Uh oh!

Choose a reason for hiding this comment

ARCHITECTURE: Eval task not cancelled when loop is cancelled (85% confidence)

Uh oh!

Choose a reason for hiding this comment

MAINTENANCE: Duplicated cleanup triplet (82% confidence)

Uh oh!

Choose a reason for hiding this comment

TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

Uh oh!

Choose a reason for hiding this comment

PERFORMANCE: Unbounded output concatenation (85% confidence)

Uh oh!

Choose a reason for hiding this comment

COMPLEXITY: parseLoopCreateArgs exceeds maintainability threshold (88% confidence)

Uh oh!

greptile-apps bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (4)

Uh oh!

Choose a reason for hiding this comment

SECURITY: Missing .max() upper-bound on evalTimeout Zod schema (82% confidence)

Uh oh!

dean0x commented Mar 30, 2026

Code Review Summary: Lower-Confidence Issues & Documentation Gaps

Documentation Issues (Not Inline - Require File Updates)

Design Issues to Address (60-80% confidence, summary consolidated)

Tests: High-Confidence Issues

Pre-Existing Issues (Not Your Responsibility, but Context)

Files Requiring Updates (Outside of Inline Comments)

Uh oh!

Choose a reason for hiding this comment

SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

Uh oh!

Choose a reason for hiding this comment

PERFORMANCE: Unbounded output concatenation (85% confidence)

Uh oh!

Choose a reason for hiding this comment

TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

Uh oh!

Choose a reason for hiding this comment

TYPE SAFETY: Unsafe as cast on eval_mode bypasses Zod validation (85% confidence)

Uh oh!

dean0x commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SECURITY: Unbounded `evalFeedback` stored in SQLite (85% confidence)

TYPE SAFETY: `evalMode` string literal instead of enum (83% confidence)

COMPLEXITY: `parseLoopCreateArgs` exceeds maintainability threshold (88% confidence)

greptile-apps bot commented Mar 30, 2026 •

edited

Loading

SECURITY: Missing `.max()` upper-bound on `evalTimeout` Zod schema (82% confidence)

SECURITY: Unbounded `evalFeedback` stored in SQLite (85% confidence)

TYPE SAFETY: `evalMode` string literal instead of enum (83% confidence)

TYPE SAFETY: Unsafe `as` cast on `eval_mode` bypasses Zod validation (85% confidence)