Skip to content

feat: agent-based loop evaluation mode#125

Closed
dean0x wants to merge 5 commits intomainfrom
feat/agent-eval-mode
Closed

feat: agent-based loop evaluation mode#125
dean0x wants to merge 5 commits intomainfrom
feat/agent-eval-mode

Conversation

@dean0x
Copy link
Copy Markdown
Owner

@dean0x dean0x commented Mar 30, 2026

Summary

Adds agent-based loop evaluation (evalMode: 'agent') — enables spawning a Claude Code instance to evaluate loop iteration quality instead of shell commands. Comprehensive implementation including domain types, database migration, repository layer, validation, agent evaluator component, composite evaluator pattern, handler wiring, CLI flags, MCP adapter integration, orchestrator prompt enhancement, and full test coverage.

Changes

  • Domain Types: Added EvalMode union type ('shell' | 'agent'), AgentEvaluatorConfig for agent setup
  • Migration v11: Added eval_mode and agent_evaluator_config columns to loops table
  • Repository: Extended LoopRepository with evalMode and agentEvaluatorConfig property handling
  • Validation: Added schema validation for eval mode and agent config via Zod
  • AgentEvaluator: New component to spawn Claude Code instance and delegate evaluation
  • CompositeEvaluator: Patterns for composing shell and agent evaluators
  • Handler Wiring: LoopHandler extended to support both evaluation modes
  • CLI Flags: Added --eval-mode and agent config flags to beat loop create
  • MCP Adapter: Updated CreateLoop tool with eval mode parameters
  • Orchestrator Prompt: Enhanced to guide agent-based loop evaluation
  • Tests: Comprehensive test coverage across domain, handler, evaluator, and integration tests

Breaking Changes

None. The evalMode defaults to 'shell' for backward compatibility.

Testing

  • Unit tests: Domain types, validators, agent evaluator
  • Integration tests: Full loop workflow with agent evaluation
  • Handler tests: Event handling with both eval modes
  • All test groups passing: `npm run test:core`, `test:handlers`, etc.

Related Issues

Closes #[issue-number] (if applicable)


Reviewed via Claude Code

Dean Sharon added 5 commits March 30, 2026 11:33
- Stale state guard now allows PAUSED loops through, consistent with
  the early guard (graceful pause should record iteration results)
- Add cleanup of in-memory tracking maps (taskToLoop, pipelineTasks)
  when stale state guard returns early, preventing memory leaks
- Add evalMode/evalPrompt to MCP LoopStatus loop response
- Add evalFeedback to MCP LoopStatus iteration history
- Add CLI tests for --eval-mode, --eval-prompt, --strategy flags
- Add MCP adapter tests for evalMode/evalPrompt schema acceptance
- Extract TaskCompletionStatus type alias, replace nested ternary with switch
- Deduplicate buildEvalPrompt template structure
- Remove redundant comments restating adjacent code
- Remove empty else block in loop-manager validation
- Fix stale dependency count comment in handler-setup
}

const lastLine = lines[lines.length - 1].trim();
// Everything before the last line (if any) as feedback
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

The evalFeedback field is constructed from the entire eval agent output and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.

Fix: Truncate feedback before storage:

const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit
const feedback = feedbackLines.length > 0
  ? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH)
  : undefined;

This is consistent with the 8000-char cap on evalPrompt in the Zod schema.

} else {
// Task COMPLETED — run exit condition evaluation
// Note: agent eval mode can take a long time; re-fetch state afterwards to guard stale data
const evalResult = await this.exitConditionEvaluator.evaluate(loop, taskId);
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARCHITECTURE: Eval task not cancelled when loop is cancelled (85% confidence)

When a loop is cancelled, the handleLoopCancelled method updates the loop status and cleans up tracking, but the eval task spawned by AgentExitConditionEvaluator is NOT tracked in taskToLoop. This means the eval agent task continues running as an orphan, wasting compute resources.

Fix: Track the in-flight eval task ID so handleLoopCancelled can emit a TaskCancelled event for any running eval task. Consider passing an AbortSignal that LoopHandler can trigger on cancellation:

async evaluate(loop: Loop, taskId: TaskId, signal?: AbortSignal): Promise<EvalResult> {
  // ...spawn eval task...
  if (signal?.aborted) {
    await this.eventBus.emit('TaskCancelled', { taskId: evalTaskId, reason: 'loop-cancelled' });
    return { passed: false, error: 'Eval cancelled — loop was cancelled' };
  }
}

const freshLoop = freshLoopResult.value;

const freshIterationResult = await this.loopRepo.findIterationByTaskId(taskId);
if (!freshIterationResult.ok || !freshIterationResult.value || freshIterationResult.value.status !== 'running') {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAINTENANCE: Duplicated cleanup triplet (82% confidence)

The three-line cleanup sequence (cleanupPipelineTaskTracking, taskToLoop.delete, cleanupPipelineTasks) appears identically at lines 294-296, 313-315, and 324-326. This violates DRY and creates a maintenance hazard.

Fix: Extract into a helper method:

private cleanupIterationTracking(iteration: LoopIteration, taskId: TaskId, loopId: string): void {
  this.cleanupPipelineTaskTracking(iteration);
  this.taskToLoop.delete(taskId);
  this.cleanupPipelineTasks(loopId, iteration.iterationNumber);
}

Then use finally block or restructure early returns to call this once.

readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode)
readonly evalDirection?: OptimizeDirection; // Optimize strategy only
readonly evalTimeout: number; // Milliseconds for exit condition evaluation
readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

The evalMode field is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).

Fix: Define an EvalMode enum:

export enum EvalMode {
  SHELL = 'shell',
  AGENT = 'agent',
}

Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:

case EvalMode.AGENT:
  return this.agentEvaluator.evaluate(loop, taskId);
case EvalMode.SHELL:
  return this.shellEvaluator.evaluate(loop, taskId);
default:
  return loop.evalMode satisfies never;

}

const gitDiffInstruction = preIterationCommitSha
? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.`
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERFORMANCE: Unbounded output concatenation (85% confidence)

The evaluator concatenates full stdout and stderr arrays into a single string before parsing, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant.

Fix: Only parse from the filtered lines array without the join-then-split round-trip:

const allLines = [...output.stdout, ...output.stderr];
const nonEmptyLines = allLines.filter((line) => line.trim().length > 0);
return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);

}

// Strategy inference from flags
// Validate --eval-prompt requires --eval-mode agent
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPLEXITY: parseLoopCreateArgs exceeds maintainability threshold (88% confidence)

This function is 238 lines with ~25 decision points. The agent eval mode additions (lines 153-206) pushed it past the threshold. The agent and shell branches each construct nearly identical shared objects.

Fix: Extract the two code paths into helper functions:

if (evalMode === 'agent') {
  return parseAgentModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... });
}
return parseShellModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... });

This makes each path easier to understand and modify independently.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 30, 2026

Greptile Summary

This PR adds an evalMode: 'agent' evaluation path for loops, enabling a dedicated Claude Code instance to judge iteration quality through code comprehension rather than shell exit codes. The implementation is comprehensive — domain types, DB migration (v15), repository layer, validation, AgentExitConditionEvaluator, CompositeExitConditionEvaluator dispatcher, handler wiring, CLI flags, MCP schema changes, orchestrator prompt guidance, and test coverage are all included. The feature is fully backward-compatible (evalMode defaults to 'shell').

Notable strengths:

  • Stale-state guard in LoopHandler correctly re-fetches loop and iteration after the potentially long agent eval, preventing races when the loop is cancelled mid-evaluation
  • completionPromise is set up before emit is awaited — correctly avoiding the window where a fast-completing task could fire before subscriptions are registered
  • The CompositeExitConditionEvaluator pattern is clean and keeps callers oblivious to which evaluator is in use

Issues found:

  • Subscription leak on emit failure (agent-exit-condition-evaluator.ts ~line 583): When emit('TaskDelegated') fails the function returns early without awaiting or cancelling completionPromise. The four event-bus subscriptions remain active until the fallback timer fires (evalTimeout + 5000 ms later).
  • stdout/stderr ordering corrupts last-line parsing (agent-exit-condition-evaluator.ts ~line 632): [...output.stdout, ...output.stderr].join('\ ') puts all stderr after all stdout. Any stderr output from the eval agent's runtime (tool traces, warnings) will become the last line, causing PASS/FAIL/score parsing to fail.
  • Custom evalPrompt silently drops the git diff instruction and format requirement (agent-exit-condition-evaluator.ts ~line 659): When evalPrompt is set it fully replaces defaultInstructions, discarding the git diff <sha>..HEAD command and the strict "last line must be exactly PASS or FAIL" constraint. The orchestrator prompt example compounds this by showing \"FAIL with an explanation\" phrasing which would produce a non-matching last line.
  • Agent mode silently accepts exitCondition (loop-manager.ts): The service layer only rejects exitCondition for shell mode; MCP callers can pass a non-empty exitCondition with evalMode: 'agent', which is stored but never executed.

Confidence Score: 4/5

Safe to merge after addressing the stdout/stderr ordering issue and the custom evalPrompt format-requirement gap, which would silently cause agent evals to fail in practice

All four findings are P2, but two of them (stdout/stderr ordering and custom evalPrompt missing format constraints) describe conditions where agent evaluation silently produces incorrect results on the primary happy path — enough to warrant fixing before wide use. The subscription leak and missing agent-mode exitCondition rejection are lower-urgency but still worth a quick fix.

src/services/agent-exit-condition-evaluator.ts requires the most attention (subscription leak, output parsing, and evalPrompt construction); src/services/loop-manager.ts needs a minor guard for agent mode + exitCondition

Important Files Changed

Filename Overview
src/services/agent-exit-condition-evaluator.ts New agent evaluator — has subscription leak on emit failure, stdout/stderr ordering issue, and custom evalPrompt loses critical format/git-diff instructions
src/services/loop-manager.ts Validation logic updated — shell mode correctly requires exitCondition, but agent mode does not reject a non-empty exitCondition passed via MCP
src/services/handlers/loop-handler.ts Stale-state guard added after slow agent eval — correctly re-fetches loop and iteration, propagates evalFeedback to all iteration outcome paths
src/implementations/database.ts Migration v15 adds eval_mode (NOT NULL DEFAULT 'shell'), eval_prompt, and eval_feedback columns — backward compatible and correctly structured
src/adapters/mcp-adapter.ts CreateLoop and ScheduleLoop schemas updated to accept optional evalMode/evalPrompt; evalFeedback surfaced in iteration output; no cross-field validation at schema level (deferred to service)
src/services/orchestrator-prompt.ts Agent eval mode documented in orchestrator prompt, but the --eval-prompt example uses 'FAIL with an explanation' phrasing which would not produce a valid bare-FAIL last line

Sequence Diagram

sequenceDiagram
    participant LH as LoopHandler
    participant CE as CompositeEvaluator
    participant SE as ShellEvaluator
    participant AE as AgentEvaluator
    participant EB as EventBus
    participant OR as OutputRepository
    participant LR as LoopRepository

    LH->>CE: evaluate(loop, taskId)
    alt evalMode == 'shell'
        CE->>SE: evaluate(loop, taskId)
        SE-->>CE: EvalResult (exitCode)
    else evalMode == 'agent'
        CE->>AE: evaluate(loop, taskId)
        AE->>LR: findIterationByTaskId(taskId)
        LR-->>AE: preIterationCommitSha
        AE->>AE: buildEvalPrompt(loop, taskId)
        AE->>EB: subscribe(TaskCompleted / Failed / Cancelled / Timeout)
        AE->>EB: emit(TaskDelegated, evalTask)
        Note over EB: Claude Code eval agent runs
        EB-->>AE: TaskCompleted(evalTaskId)
        AE->>OR: get(evalTaskId)
        OR-->>AE: stdout + stderr lines
        AE->>AE: parseEvalOutput(fullText, strategy)
        AE-->>CE: EvalResult (passed/score/feedback)
    end
    CE-->>LH: EvalResult

    Note over LH: Stale-state guard: re-fetch loop + iteration
    LH->>LR: findById(loopId)
    LH->>LR: findIterationByTaskId(taskId)
    LH->>LH: handleIterationResult(freshLoop, freshIteration, evalResult)
Loading

Comments Outside Diff (4)

  1. src/services/agent-exit-condition-evaluator.ts, line 583-595 (link)

    P2 Subscription leak when TaskDelegated emit fails

    completionPromise (and its four event-bus subscriptions) is created before the emit and is never awaited or cancelled when the emit fails. The subscriptions will remain alive until the fallback timer fires evalTimeout + 5000 ms later. In a long-running server that repeatedly fails to delegate eval tasks this leaks subscriptions on every failure.

    A cancel() helper returned from waitForTaskCompletion (or exposing the cleanup function) would let the early-return path clean up immediately:

    const completionPromise = this.waitForTaskCompletion(evalTaskId, loop.evalTimeout);
    
    const emitResult = await this.eventBus.emit('TaskDelegated', { task: evalTask });
    if (!emitResult.ok) {
      completionPromise.cancel(); // clean up subscriptions immediately
      this.logger.error('Failed to emit TaskDelegated for eval task', emitResult.error, { ... });
      return { passed: false, error: `Failed to spawn eval agent: ${emitResult.error.message}` };
    }

    Alternatively, use an AbortController or expose the resolveOnce as a cancel signal so the early-return path can force the promise to settle.

  2. src/services/agent-exit-condition-evaluator.ts, line 632-635 (link)

    P2 stdout/stderr interleaving may corrupt last-line parsing

    The eval output is assembled as [...output.stdout, ...output.stderr].join('\n'). This places all stderr lines after all stdout lines regardless of real write order. If the eval agent emits even a single stderr line (tool trace, warning, debug message from the Claude Code runtime), that line becomes the new "last line" — causing every PASS/FAIL or numeric score to appear on a non-last line and fail validation.

    Consider reading only stdout for the decision line, or at minimum place stderr first so the agent's deliberate final stdout line remains last:

    // Prefer stdout for the decision; stderr is informational
    const fullText = [...output.stderr, ...output.stdout].join('\n');

    Or, better yet, parse only stdout for the exit decision and use stderr solely for diagnostic logging.

  3. src/services/agent-exit-condition-evaluator.ts, line 659-669 (link)

    P2 Custom evalPrompt silently drops the git diff instruction and format requirements

    When loop.evalPrompt is set, instructions is set to the raw custom string and the gitDiffInstruction (the exact git diff <sha>..HEAD command needed to view this iteration's changes) and the strict last-line format requirement are both omitted from the final prompt.

    The example in orchestrator-prompt.ts reinforces this problem:

    --eval-prompt "Review the changes and output PASS if all tests pass and code quality is high, otherwise FAIL with an explanation."
    

    An LLM following this wording would likely output "FAIL: tests are still failing" as its final line, which does not match the expected exact string "FAIL" and will always be treated as a parse error.

    The cleanest fix is to append the git diff instruction and format requirement even for custom prompts:

    const formatReminder = isRetry
      ? `\n\nIMPORTANT: ${gitDiffInstruction} The LAST LINE of your response must be exactly PASS or FAIL.`
      : `\n\nIMPORTANT: ${gitDiffInstruction} The LAST LINE must be a single numeric score (e.g. 85).`;
    const instructions = (loop.evalPrompt ?? defaultInstructions) + (loop.evalPrompt ? formatReminder : '');
  4. src/services/loop-manager.ts, line 1033-1051 (link)

    P2 Agent mode silently accepts and persists a non-empty exitCondition

    The service only validates exitCondition when evalMode === 'shell'. When evalMode === 'agent', a caller (e.g. via MCP) can pass any exitCondition value and it will be stored in the database without warning. The stored value is never used by the agent evaluator, creating a confusing state where the DB record has a non-empty exitCondition that is never executed.

    Add a symmetric guard for agent mode:

    if (evalMode === 'agent') {
      if (request.exitCondition && request.exitCondition.trim().length > 0) {
        return err(
          new AutobeatError(ErrorCode.INVALID_INPUT, 'exitCondition is not valid with evalMode: agent', {
            field: 'exitCondition',
          }),
        );
      }
    }

Reviews (1): Last reviewed commit: "refactor: simplify agent evaluator and c..." | Re-trigger Greptile

evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval script timeout in ms'),
evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval timeout in ms (max: shell=300s, agent=600s)'),
workingDirectory: z.string().optional().describe('Working directory for task and eval'),
maxIterations: z.number().min(0).optional().default(10).describe('Max iterations (0 = unlimited)'),
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECURITY: Missing .max() upper-bound on evalTimeout Zod schema (82% confidence)

The evalTimeout field (within the CreateLoopSchema starting at line 237) has min(1000) but no .max(). While LoopManagerService enforces 300s/600s limits downstream, the MCP schema should enforce the boundary (validates-at-boundaries principle).

Fix: Add .max(600000) to the evalTimeout definition:

evalTimeout: z
  .number()
  .min(1000)
  .max(600000)
  .optional()
  .default(60000)
  .describe('Eval timeout in ms (max: shell=300s, agent=600s)'),

Apply the same fix to ScheduleLoopSchema.

@dean0x
Copy link
Copy Markdown
Owner Author

dean0x commented Mar 30, 2026

Code Review Summary: Lower-Confidence Issues & Documentation Gaps

Reviewer: Claude Code
Timestamp: 2026-03-30 12:14 UTC
Branch: feat/agent-eval-mode → main


Documentation Issues (Not Inline - Require File Updates)

HIGH PRIORITY (95-90% confidence):

  1. FEATURES.md not updated for agent eval mode (95% confidence)

    • The PR adds significant new feature but docs/FEATURES.md does not mention agent eval mode, new eval timeout differences (300s vs 600s), or Migration v15
    • Fix: Add agent eval mode to Loop Strategies section, Configuration section (evalMode, evalPrompt), and CLI Commands section with examples
  2. README.md omits agent eval mode examples (92% confidence)

    • Users discovering the feature won't know it exists; no mention of --eval-mode agent --strategy retry pattern
    • Fix: Add "Agent eval" example after shell eval examples
  3. CHANGELOG.md [Unreleased] is empty (90% confidence)

    • No entries for agent eval mode, custom eval prompts, stale state guard, or Migration v15
    • Fix: Add [Unreleased] section with Features and Database subsections
  4. MCP JSON Schema out of sync with Zod schema (95% confidence)

    • JSON Schema still requires exitCondition (line 1043 for CreateLoop, line 1180 for ScheduleLoop), but Zod schema makes it optional
    • Fix: Remove from required array; MCP clients will fail JSON Schema validation before Zod even runs
    • JSON Schema also missing evalMode and evalPrompt properties (lines 966-1044 for CreateLoop, 1146-1181 for ScheduleLoop)
    • Fix: Add properties and update exitCondition description

Design Issues to Address (60-80% confidence, summary consolidated)

Duplicated cleanup triplet (lines 294-296, 313-315, 324-326 in loop-handler.ts):

  • The three-line sequence appears 4 times; violates DRY
  • Suggested fix: Extract `cleanupIterationTracking()" helper

Cascading validation gaps:

  • schedule-manager.ts:485 unconditionally requires exitCondition (blocks agent-mode scheduled loops)
  • schedule.ts ParsedLoopConfig interface missing evalMode/evalPrompt fields
  • Suggested fix: Guard validation with evalMode check

Repository field naming inconsistency:

  • AgentExitConditionEvaluator uses outputRepository (full name) while handlers use loopRepo (abbreviated)
  • Suggested fix: Rename to outputRepo" for consistency with loopRepo`

Type safety improvements:

  • loop-repository.ts:609: Change LoopRowSchema from z.string() to z.enum(['shell', 'agent']) to remove unsafe as cast
  • loop-manager.ts:57: Add explicit validation rejecting invalid evalMode values (defense-in-depth)
  • loop-handler.ts:1121: Inline type for evalResult should be named type (IterationEvalFields) for reuse

Tests: High-Confidence Issues

Test boilerplate pattern (90% confidence):

  • The vi.spyOn + capturedEvalTaskId + setImmediate + simulateEvalTaskComplete pattern repeats 12+ times in agent-exit-condition-evaluator.test.ts
  • Suggested fix: Extract `evaluateWithCompletion()" shared helper to reduce boilerplate

Missing test coverage:

  • Stale iteration guard path (one of two branches uncovered)
  • Shell mode evalTimeout boundary validation (evalTimeout > 300000ms should reject)

Pre-Existing Issues (Not Your Responsibility, but Context)

  • PF-005: getResetTargetSha O(n) iteration scan will exercise more with agent eval
  • PF-006: 4 sequential git spawns per iteration (agent eval adds eval spawn overhead)
  • Pre-existing shell evaluator has no guard for empty exitCondition (should fail vs default to true)

Files Requiring Updates (Outside of Inline Comments)

  • docs/FEATURES.md - Add agent eval mode documentation
  • README.md - Add agent eval mode examples
  • CHANGELOG.md - Add [Unreleased] section
  • CLAUDE.md - Add new service files to File Locations table
  • src/adapters/mcp-adapter.ts - Fix MCP JSON Schemas (lines 1043, 1180, 966-1044, 1146-1181)
  • src/services/schedule-manager.ts:485 - Guard exitCondition validation with evalMode check
  • src/cli/commands/schedule.ts:32-44 - Add evalMode/evalPrompt to ParsedLoopConfig

Rate Limit: 4973/5000 remaining
Recommendation: 7 blocking issues (inline), 11 high-confidence findings (above), 6 documentation updates required


const lastLine = lines[lines.length - 1].trim();
// Everything before the last line (if any) as feedback
const feedbackLines = lines.slice(0, -1);
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)

The feedback field at line 235 is constructed from the entire eval agent output (minus the last line) and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.

Fix: Truncate feedback before storage, consistent with the 8000-char cap on evalPrompt in the Zod schema:

const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit
const feedback = feedbackLines.length > 0
  ? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH)
  : undefined;

const gitDiffInstruction = preIterationCommitSha
? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.`
: 'Use `git diff HEAD~1..HEAD` to see what changed in this iteration.';

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PERFORMANCE: Unbounded output concatenation (85% confidence)

The evaluator at line 115-116 concatenates full stdout and stderr arrays into a single string, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant memory duplication.

Fix: Parse directly from the filtered lines array without the join-then-split round-trip:

const allLines = [...output.stdout, ...output.stderr];
const nonEmptyLines = allLines.filter((line) => line.trim().length > 0);
return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);

readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode)
readonly evalDirection?: OptimizeDirection; // Optimize strategy only
readonly evalTimeout: number; // Milliseconds for exit condition evaluation
readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPE SAFETY: evalMode string literal instead of enum (83% confidence)

The evalMode field in the Loop domain is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).

Fix: Define an EvalMode enum in core/domain.ts:

export enum EvalMode {
  SHELL = 'shell',
  AGENT = 'agent',
}

Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:

switch (loop.evalMode) {
  case EvalMode.AGENT:
    return this.agentEvaluator.evaluate(loop, taskId);
  case EvalMode.SHELL:
    return this.shellEvaluator.evaluate(loop, taskId);
  default:
    return loop.evalMode satisfies never;
}

exitCondition: data.exit_condition,
evalDirection: data.eval_direction ? this.toOptimizeDirection(data.eval_direction) : undefined,
evalTimeout: data.eval_timeout,
evalMode: data.eval_mode as 'shell' | 'agent',
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYPE SAFETY: Unsafe as cast on eval_mode bypasses Zod validation (85% confidence)

The LoopRowSchema at line 609 validates eval_mode as bare z.string() (not z.enum()), then the repository casts with as 'shell' | 'agent' bypassing type safety. If a third eval mode is added or data corruption occurs, this cast silently lies about the type.

Fix: Change the Zod schema to use z.enum():
typescript\n// In LoopRowSchema:\neval_mode: z.enum(['shell', 'agent']).default('shell'),\n\n// In toDomain():\nevalMode: data.eval_mode, // z.enum already narrows the type\n\n\nThis removes the need for the unsafe cast and validates at the repository boundary.

@dean0x
Copy link
Copy Markdown
Owner Author

dean0x commented Mar 31, 2026

Closing: agent eval mode already on main. AgentExitConditionEvaluator and evalMode: agent fully implemented.

@dean0x dean0x closed this Mar 31, 2026
@dean0x dean0x deleted the feat/agent-eval-mode branch March 31, 2026 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant