Conversation
- Stale state guard now allows PAUSED loops through, consistent with the early guard (graceful pause should record iteration results) - Add cleanup of in-memory tracking maps (taskToLoop, pipelineTasks) when stale state guard returns early, preventing memory leaks
- Add evalMode/evalPrompt to MCP LoopStatus loop response - Add evalFeedback to MCP LoopStatus iteration history - Add CLI tests for --eval-mode, --eval-prompt, --strategy flags - Add MCP adapter tests for evalMode/evalPrompt schema acceptance
- Extract TaskCompletionStatus type alias, replace nested ternary with switch - Deduplicate buildEvalPrompt template structure - Remove redundant comments restating adjacent code - Remove empty else block in loop-manager validation - Fix stale dependency count comment in handler-setup
| } | ||
|
|
||
| const lastLine = lines[lines.length - 1].trim(); | ||
| // Everything before the last line (if any) as feedback |
There was a problem hiding this comment.
SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)
The evalFeedback field is constructed from the entire eval agent output and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.
Fix: Truncate feedback before storage:
const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit
const feedback = feedbackLines.length > 0
? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH)
: undefined;This is consistent with the 8000-char cap on evalPrompt in the Zod schema.
| } else { | ||
| // Task COMPLETED — run exit condition evaluation | ||
| // Note: agent eval mode can take a long time; re-fetch state afterwards to guard stale data | ||
| const evalResult = await this.exitConditionEvaluator.evaluate(loop, taskId); |
There was a problem hiding this comment.
ARCHITECTURE: Eval task not cancelled when loop is cancelled (85% confidence)
When a loop is cancelled, the handleLoopCancelled method updates the loop status and cleans up tracking, but the eval task spawned by AgentExitConditionEvaluator is NOT tracked in taskToLoop. This means the eval agent task continues running as an orphan, wasting compute resources.
Fix: Track the in-flight eval task ID so handleLoopCancelled can emit a TaskCancelled event for any running eval task. Consider passing an AbortSignal that LoopHandler can trigger on cancellation:
async evaluate(loop: Loop, taskId: TaskId, signal?: AbortSignal): Promise<EvalResult> {
// ...spawn eval task...
if (signal?.aborted) {
await this.eventBus.emit('TaskCancelled', { taskId: evalTaskId, reason: 'loop-cancelled' });
return { passed: false, error: 'Eval cancelled — loop was cancelled' };
}
}| const freshLoop = freshLoopResult.value; | ||
|
|
||
| const freshIterationResult = await this.loopRepo.findIterationByTaskId(taskId); | ||
| if (!freshIterationResult.ok || !freshIterationResult.value || freshIterationResult.value.status !== 'running') { |
There was a problem hiding this comment.
MAINTENANCE: Duplicated cleanup triplet (82% confidence)
The three-line cleanup sequence (cleanupPipelineTaskTracking, taskToLoop.delete, cleanupPipelineTasks) appears identically at lines 294-296, 313-315, and 324-326. This violates DRY and creates a maintenance hazard.
Fix: Extract into a helper method:
private cleanupIterationTracking(iteration: LoopIteration, taskId: TaskId, loopId: string): void {
this.cleanupPipelineTaskTracking(iteration);
this.taskToLoop.delete(taskId);
this.cleanupPipelineTasks(loopId, iteration.iterationNumber);
}Then use finally block or restructure early returns to call this once.
| readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode) | ||
| readonly evalDirection?: OptimizeDirection; // Optimize strategy only | ||
| readonly evalTimeout: number; // Milliseconds for exit condition evaluation | ||
| readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review |
There was a problem hiding this comment.
TYPE SAFETY: evalMode string literal instead of enum (83% confidence)
The evalMode field is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).
Fix: Define an EvalMode enum:
export enum EvalMode {
SHELL = 'shell',
AGENT = 'agent',
}Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:
case EvalMode.AGENT:
return this.agentEvaluator.evaluate(loop, taskId);
case EvalMode.SHELL:
return this.shellEvaluator.evaluate(loop, taskId);
default:
return loop.evalMode satisfies never;| } | ||
|
|
||
| const gitDiffInstruction = preIterationCommitSha | ||
| ? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.` |
There was a problem hiding this comment.
PERFORMANCE: Unbounded output concatenation (85% confidence)
The evaluator concatenates full stdout and stderr arrays into a single string before parsing, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant.
Fix: Only parse from the filtered lines array without the join-then-split round-trip:
const allLines = [...output.stdout, ...output.stderr];
const nonEmptyLines = allLines.filter((line) => line.trim().length > 0);
return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);| } | ||
|
|
||
| // Strategy inference from flags | ||
| // Validate --eval-prompt requires --eval-mode agent |
There was a problem hiding this comment.
COMPLEXITY: parseLoopCreateArgs exceeds maintainability threshold (88% confidence)
This function is 238 lines with ~25 decision points. The agent eval mode additions (lines 153-206) pushed it past the threshold. The agent and shell branches each construct nearly identical shared objects.
Fix: Extract the two code paths into helper functions:
if (evalMode === 'agent') {
return parseAgentModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... });
}
return parseShellModeArgs({ promptWords, untilCmd, evalCmd, strategyFlag, ... });This makes each path easier to understand and modify independently.
Greptile SummaryThis PR adds an Notable strengths:
Issues found:
Confidence Score: 4/5Safe to merge after addressing the stdout/stderr ordering issue and the custom evalPrompt format-requirement gap, which would silently cause agent evals to fail in practice All four findings are P2, but two of them (stdout/stderr ordering and custom evalPrompt missing format constraints) describe conditions where agent evaluation silently produces incorrect results on the primary happy path — enough to warrant fixing before wide use. The subscription leak and missing agent-mode exitCondition rejection are lower-urgency but still worth a quick fix. src/services/agent-exit-condition-evaluator.ts requires the most attention (subscription leak, output parsing, and evalPrompt construction); src/services/loop-manager.ts needs a minor guard for agent mode + exitCondition Important Files Changed
Sequence DiagramsequenceDiagram
participant LH as LoopHandler
participant CE as CompositeEvaluator
participant SE as ShellEvaluator
participant AE as AgentEvaluator
participant EB as EventBus
participant OR as OutputRepository
participant LR as LoopRepository
LH->>CE: evaluate(loop, taskId)
alt evalMode == 'shell'
CE->>SE: evaluate(loop, taskId)
SE-->>CE: EvalResult (exitCode)
else evalMode == 'agent'
CE->>AE: evaluate(loop, taskId)
AE->>LR: findIterationByTaskId(taskId)
LR-->>AE: preIterationCommitSha
AE->>AE: buildEvalPrompt(loop, taskId)
AE->>EB: subscribe(TaskCompleted / Failed / Cancelled / Timeout)
AE->>EB: emit(TaskDelegated, evalTask)
Note over EB: Claude Code eval agent runs
EB-->>AE: TaskCompleted(evalTaskId)
AE->>OR: get(evalTaskId)
OR-->>AE: stdout + stderr lines
AE->>AE: parseEvalOutput(fullText, strategy)
AE-->>CE: EvalResult (passed/score/feedback)
end
CE-->>LH: EvalResult
Note over LH: Stale-state guard: re-fetch loop + iteration
LH->>LR: findById(loopId)
LH->>LR: findIterationByTaskId(taskId)
LH->>LH: handleIterationResult(freshLoop, freshIteration, evalResult)
|
| evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval script timeout in ms'), | ||
| evalTimeout: z.number().min(1000).optional().default(60000).describe('Eval timeout in ms (max: shell=300s, agent=600s)'), | ||
| workingDirectory: z.string().optional().describe('Working directory for task and eval'), | ||
| maxIterations: z.number().min(0).optional().default(10).describe('Max iterations (0 = unlimited)'), |
There was a problem hiding this comment.
SECURITY: Missing .max() upper-bound on evalTimeout Zod schema (82% confidence)
The evalTimeout field (within the CreateLoopSchema starting at line 237) has min(1000) but no .max(). While LoopManagerService enforces 300s/600s limits downstream, the MCP schema should enforce the boundary (validates-at-boundaries principle).
Fix: Add .max(600000) to the evalTimeout definition:
evalTimeout: z
.number()
.min(1000)
.max(600000)
.optional()
.default(60000)
.describe('Eval timeout in ms (max: shell=300s, agent=600s)'),Apply the same fix to ScheduleLoopSchema.
Code Review Summary: Lower-Confidence Issues & Documentation GapsReviewer: Claude Code Documentation Issues (Not Inline - Require File Updates)HIGH PRIORITY (95-90% confidence):
Design Issues to Address (60-80% confidence, summary consolidated)Duplicated cleanup triplet (lines 294-296, 313-315, 324-326 in loop-handler.ts):
Cascading validation gaps:
Repository field naming inconsistency:
Type safety improvements:
Tests: High-Confidence IssuesTest boilerplate pattern (90% confidence):
Missing test coverage:
Pre-Existing Issues (Not Your Responsibility, but Context)
Files Requiring Updates (Outside of Inline Comments)
Rate Limit: 4973/5000 remaining |
|
|
||
| const lastLine = lines[lines.length - 1].trim(); | ||
| // Everything before the last line (if any) as feedback | ||
| const feedbackLines = lines.slice(0, -1); |
There was a problem hiding this comment.
SECURITY: Unbounded evalFeedback stored in SQLite (85% confidence)
The feedback field at line 235 is constructed from the entire eval agent output (minus the last line) and stored directly in the loop_iterations.eval_feedback column with no size cap. A misbehaving eval agent could produce megabytes of output, causing database bloat and performance degradation.
Fix: Truncate feedback before storage, consistent with the 8000-char cap on evalPrompt in the Zod schema:
const MAX_FEEDBACK_LENGTH = 16000; // ~16KB reasonable limit
const feedback = feedbackLines.length > 0
? feedbackLines.join('\n').slice(0, MAX_FEEDBACK_LENGTH)
: undefined;| const gitDiffInstruction = preIterationCommitSha | ||
| ? `Use \`git diff ${preIterationCommitSha}..HEAD\` to see what changed in this iteration.` | ||
| : 'Use `git diff HEAD~1..HEAD` to see what changed in this iteration.'; | ||
|
|
There was a problem hiding this comment.
PERFORMANCE: Unbounded output concatenation (85% confidence)
The evaluator at line 115-116 concatenates full stdout and stderr arrays into a single string, creating two full copies in memory (array + joined string), then a third filtered copy from lines.filter(). For a 100KB agent output, this is significant memory duplication.
Fix: Parse directly from the filtered lines array without the join-then-split round-trip:
const allLines = [...output.stdout, ...output.stderr];
const nonEmptyLines = allLines.filter((line) => line.trim().length > 0);
return this.parseEvalOutputFromLines(nonEmptyLines, loop.strategy);| readonly exitCondition: string; // Shell command to evaluate iteration result (empty string for agent mode) | ||
| readonly evalDirection?: OptimizeDirection; // Optimize strategy only | ||
| readonly evalTimeout: number; // Milliseconds for exit condition evaluation | ||
| readonly evalMode: 'shell' | 'agent'; // Evaluation mode: shell command or agent review |
There was a problem hiding this comment.
TYPE SAFETY: evalMode string literal instead of enum (83% confidence)
The evalMode field in the Loop domain is typed as 'shell' | 'agent' string literal union with no single source of truth. The MCP adapter and repository use as 'shell' | 'agent' casts that bypass TypeScript's exhaustiveness checking. The project already uses enums for similar concepts (LoopStrategy, OptimizeDirection).
Fix: Define an EvalMode enum in core/domain.ts:
export enum EvalMode {
SHELL = 'shell',
AGENT = 'agent',
}Then use EvalMode everywhere and add exhaustive switch in CompositeExitConditionEvaluator:
switch (loop.evalMode) {
case EvalMode.AGENT:
return this.agentEvaluator.evaluate(loop, taskId);
case EvalMode.SHELL:
return this.shellEvaluator.evaluate(loop, taskId);
default:
return loop.evalMode satisfies never;
}| exitCondition: data.exit_condition, | ||
| evalDirection: data.eval_direction ? this.toOptimizeDirection(data.eval_direction) : undefined, | ||
| evalTimeout: data.eval_timeout, | ||
| evalMode: data.eval_mode as 'shell' | 'agent', |
There was a problem hiding this comment.
TYPE SAFETY: Unsafe as cast on eval_mode bypasses Zod validation (85% confidence)
The LoopRowSchema at line 609 validates eval_mode as bare z.string() (not z.enum()), then the repository casts with as 'shell' | 'agent' bypassing type safety. If a third eval mode is added or data corruption occurs, this cast silently lies about the type.
Fix: Change the Zod schema to use z.enum():
typescript\n// In LoopRowSchema:\neval_mode: z.enum(['shell', 'agent']).default('shell'),\n\n// In toDomain():\nevalMode: data.eval_mode, // z.enum already narrows the type\n\n\nThis removes the need for the unsafe cast and validates at the repository boundary.
|
Closing: agent eval mode already on main. AgentExitConditionEvaluator and evalMode: agent fully implemented. |
Summary
Adds agent-based loop evaluation (
evalMode: 'agent') — enables spawning a Claude Code instance to evaluate loop iteration quality instead of shell commands. Comprehensive implementation including domain types, database migration, repository layer, validation, agent evaluator component, composite evaluator pattern, handler wiring, CLI flags, MCP adapter integration, orchestrator prompt enhancement, and full test coverage.Changes
EvalModeunion type ('shell' | 'agent'),AgentEvaluatorConfigfor agent setupeval_modeandagent_evaluator_configcolumns toloopstableLoopRepositorywithevalModeandagentEvaluatorConfigproperty handlingLoopHandlerextended to support both evaluation modes--eval-modeand agent config flags tobeat loop createCreateLooptool with eval mode parametersBreaking Changes
None. The
evalModedefaults to'shell'for backward compatibility.Testing
Related Issues
Closes #[issue-number] (if applicable)
Reviewed via Claude Code