fix(step-ops): kill gateway session when step is failed (issue #225) by paralizeer · Pull Request #283 · snarktank/antfarm

paralizeer · 2026-03-06T02:13:25Z

When a workflow step is manually failed via 'antfarm step fail', any active gateway subagent session continues running, burning tokens until the 60-minute hard timeout.

This fix:

Adds session_key to the step query in failStep()
Calls killSession() to terminate the associated gateway session
Best-effort cleanup - logs warning on failure but doesn't block step failure

Also adds session_key to cleanupAbandonedSteps() query for future use.

Impact: Eliminates token waste when steps are manually failed. A failed step now properly cleans up its associated session.

Testing: Built successfully.

Risk: Low - session cleanup is best-effort, doesn't block on failure

Refs: #225

Add dryRunWorkflow() function that: - Validates workflow YAML via loadWorkflowSpec() - Builds execution context with placeholder values - Resolves all step input templates using resolveTemplate() - Prints execution plan showing all steps with agent assignments - Returns without creating DB entries or spawning crons Update CLI to call dryRunWorkflow when --dry-run flag is passed to 'workflow run' command. Tested with coding-sprint and bug-fix workflows.

…ntfarm workflow CLI. When passed to 'workflow run', it should validate the workflow YAML, resolve all template variables with placeholder values, print the execution plan (steps, agents, order), and exit without actually creating a run or spawning any agents. Should work for all workflows (coding-sprint, bug-fix, feature-dev, security-audit).

- Add safety reset in claimStep: if step is running but has no current_story_id, reset to pending - Add current_story.* context keys for template usage - Set defaults for reviewer template keys (commit, test_result) - Add logging to checkLoopContinuation for debugging - Update all workflow YAMLs from 'default' to 'minimax/MiniMax-M2.5' - Add memory access to developer/planner/reviewer/tester agents - Add new prospector workflows: eps-prospector, local-prospector, job-scout, gran-concepcion-prospector Addresses: snarktank#272 (story loop stuck), snarktank#266 (stall after Story 1) Auto-generated by Openclaw AutoDev

The workflow YAMLs were updated to use 'minimax/MiniMax-M2.5' instead of 'default' (commit 021244b), but the tests still expected 'default'. This caused 4 test failures in the polling configuration tests. Updated test expectations in: - tests/bug-fix-polling.test.ts - tests/feature-dev-polling.test.ts - tests/security-audit-polling.test.ts - tests/polling-timeout-sync.test.ts Auto-generated by Openclaw AutoDev

The DEFAULT_POLLING_MODEL was set to 'default' which is not a valid model identifier for sessions_spawn. This caused agent cron jobs to fail silently - they would fire but the sessions would not complete because the model was invalid. Changed both occurrences of 'default' to 'minimax/MiniMax-M2.5' which matches the default model in OpenClaw config and the workflow YAMLs. Fixes issue snarktank#217 - Agent cron jobs spawn sessions but work does not complete

Add validation in completeStep to check that step output contains all required keys specified in the workflow's 'expects' field. When a step outputs KEY: value pairs, we now validate that all keys listed in expects are present. If any required keys are missing, the step fails with a descriptive error message. This prevents incomplete step output from propagating to downstream steps and causing confusing failures later. Issue: snarktank#270 - Workflow may accept incomplete step output and advance with missing required context keys Auto-generated by Openclaw AutoDev

After 5 consecutive errors, the medic now auto-disables cron jobs to prevent wasted tokens on persistently failing jobs (issue snarktank#218). Changes: - gateway-api.ts: extract consecutiveErrors and lastStatus from cron list - gateway-api.ts: add disableCronJob() function for circuit breaker action - checks.ts: add checkFailingCrons() to detect crons exceeding error threshold - checks.ts: add disable_cron action type - medic.ts: handle disable_cron action to auto-disable failing cron jobs This is part of Resilience Week - making the system handle failure as elegantly as it handles success. Auto-generated by Openclaw AutoDev

…NTS.md The developer/coder/fixer agents were outputting STATUS: done but not calling the step complete CLI, causing steps to get stuck in 'running' state indefinitely. This happened because the polling prompt had the instruction but the agent AGENTS.md did not. Added explicit step complete instructions to: - feature-dev/agents/developer/AGENTS.md - coding-sprint/agents/coder/AGENTS.md - bug-fix/agents/fixer/AGENTS.md Each now includes: - ⚠️ CRITICAL warning header - Exact command to write output to temp file and pipe to step complete - Explanation that session will end after this call This should fix issue snarktank#272 where developer agent sessions exit after each story without completing the step. Refs: snarktank#272

The completeStep function referenced 'sessionKey' which is not a parameter of this function. Fixed by: 1. Adding session_key to the step SELECT query 2. Using step.session_key to preserve the existing session key This bug was causing TypeScript compilation failures. Auto-generated by Openclaw AutoDev

…ank#225) When a workflow step is manually failed via 'antfarm step fail', any active gateway subagent session continues running, burning tokens until the 60-minute hard timeout. This fix: - Adds session_key to the step query in failStep() - Calls killSession() to terminate the associated gateway session - Best-effort cleanup - logs warning on failure but doesn't block step failure Also adds session_key to cleanupAbandonedSteps() query for future use. Impact: Eliminates token waste when steps are manually failed. A failed step now properly cleans up its associated session. Refs: snarktank#225 Auto-generated by Openclaw AutoDev

vercel · 2026-03-06T02:13:29Z

@paralizeer is attempting to deploy a commit to the Ryan Team on Vercel.

A member of the Team first needs to authorize it.

Claw and others added 11 commits March 2, 2026 01:50

sprint: Add --dry-run flag parsing in CLI

b2fb943

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(step-ops): kill gateway session when step is failed (issue #225)#283

fix(step-ops): kill gateway session when step is failed (issue #225)#283
paralizeer wants to merge 11 commits intosnarktank:mainfrom
paralizeer:auto/fix/session-kill-on-step-fail-20260306_021154

paralizeer commented Mar 6, 2026

Uh oh!

vercel bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paralizeer commented Mar 6, 2026

Uh oh!

vercel bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant