AutoResearchClaw v0.4.0 transforms the pipeline from purely autonomous to a human-AI collaborative research engine. This guide covers everything you need to know.
- Why Co-Pilot?
- Quick Start
- Intervention Modes
- The Co-Pilot Workflow
- CLI Commands
- Stage-by-Stage Intervention Guide
- Workshops
- Detached Operation
- Safety & Guardrails
- Intelligence Layer
- Pipeline Branching
- Adapters (CLI / WebSocket / MCP)
- Configuration Reference
- FAQ
Fully autonomous research pipelines produce papers fast, but testing reveals consistent quality gaps:
| Problem | Root Cause |
|---|---|
| Weak research ideas | AI lacks taste for what's truly novel and impactful |
| Missing baselines | AI doesn't know which comparisons reviewers expect |
| Fragile experiment code | No human sanity check before execution |
| Thin analysis | AI draws superficial conclusions from results |
| Generic paper writing | AI produces correct-but-bland academic prose |
The HITL Co-Pilot system solves this by letting you intervene exactly where your expertise matters most, while the AI handles the heavy lifting everywhere else.
The result: papers that combine AI speed with human judgment.
researchclaw run --topic "Your research idea" --mode co-pilotThe pipeline will run automatically and pause at key decision points for your input. At each pause, you'll see an interactive prompt with available actions.
researchclaw run --topic "Your research idea" --mode expressOnly pauses at 3 critical gates: hypothesis approval (Stage 8), experiment design (Stage 9), and final quality check (Stage 20).
researchclaw run --topic "Your research idea" --auto-approveNo human intervention. Identical to pre-v0.4.0 behavior.
| Mode | Flag | Pauses At | Best For |
|---|---|---|---|
| Full Auto | --auto-approve |
Never | Quick exploration, low-stakes experiments |
| Gate Only | --mode gate-only |
3 gate stages (5, 9, 20) | Light oversight |
| Checkpoint | --mode checkpoint |
End of each phase (8 points) | Phase-level review |
| Co-Pilot | --mode co-pilot |
Critical stages + SmartPause triggers | Recommended for production |
| Step-by-Step | --mode step-by-step |
After every stage (23 pauses) | Learning the pipeline |
| Express | --mode express |
3 most critical gates only | Experienced users |
| Custom | --mode custom |
User-defined per-stage policies | Advanced configuration |
- First time using the pipeline? Start with
step-by-stepto understand each stage. - Publishing a real paper? Use
co-pilotfor the best quality. - Running overnight? Use
gate-onlyorexpress— fewer interruptions. - Batch processing many topics? Use
full-auto.
When the pipeline pauses, you'll see an interactive panel:
──────────────────────────────────────────────────────────
HITL | Stage 08: HYPOTHESIS_GEN
Post-stage review
──────────────────────────────────────────────────────────
Stage 8 (HYPOTHESIS_GEN) — done
Hypotheses generated. This is a CRITICAL decision point —
review each hypothesis for novelty, feasibility, and significance.
Outputs:
hypotheses.md (1,247 bytes)
→ ## Hypothesis 1: Quantum gate noise as structured regularization
novelty_report.json (892 bytes)
Novelty score: 0.72 (moderate)
Available actions:
[a] Approve and continue
[r] Reject and rollback
[e] Edit stage output
[c] Start collaborative chat
[i] Inject guidance / direction
[s] Skip this stage
[q] Abort pipeline
[v] View full stage output
Action >
| Key | Action | What Happens |
|---|---|---|
a |
Approve | Accept the output and continue to the next stage |
r |
Reject | Reject the output; pipeline rolls back to an earlier stage |
e |
Edit | Opens the output file in your $EDITOR (vim, nano, VS Code, etc.) |
c |
Collaborate | Start a multi-turn chat with the AI to refine the output together |
i |
Inject Guidance | Provide direction that will be incorporated into subsequent stages |
s |
Skip | Skip this stage entirely (use with caution) |
b |
Rollback | Jump back to a specific earlier stage |
q |
Abort | Stop the pipeline entirely |
v |
View | Display the full contents of output files |
# Co-Pilot mode
researchclaw run --topic "Quantum noise as neural network regularization" --mode co-pilot
# With explicit config
researchclaw run --config config.arc.yaml --topic "..." --mode co-pilot
# Resume a previous run in co-pilot mode
researchclaw run --config config.arc.yaml --resume --mode co-pilotThese commands let you interact with a paused pipeline from a separate terminal:
# Check status
researchclaw status artifacts/rc-2026-0328-abc123
# Attach interactively (full TUI)
researchclaw attach artifacts/rc-2026-0328-abc123
# Quick approve (non-interactive)
researchclaw approve artifacts/rc-2026-0328-abc123 --message "Looks good"
# Quick reject
researchclaw reject artifacts/rc-2026-0328-abc123 --reason "Missing ResNet baseline"
# Inject guidance for a specific stage
researchclaw guide artifacts/rc-2026-0328-abc123 --stage 9 --message "Add Dropout as baseline"| Stage | Name | Co-Pilot Behavior | Your Role |
|---|---|---|---|
| 1-2 | Scoping | Pause after | Confirm research direction and scope |
| 3 | Search Strategy | Pause after | Add missing search terms or sources |
| 5 | Literature Screen | Approval required | Verify important papers aren't filtered out |
| 7 | Synthesis | Pause after | Check if the identified gaps match your understanding |
| 8 | Hypothesis Gen | Collaboration | Review, discuss, and refine the core research idea |
| 9 | Experiment Design | Collaboration + Approval | Verify baselines, benchmarks, metrics, ablations |
| 10 | Code Generation | Pause after | Spot-check code quality |
| 12 | Experiment Run | Stream output | Monitor training metrics in real-time |
| 13 | Iterative Refine | Pause after | Decide if refinement should continue |
| 15 | Research Decision | Approval required | Choose PROCEED, PIVOT, or REFINE |
| 16 | Paper Outline | Pause after | Adjust section structure |
| 17 | Paper Draft | Collaboration | Co-write key sections |
| 18 | Peer Review | Pause after | Prioritize which review comments to address |
| 20 | Quality Gate | Approval required | Final publication decision |
| 23 | Citation Verify | Pause after | Review flagged citations |
You can inject guidance for any stage at any time, even before it runs:
researchclaw guide artifacts/rc-xxx --stage 17 --message "Emphasize the theoretical contribution in the Introduction. Keep the Method section concise."Guidance is automatically included in the LLM's context when that stage executes.
Workshops are specialized collaboration tools for the most critical stages.
When you enter collaboration mode at Stage 8, the Idea Workshop activates:
You > c (start collaboration)
AI > I generated 3 candidate hypotheses. Here's my evaluation:
1. "Quantum gate noise as structured regularization" — Novelty: 8/10, Feasibility: 6/10
2. "Entanglement-based feature selection" — Novelty: 7/10, Feasibility: 4/10
3. "Quantum sampling for data augmentation" — Novelty: 5/10, Feasibility: 8/10
Which direction interests you most?
You > Hypothesis 1 is the most promising, but we need to be more specific.
The noise should be modeled as a stochastic regularizer, and we need
to compare against Dropout, Label Smoothing, and MixUp.
AI > Updated hypothesis:
"We theoretically characterize quantum gate noise as a structured
stochastic regularizer and empirically demonstrate it outperforms
Dropout, Label Smoothing, and MixUp on small-sample classification."
Baselines updated: + Dropout, + Label Smoothing, + MixUp, + CutMix
Shall I proceed with this direction?
You > Yes, approve.
The workshop automatically:
- Saves your refined hypothesis to
hypotheses.md - Records the conversation in
hitl/chat_stage_08.jsonl - Persists candidate evaluations in
hitl/idea_workshop.json
The Baseline Navigator helps you build a rigorous experiment design:
AI > Here are the suggested baselines:
[AI] ResNet-50 (standard image classification baseline)
[AI] ViT-B/16 (transformer baseline)
[AI] Dropout (regularization baseline)
Experiment Design Checklist:
[+] Baselines: 3
[ ] Benchmarks: CIFAR-10, CIFAR-100
[ ] Metrics: accuracy, F1
You > Add Label Smoothing and MixUp as baselines. Also add STL-10.
AI > Updated. Current checklist:
[+] Baselines: 5 (ResNet-50, ViT-B/16, Dropout, Label Smoothing, MixUp)
[+] Benchmarks: 3 (CIFAR-10, CIFAR-100, STL-10)
[+] Metrics: accuracy, F1
Review questions:
[ ] Are all standard baselines for this domain included? ✓
[ ] Is there at least one ablation planned? — suggest: "noise magnitude sweep"
The Paper Co-Writer supports three workflows:
- AI-first (default): AI writes the draft, you edit and refine
- Human-first: You write key paragraphs, AI expands and polishes
- Interleaved: You alternate — you write the Method, AI writes Related Work
AI > Paper draft completed (5,200 words, 7 sections).
Which sections would you like to review or co-edit?
You > The Introduction is too generic. Let me rewrite the first paragraph.
AI > Opening your editor... (saves backup to hitl/snapshots/)
[Editor opens with Introduction section]
You > [Saves edited file]
AI > Your edits look great. I'll polish the transitions and ensure
the notation is consistent with the Method section.
Updated Introduction: 420 words → 380 words (tighter, more specific).
Change summary: +3 added, -5 deleted, ~8 changed, 22 unchanged
Research runs can take hours. You don't need to sit and watch.
- Pipeline pauses → writes
hitl/waiting.json - Pipeline enters file-polling mode (checks every 2 seconds for
response.json) - You respond whenever you're ready via
attach,approve, or web dashboard - Pipeline picks up your response and resumes
# Start the run at 6 PM
researchclaw run --topic "..." --mode co-pilot &
# Pipeline runs Stages 1-7, pauses at Stage 8...
# You go home
# Next morning, check status
researchclaw status artifacts/rc-2026-xxx
# Output: "WAITING for input at Stage 8 — HYPOTHESIS_GEN (since 18:42)"
# Review and approve
researchclaw attach artifacts/rc-2026-xxx
# Interactive review → approve → pipeline resumesBy default, the pipeline waits 24 hours for a response. You can configure this:
hitl:
timeouts:
default_human_timeout_sec: 86400 # 24h (default)
auto_proceed_on_timeout: false # true = auto-approve after timeoutSet a spending limit to prevent runaway API costs:
hitl:
cost_budget_usd: 50.0 # Pipeline pauses at 50%, 80%, and 100% of budgetWhen a threshold is breached, the pipeline pauses with a cost summary:
Cost budget alert: Cost: $42.50 / $50.00 [████████████████░░░░] 85%
The Claim Verifier automatically checks AI-generated text against your collected literature:
- Citation claims: Are cited papers in your shortlist? Or fabricated?
- Numerical claims: Do reported numbers match actual experiment data?
- Factual claims: Are "it has been shown that..." statements grounded?
Unverified claims are flagged in the review summary, letting you decide what to keep.
Every stage output gets a SHA256 manifest (manifest.json) for reproducibility. If an artifact is modified outside the pipeline, verification will detect it.
For team/production use, configure tiered notification escalation:
hitl:
escalation:
levels:
- delay_sec: 0 # Immediate terminal notification
channel: terminal
- delay_sec: 1800 # After 30 min → Slack
channel: slack
message: "Pipeline needs attention"
- delay_sec: 7200 # After 2h → email
channel: email
- delay_sec: 86400 # After 24h → auto-abort
channel: terminal
auto_action: abortRun custom scripts before/after any stage:
# Create a hook script
cat > artifacts/rc-xxx/hooks/post_stage_10.sh << 'EOF'
#!/bin/sh
echo "Running linter on generated code..."
cd $RC_RUN_DIR/stage-10/experiment && python -m py_compile main.py
EOF
chmod +x artifacts/rc-xxx/hooks/post_stage_10.shHooks receive environment variables: RC_STAGE_NUM, RC_STAGE_NAME, RC_RUN_DIR, RC_HOOK_NAME.
SmartPause goes beyond fixed gate stages. It dynamically decides whether to pause based on:
- Quality score (from PRM or heuristics): Low quality → pause for review
- Stage criticality: High-impact stages (hypotheses, experiment design) have lower thresholds
- Historical rejection rate: Stages you frequently reject get paused more often
- Confidence: When the AI is uncertain, it asks for help
You don't need to configure SmartPause — it works automatically in co-pilot mode.
Every time you approve, reject, or edit, the system learns:
- Stages you always approve → future runs auto-approve them
- Stages you frequently reject → future runs pause more aggressively
- Your edit patterns → inform SmartPause thresholds
After 5+ runs, the system adapts to your review style.
At any pause point, the system estimates the final paper quality based on current artifacts:
- Literature coverage (number and diversity of papers)
- Hypothesis specificity and falsifiability
- Experiment design completeness (baselines, ablations, metrics)
- Result strength (improvement over baselines)
- Draft quality (length, structure, section coverage)
- Citation integrity
Risk factors are highlighted so you know where to focus your attention.
When you're unsure which research direction to pursue, branch the pipeline:
# At Stage 8, you see 3 promising hypotheses
Action > b (branch)
# Fork to explore Hypothesis A
researchclaw branch create --run-dir artifacts/rc-xxx --name "quantum-noise" --stage 8
# Fork to explore Hypothesis B
researchclaw branch create --run-dir artifacts/rc-xxx --name "entanglement" --stage 8
Each branch gets its own copy of the pipeline state. Run them independently, then compare:
# Compare branches at Stage 14 (after experiments)
researchclaw branch compare --run-dir artifacts/rc-xxx --stage 14Branch Comparison — Stage 14: RESULT_ANALYSIS
main:
artifacts: 3, quality: 0.72
→ Best accuracy: 78.3%
quantum-noise:
artifacts: 3, quality: 0.85
→ Best accuracy: 82.1%
entanglement:
artifacts: 2, quality: 0.61
→ Best accuracy: 74.5%
Merge the winner:
researchclaw branch merge --run-dir artifacts/rc-xxx --branch "quantum-noise" --from-stage 9The HITL system supports three interaction channels:
Terminal-based interaction with ANSI colors, $EDITOR integration, and multi-line input. Works over SSH.
For the web dashboard. Provides real-time updates via WebSocket:
Browser → WebSocket → ws_adapter.py → waiting.json / response.json → Pipeline
Message types: get_status, approve, reject, edit, inject_guidance, chat_message.
External AI agents (Claude, OpenClaw) can interact with the HITL system via MCP tool calls:
hitl_get_status— Check if the pipeline is waitinghitl_approve_stage— Approve the current gatehitl_reject_stage— Reject with reasonhitl_inject_guidance— Provide directionhitl_view_output— Read stage artifacts
This enables agent-in-the-loop workflows where another AI system reviews and approves the pipeline's work.
hitl:
enabled: true # Master switch (default: false)
mode: co-pilot # Intervention mode (see table above)
cost_budget_usd: 0.0 # Cost limit in USD (0 = unlimited)
notifications:
on_pause: true # Notify on pipeline pause
on_quality_drop: true # Notify on quality issues
on_error: true # Notify on stage errors
channels: ["terminal"] # terminal | slack | email | webhook
collaboration:
llm_model: "" # Model for chat (default: primary model)
max_chat_turns: 50 # Max turns per collaboration session
save_chat_history: true # Persist chat logs to hitl/
timeouts:
default_human_timeout_sec: 86400 # Wait time for human input (24h)
auto_proceed_on_timeout: false # Auto-approve on timeout
# Per-stage policies (for 'custom' mode)
stage_policies:
8:
require_approval: true # Must approve before continuing
enable_collaboration: true # Enable chat mode
pause_before: false # Pause before execution
pause_after: true # Pause after execution
allow_edit_output: true # Allow editing output files
allow_inject_prompt: true # Allow guidance injection
stream_output: false # Stream LLM output in real-time
min_quality_score: 0.0 # Pause if quality below threshold
max_auto_retries: 2 # Auto-retry count before pausing
human_timeout_sec: 86400 # Per-stage timeout override
auto_proceed_on_timeout: false # Per-stage auto-proceed override| Variable | Purpose |
|---|---|
EDITOR |
Editor for file editing (default: nano on Unix, notepad on Windows) |
RESEARCHCLAW_SLACK_WEBHOOK |
Slack webhook URL for notifications |
RESEARCHCLAW_WEBHOOK_URL |
Generic webhook URL for notifications |
Only at the stages where you choose to intervene. In co-pilot mode, ~15 of 23 stages run automatically. Typical human time is 30-60 minutes per run, compared to 2-4 hours of autonomous execution.
Not currently, but you can resume a paused run with a different mode:
researchclaw run --resume --output artifacts/rc-xxx --mode step-by-stepPress v to view the full output, then c to chat with the AI about it. The AI can explain what it did and why, and suggest what to focus on.
Yes. The MCP adapter exposes HITL tools that any ACP-compatible agent can call. OpenClaw can automatically review and approve gates.
Everything goes in {run_dir}/hitl/:
session.json— Session stateinterventions.jsonl— All interventions (append log)chat_stage_NN.jsonl— Chat historiessnapshots/— File backups before editsguidance/— Injected guidance per stagenotifications.jsonl— Notification log
Yes. Without hitl.enabled: true or --mode, the pipeline behaves identically to v0.3.x. The --auto-approve flag still works and takes precedence over HITL settings.