Skip to content

Latest commit

 

History

History
620 lines (433 loc) · 20.7 KB

File metadata and controls

620 lines (433 loc) · 20.7 KB

Human-in-the-Loop Co-Pilot Guide

AutoResearchClaw v0.4.0 transforms the pipeline from purely autonomous to a human-AI collaborative research engine. This guide covers everything you need to know.


Table of Contents

  1. Why Co-Pilot?
  2. Quick Start
  3. Intervention Modes
  4. The Co-Pilot Workflow
  5. CLI Commands
  6. Stage-by-Stage Intervention Guide
  7. Workshops
  8. Detached Operation
  9. Safety & Guardrails
  10. Intelligence Layer
  11. Pipeline Branching
  12. Adapters (CLI / WebSocket / MCP)
  13. Configuration Reference
  14. FAQ

1. Why Co-Pilot?

Fully autonomous research pipelines produce papers fast, but testing reveals consistent quality gaps:

Problem Root Cause
Weak research ideas AI lacks taste for what's truly novel and impactful
Missing baselines AI doesn't know which comparisons reviewers expect
Fragile experiment code No human sanity check before execution
Thin analysis AI draws superficial conclusions from results
Generic paper writing AI produces correct-but-bland academic prose

The HITL Co-Pilot system solves this by letting you intervene exactly where your expertise matters most, while the AI handles the heavy lifting everywhere else.

The result: papers that combine AI speed with human judgment.


2. Quick Start

Option A: Co-Pilot Mode (Recommended)

researchclaw run --topic "Your research idea" --mode co-pilot

The pipeline will run automatically and pause at key decision points for your input. At each pause, you'll see an interactive prompt with available actions.

Option B: Express Mode (Minimal Interruption)

researchclaw run --topic "Your research idea" --mode express

Only pauses at 3 critical gates: hypothesis approval (Stage 8), experiment design (Stage 9), and final quality check (Stage 20).

Option C: Full Auto (Original Behavior)

researchclaw run --topic "Your research idea" --auto-approve

No human intervention. Identical to pre-v0.4.0 behavior.


3. Intervention Modes

Mode Flag Pauses At Best For
Full Auto --auto-approve Never Quick exploration, low-stakes experiments
Gate Only --mode gate-only 3 gate stages (5, 9, 20) Light oversight
Checkpoint --mode checkpoint End of each phase (8 points) Phase-level review
Co-Pilot --mode co-pilot Critical stages + SmartPause triggers Recommended for production
Step-by-Step --mode step-by-step After every stage (23 pauses) Learning the pipeline
Express --mode express 3 most critical gates only Experienced users
Custom --mode custom User-defined per-stage policies Advanced configuration

How to Choose

  • First time using the pipeline? Start with step-by-step to understand each stage.
  • Publishing a real paper? Use co-pilot for the best quality.
  • Running overnight? Use gate-only or express — fewer interruptions.
  • Batch processing many topics? Use full-auto.

4. The Co-Pilot Workflow

When the pipeline pauses, you'll see an interactive panel:

──────────────────────────────────────────────────────────
  HITL | Stage 08: HYPOTHESIS_GEN
  Post-stage review
──────────────────────────────────────────────────────────

  Stage 8 (HYPOTHESIS_GEN) — done

  Hypotheses generated. This is a CRITICAL decision point —
  review each hypothesis for novelty, feasibility, and significance.

  Outputs:
    hypotheses.md (1,247 bytes)
      → ## Hypothesis 1: Quantum gate noise as structured regularization
    novelty_report.json (892 bytes)

  Novelty score: 0.72 (moderate)

  Available actions:
    [a] Approve and continue
    [r] Reject and rollback
    [e] Edit stage output
    [c] Start collaborative chat
    [i] Inject guidance / direction
    [s] Skip this stage
    [q] Abort pipeline
    [v] View full stage output

Action >

Available Actions at Every Pause

Key Action What Happens
a Approve Accept the output and continue to the next stage
r Reject Reject the output; pipeline rolls back to an earlier stage
e Edit Opens the output file in your $EDITOR (vim, nano, VS Code, etc.)
c Collaborate Start a multi-turn chat with the AI to refine the output together
i Inject Guidance Provide direction that will be incorporated into subsequent stages
s Skip Skip this stage entirely (use with caution)
b Rollback Jump back to a specific earlier stage
q Abort Stop the pipeline entirely
v View Display the full contents of output files

5. CLI Commands

Starting a Run

# Co-Pilot mode
researchclaw run --topic "Quantum noise as neural network regularization" --mode co-pilot

# With explicit config
researchclaw run --config config.arc.yaml --topic "..." --mode co-pilot

# Resume a previous run in co-pilot mode
researchclaw run --config config.arc.yaml --resume --mode co-pilot

Detached Interaction

These commands let you interact with a paused pipeline from a separate terminal:

# Check status
researchclaw status artifacts/rc-2026-0328-abc123

# Attach interactively (full TUI)
researchclaw attach artifacts/rc-2026-0328-abc123

# Quick approve (non-interactive)
researchclaw approve artifacts/rc-2026-0328-abc123 --message "Looks good"

# Quick reject
researchclaw reject artifacts/rc-2026-0328-abc123 --reason "Missing ResNet baseline"

# Inject guidance for a specific stage
researchclaw guide artifacts/rc-2026-0328-abc123 --stage 9 --message "Add Dropout as baseline"

6. Stage-by-Stage Intervention Guide

Where Your Input Matters Most

Stage Name Co-Pilot Behavior Your Role
1-2 Scoping Pause after Confirm research direction and scope
3 Search Strategy Pause after Add missing search terms or sources
5 Literature Screen Approval required Verify important papers aren't filtered out
7 Synthesis Pause after Check if the identified gaps match your understanding
8 Hypothesis Gen Collaboration Review, discuss, and refine the core research idea
9 Experiment Design Collaboration + Approval Verify baselines, benchmarks, metrics, ablations
10 Code Generation Pause after Spot-check code quality
12 Experiment Run Stream output Monitor training metrics in real-time
13 Iterative Refine Pause after Decide if refinement should continue
15 Research Decision Approval required Choose PROCEED, PIVOT, or REFINE
16 Paper Outline Pause after Adjust section structure
17 Paper Draft Collaboration Co-write key sections
18 Peer Review Pause after Prioritize which review comments to address
20 Quality Gate Approval required Final publication decision
23 Citation Verify Pause after Review flagged citations

Guidance Injection

You can inject guidance for any stage at any time, even before it runs:

researchclaw guide artifacts/rc-xxx --stage 17 --message "Emphasize the theoretical contribution in the Introduction. Keep the Method section concise."

Guidance is automatically included in the LLM's context when that stage executes.


7. Workshops

Workshops are specialized collaboration tools for the most critical stages.

Idea Workshop (Stage 7-8)

When you enter collaboration mode at Stage 8, the Idea Workshop activates:

You > c  (start collaboration)

AI > I generated 3 candidate hypotheses. Here's my evaluation:
     1. "Quantum gate noise as structured regularization" — Novelty: 8/10, Feasibility: 6/10
     2. "Entanglement-based feature selection" — Novelty: 7/10, Feasibility: 4/10
     3. "Quantum sampling for data augmentation" — Novelty: 5/10, Feasibility: 8/10

     Which direction interests you most?

You > Hypothesis 1 is the most promising, but we need to be more specific.
     The noise should be modeled as a stochastic regularizer, and we need
     to compare against Dropout, Label Smoothing, and MixUp.

AI > Updated hypothesis:
     "We theoretically characterize quantum gate noise as a structured
     stochastic regularizer and empirically demonstrate it outperforms
     Dropout, Label Smoothing, and MixUp on small-sample classification."

     Baselines updated: + Dropout, + Label Smoothing, + MixUp, + CutMix

     Shall I proceed with this direction?

You > Yes, approve.

The workshop automatically:

  • Saves your refined hypothesis to hypotheses.md
  • Records the conversation in hitl/chat_stage_08.jsonl
  • Persists candidate evaluations in hitl/idea_workshop.json

Baseline Navigator (Stage 9)

The Baseline Navigator helps you build a rigorous experiment design:

AI > Here are the suggested baselines:
     [AI] ResNet-50 (standard image classification baseline)
     [AI] ViT-B/16 (transformer baseline)
     [AI] Dropout (regularization baseline)

     Experiment Design Checklist:
     [+] Baselines: 3
     [ ] Benchmarks: CIFAR-10, CIFAR-100
     [ ] Metrics: accuracy, F1

You > Add Label Smoothing and MixUp as baselines. Also add STL-10.

AI > Updated. Current checklist:
     [+] Baselines: 5 (ResNet-50, ViT-B/16, Dropout, Label Smoothing, MixUp)
     [+] Benchmarks: 3 (CIFAR-10, CIFAR-100, STL-10)
     [+] Metrics: accuracy, F1

     Review questions:
     [ ] Are all standard baselines for this domain included? ✓
     [ ] Is there at least one ablation planned? — suggest: "noise magnitude sweep"

Paper Co-Writer (Stage 16-17)

The Paper Co-Writer supports three workflows:

  1. AI-first (default): AI writes the draft, you edit and refine
  2. Human-first: You write key paragraphs, AI expands and polishes
  3. Interleaved: You alternate — you write the Method, AI writes Related Work
AI > Paper draft completed (5,200 words, 7 sections).
     Which sections would you like to review or co-edit?

You > The Introduction is too generic. Let me rewrite the first paragraph.

AI > Opening your editor... (saves backup to hitl/snapshots/)

     [Editor opens with Introduction section]

You > [Saves edited file]

AI > Your edits look great. I'll polish the transitions and ensure
     the notation is consistent with the Method section.

     Updated Introduction: 420 words → 380 words (tighter, more specific).
     Change summary: +3 added, -5 deleted, ~8 changed, 22 unchanged

8. Detached Operation

Research runs can take hours. You don't need to sit and watch.

How It Works

  1. Pipeline pauses → writes hitl/waiting.json
  2. Pipeline enters file-polling mode (checks every 2 seconds for response.json)
  3. You respond whenever you're ready via attach, approve, or web dashboard
  4. Pipeline picks up your response and resumes

Scenario: Overnight Run

# Start the run at 6 PM
researchclaw run --topic "..." --mode co-pilot &

# Pipeline runs Stages 1-7, pauses at Stage 8...
# You go home

# Next morning, check status
researchclaw status artifacts/rc-2026-xxx
# Output: "WAITING for input at Stage 8 — HYPOTHESIS_GEN (since 18:42)"

# Review and approve
researchclaw attach artifacts/rc-2026-xxx
# Interactive review → approve → pipeline resumes

Timeout Behavior

By default, the pipeline waits 24 hours for a response. You can configure this:

hitl:
  timeouts:
    default_human_timeout_sec: 86400   # 24h (default)
    auto_proceed_on_timeout: false     # true = auto-approve after timeout

9. Safety & Guardrails

Cost Budget

Set a spending limit to prevent runaway API costs:

hitl:
  cost_budget_usd: 50.0   # Pipeline pauses at 50%, 80%, and 100% of budget

When a threshold is breached, the pipeline pauses with a cost summary:

Cost budget alert: Cost: $42.50 / $50.00 [████████████████░░░░] 85%

Claim Verification

The Claim Verifier automatically checks AI-generated text against your collected literature:

  • Citation claims: Are cited papers in your shortlist? Or fabricated?
  • Numerical claims: Do reported numbers match actual experiment data?
  • Factual claims: Are "it has been shown that..." statements grounded?

Unverified claims are flagged in the review summary, letting you decide what to keep.

SHA256 Artifact Checksums

Every stage output gets a SHA256 manifest (manifest.json) for reproducibility. If an artifact is modified outside the pipeline, verification will detect it.

Escalation Policy

For team/production use, configure tiered notification escalation:

hitl:
  escalation:
    levels:
      - delay_sec: 0       # Immediate terminal notification
        channel: terminal
      - delay_sec: 1800    # After 30 min → Slack
        channel: slack
        message: "Pipeline needs attention"
      - delay_sec: 7200    # After 2h → email
        channel: email
      - delay_sec: 86400   # After 24h → auto-abort
        channel: terminal
        auto_action: abort

Extensible Hooks

Run custom scripts before/after any stage:

# Create a hook script
cat > artifacts/rc-xxx/hooks/post_stage_10.sh << 'EOF'
#!/bin/sh
echo "Running linter on generated code..."
cd $RC_RUN_DIR/stage-10/experiment && python -m py_compile main.py
EOF
chmod +x artifacts/rc-xxx/hooks/post_stage_10.sh

Hooks receive environment variables: RC_STAGE_NUM, RC_STAGE_NAME, RC_RUN_DIR, RC_HOOK_NAME.


10. Intelligence Layer

SmartPause

SmartPause goes beyond fixed gate stages. It dynamically decides whether to pause based on:

  • Quality score (from PRM or heuristics): Low quality → pause for review
  • Stage criticality: High-impact stages (hypotheses, experiment design) have lower thresholds
  • Historical rejection rate: Stages you frequently reject get paused more often
  • Confidence: When the AI is uncertain, it asks for help

You don't need to configure SmartPause — it works automatically in co-pilot mode.

Intervention Learning (ALHF)

Every time you approve, reject, or edit, the system learns:

  • Stages you always approve → future runs auto-approve them
  • Stages you frequently reject → future runs pause more aggressively
  • Your edit patterns → inform SmartPause thresholds

After 5+ runs, the system adapts to your review style.

Quality Predictor

At any pause point, the system estimates the final paper quality based on current artifacts:

  • Literature coverage (number and diversity of papers)
  • Hypothesis specificity and falsifiability
  • Experiment design completeness (baselines, ablations, metrics)
  • Result strength (improvement over baselines)
  • Draft quality (length, structure, section coverage)
  • Citation integrity

Risk factors are highlighted so you know where to focus your attention.


11. Pipeline Branching

When you're unsure which research direction to pursue, branch the pipeline:

# At Stage 8, you see 3 promising hypotheses
Action > b  (branch)

# Fork to explore Hypothesis A
researchclaw branch create --run-dir artifacts/rc-xxx --name "quantum-noise" --stage 8

# Fork to explore Hypothesis B
researchclaw branch create --run-dir artifacts/rc-xxx --name "entanglement" --stage 8

Each branch gets its own copy of the pipeline state. Run them independently, then compare:

# Compare branches at Stage 14 (after experiments)
researchclaw branch compare --run-dir artifacts/rc-xxx --stage 14
Branch Comparison — Stage 14: RESULT_ANALYSIS

  main:
    artifacts: 3, quality: 0.72
    → Best accuracy: 78.3%

  quantum-noise:
    artifacts: 3, quality: 0.85
    → Best accuracy: 82.1%

  entanglement:
    artifacts: 2, quality: 0.61
    → Best accuracy: 74.5%

Merge the winner:

researchclaw branch merge --run-dir artifacts/rc-xxx --branch "quantum-noise" --from-stage 9

12. Adapters

The HITL system supports three interaction channels:

CLI Adapter (Default)

Terminal-based interaction with ANSI colors, $EDITOR integration, and multi-line input. Works over SSH.

WebSocket Adapter

For the web dashboard. Provides real-time updates via WebSocket:

Browser → WebSocket → ws_adapter.py → waiting.json / response.json → Pipeline

Message types: get_status, approve, reject, edit, inject_guidance, chat_message.

MCP Adapter

External AI agents (Claude, OpenClaw) can interact with the HITL system via MCP tool calls:

  • hitl_get_status — Check if the pipeline is waiting
  • hitl_approve_stage — Approve the current gate
  • hitl_reject_stage — Reject with reason
  • hitl_inject_guidance — Provide direction
  • hitl_view_output — Read stage artifacts

This enables agent-in-the-loop workflows where another AI system reviews and approves the pipeline's work.


13. Configuration Reference

hitl:
  enabled: true                        # Master switch (default: false)
  mode: co-pilot                       # Intervention mode (see table above)
  cost_budget_usd: 0.0                 # Cost limit in USD (0 = unlimited)

  notifications:
    on_pause: true                     # Notify on pipeline pause
    on_quality_drop: true              # Notify on quality issues
    on_error: true                     # Notify on stage errors
    channels: ["terminal"]             # terminal | slack | email | webhook

  collaboration:
    llm_model: ""                      # Model for chat (default: primary model)
    max_chat_turns: 50                 # Max turns per collaboration session
    save_chat_history: true            # Persist chat logs to hitl/

  timeouts:
    default_human_timeout_sec: 86400   # Wait time for human input (24h)
    auto_proceed_on_timeout: false     # Auto-approve on timeout

  # Per-stage policies (for 'custom' mode)
  stage_policies:
    8:
      require_approval: true           # Must approve before continuing
      enable_collaboration: true       # Enable chat mode
      pause_before: false              # Pause before execution
      pause_after: true                # Pause after execution
      allow_edit_output: true          # Allow editing output files
      allow_inject_prompt: true        # Allow guidance injection
      stream_output: false             # Stream LLM output in real-time
      min_quality_score: 0.0           # Pause if quality below threshold
      max_auto_retries: 2              # Auto-retry count before pausing
      human_timeout_sec: 86400         # Per-stage timeout override
      auto_proceed_on_timeout: false   # Per-stage auto-proceed override

Environment Variables

Variable Purpose
EDITOR Editor for file editing (default: nano on Unix, notepad on Windows)
RESEARCHCLAW_SLACK_WEBHOOK Slack webhook URL for notifications
RESEARCHCLAW_WEBHOOK_URL Generic webhook URL for notifications

14. FAQ

Does HITL slow down the pipeline?

Only at the stages where you choose to intervene. In co-pilot mode, ~15 of 23 stages run automatically. Typical human time is 30-60 minutes per run, compared to 2-4 hours of autonomous execution.

Can I switch modes mid-run?

Not currently, but you can resume a paused run with a different mode:

researchclaw run --resume --output artifacts/rc-xxx --mode step-by-step

What if I'm not sure what to do at a pause?

Press v to view the full output, then c to chat with the AI about it. The AI can explain what it did and why, and suggest what to focus on.

Does HITL work with ACP/OpenClaw?

Yes. The MCP adapter exposes HITL tools that any ACP-compatible agent can call. OpenClaw can automatically review and approve gates.

What data does HITL store?

Everything goes in {run_dir}/hitl/:

  • session.json — Session state
  • interventions.jsonl — All interventions (append log)
  • chat_stage_NN.jsonl — Chat histories
  • snapshots/ — File backups before edits
  • guidance/ — Injected guidance per stage
  • notifications.jsonl — Notification log

Is it backward compatible?

Yes. Without hitl.enabled: true or --mode, the pipeline behaves identically to v0.3.x. The --auto-approve flag still works and takes precedence over HITL settings.