Skip to content

feat: CLI analysis tool for experiment results — structured feedback for the autonomous agent #476

@ravyg

Description

@ravyg

Problem

I've been running autoresearch autonomously on my local setup for extended sessions (50-100+ experiments overnight). The agent logs everything to results.tsv as instructed by program.md, but has no programmatic way to analyze those results.

Currently the agent has to:

  1. Manually cat results.tsv and reason about raw tab-separated data
  2. Try to mentally track which experiments improved things and by how much
  3. Guess whether it's plateauing or still making progress

This becomes a real bottleneck during long autonomous runs. The agent wastes experiments re-trying approaches that are clearly in a local minimum, because it can't easily see the big picture of what's been tried and what worked.

Proposed Solution

A lightweight analysis.py CLI script that mirrors the existing analysis.ipynb but is callable by the agent (or by the human checking in on a run):

uv run analysis.py                          # text report to stdout
uv run analysis.py --json                   # machine-readable JSON for the agent
uv run analysis.py --plot progress.png      # save progress chart
uv run analysis.py --tsv path/to/results.tsv  # custom TSV path

The --json mode is the key addition — the agent can call it and get structured data:

{
  "total_experiments": 52,
  "kept": 11,
  "discarded": 38,
  "crashed": 3,
  "keep_rate": 0.2245,
  "baseline_bpb": 0.9979,
  "best_bpb": 0.9612,
  "improvement": 0.0367,
  "improvement_pct": 3.68,
  "best_experiment": "increase batch size to 2**20",
  "top_hits": [...],
  "trajectory": "plateauing"
}

The trajectory field (improving / plateauing / stuck) is especially useful — program.md could instruct the agent to check this periodically and switch strategies when progress stalls.

Why this matters for autonomous research

The whole point of autoresearch is that the agent runs independently for hours. Right now it's flying blind between experiments — it can see the last result but not the trend. This is like a researcher who records lab notes but never reads them back. Giving the agent a structured summary between experiments directly improves experiment selection quality.

Design constraints

  • Uses only existing dependencies (pandas, numpy, matplotlib — already in pyproject.toml)
  • Single file, no changes to prepare.py or train.py
  • Reads results.tsv in the exact format program.md specifies
  • Keeps the repo minimal — just one new file

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions