Problem
I've been running autoresearch autonomously on my local setup for extended sessions (50-100+ experiments overnight). The agent logs everything to results.tsv as instructed by program.md, but has no programmatic way to analyze those results.
Currently the agent has to:
- Manually
cat results.tsv and reason about raw tab-separated data
- Try to mentally track which experiments improved things and by how much
- Guess whether it's plateauing or still making progress
This becomes a real bottleneck during long autonomous runs. The agent wastes experiments re-trying approaches that are clearly in a local minimum, because it can't easily see the big picture of what's been tried and what worked.
Proposed Solution
A lightweight analysis.py CLI script that mirrors the existing analysis.ipynb but is callable by the agent (or by the human checking in on a run):
uv run analysis.py # text report to stdout
uv run analysis.py --json # machine-readable JSON for the agent
uv run analysis.py --plot progress.png # save progress chart
uv run analysis.py --tsv path/to/results.tsv # custom TSV path
The --json mode is the key addition — the agent can call it and get structured data:
{
"total_experiments": 52,
"kept": 11,
"discarded": 38,
"crashed": 3,
"keep_rate": 0.2245,
"baseline_bpb": 0.9979,
"best_bpb": 0.9612,
"improvement": 0.0367,
"improvement_pct": 3.68,
"best_experiment": "increase batch size to 2**20",
"top_hits": [...],
"trajectory": "plateauing"
}
The trajectory field (improving / plateauing / stuck) is especially useful — program.md could instruct the agent to check this periodically and switch strategies when progress stalls.
Why this matters for autonomous research
The whole point of autoresearch is that the agent runs independently for hours. Right now it's flying blind between experiments — it can see the last result but not the trend. This is like a researcher who records lab notes but never reads them back. Giving the agent a structured summary between experiments directly improves experiment selection quality.
Design constraints
- Uses only existing dependencies (pandas, numpy, matplotlib — already in
pyproject.toml)
- Single file, no changes to
prepare.py or train.py
- Reads
results.tsv in the exact format program.md specifies
- Keeps the repo minimal — just one new file
Problem
I've been running autoresearch autonomously on my local setup for extended sessions (50-100+ experiments overnight). The agent logs everything to
results.tsvas instructed byprogram.md, but has no programmatic way to analyze those results.Currently the agent has to:
cat results.tsvand reason about raw tab-separated dataThis becomes a real bottleneck during long autonomous runs. The agent wastes experiments re-trying approaches that are clearly in a local minimum, because it can't easily see the big picture of what's been tried and what worked.
Proposed Solution
A lightweight
analysis.pyCLI script that mirrors the existinganalysis.ipynbbut is callable by the agent (or by the human checking in on a run):The
--jsonmode is the key addition — the agent can call it and get structured data:{ "total_experiments": 52, "kept": 11, "discarded": 38, "crashed": 3, "keep_rate": 0.2245, "baseline_bpb": 0.9979, "best_bpb": 0.9612, "improvement": 0.0367, "improvement_pct": 3.68, "best_experiment": "increase batch size to 2**20", "top_hits": [...], "trajectory": "plateauing" }The
trajectoryfield (improving/plateauing/stuck) is especially useful —program.mdcould instruct the agent to check this periodically and switch strategies when progress stalls.Why this matters for autonomous research
The whole point of autoresearch is that the agent runs independently for hours. Right now it's flying blind between experiments — it can see the last result but not the trend. This is like a researcher who records lab notes but never reads them back. Giving the agent a structured summary between experiments directly improves experiment selection quality.
Design constraints
pyproject.toml)prepare.pyortrain.pyresults.tsvin the exact formatprogram.mdspecifies