autoresearch-mlx

An MLX port of karpathy/autoresearch for Apple Silicon.

An AI agent autonomously experiments with model architecture and data quality to minimize val_bpb within a fixed 5-minute training budget. This fork replaces the PyTorch/CUDA backend with MLX so the workflow runs natively on Mac.

Quick start

Requirements: Apple Silicon Mac, Python 3.12+, uv.

# 1. Install dependencies
uv sync

# 2. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 3. Run a single training experiment (~5 min)
uv run train.py

For TinyStories (smaller dataset, good for smaller machines):

uv run prepare.py --dataset tinystories
# Then set DATASET = "tinystories" in train.py before running

Architecture

The project is organized around two autonomous experiment programs that share a common training/evaluation infrastructure but operate on different parts of the codebase:

                  +-------------------+
                  |   evaluate_bpb    |  <-- ground truth metric (locked)
                  +-------------------+
                          |
            +-------------+-------------+
            |                           |
    +-------v-------+          +--------v--------+
    | Model Program |          | Data Program    |
    | (program.md)  |          | (program_data.md)|
    +-------+-------+          +--------+--------+
            |                           |
     edits train.py            edits prepare.py
     (architecture,            + data_sources.py
      optimizer,               (filtering, tokenizer,
      hyperparams)              curriculum, mixing)
            |                           |
            +-------------+-------------+
                          |
                  +-------v-------+
                  |  train.py     |  <-- 5-minute training run
                  +-------+-------+
                          |
              +-----------+-----------+
              |                       |
      +-------v-------+      +-------v-------+
      | run.log       |      | data/         |
      | (raw output   |      | last_run.json |
      |  for crashes) |      | run_*.json    |
      +---------------+      +---------------+

Separation of concerns

The two programs have non-overlapping edit scopes by design. This prevents an agent from making coupled changes across model and data simultaneously, which would be hard to attribute and hard to revert.

	Model program	Data program
Edits	`train.py` only	`prepare.py` + `data_sources.py`
Read-only	`prepare.py`, `data_sources.py`	`train.py`
Locked	`evaluate_bpb`	`evaluate_bpb`
Branch prefix	`autoresearch/<tag>`	`autoresearch-data/<tag>`
Typical cycle	~6 min (5 train + 1 eval)	~8 min (2 prepare + 5 train + 1 eval)

Changes that span both (e.g., token-level loss weighting) are flagged for human-directed work.

Dual-channel output

train.py produces two output channels:

run.log -- raw stdout+stderr capture (> run.log 2>&1). Contains progress lines with \r carriage returns and the human-readable summary block. The agent uses this for crash diagnostics (tail -50 run.log). Overwritten each run, gitignored.
data/last_run.json -- structured JSON written by log_utils.save_json. Contains all metrics in a machine-readable format. The agent reads this for results instead of grepping stdout. A timestamped copy (data/run_YYYYMMDD_HHMMSS.json) is archived alongside it.

This separates the human interface (stdout text) from the machine interface (structured JSON). The upstream karpathy/autoresearch uses only the grep-based approach; the structured JSON output is our addition.

Data flow

train.py finishes
    |
    +-- stdout/stderr --> run.log (overwritten, gitignored)
    |                       \-- agent reads on crash: tail -50 run.log
    |
    +-- save_json() --> data/last_run.json (stable path, always latest)
    |               \-> data/run_20260315_142301.json (timestamped archive)
    |                       \-- agent reads for metrics
    |
    +-- agent logs --> results.tsv (append-only, gitignored)
    |                   \-- 6 columns: commit, val_bpb, memory_gb, avg_tok_sec, status, description
    |
    +-- analysis.py reads --> data/run_*.json + results.tsv

Experiment loop

Both programs follow the same loop (model program shown):

Edit train.py with an idea
git commit
uv run train.py > run.log 2>&1
Read data/last_run.json for metrics
Keep (advance branch) or discard (git reset) based on quality+throughput framework
Log to results.tsv
Repeat indefinitely

The agent runs autonomously on a dedicated branch. It never pushes to remote. The human can walk away and come back to a results.tsv full of experiments.

Running the agent

Experiment skills are provided via the experiment-plugin/ Claude Code plugin. Load it with --plugin-dir:

claude --plugin-dir ./experiment-plugin

Recommended alias: alias ar='claude --plugin-dir ./experiment-plugin'

Interactive mode

ar

Then:

/experiment:model mar15

Autonomous mode (unattended)

The project ships .claude/settings.json with a scoped allowlist. git push is explicitly denied.

# Model experiments
ar -p "/experiment:model mar15"

# Data experiments
ar -p "/experiment:data mar15-data"

Other skills

Skill	Description
`/experiment:run`	Single experiment cycle (commit, train, extract, log)
`/experiment:compare [N]`	Compare recent training runs
`/experiment:review`	Pre-flight check on train.py changes

See docs/guide.md for full details on permissions, safety, and tuning for smaller machines.

Project structure

train.py              - model, optimizer, training loop (model program edits)
prepare.py            - data prep, tokenizer, dataloader, evaluate_bpb (data program edits)
data_sources.py       - dataset registry and configuration (data program edits)
log_utils.py          - structured output, diagnostics, logging
program.md            - model experiment agent instructions
program_data.md       - data experiment agent instructions
experiment-plugin/    - Claude Code plugin (skills: model, data, run, compare, review; agent: experiment-reviewer)
bench.py              - performance profiling (compiled vs uncompiled)
analysis.py           - experiment results analysis (reads run_*.json + results.tsv)
docs/guide.md         - usage guide for all modes
tests/                - test suite
data/                 - run archives (last_run.json, run_*.json, bench_*.json)
internal/log/         - session-by-session development notes

Current results (v0.7.2)

Tested on M2 Ultra, 192GB unified memory. 5-minute training budget, 11.5M parameter GPT with value embeddings.

Metric	Value
val_bpb	1.859
Training steps	641
Avg throughput	139,999 tok/sec
Peak memory	10.6 GB
Total time (train + eval)	350.7s

DEPTH=4, 5-group MultiOptimizer (Muon + AdamW), Muon momentum ramp 0.85->0.95, weight decay linear decay to 0.

What changed from the original

PyTorch/CUDA replaced with pure MLX for Apple Silicon
mx.fast.* ops (SDPA, RoPE, RMS norm) instead of Flash Attention 3 and manual implementations
Multi-dataset support via data_sources.py
Autonomous data experiment loop alongside the model experiment loop
Structured JSON output (data/last_run.json) for machine-readable results -- upstream uses grep-from-stdout only
Scoped Claude Code permissions for safe unattended operation

Documentation

docs/guide.md -- step-by-step for all modes (model, data, engineering, manual, autonomous)
AGENTS.md -- accumulated technical knowledge (optimizer routing, MLX gotchas, training results)
internal/data-investigations.md -- data quality investigation backlog
internal/ane-integration.md -- ANE integration roadmap
internal/analysis/ -- throughput regression and eval bottleneck analyses
internal/log/ -- session-by-session development notes

Acknowledgements

MLX
karpathy/autoresearch -- the OG
trevin-creator/autoresearch-mlx -- another MLX port whose findings on DEPTH=4 and hyperparameter tuning informed our baseline reset in v0.6.0

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-mlx

Quick start

Architecture

Separation of concerns

Dual-channel output

Data flow

Experiment loop

Running the agent

Interactive mode

Autonomous mode (unattended)

Other skills

Project structure

Current results (v0.7.2)

What changed from the original

Documentation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.claude		.claude
data		data
docs		docs
experiment-plugin		experiment-plugin
internal		internal
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
analysis.ipynb		analysis.ipynb
analysis.py		analysis.py
bench.py		bench.py
bench_compare.py		bench_compare.py
data_sources.py		data_sources.py
log_utils.py		log_utils.py
prepare.py		prepare.py
program.md		program.md
program_data.md		program_data.md
progress.png		progress.png
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

autoresearch-mlx

Quick start

Architecture

Separation of concerns

Dual-channel output

Data flow

Experiment loop

Running the agent

Interactive mode

Autonomous mode (unattended)

Other skills

Project structure

Current results (v0.7.2)

What changed from the original

Documentation

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages