Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.venv
karpathy
__pycache__
*.pyc
run.log
results.tsv
.git
progress.png
analysis.ipynb
27 changes: 27 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM nvidia/cuda:12.8.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# System deps
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 python3.10-venv python3-pip curl git ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Install uv
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

WORKDIR /app

# Copy project files
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen

# Copy source
COPY prepare.py train.py agent.py program.md ./

EXPOSE 9090

# Default: run training
CMD ["uv", "run", "train.py"]
207 changes: 159 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,202 @@
# autoresearch

![teaser](progress.png)
> Fork of [karpathy/autoresearch](https://github.com/karpathy/autoresearch) by [DeepBlueDynamics](https://github.com/DeepBlueDynamics)

*One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026*.
Give an AI agent a real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up to a log of experiments and a better model.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of [nanochat](https://github.com/karpathy/nanochat). The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the `program.md` Markdown files that provide context to the AI agents and set up your autonomous research org. The default `program.md` in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this [tweet](https://x.com/karpathy/status/2029701092347630069).
## What's different in this fork

## How it works

The repo is deliberately kept small and only really has three files that matter:

- **`prepare.py`** — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. **This file is edited and iterated on by the agent**.
- **`program.md`** — baseline instructions for one agent. Point your agent here and let it go. **This file is edited and iterated on by the human**.

By design, training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

If you are new to neural networks, this ["Dummy's Guide"](https://x.com/hooeem/status/2030720614752039185) looks pretty good for a lot more context.
- **Agent harness** (`agent.py`) — structured tool-calling agent that works with Claude, GPT, or Gemini. 10 tools for autonomous experimentation including persistent thermodynamic memory via [ferricula](https://github.com/DeepBlueDynamics/ferricula).
- **Weber electrodynamic optimizer** — applies Weber's force law bracket `W = 1 - v²/(2c²) + v·a/c²` to learning rate, modifying effective step size based on parameter velocity and acceleration. Physics-inspired adaptive optimization.
- **SDR entropy seeding** — replaces the fixed `torch.manual_seed(42)` with true hardware randomness from an RTL-SDR radio receiver via [sdr-random](https://github.com/DeepBlueDynamics/sdr-random). Falls back to `os.urandom` if unavailable.
- **Multi-GPU support** — auto-detects Flash Attention 3 (H100/Hopper) or falls back to PyTorch SDPA (consumer GPUs). Windows support with automatic `torch.compile` bypass.
- **Optimized defaults** — hyperparameters from 215 experiments across Karpathy's sessions ([Discussion #32](https://github.com/karpathy/autoresearch/discussions/32), [#43](https://github.com/karpathy/autoresearch/discussions/43)).
- **Docker** — container with NVIDIA GPU passthrough, compose stack with ferricula memory service.

## Quick start

**Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).
**Requirements:** Single NVIDIA GPU, Python 3.10+, [uv](https://docs.astral.sh/uv/).

```bash

# 1. Install uv project manager (if you don't already have it)
# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
# 3. Download data + train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Manually run a single training experiment (~5 min)
# 4. Run a single training experiment (~5 min)
uv run train.py
```

If the above commands all work ok, your setup is working and you can go into autonomous research mode.
## Platform support

| Platform | Flash Attn | torch.compile | Notes |
|----------|-----------|---------------|-------|
| **H100 / Hopper** | FA3 (native) | Triton | Full speed, no changes needed |
| **RTX 3060/4090 / Ampere+** | PyTorch SDPA (auto-fallback) | Triton (Linux) | Tune DEPTH, BATCH_SIZE for VRAM |
| **Windows (any GPU)** | PyTorch SDPA (auto-fallback) | Eager mode (auto) | Triton unavailable, runs slower |

The script auto-detects everything. No manual flags needed — just tune hyperparameters for your VRAM.

### Tuning for smaller GPUs

The defaults are optimized for H100 80GB. For consumer GPUs, edit the hyperparameters block in `train.py`:

```python
# RTX 3060 12GB
DEPTH = 4
DEVICE_BATCH_SIZE = 16
TOTAL_BATCH_SIZE = 2**16
WINDOW_PATTERN = "SL"

# RTX 4090 24GB
DEPTH = 6
DEVICE_BATCH_SIZE = 32
TOTAL_BATCH_SIZE = 2**17
WINDOW_PATTERN = "SSL"
```

## Running the agent

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:
```bash
# Install your provider's SDK
uv pip install anthropic # or: openai, google-genai

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-... # Linux/Mac
set ANTHROPIC_API_KEY=sk-ant-... # Windows cmd
$env:ANTHROPIC_API_KEY="sk-ant-..." # PowerShell

# Run with Claude
uv run python agent.py --provider anthropic --model claude-sonnet-4-20250514

# Run with GPT
uv run python agent.py --provider openai --model gpt-4o

# Run with Gemini
uv run python agent.py --provider gemini --model gemini-2.0-flash

# Limit experiments, use a named branch
uv run python agent.py --provider anthropic --model claude-sonnet-4-20250514 --tag mar18 --max-experiments 20

# With ferricula memory (persistent experiment memory across runs)
uv run python agent.py --provider anthropic --model claude-sonnet-4-20250514 --memory http://localhost:8765
```

### Agent tools

| Tool | What it does |
|------|-------------|
| `get_config` | Read current hyperparameters from train.py |
| `set_hyperparams` | Modify hyperparameters (batch size, LR, depth, etc.) |
| `edit_code` | Replace entire sections of train.py (model, optimizer, training loop) |
| `run_experiment` | Execute 5-min training run, return val_bpb + metrics |
| `get_history` | Read results.tsv — full experiment log |
| `keep` | Git commit + log improvement to results.tsv |
| `discard` | Revert changes + log failure to results.tsv |
| `read_code` | Inspect specific lines of train.py |
| `remember` | Store insight in persistent thermodynamic memory (ferricula) |
| `recall` | Search memory for similar past experiments |

The agent loops autonomously: check config, propose a change, run it, evaluate, keep or discard, repeat. Context auto-compresses so it can run indefinitely.

### Manual mode

You can also run experiments the original way — point Claude Code, Codex, or any coding agent at `program.md`:

```
Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
Hi have a look at program.md and let's kick off a new experiment!
```

The `program.md` file is essentially a super lightweight "skill".
## Weber electrodynamic optimizer

## Project structure
Applies Weber's force law bracket to the optimizer step, modifying effective learning rate based on parameter velocity (momentum) and acceleration (change in momentum):

```
prepare.py — constants, data prep + runtime utilities (do not modify)
train.py — model, optimizer, training loop (agent modifies this)
program.md — agent instructions
pyproject.toml — dependencies
W = 1 - v²/(2c²) + v·a/c²
```

## Design choices
- **Stable momentum** (v small): W ~ 1, normal update
- **Accelerating params** (v·a > 0): W > 1, larger step — leans into acceleration
- **Decelerating params** (v·a < 0): W < 1, smaller step — eases off
- **Fast params** (v² large): -v²/2c² damps — natural speed limit

- **Single file to modify.** The agent only touches `train.py`. This keeps the scope manageable and diffs reviewable.
- **Fixed time budget.** Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
- **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.
Applied to both AdamW (per-element) and Muon (per-matrix). Controlled by `WEBER_C_SQ` hyperparameter (default 1.0). Larger = subtler correction.

## Platform support
## SDR entropy seeding

Seeds PyTorch's RNG with true hardware randomness from an RTL-SDR radio receiver. Entropy comes from ADC quantization noise — physically random, not pseudorandom.

Requires [sdr-random](https://github.com/DeepBlueDynamics/sdr-random) running on a machine with an RTL-SDR dongle:

```bash
# On the SDR host
sdr-rand local --port 9090

# train.py auto-fetches from http://<host>:9090/api/entropy
# Falls back to os.urandom if unavailable
uv run train.py
```

Configure the SDR host IP in `train.py` (search for `192.168.86.24`).

This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.
## Docker

Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:
```bash
# One-time: download data
docker compose --profile setup run prepare

# Run training
docker compose run train

1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this [TinyStories dataset](https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean). These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
2. You might experiment with decreasing `vocab_size`, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
3. In `prepare.py`, you'll want to lower `MAX_SEQ_LEN` a lot, depending on the computer even down to 256 etc. As you lower `MAX_SEQ_LEN`, you may want to experiment with increasing `DEVICE_BATCH_SIZE` in `train.py` slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
4. Also in `prepare.py`, you'll want to decrease `EVAL_TOKENS` so that your validation loss is evaluated on a lot less data.
5. In `train.py`, the primary single knob that controls model complexity is the `DEPTH` (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
6. You'll want to most likely use `WINDOW_PATTERN` of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
7. You'll want to lower `TOTAL_BATCH_SIZE` a lot, but keep it powers of 2, e.g. down to `2**14` (~16K) or so even, hard to tell.
# Run the autonomous agent
ANTHROPIC_API_KEY=sk-ant-... docker compose run agent

I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.
# Full stack with ferricula memory
docker compose up ferricula -d
ANTHROPIC_API_KEY=sk-ant-... docker compose run agent
```

## Notable forks
Requires `nvidia-container-toolkit` for GPU passthrough.

## Project structure

```
train.py model, optimizer, training loop (agent modifies this)
prepare.py constants, data prep, evaluation (do not modify)
agent.py autonomous experiment agent (Claude / GPT / Gemini)
program.md manual-mode agent instructions
pyproject.toml dependencies
Dockerfile CUDA runtime + uv + PyTorch
docker-compose.yml train, agent, ferricula, prepare services
results.tsv experiment log (auto-generated)
```

- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) (MacOS)
- [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) (MacOS)
- [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx) (Windows)
- [andyluo7/autoresearch](https://github.com/andyluo7/autoresearch) (AMD)
## Optimized defaults

Hyperparameters validated across 215 experiments on H100:

| Setting | Upstream | This fork | Impact |
|---------|----------|-----------|--------|
| Depth | 8 | 9 | -0.004 val_bpb |
| Aspect ratio | 64 | 57 | depth-over-width |
| Batch size | 524K | 262K | -0.012 (more steps in 5 min) |
| Window pattern | SSSL | SSSSL | -0.004 cumulative |
| Short window | seq_len/2 | seq_len/8 | narrower local attention |
| RoPE base | 10K | 200K | -0.001 |
| Embedding LR | 0.6 | 0.9 | -0.005 |
| Warmdown ratio | 0.5 | 0.75 | -0.001 to -0.027 |
| Final LR frac | 0.0 | 0.05 | -0.006 |
| Init scale | 1.0x | 0.68x | -0.016 cumulative |
| x0_lambda init | 0.1 | 0.05 | -0.001 |
| Embedding WD | 0.0 | 0.001 | regularization |
| VE WD | 0.0 | 0.003 | -0.003 cumulative |
| LM head WD | 0.0 | 0.01 | -0.009 |
| Softcap | float32 before tanh | bf16 tanh, then float32 | saves ~4GB VRAM |
| **Weber c²** | N/A | 1.0 | velocity-dependent LR bracket |

## License

Expand Down
Loading