Running on multi-GPU nodes #55

karpathy · 2026-03-08T23:37:49Z

karpathy
Mar 8, 2026
Maintainer

There are many ways to possibly parallelize autoresearch. Here is one that I've been playing with on my 8XH100 node. It uses a fanout strategy. Instead of pushing this to the code I thought a Discussion might work better so that other people can share their own setups and links.

autoresearch (multi-GPU)

This is an experiment to have the LLM do its own research, using all 8 GPUs in parallel.

Setup

To set up a new experiment, work with the user to:

Agree on a run tag: propose a tag based on today's date (e.g. mar5). The branch autoresearch/<tag> must not already exist — this is a fresh run.
Create the branch: git checkout -b autoresearch/<tag> from current master.
Read the in-scope files: The repo is small. Read these files for full context:
- README.md — repository context.
- prepare.py — fixed constants, data prep, tokenizer, dataloader, evaluation. Do not modify.
- train.py — the file you modify. Model architecture, optimizer, training loop.
Verify data exists: Check that ~/.cache/autoresearch/ contains data shards and a tokenizer. If not, tell the human to run uv run prepare.py.
Initialize results.tsv: Create results.tsv with just the header row. The baseline will be recorded after the first run.
Confirm and go: Confirm setup looks good.

Once you get confirmation, kick off the experimentation.

Experimentation

Each experiment runs on a single GPU. The training script runs for a fixed time budget of 5 minutes (wall clock training time, excluding startup/compilation). With 8 GPUs available, you run 8 experiments in parallel per round, testing 8 different ideas simultaneously.

What you CAN do:

Modify train.py — this is the only file you edit. Everything is fair game: model architecture, optimizer, hyperparameters, training loop, batch size, model size, etc.

What you CANNOT do:

Modify prepare.py. It is read-only. It contains the fixed evaluation, data loading, tokenizer, and training constants (time budget, sequence length, etc).
Install new packages or add dependencies. You can only use what's already in pyproject.toml.
Modify the evaluation harness. The evaluate_bpb function in prepare.py is the ground truth metric.

The goal is simple: get the lowest val_bpb. Since the time budget is fixed, you don't need to worry about training time — it's always 5 minutes. Everything is fair game: change the architecture, the optimizer, the hyperparameters, the batch size, the model size. The only constraint is that the code runs without crashing and finishes within the time budget.

VRAM is a soft constraint. Some increase is acceptable for meaningful val_bpb gains, but it should not blow up dramatically.

Simplicity criterion: All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome — that's a simplification win. When evaluating whether to keep a change, weigh the complexity cost against the improvement magnitude. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 val_bpb improvement from deleting code? Definitely keep. An improvement of ~0 but much simpler code? Keep.

The first run: Your very first round should always establish the baseline by running the training script as is on one GPU, so you can calibrate the baseline numbers for this specific platform.

Output format

Once the script finishes it prints a summary like this:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

Note that the script is configured to always stop after 5 minutes, so depending on the computing platform of this computer the numbers might look different. You can extract the key metric from the log file:

grep "^val_bpb:" run.log

Logging results

When an experiment round is done, log ALL variants to results.tsv (tab-separated, NOT comma-separated — commas break in descriptions).

The TSV has a header row and 5 columns:

commit	val_bpb	memory_gb	status	description

git commit hash (short, 7 chars)
val_bpb achieved (e.g. 1.234567) — use 0.000000 for crashes
peak memory in GB, round to .1f (e.g. 12.3 — divide peak_vram_mb by 1024) — use 0.0 for crashes
status: keep, discard, or crash
short text description of what this experiment tried

Example:

commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	[R1 gpu0] increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	[R1 gpu1] switch to GeLU activation
d4e5f6g	0.000000	0.0	crash	[R1 gpu2] double model width (OOM)

The experiment loop (parallel, 8 GPUs)

The experiment runs on a dedicated branch (e.g. autoresearch/mar5).

LOOP FOREVER (until I wake up and come back in the morning):

1. Plan a round

Come up with 8 different ideas to test. These should be diverse — don't waste GPUs on near-identical variants. Good strategies:

Mix bold and conservative ideas (e.g., 3 hyperparameter tweaks, 2 architectural changes, 2 combinations of previous wins, 1 wild card)
If a previous round had partial wins (improved but not best), try combining them
If you're stuck, try more radical changes on some GPUs while doing fine-grained sweeps on others

2. Launch 8 experiments (fan out from BASE)

Save the current commit as BASE — this is the starting point for all 8 experiments.

BASE=$(git rev-parse HEAD)

For each idea i (0 through 7), repeat:

Edit train.py with idea i
Commit: git commit -am "experiment: <short description>"
Launch in background: CUDA_VISIBLE_DEVICES=$i uv run train.py > /tmp/run_gpu${i}.log 2>&1 &
Save the commit hash (e.g. in a shell array or just note it down)
Restore train.py from BASE: git show $BASE:train.py > train.py
Verify the file is clean before editing the next variant: git diff $BASE -- train.py should show no changes.

IMPORTANT — why git show instead of git checkout: Using git checkout $BASE to reset between experiments is fragile. It can fail silently due to index.lock files, dirty working trees, or detached HEAD issues, causing edits to stack on top of each other instead of fanning out from BASE. Using git show $BASE:train.py > train.py directly overwrites the file contents, which is reliable regardless of git state. The running process already has the code in memory, so overwriting the file doesn't affect it.

GPU 0 starts first and GPU 7 starts last. The stagger is just however long it takes to edit and commit each variant (~30-60s each). That's fine.

After all 8 are launched, wait for them to finish.

3. Collect results

When all of them finish, collect the results:

for i in $(seq 0 7); do
  echo "=== GPU $i ==="
  grep "^val_bpb:\|^peak_vram_mb:" /tmp/run_gpu${i}.log
done

4. Pick the winner

Compare all 8 results against the current best val_bpb:

If a variant won (lowest val_bpb, better than current best):
- Copy the winner's train.py directly: git show <winner_commit>:train.py > train.py
- Verify only the intended change is present: git diff $BASE -- train.py
- Commit: git commit -am "keep: <short description>"
- This is now the new BASE for the next round
- IMPORTANT — why not git cherry-pick: Cherry-pick can silently merge in unintended changes if the commit's parent had a dirty working tree. Copying the file directly and verifying the diff is safer.
If nothing improved:
- Stay at BASE, move on to the next round
Note the runners-up: If multiple variants improved, remember them — try combining them next round.

5. Log all results

Record ALL 8 variants in results.tsv, not just the winner. Use the round and GPU number in the description for traceability (e.g., [R3 gpu2] increase depth to 12).

6. Repeat

Go back to step 1 with new ideas informed by what you just learned.

Strategy tips

Combine winners: If round N found that idea A (gpu2) and idea B (gpu5) both improved, try combining A+B in round N+1.
Sweep around wins: If increasing LR to 0.04 helped, next round try 0.035, 0.045, 0.05 on a few GPUs to find the optimum.
Ablate: If a complex change helped, try simpler versions of it to see what's actually driving the gain.
Don't repeat yourself: If you tested an idea and it didn't work, don't test the exact same thing again. Vary it or move on.
Throughput: Each round takes ~8 min wall clock (5 min training + staggered launch + eval). That's ~7 rounds/hour × 8 ideas = ~56 experiments/hour. In a 10-hour session, you can test ~560 ideas.

Timeout: Each round should take ~8 minutes total. If any run exceeds 10 minutes, kill it and treat it as a crash.

Crashes: If a variant crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), consider retrying the idea in the next round with the fix. If the idea itself is fundamentally broken, just record it as a crash in the TSV and move on. Don't retry individual variants mid-round — the round has 7 other results to work with.

NEVER STOP: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.

As an example use case, a user might leave you running while they sleep. Each round takes ~8 minutes and tests 8 ideas, so you can run ~7 rounds/hour for ~56 experiments/hour. Over an average human sleep you'd test ~500+ ideas. The user then wakes up to experimental results, all completed by you while they slept!

matt-langston · 2026-03-09T05:09:02Z

matt-langston
Mar 9, 2026

I have autoresearch running on multiple GPUs - two DGX Sparks - in a similar fashion using ssh and scp and a "shared bulletin board" (aka simple text files) of results and messages that I pass between the GPUs.

My writeup here: matt-langston#1
My fork here: https://github.com/matt-langston/autoresearch (branch dgx-spark - the default branch of my fork).

My method doesn't have any single point of failure, and if a GPU suddenly drops out of the swarm then it just continues on autonomously. The power of the swarm comes from the network effect of agents sharing results with one another so that all agents can see the 10,000 foot view and reason about the goal as a whole.

If any agent drops out of the swarm, then it is simply a loss of efficiency.

A blockchain may be the most resilient approach to record which agent has done what in an incorruptible ledger. But that is a bit complicated at this stage when we're all still trying different things and experimenting.

0 replies

aniruddhaadak80 · 2026-03-10T08:14:27Z

aniruddhaadak80
Mar 10, 2026

Just chiming in here-I actually just submitted a PR (#117) to add native Distributed Data Parallel (DDP) support to the repo using torchrun and nccl.

It sets up the process groups and scales the gradient accumulations so you can run it on multi-GPU nodes right out of the box without having to drastically change the core scripts. Hopefully, that makes your deployment significantly easier going forward and helps fully saturate your node.

0 replies

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on multi-GPU nodes #55

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

This comment was marked as spam.

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Running on multi-GPU nodes #55

Uh oh!

karpathy Mar 8, 2026 Maintainer

autoresearch (multi-GPU)

Setup

Experimentation

Output format

Logging results

The experiment loop (parallel, 8 GPUs)

1. Plan a round

2. Launch 8 experiments (fan out from BASE)

3. Collect results

4. Pick the winner

5. Log all results

6. Repeat

Strategy tips

Replies: 3 comments

This comment was marked as spam.

Uh oh!

matt-langston Mar 9, 2026

Uh oh!

Uh oh!

aniruddhaadak80 Mar 10, 2026

karpathy
Mar 8, 2026
Maintainer

matt-langston
Mar 9, 2026

aniruddhaadak80
Mar 10, 2026