Replies: 3 comments
This comment was marked as spam.
This comment was marked as spam.
-
|
I have My writeup here: matt-langston#1 My method doesn't have any single point of failure, and if a GPU suddenly drops out of the swarm then it just continues on autonomously. The power of the swarm comes from the network effect of agents sharing results with one another so that all agents can see the 10,000 foot view and reason about the goal as a whole. If any agent drops out of the swarm, then it is simply a loss of efficiency. A blockchain may be the most resilient approach to record which agent has done what in an incorruptible ledger. But that is a bit complicated at this stage when we're all still trying different things and experimenting. |
Beta Was this translation helpful? Give feedback.
-
|
Just chiming in here-I actually just submitted a PR (#117) to add native Distributed Data Parallel (DDP) support to the repo using It sets up the process groups and scales the gradient accumulations so you can run it on multi-GPU nodes right out of the box without having to drastically change the core scripts. Hopefully, that makes your deployment significantly easier going forward and helps fully saturate your node. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
There are many ways to possibly parallelize autoresearch. Here is one that I've been playing with on my 8XH100 node. It uses a fanout strategy. Instead of pushing this to the code I thought a Discussion might work better so that other people can share their own setups and links.
autoresearch (multi-GPU)
This is an experiment to have the LLM do its own research, using all 8 GPUs in parallel.
Setup
To set up a new experiment, work with the user to:
mar5). The branchautoresearch/<tag>must not already exist — this is a fresh run.git checkout -b autoresearch/<tag>from current master.README.md— repository context.prepare.py— fixed constants, data prep, tokenizer, dataloader, evaluation. Do not modify.train.py— the file you modify. Model architecture, optimizer, training loop.~/.cache/autoresearch/contains data shards and a tokenizer. If not, tell the human to runuv run prepare.py.results.tsvwith just the header row. The baseline will be recorded after the first run.Once you get confirmation, kick off the experimentation.
Experimentation
Each experiment runs on a single GPU. The training script runs for a fixed time budget of 5 minutes (wall clock training time, excluding startup/compilation). With 8 GPUs available, you run 8 experiments in parallel per round, testing 8 different ideas simultaneously.
What you CAN do:
train.py— this is the only file you edit. Everything is fair game: model architecture, optimizer, hyperparameters, training loop, batch size, model size, etc.What you CANNOT do:
prepare.py. It is read-only. It contains the fixed evaluation, data loading, tokenizer, and training constants (time budget, sequence length, etc).pyproject.toml.evaluate_bpbfunction inprepare.pyis the ground truth metric.The goal is simple: get the lowest val_bpb. Since the time budget is fixed, you don't need to worry about training time — it's always 5 minutes. Everything is fair game: change the architecture, the optimizer, the hyperparameters, the batch size, the model size. The only constraint is that the code runs without crashing and finishes within the time budget.
VRAM is a soft constraint. Some increase is acceptable for meaningful val_bpb gains, but it should not blow up dramatically.
Simplicity criterion: All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome — that's a simplification win. When evaluating whether to keep a change, weigh the complexity cost against the improvement magnitude. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 val_bpb improvement from deleting code? Definitely keep. An improvement of ~0 but much simpler code? Keep.
The first run: Your very first round should always establish the baseline by running the training script as is on one GPU, so you can calibrate the baseline numbers for this specific platform.
Output format
Once the script finishes it prints a summary like this:
Note that the script is configured to always stop after 5 minutes, so depending on the computing platform of this computer the numbers might look different. You can extract the key metric from the log file:
Logging results
When an experiment round is done, log ALL variants to
results.tsv(tab-separated, NOT comma-separated — commas break in descriptions).The TSV has a header row and 5 columns:
keep,discard, orcrashExample:
The experiment loop (parallel, 8 GPUs)
The experiment runs on a dedicated branch (e.g.
autoresearch/mar5).LOOP FOREVER (until I wake up and come back in the morning):
1. Plan a round
Come up with 8 different ideas to test. These should be diverse — don't waste GPUs on near-identical variants. Good strategies:
2. Launch 8 experiments (fan out from BASE)
Save the current commit as BASE — this is the starting point for all 8 experiments.
For each idea
i(0 through 7), repeat:train.pywith ideaigit commit -am "experiment: <short description>"CUDA_VISIBLE_DEVICES=$i uv run train.py > /tmp/run_gpu${i}.log 2>&1 &git show $BASE:train.py > train.pygit diff $BASE -- train.pyshould show no changes.IMPORTANT — why
git showinstead ofgit checkout: Usinggit checkout $BASEto reset between experiments is fragile. It can fail silently due to index.lock files, dirty working trees, or detached HEAD issues, causing edits to stack on top of each other instead of fanning out from BASE. Usinggit show $BASE:train.py > train.pydirectly overwrites the file contents, which is reliable regardless of git state. The running process already has the code in memory, so overwriting the file doesn't affect it.GPU 0 starts first and GPU 7 starts last. The stagger is just however long it takes to edit and commit each variant (~30-60s each). That's fine.
After all 8 are launched, wait for them to finish.
3. Collect results
When all of them finish, collect the results:
4. Pick the winner
Compare all 8 results against the current best
val_bpb:git show <winner_commit>:train.py > train.pygit diff $BASE -- train.pygit commit -am "keep: <short description>"git cherry-pick: Cherry-pick can silently merge in unintended changes if the commit's parent had a dirty working tree. Copying the file directly and verifying the diff is safer.5. Log all results
Record ALL 8 variants in
results.tsv, not just the winner. Use the round and GPU number in the description for traceability (e.g.,[R3 gpu2] increase depth to 12).6. Repeat
Go back to step 1 with new ideas informed by what you just learned.
Strategy tips
Timeout: Each round should take ~8 minutes total. If any run exceeds 10 minutes, kill it and treat it as a crash.
Crashes: If a variant crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), consider retrying the idea in the next round with the fix. If the idea itself is fundamentally broken, just record it as a crash in the TSV and move on. Don't retry individual variants mid-round — the round has 7 other results to work with.
NEVER STOP: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.
As an example use case, a user might leave you running while they sleep. Each round takes ~8 minutes and tests 8 ideas, so you can run ~7 rounds/hour for ~56 experiments/hour. Over an average human sleep you'd test ~500+ ideas. The user then wakes up to experimental results, all completed by you while they slept!
Beta Was this translation helpful? Give feedback.
All reactions