This guide is for AI2 researchers with access to Beaker, Weka, and internal infrastructure. External users should see Pretraining.md instead.
- Quick Start
- Beaker Setup
- Launch Methods
- Internal Datasets
- Beaker Information
- Internal-Specific Gotchas
If you're an AI2 researcher, follow these steps to get running quickly:
- Configure Beaker (see Beaker Setup)
- Use existing datasets on Weka (see Internal Datasets)
- Launch via Beaker or Sessions (see Launch Methods)
Create a GitHub token that can clone this repo on Beaker. Generate a token here with the following permissions:
reporead:packagesread:orgwrite:orgread:project
Important: Authorize this token for the allenai org by clicking on the "Configure SSO" dropdown here for the token you created.
beaker config set default_workspace ai2/earth-systems
beaker workspace set-budget ai2/earth-systems ai2/atec-olmoearthACCOUNT=$(beaker account whoami --format json | jq -r '.[0].name')
beaker secret write ${ACCOUNT}_WANDB_API_KEY <your_key>
beaker secret write ${ACCOUNT}_BEAKER_TOKEN <your_token>
beaker secret write ${ACCOUNT}_GITHUB_TOKEN <your_key>Note: Make sure you have
jqinstalled: https://stedolan.github.io/jq/
To launch pre-emptible jobs, we use the main entrypoint in olmoearth_pretrain/internal/experiment.py and write python configuration files in scripts/official/.
Scheduling Priority Jobs are launched at high priority by default. To configure this use --launch.priority=<low,normal,urgent> to specify as an additonal override
python3 scripts/official/base.py launch my_run_name ai2/saturnThis will launch a Beaker job and stream the logs to your console until you cancel. Add additional overrides as needed (see Pretraining.md for details).
For interactive development and debugging, you can use Beaker sessions.
See the VSCode/Cursor workflow setup document for detailed instructions.
When creating a session, include the following args:
--secret-env WANDB_API_KEY=<your_beaker_username>_WANDB_API_KEY \
--secret-env BEAKER_TOKEN=<your_beaker_username>_BEAKER_TOKENTo use flash attention in a session:
- Use
beaker://petew/olmo-core-tch270cu128as your base Beaker image - Set up a conda environment:
conda init exec bash conda shell.bash activate base pip install -e '.[all]'
For debugging in sessions, use:
torchrun --nproc_per_node=8 scripts/official/base.py train test_run localAdd additional overrides as needed (see Pretraining.md for examples).
Internal datasets are stored on Weka at:
/weka/dfive-default/helios/dataset/
You can reference these paths directly in your launch commands with --dataset.h5py_dir.
See the main README.md for specific dataset paths and details.
Evaluation datasets have default paths configured in olmoearth_pretrain/evals/datasets/paths.py that point to internal AI2 infrastructure. You typically don't need to override these.
Quick Reference:
- Budget:
ai2/atec-olmoearth - Workspace:
ai2/earth-systems - Weka:
weka://dfive-default
When launching Beaker jobs, the code is cloned fresh. Always commit and push your changes before launching.
- Jobs: For long-running training, use pre-emptible jobs
- Sessions: For interactive debugging, use sessions with the torchrun command
If you're experiencing slow data loading, consider using the 128x128 tile versions of datasets (4x more samples, better for GB/s bottlenecks on Weka).
- Pretraining.md - Main training guide (launching, overrides, experiments)
- README.md - Project overview and dataset details