Internal Setup Guide (AI2 Researchers)

This guide is for AI2 researchers with access to Beaker, Weka, and internal infrastructure. External users should see Pretraining.md instead.

Quick Start
Beaker Setup
Launch Methods
Internal Datasets
Beaker Information
Internal-Specific Gotchas

Quick Start (10 Minutes)

If you're an AI2 researcher, follow these steps to get running quickly:

Configure Beaker (see Beaker Setup)
Use existing datasets on Weka (see Internal Datasets)
Launch via Beaker or Sessions (see Launch Methods)

Beaker Setup

1. Create GitHub Token

Create a GitHub token that can clone this repo on Beaker. Generate a token here with the following permissions:

repo
read:packages
read:org
write:org
read:project

Important: Authorize this token for the allenai org by clicking on the "Configure SSO" dropdown here for the token you created.

2. Configure Beaker Workspace and Budget

beaker config set default_workspace ai2/earth-systems
beaker workspace set-budget ai2/earth-systems ai2/atec-olmoearth

3. Set Beaker Secrets

ACCOUNT=$(beaker account whoami --format json | jq -r '.[0].name')
beaker secret write ${ACCOUNT}_WANDB_API_KEY <your_key>
beaker secret write ${ACCOUNT}_BEAKER_TOKEN <your_token>
beaker secret write ${ACCOUNT}_GITHUB_TOKEN <your_key>

Note: Make sure you have jq installed: https://stedolan.github.io/jq/

Launch Methods

Pre-emptible Jobs

To launch pre-emptible jobs, we use the main entrypoint in olmoearth_pretrain/internal/experiment.py and write python configuration files in scripts/official/.

⚠️ Important: Before launching your script, MAKE SURE YOUR CODE IS COMMITTED AND PUSHED as we clone the code on top of a docker image when we launch the job.

Scheduling Priority Jobs are launched at high priority by default. To configure this use --launch.priority=<low,normal,urgent> to specify as an additonal override

Launch Command

python3 scripts/official/base.py launch my_run_name ai2/saturn

This will launch a Beaker job and stream the logs to your console until you cancel. Add additional overrides as needed (see Pretraining.md for details).

Beaker Sessions

For interactive development and debugging, you can use Beaker sessions.

Setup Workflow

See the VSCode/Cursor workflow setup document for detailed instructions.

Session Creation

When creating a session, include the following args:

--secret-env WANDB_API_KEY=<your_beaker_username>_WANDB_API_KEY \
--secret-env BEAKER_TOKEN=<your_beaker_username>_BEAKER_TOKEN

Flash Attention Setup

To use flash attention in a session:

Use beaker://petew/olmo-core-tch270cu128 as your base Beaker image

Set up a conda environment:

conda init
exec bash
conda shell.bash activate base
pip install -e '.[all]'

Running in Sessions

For debugging in sessions, use:

torchrun --nproc_per_node=8 scripts/official/base.py train test_run local

Add additional overrides as needed (see Pretraining.md for examples).

Internal Datasets

Dataset Locations on Weka

Internal datasets are stored on Weka at:

/weka/dfive-default/helios/dataset/

You can reference these paths directly in your launch commands with --dataset.h5py_dir.

See the main README.md for specific dataset paths and details.

Evaluation Datasets

Evaluation datasets have default paths configured in olmoearth_pretrain/evals/datasets/paths.py that point to internal AI2 infrastructure. You typically don't need to override these.

Beaker Information

Quick Reference:

Budget: ai2/atec-olmoearth
Workspace: ai2/earth-systems
Weka: weka://dfive-default

Internal-Specific Gotchas

1. Code Must Be Committed

When launching Beaker jobs, the code is cloned fresh. Always commit and push your changes before launching.

2. Beaker Sessions vs Jobs

Jobs: For long-running training, use pre-emptible jobs
Sessions: For interactive debugging, use sessions with the torchrun command

3. Weka Performance

If you're experiencing slow data loading, consider using the 128x128 tile versions of datasets (4x more samples, better for GB/s bottlenecks on Weka).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Setup Guide (AI2 Researchers)

Table of Contents

Quick Start (10 Minutes)

Beaker Setup

1. Create GitHub Token

2. Configure Beaker Workspace and Budget

3. Set Beaker Secrets

Launch Methods

Pre-emptible Jobs

Launch Command

Beaker Sessions

Setup Workflow

Session Creation

Flash Attention Setup

Running in Sessions

Internal Datasets

Dataset Locations on Weka

Evaluation Datasets

Beaker Information

Internal-Specific Gotchas

1. Code Must Be Committed

2. Beaker Sessions vs Jobs

3. Weka Performance

See Also

FilesExpand file tree

Setup-Internal.md

Latest commit

History

Setup-Internal.md

File metadata and controls

Internal Setup Guide (AI2 Researchers)

Table of Contents

Quick Start (10 Minutes)

Beaker Setup

1. Create GitHub Token

2. Configure Beaker Workspace and Budget

3. Set Beaker Secrets

Launch Methods

Pre-emptible Jobs

Launch Command

Beaker Sessions

Setup Workflow

Session Creation

Flash Attention Setup

Running in Sessions

Internal Datasets

Dataset Locations on Weka

Evaluation Datasets

Beaker Information

Internal-Specific Gotchas

1. Code Must Be Committed

2. Beaker Sessions vs Jobs

3. Weka Performance

See Also