Skip to content

Commit 8ef02a3

Browse files
MakneeMakneeMichaelvll
authored
[Example] Full finetuning/LoRA gpt-oss 20b/120b (skypilot-org#6551)
* gpt-oss sft runining * Format * Add * Add * Add example * Update README * Comments * REmove * Add training * Add training * Add training * Update README.md Co-authored-by: Zhanghao Wu <[email protected]> * Update llm/gpt-oss-sft/README.md Co-authored-by: Zhanghao Wu <[email protected]> * Update llm/gpt-oss-sft/README.md Co-authored-by: Zhanghao Wu <[email protected]> * Add training * Add training * Add training * Updated path * Updated path * remove the local images * Updated path * Updated path * Removed to imgur * Remove * Update docs * Update docs * update the readme * Update docs * fix hyperbolic unit tests --------- Co-authored-by: Maknee <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]>
1 parent be23db4 commit 8ef02a3

File tree

12 files changed

+502
-2
lines changed

12 files changed

+502
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
----
4040

4141
:fire: *News* :fire:
42-
- [Aug 2025] Run and serve **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**example**](./llm/gpt-oss/)
42+
- [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-sft/)
4343
- [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
4444
- [Jul 2025] 🎉 SkyPilot v0.10.0 released! [**blog post**](https://blog.skypilot.co/announcing-skypilot-0.10.0/), [**release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.10.0)
4545
- [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../llm/gpt-oss-finetuning/README.md

docs/source/examples/training/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Training
88
DeepSpeed <deepspeed.md>
99
Distributed PyTorch <distributed-pytorch.md>
1010
Distributed TensorFlow <distributed-tensorflow.md>
11+
Finetuning GPT-OSS <gpt-oss-finetuning.md>
1112
Finetuning Llama 4 <llama-4-finetuning.md>
1213
Finetuning Llama 3 <llama-3_1-finetuning.md>
1314
Finetuning Llama 2 <llama-2-finetuning.md>

llm/gpt-oss-finetuning/README.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Finetuning OpenAI gpt-oss Models with SkyPilot
2+
3+
![](https://i.imgur.com/TkoqCQK.png)
4+
5+
On August 5, 2025, OpenAI released [gpt-oss](https://openai.com/open-models/), including two state-of-the-art open-weight language models: `gpt-oss-120b` and `gpt-oss-20b`. These models deliver strong real-world performance at low cost and are available under the flexible Apache 2.0 license.
6+
7+
The `gpt-oss-120b` model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while the `gpt-oss-20b` model delivers similar results to OpenAI o3-mini.
8+
9+
This guide walks through how to finetune both models with LoRA/full finetuning using [🤗 Accelerate](https://github.com/huggingface/accelerate).
10+
11+
![Cloud Logos](https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-light.png)
12+
13+
## Step 0: Setup infrastructure
14+
15+
SkyPilot is a framework for running AI and batch workloads on any infrastructure, offering unified execution, high cost savings, and high GPU availability.
16+
17+
### Install SkyPilot
18+
19+
```bash
20+
pip install 'skypilot[all]'
21+
```
22+
For more details on how to setup your cloud credentials see [SkyPilot docs](https://docs.skypilot.co).
23+
24+
### Choose your infrastructure
25+
26+
```bash
27+
sky check
28+
```
29+
30+
## Step 1: Run gpt-oss models
31+
32+
### Full finetuning
33+
34+
**For `gpt-oss-20b` (smaller model):**
35+
- Requirements: 1 node, 8x H100 GPUs
36+
```bash
37+
sky launch -c gpt-oss-20b-sft gpt-oss-20b-sft.yaml
38+
```
39+
40+
**For `gpt-oss-120b` (larger model):**
41+
- Requirements: 4 nodes, 8x H200 GPUs each
42+
```bash
43+
sky launch -c gpt-oss-120b-sft gpt-oss-120b-sft.yaml
44+
```
45+
46+
```yaml
47+
# gpt-oss-120b-sft.yaml
48+
resources:
49+
accelerators: H200:8
50+
network_tier: best
51+
52+
file_mounts:
53+
/sft: ./sft
54+
55+
num_nodes: 4
56+
57+
setup: |
58+
conda install cuda -c nvidia
59+
uv venv ~/training --seed --python 3.10
60+
source ~/training/bin/activate
61+
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
62+
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
63+
uv pip install deepspeed
64+
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
65+
66+
uv pip install nvitop
67+
68+
run: |
69+
source ~/training/bin/activate
70+
71+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
72+
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
73+
74+
accelerate launch \
75+
--config_file /sft/fsdp2_120b.yaml \
76+
--num_machines $SKYPILOT_NUM_NODES \
77+
--num_processes $NP \
78+
--machine_rank $SKYPILOT_NODE_RANK \
79+
--main_process_ip $MASTER_ADDR \
80+
--main_process_port 29500 \
81+
/sft/train.py --model_id openai/gpt-oss-120b
82+
```
83+
84+
### LoRA finetuning
85+
86+
**For `gpt-oss-20b` with LoRA:**
87+
- Requirements: 1 node, 2x H100 GPU
88+
```bash
89+
sky launch -c gpt-oss-20b-lora gpt-oss-20b-lora.yaml
90+
```
91+
92+
**For `gpt-oss-120b` with LoRA:**
93+
- Requirements: 1 node, 8x H100 GPUs
94+
```bash
95+
sky launch -c gpt-oss-120b-lora gpt-oss-120b-lora.yaml
96+
```
97+
98+
## Step 2: Monitor and get results
99+
100+
Once your finetuning job is running, you can monitor the progress and retrieve results:
101+
102+
```bash
103+
# Check job status
104+
sky status
105+
106+
# View logs
107+
sky logs <cluster-name>
108+
109+
# Download results when complete
110+
sky down <cluster-name>
111+
```
112+
113+
### Example full finetuning progress
114+
115+
Here's what you can expect to see during training - the loss should decrease and token accuracy should improve over time:
116+
117+
#### gpt-oss-20b training progress
118+
119+
```
120+
Training Progress for gpt-oss-20b on Nebius:
121+
6%|▋ | 1/16 [01:18<19:31, 78.12s/it]
122+
{'loss': 2.2344, 'grad_norm': 17.139, 'learning_rate': 0.0, 'num_tokens': 51486.0, 'mean_token_accuracy': 0.5436, 'epoch': 0.06}
123+
124+
12%|█▎ | 2/16 [01:23<08:10, 35.06s/it]
125+
{'loss': 2.1689, 'grad_norm': 16.724, 'learning_rate': 0.0002, 'num_tokens': 105023.0, 'mean_token_accuracy': 0.5596, 'epoch': 0.12}
126+
127+
25%|██▌ | 4/16 [01:34<03:03, 15.26s/it]
128+
{'loss': 2.1548, 'grad_norm': 3.983, 'learning_rate': 0.000192, 'num_tokens': 214557.0, 'mean_token_accuracy': 0.5182, 'epoch': 0.25}
129+
130+
50%|█████ | 8/16 [01:56<00:59, 7.43s/it]
131+
{'loss': 2.1323, 'grad_norm': 3.460, 'learning_rate': 0.000138, 'num_tokens': 428975.0, 'mean_token_accuracy': 0.5432, 'epoch': 0.5}
132+
133+
75%|███████▌ | 12/16 [02:15<00:21, 5.50s/it]
134+
{'loss': 1.4624, 'grad_norm': 0.888, 'learning_rate': 6.5e-05, 'num_tokens': 641021.0, 'mean_token_accuracy': 0.6522, 'epoch': 0.75}
135+
136+
100%|██████████| 16/16 [02:34<00:00, 4.88s/it]
137+
{'loss': 1.1294, 'grad_norm': 0.713, 'learning_rate': 2.2e-05, 'num_tokens': 852192.0, 'mean_token_accuracy': 0.7088, 'epoch': 1.0}
138+
139+
Final Training Summary:
140+
{'train_runtime': 298.36s, 'train_samples_per_second': 3.352, 'train_steps_per_second': 0.054, 'train_loss': 2.086, 'epoch': 1.0}
141+
✓ Job finished (status: SUCCEEDED).
142+
```
143+
144+
Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop)
145+
146+
![nvitop](https://i.imgur.com/pGqj9RD.png)
147+
148+
#### gpt-oss-120b training progress
149+
150+
```
151+
Training Progress for gpt-oss-120b on 4 nodes:
152+
3%|▏ | 1/32 [03:45<116:23, 225.28s/it]
153+
6%|▋ | 2/32 [06:12<90:21, 181.05s/it]
154+
9%|▉ | 3/32 [08:45<71:22, 147.67s/it]
155+
12%|█▎ | 4/32 [11:18<59:44, 128.01s/it]
156+
25%|██▌ | 8/32 [22:36<67:48, 169.50s/it]
157+
44%|████▍ | 14/32 [29:03<43:37, 145.41s/it]
158+
```
159+
160+
Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop)
161+
162+
![nvitop](https://i.imgur.com/dAov9ud.png)
163+
164+
## Configuration files
165+
166+
You can find the complete configurations in [the following directory](https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-oss-finetuning/).
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
resources:
2+
accelerators: H100:8
3+
network_tier: best
4+
5+
file_mounts:
6+
/sft: ./sft
7+
8+
num_nodes: 1
9+
10+
setup: |
11+
conda install cuda -c nvidia
12+
uv venv ~/training --seed --python 3.10
13+
source ~/training/bin/activate
14+
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
15+
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
16+
uv pip install deepspeed
17+
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
18+
19+
uv pip install nvitop
20+
21+
run: |
22+
source ~/training/bin/activate
23+
24+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
25+
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
26+
27+
python /sft/train.py --model_id openai/gpt-oss-120b --enable_lora
28+
29+
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
resources:
2+
accelerators: H200:8
3+
network_tier: best
4+
5+
file_mounts:
6+
/sft: ./sft
7+
8+
num_nodes: 4
9+
10+
setup: |
11+
conda install cuda -c nvidia
12+
uv venv ~/training --seed --python 3.10
13+
source ~/training/bin/activate
14+
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
15+
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
16+
uv pip install deepspeed
17+
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
18+
19+
uv pip install nvitop
20+
21+
run: |
22+
source ~/training/bin/activate
23+
24+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
25+
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
26+
27+
accelerate launch --config_file /sft/fsdp2_120b.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-120b
28+
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
resources:
2+
accelerators: H100:2
3+
network_tier: best
4+
5+
file_mounts:
6+
/sft: ./sft
7+
8+
num_nodes: 1
9+
10+
setup: |
11+
conda install cuda -c nvidia
12+
uv venv ~/training --seed --python 3.10
13+
source ~/training/bin/activate
14+
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
15+
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
16+
uv pip install deepspeed
17+
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
18+
19+
uv pip install nvitop
20+
21+
run: |
22+
source ~/training/bin/activate
23+
24+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
25+
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
26+
27+
python /sft/train.py --model_id openai/gpt-oss-20b --enable_lora
28+
29+
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
resources:
2+
accelerators: H100:8
3+
network_tier: best
4+
5+
file_mounts:
6+
/sft: ./sft
7+
8+
num_nodes: 1
9+
10+
setup: |
11+
conda install cuda -c nvidia
12+
uv venv ~/training --seed --python 3.10
13+
source ~/training/bin/activate
14+
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
15+
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
16+
uv pip install deepspeed
17+
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
18+
19+
uv pip install nvitop
20+
21+
run: |
22+
source ~/training/bin/activate
23+
24+
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
25+
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
26+
27+
accelerate launch --config_file /sft/fsdp2.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-20b
28+
29+
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Requires accelerate 1.7.0 or higher
2+
compute_environment: LOCAL_MACHINE
3+
debug: false
4+
distributed_type: FSDP
5+
downcast_bf16: 'no'
6+
enable_cpu_affinity: false
7+
fsdp_config:
8+
fsdp_activation_checkpointing: true
9+
fsdp_sharding_strategy: FULL_SHARD
10+
fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
11+
fsdp_backward_prefetch: BACKWARD_PRE
12+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
13+
fsdp_cpu_ram_efficient_loading: false
14+
fsdp_offload_params: false
15+
fsdp_reshard_after_forward: false
16+
fsdp_use_orig_params: true
17+
# fsdp_state_dict_type: FULL_STATE_DICT
18+
fsdp_state_dict_type: SHARDED_STATE_DICT
19+
fsdp_forward_prefetch: true
20+
fsdp_version: 2
21+
machine_rank: 0
22+
main_training_function: main
23+
mixed_precision: bf16
24+
num_machines: 1
25+
num_processes: 8
26+
rdzv_backend: c10d
27+
same_network: false
28+
tpu_env: []
29+
tpu_use_cluster: false
30+
tpu_use_sudo: false
31+
use_cpu: false
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
compute_environment: LOCAL_MACHINE
2+
debug: false
3+
distributed_type: FSDP
4+
downcast_bf16: 'no'
5+
enable_cpu_affinity: false
6+
fsdp_config:
7+
fsdp_activation_checkpointing: true
8+
fsdp_sharding_strategy: FULL_SHARD
9+
fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
10+
fsdp_backward_prefetch: BACKWARD_PRE
11+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
12+
fsdp_cpu_ram_efficient_loading: true
13+
fsdp_offload_params: false
14+
fsdp_reshard_after_forward: true
15+
fsdp_use_orig_params: false
16+
fsdp_state_dict_type: SHARDED_STATE_DICT
17+
fsdp_forward_prefetch: false
18+
fsdp_version: 2
19+
machine_rank: 0
20+
main_training_function: main
21+
mixed_precision: bf16
22+
num_machines: 1
23+
num_processes: 8
24+
rdzv_backend: c10d
25+
same_network: false
26+
tpu_env: []
27+
tpu_use_cluster: false
28+
tpu_use_sudo: false
29+
use_cpu: false

0 commit comments

Comments
 (0)