[Example] Full finetuning/LoRA gpt-oss 20b/120b (skypilot-org#6551)

Maknee · Maknee · Michaelvll · web-flow · commit 8ef02a326498 · 2025-08-07T18:08:21.000Z
* gpt-oss sft runining

* Format

* Add

* Add

* Add example

* Update README

* Comments

* REmove

* Add training

* Add training

* Add training

* Update README.md

Co-authored-by: Zhanghao Wu &lt;zhanghao.wu@outlook.com&gt;

* Update llm/gpt-oss-sft/README.md

Co-authored-by: Zhanghao Wu &lt;zhanghao.wu@outlook.com&gt;

* Update llm/gpt-oss-sft/README.md

Co-authored-by: Zhanghao Wu &lt;zhanghao.wu@outlook.com&gt;

* Add training

* Add training

* Add training

* Updated path

* Updated path

* remove the local images

* Updated path

* Updated path

* Removed to imgur

* Remove

* Update docs

* Update docs

* update the readme

* Update docs

* fix hyperbolic unit tests

---------

Co-authored-by: Maknee &lt;henry@assemblesys.com&gt;
Co-authored-by: Zhanghao Wu &lt;zhanghao.wu@outlook.com&gt;
diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@
 ----
 
 :fire: *News* :fire:
-- [Aug 2025] Run and serve **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**example**](./llm/gpt-oss/)
+- [Aug 2025] Serve and finetune **OpenAI GPT-OSS models** (gpt-oss-120b, gpt-oss-20b) with one command on any infra: [**serve**](./llm/gpt-oss/) + [**LoRA and full finetuning**](./llm/gpt-oss-sft/)
 - [Jul 2025] Run distributed **RL training for LLMs** with Verl (PPO, GRPO) on any cloud: [**example**](./llm/verl/)
 - [Jul 2025] 🎉 SkyPilot v0.10.0 released! [**blog post**](https://blog.skypilot.co/announcing-skypilot-0.10.0/), [**release notes**](https://github.com/skypilot-org/skypilot/releases/tag/v0.10.0)
 - [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
diff --git a/docs/source/examples/training/gpt-oss-finetuning.md b/docs/source/examples/training/gpt-oss-finetuning.md
@@ -0,0 +1 @@
+../../../../llm/gpt-oss-finetuning/README.md
diff --git a/docs/source/examples/training/index.rst b/docs/source/examples/training/index.rst
@@ -8,6 +8,7 @@ Training
    DeepSpeed <deepspeed.md>
    Distributed PyTorch <distributed-pytorch.md>
    Distributed TensorFlow <distributed-tensorflow.md>
+   Finetuning GPT-OSS <gpt-oss-finetuning.md>
    Finetuning Llama 4 <llama-4-finetuning.md>
    Finetuning Llama 3 <llama-3_1-finetuning.md>
    Finetuning Llama 2 <llama-2-finetuning.md>
diff --git a/llm/gpt-oss-finetuning/README.md b/llm/gpt-oss-finetuning/README.md
@@ -0,0 +1,166 @@
+# Finetuning OpenAI gpt-oss Models with SkyPilot
+
+![](https://i.imgur.com/TkoqCQK.png)
+
+On August 5, 2025, OpenAI released [gpt-oss](https://openai.com/open-models/), including two state-of-the-art open-weight language models: `gpt-oss-120b` and `gpt-oss-20b`. These models deliver strong real-world performance at low cost and are available under the flexible Apache 2.0 license.
+
+The `gpt-oss-120b` model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while the `gpt-oss-20b` model delivers similar results to OpenAI o3-mini.
+
+This guide walks through how to finetune both models with LoRA/full finetuning using [🤗 Accelerate](https://github.com/huggingface/accelerate).
+
+![Cloud Logos](https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-light.png)
+
+## Step 0: Setup infrastructure
+
+SkyPilot is a framework for running AI and batch workloads on any infrastructure, offering unified execution, high cost savings, and high GPU availability.
+
+### Install SkyPilot
+
+```bash
+pip install 'skypilot[all]'
+```
+For more details on how to setup your cloud credentials see [SkyPilot docs](https://docs.skypilot.co).
+
+### Choose your infrastructure
+
+```bash
+sky check
+```
+
+## Step 1: Run gpt-oss models
+
+### Full finetuning
+
+**For `gpt-oss-20b` (smaller model):**
+- Requirements: 1 node, 8x H100 GPUs
+```bash
+sky launch -c gpt-oss-20b-sft gpt-oss-20b-sft.yaml
+```
+
+**For `gpt-oss-120b` (larger model):**
+- Requirements: 4 nodes, 8x H200 GPUs each
+```bash
+sky launch -c gpt-oss-120b-sft gpt-oss-120b-sft.yaml
+```
+
+```yaml
+# gpt-oss-120b-sft.yaml
+resources:
+  accelerators: H200:8
+  network_tier: best
+
+file_mounts:
+  /sft: ./sft
+
+num_nodes: 4
+
+setup: |
+  conda install cuda -c nvidia
+  uv venv ~/training --seed --python 3.10
+  source ~/training/bin/activate
+  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
+  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
+  uv pip install deepspeed
+  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
+
+  uv pip install nvitop
+
+run: |
+  source ~/training/bin/activate
+
+  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
+
+  accelerate launch \
+    --config_file /sft/fsdp2_120b.yaml \
+    --num_machines $SKYPILOT_NUM_NODES \
+    --num_processes $NP \
+    --machine_rank $SKYPILOT_NODE_RANK \
+    --main_process_ip $MASTER_ADDR \
+    --main_process_port 29500 \
+    /sft/train.py --model_id openai/gpt-oss-120b
+```
+
+### LoRA finetuning
+
+**For `gpt-oss-20b` with LoRA:**
+- Requirements: 1 node, 2x H100 GPU
+```bash
+sky launch -c gpt-oss-20b-lora gpt-oss-20b-lora.yaml
+```
+
+**For `gpt-oss-120b` with LoRA:**
+- Requirements: 1 node, 8x H100 GPUs
+```bash
+sky launch -c gpt-oss-120b-lora gpt-oss-120b-lora.yaml
+```
+
+## Step 2: Monitor and get results
+
+Once your finetuning job is running, you can monitor the progress and retrieve results:
+
+```bash
+# Check job status
+sky status
+
+# View logs
+sky logs <cluster-name>
+
+# Download results when complete
+sky down <cluster-name>
+```
+
+### Example full finetuning progress
+
+Here's what you can expect to see during training - the loss should decrease and token accuracy should improve over time:
+
+#### gpt-oss-20b training progress
+
+```
+Training Progress for gpt-oss-20b on Nebius:
+  6%|▋         | 1/16 [01:18<19:31, 78.12s/it]
+{'loss': 2.2344, 'grad_norm': 17.139, 'learning_rate': 0.0, 'num_tokens': 51486.0, 'mean_token_accuracy': 0.5436, 'epoch': 0.06}
+
+ 12%|█▎        | 2/16 [01:23<08:10, 35.06s/it]
+{'loss': 2.1689, 'grad_norm': 16.724, 'learning_rate': 0.0002, 'num_tokens': 105023.0, 'mean_token_accuracy': 0.5596, 'epoch': 0.12}
+
+ 25%|██▌       | 4/16 [01:34<03:03, 15.26s/it]
+{'loss': 2.1548, 'grad_norm': 3.983, 'learning_rate': 0.000192, 'num_tokens': 214557.0, 'mean_token_accuracy': 0.5182, 'epoch': 0.25}
+
+ 50%|█████     | 8/16 [01:56<00:59,  7.43s/it]
+{'loss': 2.1323, 'grad_norm': 3.460, 'learning_rate': 0.000138, 'num_tokens': 428975.0, 'mean_token_accuracy': 0.5432, 'epoch': 0.5}
+
+ 75%|███████▌  | 12/16 [02:15<00:21,  5.50s/it]
+{'loss': 1.4624, 'grad_norm': 0.888, 'learning_rate': 6.5e-05, 'num_tokens': 641021.0, 'mean_token_accuracy': 0.6522, 'epoch': 0.75}
+
+100%|██████████| 16/16 [02:34<00:00,  4.88s/it]
+{'loss': 1.1294, 'grad_norm': 0.713, 'learning_rate': 2.2e-05, 'num_tokens': 852192.0, 'mean_token_accuracy': 0.7088, 'epoch': 1.0}
+
+Final Training Summary:
+{'train_runtime': 298.36s, 'train_samples_per_second': 3.352, 'train_steps_per_second': 0.054, 'train_loss': 2.086, 'epoch': 1.0}
+✓ Job finished (status: SUCCEEDED).
+```
+
+Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop)
+
+![nvitop](https://i.imgur.com/pGqj9RD.png)
+
+#### gpt-oss-120b training progress
+
+```
+Training Progress for gpt-oss-120b on 4 nodes:
+  3%|▏         | 1/32 [03:45<116:23, 225.28s/it]
+  6%|▋         | 2/32 [06:12<90:21, 181.05s/it]
+  9%|▉         | 3/32 [08:45<71:22, 147.67s/it]
+ 12%|█▎        | 4/32 [11:18<59:44, 128.01s/it]
+ 25%|██▌       | 8/32 [22:36<67:48, 169.50s/it]
+ 44%|████▍     | 14/32 [29:03<43:37, 145.41s/it]
+```
+
+Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop)
+
+![nvitop](https://i.imgur.com/dAov9ud.png)
+
+## Configuration files
+
+You can find the complete configurations in [the following directory](https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-oss-finetuning/).
diff --git a/llm/gpt-oss-finetuning/gpt-oss-120b-lora.yaml b/llm/gpt-oss-finetuning/gpt-oss-120b-lora.yaml
@@ -0,0 +1,29 @@
+resources:
+  accelerators: H100:8
+  network_tier: best
+
+file_mounts:
+  /sft: ./sft
+
+num_nodes: 1
+
+setup: |
+  conda install cuda -c nvidia
+  uv venv ~/training --seed --python 3.10
+  source ~/training/bin/activate
+  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
+  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
+  uv pip install deepspeed
+  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
+
+  uv pip install nvitop
+
+run: |
+  source ~/training/bin/activate
+
+  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
+
+  python /sft/train.py --model_id openai/gpt-oss-120b --enable_lora
+
+  
diff --git a/llm/gpt-oss-finetuning/gpt-oss-120b-sft.yaml b/llm/gpt-oss-finetuning/gpt-oss-120b-sft.yaml
@@ -0,0 +1,28 @@
+resources:
+  accelerators: H200:8
+  network_tier: best
+
+file_mounts:
+  /sft: ./sft
+
+num_nodes: 4
+
+setup: |
+  conda install cuda -c nvidia
+  uv venv ~/training --seed --python 3.10
+  source ~/training/bin/activate
+  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
+  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
+  uv pip install deepspeed
+  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
+
+  uv pip install nvitop
+
+run: |
+  source ~/training/bin/activate
+
+  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
+
+  accelerate launch --config_file /sft/fsdp2_120b.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-120b
+
diff --git a/llm/gpt-oss-finetuning/gpt-oss-20b-lora.yaml b/llm/gpt-oss-finetuning/gpt-oss-20b-lora.yaml
@@ -0,0 +1,29 @@
+resources:
+  accelerators: H100:2
+  network_tier: best
+
+file_mounts:
+  /sft: ./sft
+
+num_nodes: 1
+
+setup: |
+  conda install cuda -c nvidia
+  uv venv ~/training --seed --python 3.10
+  source ~/training/bin/activate
+  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
+  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
+  uv pip install deepspeed
+  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
+
+  uv pip install nvitop
+
+run: |
+  source ~/training/bin/activate
+
+  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
+
+  python /sft/train.py --model_id openai/gpt-oss-20b --enable_lora
+
+  
diff --git a/llm/gpt-oss-finetuning/gpt-oss-20b-sft.yaml b/llm/gpt-oss-finetuning/gpt-oss-20b-sft.yaml
@@ -0,0 +1,29 @@
+resources:
+  accelerators: H100:8
+  network_tier: best
+
+file_mounts:
+  /sft: ./sft
+
+num_nodes: 1
+
+setup: |
+  conda install cuda -c nvidia
+  uv venv ~/training --seed --python 3.10
+  source ~/training/bin/activate
+  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
+  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
+  uv pip install deepspeed
+  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
+
+  uv pip install nvitop
+
+run: |
+  source ~/training/bin/activate
+
+  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
+
+  accelerate launch --config_file /sft/fsdp2.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-20b
+
+  
diff --git a/llm/gpt-oss-finetuning/sft/fsdp2.yaml b/llm/gpt-oss-finetuning/sft/fsdp2.yaml
@@ -0,0 +1,31 @@
+# Requires accelerate 1.7.0 or higher
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: true
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: false
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: false
+  fsdp_use_orig_params: true
+  # fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_forward_prefetch: true
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: c10d
+same_network: false
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/llm/gpt-oss-finetuning/sft/fsdp2_120b.yaml b/llm/gpt-oss-finetuning/sft/fsdp2_120b.yaml
@@ -0,0 +1,29 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: true
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_use_orig_params: false
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_forward_prefetch: false
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: c10d
+same_network: false
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
diff --git a/llm/gpt-oss-finetuning/sft/train.py b/llm/gpt-oss-finetuning/sft/train.py
diff --git a/tests/unit_tests/test_hyperbolic.py b/tests/unit_tests/test_hyperbolic.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+../../../../llm/gpt-oss-finetuning/README.md`