|
| 1 | +# Finetuning OpenAI gpt-oss Models with SkyPilot |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +On August 5, 2025, OpenAI released [gpt-oss](https://openai.com/open-models/), including two state-of-the-art open-weight language models: `gpt-oss-120b` and `gpt-oss-20b`. These models deliver strong real-world performance at low cost and are available under the flexible Apache 2.0 license. |
| 6 | + |
| 7 | +The `gpt-oss-120b` model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while the `gpt-oss-20b` model delivers similar results to OpenAI o3-mini. |
| 8 | + |
| 9 | +This guide walks through how to finetune both models with LoRA/full finetuning using [🤗 Accelerate](https://github.com/huggingface/accelerate). |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | +## Step 0: Setup infrastructure |
| 14 | + |
| 15 | +SkyPilot is a framework for running AI and batch workloads on any infrastructure, offering unified execution, high cost savings, and high GPU availability. |
| 16 | + |
| 17 | +### Install SkyPilot |
| 18 | + |
| 19 | +```bash |
| 20 | +pip install 'skypilot[all]' |
| 21 | +``` |
| 22 | +For more details on how to setup your cloud credentials see [SkyPilot docs](https://docs.skypilot.co). |
| 23 | + |
| 24 | +### Choose your infrastructure |
| 25 | + |
| 26 | +```bash |
| 27 | +sky check |
| 28 | +``` |
| 29 | + |
| 30 | +## Step 1: Run gpt-oss models |
| 31 | + |
| 32 | +### Full finetuning |
| 33 | + |
| 34 | +**For `gpt-oss-20b` (smaller model):** |
| 35 | +- Requirements: 1 node, 8x H100 GPUs |
| 36 | +```bash |
| 37 | +sky launch -c gpt-oss-20b-sft gpt-oss-20b-sft.yaml |
| 38 | +``` |
| 39 | + |
| 40 | +**For `gpt-oss-120b` (larger model):** |
| 41 | +- Requirements: 4 nodes, 8x H200 GPUs each |
| 42 | +```bash |
| 43 | +sky launch -c gpt-oss-120b-sft gpt-oss-120b-sft.yaml |
| 44 | +``` |
| 45 | + |
| 46 | +```yaml |
| 47 | +# gpt-oss-120b-sft.yaml |
| 48 | +resources: |
| 49 | + accelerators: H200:8 |
| 50 | + network_tier: best |
| 51 | + |
| 52 | +file_mounts: |
| 53 | + /sft: ./sft |
| 54 | + |
| 55 | +num_nodes: 4 |
| 56 | + |
| 57 | +setup: | |
| 58 | + conda install cuda -c nvidia |
| 59 | + uv venv ~/training --seed --python 3.10 |
| 60 | + source ~/training/bin/activate |
| 61 | + uv pip install torch --index-url https://download.pytorch.org/whl/cu128 |
| 62 | + uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0" |
| 63 | + uv pip install deepspeed |
| 64 | + uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2 |
| 65 | +
|
| 66 | + uv pip install nvitop |
| 67 | +
|
| 68 | +run: | |
| 69 | + source ~/training/bin/activate |
| 70 | +
|
| 71 | + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) |
| 72 | + NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES)) |
| 73 | +
|
| 74 | + accelerate launch \ |
| 75 | + --config_file /sft/fsdp2_120b.yaml \ |
| 76 | + --num_machines $SKYPILOT_NUM_NODES \ |
| 77 | + --num_processes $NP \ |
| 78 | + --machine_rank $SKYPILOT_NODE_RANK \ |
| 79 | + --main_process_ip $MASTER_ADDR \ |
| 80 | + --main_process_port 29500 \ |
| 81 | + /sft/train.py --model_id openai/gpt-oss-120b |
| 82 | +``` |
| 83 | +
|
| 84 | +### LoRA finetuning |
| 85 | +
|
| 86 | +**For `gpt-oss-20b` with LoRA:** |
| 87 | +- Requirements: 1 node, 2x H100 GPU |
| 88 | +```bash |
| 89 | +sky launch -c gpt-oss-20b-lora gpt-oss-20b-lora.yaml |
| 90 | +``` |
| 91 | + |
| 92 | +**For `gpt-oss-120b` with LoRA:** |
| 93 | +- Requirements: 1 node, 8x H100 GPUs |
| 94 | +```bash |
| 95 | +sky launch -c gpt-oss-120b-lora gpt-oss-120b-lora.yaml |
| 96 | +``` |
| 97 | + |
| 98 | +## Step 2: Monitor and get results |
| 99 | + |
| 100 | +Once your finetuning job is running, you can monitor the progress and retrieve results: |
| 101 | + |
| 102 | +```bash |
| 103 | +# Check job status |
| 104 | +sky status |
| 105 | +
|
| 106 | +# View logs |
| 107 | +sky logs <cluster-name> |
| 108 | +
|
| 109 | +# Download results when complete |
| 110 | +sky down <cluster-name> |
| 111 | +``` |
| 112 | + |
| 113 | +### Example full finetuning progress |
| 114 | + |
| 115 | +Here's what you can expect to see during training - the loss should decrease and token accuracy should improve over time: |
| 116 | + |
| 117 | +#### gpt-oss-20b training progress |
| 118 | + |
| 119 | +``` |
| 120 | +Training Progress for gpt-oss-20b on Nebius: |
| 121 | + 6%|▋ | 1/16 [01:18<19:31, 78.12s/it] |
| 122 | +{'loss': 2.2344, 'grad_norm': 17.139, 'learning_rate': 0.0, 'num_tokens': 51486.0, 'mean_token_accuracy': 0.5436, 'epoch': 0.06} |
| 123 | + |
| 124 | + 12%|█▎ | 2/16 [01:23<08:10, 35.06s/it] |
| 125 | +{'loss': 2.1689, 'grad_norm': 16.724, 'learning_rate': 0.0002, 'num_tokens': 105023.0, 'mean_token_accuracy': 0.5596, 'epoch': 0.12} |
| 126 | + |
| 127 | + 25%|██▌ | 4/16 [01:34<03:03, 15.26s/it] |
| 128 | +{'loss': 2.1548, 'grad_norm': 3.983, 'learning_rate': 0.000192, 'num_tokens': 214557.0, 'mean_token_accuracy': 0.5182, 'epoch': 0.25} |
| 129 | + |
| 130 | + 50%|█████ | 8/16 [01:56<00:59, 7.43s/it] |
| 131 | +{'loss': 2.1323, 'grad_norm': 3.460, 'learning_rate': 0.000138, 'num_tokens': 428975.0, 'mean_token_accuracy': 0.5432, 'epoch': 0.5} |
| 132 | + |
| 133 | + 75%|███████▌ | 12/16 [02:15<00:21, 5.50s/it] |
| 134 | +{'loss': 1.4624, 'grad_norm': 0.888, 'learning_rate': 6.5e-05, 'num_tokens': 641021.0, 'mean_token_accuracy': 0.6522, 'epoch': 0.75} |
| 135 | + |
| 136 | +100%|██████████| 16/16 [02:34<00:00, 4.88s/it] |
| 137 | +{'loss': 1.1294, 'grad_norm': 0.713, 'learning_rate': 2.2e-05, 'num_tokens': 852192.0, 'mean_token_accuracy': 0.7088, 'epoch': 1.0} |
| 138 | + |
| 139 | +Final Training Summary: |
| 140 | +{'train_runtime': 298.36s, 'train_samples_per_second': 3.352, 'train_steps_per_second': 0.054, 'train_loss': 2.086, 'epoch': 1.0} |
| 141 | +✓ Job finished (status: SUCCEEDED). |
| 142 | +``` |
| 143 | +
|
| 144 | +Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop) |
| 145 | +
|
| 146 | + |
| 147 | +
|
| 148 | +#### gpt-oss-120b training progress |
| 149 | +
|
| 150 | +``` |
| 151 | +Training Progress for gpt-oss-120b on 4 nodes: |
| 152 | + 3%|▏ | 1/32 [03:45<116:23, 225.28s/it] |
| 153 | + 6%|▋ | 2/32 [06:12<90:21, 181.05s/it] |
| 154 | + 9%|▉ | 3/32 [08:45<71:22, 147.67s/it] |
| 155 | + 12%|█▎ | 4/32 [11:18<59:44, 128.01s/it] |
| 156 | + 25%|██▌ | 8/32 [22:36<67:48, 169.50s/it] |
| 157 | + 44%|████▍ | 14/32 [29:03<43:37, 145.41s/it] |
| 158 | +``` |
| 159 | +
|
| 160 | +Memory and GPU utilization using [nvitop](https://github.com/XuehaiPan/nvitop) |
| 161 | +
|
| 162 | + |
| 163 | +
|
| 164 | +## Configuration files |
| 165 | +
|
| 166 | +You can find the complete configurations in [the following directory](https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-oss-finetuning/). |
0 commit comments