Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4316425
wip
hjh0119 Aug 29, 2025
5d46eae
init wip
hjh0119 Sep 1, 2025
5828229
args wip
hjh0119 Sep 1, 2025
a82cec4
Merge remote-tracking branch 'origin/main' into mega-grpo
hjh0119 Sep 2, 2025
0689b76
reuse _prepare_rollout_engine
hjh0119 Sep 3, 2025
46593cf
merge main
hjh0119 Sep 11, 2025
3da8756
mega wip
hjh0119 Sep 12, 2025
2ca7ac1
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 17, 2025
d9ec029
wip
hjh0119 Sep 17, 2025
7c56f9f
override train_step wip
hjh0119 Sep 17, 2025
686fc74
remove override train_step to grpo
hjh0119 Sep 18, 2025
095bcbd
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 18, 2025
4d9457b
sync weight wip
hjh0119 Sep 18, 2025
f52d5e1
rollout wip
hjh0119 Sep 19, 2025
155d4fb
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 22, 2025
3c69c39
modify mini_batch_size to generation batch size
hjh0119 Sep 22, 2025
eebdd47
wip
hjh0119 Sep 24, 2025
de6ecfe
loss wip
hjh0119 Sep 28, 2025
4569e54
fix repeat n
hjh0119 Sep 28, 2025
f118935
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Sep 29, 2025
9cb84e3
fix padding to multiple of tp_size
hjh0119 Sep 29, 2025
8627aa3
compute loss
hjh0119 Sep 29, 2025
2292cf8
fix logps
hjh0119 Sep 30, 2025
bbe5f39
logging & patch VL
hjh0119 Sep 30, 2025
6a2940c
fix rollout_group & rollout judgement
hjh0119 Oct 1, 2025
486c3d4
fix step
hjh0119 Oct 6, 2025
7e8e6b0
merge main
hjh0119 Oct 6, 2025
c68d976
move old base trainer to newer
hjh0119 Oct 7, 2025
6b1653c
fix
hjh0119 Oct 8, 2025
d4a9dcc
offload utils
hjh0119 Oct 8, 2025
9dc92a0
offload context
hjh0119 Oct 9, 2025
7bc3d61
Resolve merge conflict in megatron_args.py by removing duplicate fiel…
hjh0119 Oct 9, 2025
91f97ca
fix resolve
hjh0119 Oct 9, 2025
59f436c
fix logps
hjh0119 Oct 9, 2025
8dea6d7
fix old logps
hjh0119 Oct 9, 2025
abac696
reduce redundancy
hjh0119 Oct 9, 2025
3a3ff37
replace token
hjh0119 Oct 10, 2025
2cd89dc
fix offload model
hjh0119 Oct 10, 2025
50d5e6f
offload optimizer & ref
hjh0119 Oct 11, 2025
e1a06c6
support cp
hjh0119 Oct 11, 2025
ff9b667
fix pp+cp
hjh0119 Oct 11, 2025
ba4bfbf
lora wip
hjh0119 Oct 11, 2025
e5a6252
Merge remote-tracking branch 'origin' into mega-grpo
hjh0119 Oct 13, 2025
e22c790
arguments document
hjh0119 Oct 13, 2025
b3de262
wip lora&cp
hjh0119 Oct 14, 2025
d5bd92c
merge origin
hjh0119 Oct 14, 2025
fe3270f
remove unused patch
hjh0119 Oct 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ importance_weights = torch.exp(log_importance_weights)
- `importance_sampling_level sequence` (GSPO)
- `importance_sampling_level sequence_token` (GSPO-token)

其中 sequence_token 要求 ms-swift > 3.7 (源码安装)
其中 sequence_token 要求 ms-swift >= 3.8

论文其他超参
```bash
Expand Down
20 changes: 11 additions & 9 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,15 @@ reward模型参数将在PPO、GRPO中使用。

#### GRPO参数
- beta: KL正则系数,默认为0.04,设置为0时不加载ref model。
- epsilon: clip 系数,默认为0.2。
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- dynamic_sample:筛除group内奖励标准差为0的数据,额外采样新数据,默认为False。
- max_resample_times:dynamic_sample设置下限制重采样次数,默认3次。
- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算,默认为1.0,即不过滤低熵 token,具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 、 `sequence` 和 `sequence_token`,默认为`token`。具体参考[GSPO文档](./GRPO/AdvancedResearch/GSPO.md)
- per_device_train_batch_size: 每个设备训练批量大小,在GRPO中,指 completion 的批次大小。
- per_device_eval_batch_size: 每个设备评估批量大小,在GRPO中,指 completion 的批次大小。
- generation_batch_size: 采样completion批量大小,需要是 num_processes * per_device_train_batch_size 的倍数,默认等于 per_device_batch_size * gradient_accumulation_steps * num_processes
Expand Down Expand Up @@ -542,22 +551,15 @@ reward模型参数将在PPO、GRPO中使用。
- completion_length_limit_scope: 在多轮对话中,`max_completion_length` 的限制范围。
`total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
- num_iterations: 每个批次代更新次数,默认为1。
- epsilon: clip 系数,默认为0.2。
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置,建议大于 1 + epsilon。默认为None。
- sync_ref_model: 是否定期同步ref_model,默认为False。
- ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
- ref_model_sync_steps:同步频率,默认为512。
- move_model_batches: 在模型向vLLM等快速推理框架移动参数时,将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分,否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意:该参数仅对LoRA(PEFT)训练有意义。
- multi_turn_scheduler: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。
- max_turns: 多轮GRPO的轮数上限。默认为None,不做限制。
- dynamic_sample:筛除group内奖励标准差为0的数据,额外采样新数据,默认为False。
- max_resample_times:dynamic_sample设置下限制重采样次数,默认3次。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算,默认为1.0,即不过滤低熵 token,具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
- log_entropy: 记录训练中的熵值变化动态,默认为False,具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 和 `sequence`,`token` 模式下保留原始的每个 token 的对数概率比,`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://www.arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练,默认为`token`。

##### 奖励函数参数
内置的奖励函数参考[文档](./GRPO/DeveloperGuide/奖励函数.md)
cosine 奖励参数
- cosine_min_len_value_wrong:cosine 奖励函数参数,生成错误答案时,最小长度对应的奖励值。默认值为-0.5。
- cosine_max_len_value_wrong:生成错误答案时,最大长度对应的奖励值。默认值为0.0。
Expand Down
31 changes: 30 additions & 1 deletion docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ lora训练:


**DPO参数**:
- ref_load: ref_model的加载路径。采用DPO/KTO算法且使用全参数训练时需要传入。默认为None,即设置为`load`。
- ref_load: ref_model的加载路径。采用DPO/GRPO/KTO算法且使用全参数训练时需要传入。默认为None,即设置为`load`。
- ref_adapter_load: 加载ref_adapter的权重路径,默认为None。若你要使用SFT产生的LoRA权重进行DPO,请使用"ms-swift>=3.8",并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训,则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高,表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo"),beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
- 🔥rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数,用于控制损失函数中NLL项的权重(即SFT损失),`loss = dpo_loss + rpo_alpha * sft_loss`,论文中推荐设置为`1.`。默认为`None`,即默认不引入sft_loss。
Expand All @@ -254,6 +254,35 @@ lora训练:
- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。

**GRPO参数**
- ref_load: 含义同DPO。
- ref_adapter_load: 含义同DPO。
- beta: KL正则系数,默认为0.04,设置为0时不加载ref model。
- epsilon: clip 系数,默认为0.2。
- epsilon_high: upper clip 系数,默认为None,设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
- overlong_filter:跳过超长截断的样本,不参与loss计算,默认为False。
- importance_sampling_level: 控制重要性采样比计算,可选项为 `token` 、 `sequence` 和 `sequence_token`,默认为`token`。具体参考[GSPO文档](../Instruction/GRPO/AdvancedResearch/GSPO.md)
- batch size 相关参数(注意以下均为 completion-level)
- micro_batch_size: 每个device的批次大小,默认为1。
- global_batch_size: 总批次大小,等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。对应每次更新权重的训练数据大小(mini_batch_size)
- generation_batch_size: 采样批量大小,需要是global_batch_size的倍数,默认等于global_batch_size
- steps_per_generation:每轮生成的优化步数,即采样批量大小相对global_batch_size的倍数,默认为1。
- num_generations:每个prompt采样的数量,论文中的G值。采样批量大小需被num_generations 整除。默认为 8。
- reward_funcs: GRPO算法奖励函数,可选项为`accuracy`、`format`、`cosine`、`repetition`和`soft_overlong`,见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`。
- reward_weights: 每个奖励函数的权重。必须与奖励函数和奖励模型的总数量匹配。如果为 None,则所有奖励的权重都相等,为`1.0`。
- loss_type: loss 归一化的类型,可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)。
- vllm_mode 参数
- vllm_gpu_memory_utilization: vllm透传参数,默认为0.9。
- vllm_max_model_len: vllm透传参数,默认为None。
- vllm_enforce_eager: vllm透传参数,默认为False。
- vllm_limit_mm_per_prompt: vllm透传参数,默认为None。
- vllm_enable_prefix_caching: vllm透传参数,默认为True。
- sleep_level: 训练时释放 vLLM 显存,可选项为[0, 1], 默认为0,不释放
- offload_optimizer: 是否在vLLM推理时offload optimizer参数,默认为False。
- offload_model: 是否在vLLM推理时 offload 模型,默认为False。

内置奖励函数参数参考[文档](../Instruction/命令行参数.md#奖励函数参数)

## 训练参数

Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用dataset、template等参数,也支持ms-swift中的特定模型参数**)。基本参数的内容可以参考[这里](../Instruction/命令行参数.md#基本参数)。此外还包括以下参数:
Expand Down
22 changes: 12 additions & 10 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,15 @@ The meanings of the following parameters can be referenced [here](https://huggin

#### GRPO Arguments
- beta: KL regularization coefficient; default 0.04. Setting it to 0 disables the reference model.
- epsilon: epsilon value for clipping. Default is 0.2.
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
- top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
- per_device_train_batch_size: The training batch size per device. In GRPO, this refers to the batch size of completions during training.
- per_device_eval_batch_size: The evaluation batch size per device. In GRPO, this refers to the batch size of completions during evaluation.
- generation_batch_size: Batch size to use for generation. It defaults to the effective training batch size: per_device_train_batch_size * num_processes * gradient_accumulation_steps`
Expand Down Expand Up @@ -556,23 +565,16 @@ The meanings of the following parameters can be referenced [here](https://huggin
- top_p: Default is 0.9.
- repetition_penalty: Repetition penalty term. Default is 1.
- num_iterations: number of iterations per batch. Default is 1.
- epsilon: epsilon value for clipping. Default is 0.2.
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).

- sync_ref_model: Whether to synchronize the reference model. Default is False。
- ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
- ref_model_sync_steps:The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
- multi_turn_scheduler: Multi-turn GRPO parameter; pass the corresponding plugin name, and make sure to implement it in plugin/multi_turn.py.
- max_turns: Maximum number of rounds for multi-turn GRPO. The default is None, which means there is no limit.
- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
- top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.

##### Reward function parameters
Refer to the [documentation](./GRPO/DeveloperGuide/reward_function.md) for built-in reward functions.

cosine reward function arguments
- cosine_min_len_value_wrong (default: -0.5): Reward value corresponding to the minimum length when the answer is incorrect.
Expand Down
Loading