Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
v3.8.0
中文版
新特性
- Megatron-SWIFT
a. 支持多模态模型训练,包含LoRA/全参训练(CPT/SFT/DPO)。目前支持了Qwen2-VL、Qwen2.5-VL、Qwen2.5-Omni、InternVL3、InternVL3.5、GLM-4.5V、Ovis2.5系列模型。文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/%E5%A4%9A%E6%A8%A1%E6%80%81%E6%A8%A1%E5%9E%8B.html 。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimoda
b. 支持Merge-LoRA,方便使用LoRA进行SFT后,Merge-LoRA进行DPO训练。文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/LoRA%E8%AE%AD%E7%BB%83.html#merge-lora
c. 支持channel loss,使用--enable_channel_loss参数开启。在数据集中准备"channel"字段,ms-swift会根据该字段分组统计loss。数据集准备参考文档:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#channel-loss
d. 支持对MoE的router部分进行LoRA训练。设置--target_modules all-router all-linear即可。
e. 支持deepspeed launcher启动训练。
f. 支持在权重转换时,将显存无法存放的hf_model部分offload到cpu。 - GRPO
a. GRPO多轮重构,支持自由度更高的多轮训练,详见多轮训练文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b.--truncation_strategy delete参数支持跳过encode失败的数据。 - 训练
a. 支持了DFT loss, 在SFT训练中设置参数--enable_dft_loss true使用该功能。(含Megatron-SWIFT),实验结果参考此PR:#5355
b. 数据集中支持通过增加"loss"字段,控制每一轮对话是否计算损失。文档参考:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#id4
c. 在混合思考模型的训练时,自动填充no_think_prefix(当样本不含思考部分时),例如对Qwen/Qwen3-30B-A3B模型自动填充<think>\n\n</think>\n\n。
d. 支持early_stop_interval参数,确保best_metric在early_stop_interval个周期内没有提升时终止训练。
e. MoE训练参数router_aux_loss_coef默认值从config.json中读取修改为默认值为0。(Megatron-SWIFT同步修改)
f. channel loss重构,删除--channels参数,使用--enable_channel_loss参数替代。
g. 新增ROOT_IMAGE_DIR环境变量,指定图像(多模态)资源的根目录。
h. 支持DLRover flash checkpoint进行权重异步持久化(暂不支持safetensors格式),感谢招商银行技术团队的贡献。
i. Qwen2.5-Omni支持序列分类任务;并支持单样本中含混合模态数据训练。
j. 支持target_parameters参数,该特性需要安装"peft>=0.17.0"。
k. 支持GLM4.5 agent template。 - RLHF
a. LD-DPO支持,使用ld_alpha参数对超出公共前缀部分的logps加权,抑制长度偏好。
b. DPO支持packing,提升训练吞吐量。(含Megatron-SWIFT)训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/dpo.sh ;https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/dpo/packing.sh
c. 数据集"rejected_messages"格式支持,提供比"rejected_response"格式更大的可拓展性(例如多模态/Agent场景)。文档参考:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html#rlhf
d.ref_adapters参数支持,方便在LoRA SFT之后衔接DPO/KTO/GRPO的场景。(在Megatron-SWIFT中,该参数名为ref_adapter_load)
e. DPO训练参数rpo_alpha默认值从1修改为None,与TRL参数默认值对齐。(Megatron-SWIFT同步修改) - 全链路能力
a.swift eval模块升级使用"evalscope>=1.0"。
b. 推理RequestConfig支持return_details参数返回图像在template中缩放后的尺寸,方便grounding任务推理时画目标框。例子参考:https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py
c. vllm支持更多多模型模型:ovis2.5、interns1、internvl3.5。
d. vllm新增disable_cascade_attn参数支持。
新模型
- 纯文本模型:
a. deepseek-ai/DeepSeek-V3.1(含Megatron-SWIFT)
b. moonshotai/Kimi-K2-Instruct-0905
c. ByteDance-Seed/Seed-OSS-36B-Instruct
d. meituan-longcat/LongCat-Flash-Chat
e. google/gemma-3-270m-it 系列 - 多模态模型:
a. AIDC-AI/Ovis2.5-2B 系列(支持padding_free/packing),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/ovis2/train.sh
b. OpenGVLab/InternVL3_5-1B 系列(含Megatron-SWIFT,支持混合模态数据集训练和padding_free/packing),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/lora.sh
c. ZhipuAI/GLM-4.5V(含Megatron-SWIFT,支持混合模态数据集训练和padding_free/packing),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/glm4_5v.sh
d. OpenBMB/MiniCPM-V-4_5
e. rednote-hilab/dots.ocr
f. Shanghai_AI_Laboratory/Intern-S1-mini 系列
g. mispeech/midashenglm-7b
English Version
New Features
- Megatron-SWIFT
a. Supports multimodal model training including LoRA/full-parameter training (CPT/SFT/DPO). Currently supports Qwen2-VL, Qwen2.5-VL, Qwen2.5-Omni, InternVL3, InternVL3.5, GLM-4.5V, and Ovis2.5 series models. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Multimodal-Model.html . Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal
b. Supports Merge-LoRA, enabling LoRA-based SFT followed by DPO training via merged LoRA weights. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/LoRA-Training.html#merge-lora
c. Supports channel loss via the--enable_channel_lossflag. Include a "channel" field in your dataset; ms-swift will group and compute loss accordingly. Dataset preparation guide: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#channel-loss
d. Supports LoRA training on MoE router components. Set--target_modules all-router all-linear.
e. Supports training launched via DeepSpeed launcher.
f. During weight conversion, supports offloading parts of the Hugging Face model that exceed GPU memory to CPU. - GRPO
a. GRPO multi-round refactoring enables higher flexibility in multi-round training. See detailed documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/multi_turn.html
b. The--truncation_strategy deleteparameter skips data that fails encoding. - Training
a. Supports DFT loss. Enable with--enable_dft_loss trueduring SFT training (including Megatron-SWIFT). See experimental results in this PR: #5355
b. Datasets now support a "loss" field to control whether loss is computed for each conversation turn. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Automatically fillsno_think_prefixduring mixed-thinking model training (when samples lack reasoning segments), e.g., inserts<think>\n\n</think>\n\nfor Qwen/Qwen3-30B-A3B models.
d. Supportsearly_stop_intervalparameter to terminate training ifbest_metricdoes not improve overearly_stop_intervalepochs.
e. Default value of MoE training parameterrouter_aux_loss_coefchanged from config.json to 0 (also updated in Megatron-SWIFT).
f. Refactored channel loss: removed--channelsparameter, replaced with--enable_channel_loss.
g. AddedROOT_IMAGE_DIRenvironment variable to specify root directory for image (multimodal) resources.
h. Supports DLRover flash checkpoint for asynchronous weight persistence (safetensors format not yet supported). Thanks to contributions from China Merchants Bank's tech team.
i. Qwen2.5-Omni now supports sequence classification tasks and training with mixed-modality data within a single sample.
j. Supportstarget_parametersfeature (requires "peft>=0.17.0").
k. Supports GLM4.5 agent template. - RLHF
a. Supports LD-DPO: useld_alphato weight logps beyond the common prefix, suppressing length bias.
b. DPO supports packing to improve training throughput (including Megatron-SWIFT). Training script references:
https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/dpo.sh ; https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/dpo/packing.sh
c. Supports "rejected_messages" dataset format, offering greater extensibility than "rejected_response" (e.g., for multimodal/Agent scenarios). Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#dpo-orpo-cpo-simpo-rm
d. Supportsref_adaptersparameter for seamless transition from LoRA SFT to DPO/KTO/GRPO (namedref_adapter_loadin Megatron-SWIFT).
e. Default value of DPO training parameterrpo_alphachanged from 1 to None, aligning with TRL defaults (also updated in Megatron-SWIFT). - End-to-End Capabilities
a. Upgradedswift evalmodule to use "evalscope>=1.0".
b. InferenceRequestConfignow supportsreturn_detailsto return resized image dimensions after template processing, aiding bounding box drawing in grounding tasks. Example: https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py
c. vLLM now supports more multimodal models: ovis2.5, interns1, internvl3.5.
d. vLLM addsdisable_cascade_attnparameter support.
New Models
- Text-only Models:
a. deepseek-ai/DeepSeek-V3.1 (including Megatron-SWIFT)
b. moonshotai/Kimi-K2-Instruct-0905
c. ByteDance-Seed/Seed-OSS-36B-Instruct
d. meituan-longcat/LongCat-Flash-Chat
e. google/gemma-3-270m-it series - Multimodal Models:
a. AIDC-AI/Ovis2.5-2B series (supports padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/ovis2/train.sh
b. OpenGVLab/InternVL3_5-1B series (including Megatron-SWIFT, supports mixed-modality datasets and padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/lora.sh
c. ZhipuAI/GLM-4.5V (including Megatron-SWIFT, supports mixed-modality datasets and padding-free/packing). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/multimodal/moe/glm4_5v.sh
d. OpenBMB/MiniCPM-V-4_5
e. rednote-hilab/dots.ocr
f. Shanghai_AI_Laboratory/Intern-S1-mini series
g. mispeech/midashenglm-7b
What's Changed
- support flash checkpoint by integrated with DLRover by @meichangsu1 in #5060
- update models shell by @Jintao-Huang in #5299
- [model] fix qwen2_5_vl fps by @Jintao-Huang in #5306
- [template] update gpt oss template by @Jintao-Huang in #5308
- update plot_images by @Jintao-Huang in https://github.com/modelscope/m...
Patch release v3.7.3
Full Changelog: v3.7.2...v3.7.3
Patch release v3.7.2
Full Changelog: v3.7.1...v3.7.2
Patch release v3.7.1
Full Changelog: v3.7.0...v3.7.1
v3.7.0
中文版
新特性
- GRPO:
a. 支持GSPO算法,在GRPO训练中使用参数--importance_sampling_level sequence,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
b. GRPO server mode 支持多机 rollout,支持传入多个 vllm_server_host/port,参考脚本:https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
c. GRPO rollout 兼容 GYM 环境规范(感谢开发者Mouse的贡献),参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/GYM%E7%8E%AF%E5%A2%83%E8%AE%AD%E7%BB%83.html
d. GRPO 支持 entropy_mask 来过滤低熵token损失计算,同时logger支持记录熵值动态,参考文档https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
e. 支持多轮算法DeepEyes训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
f. GRPO 支持--truncation_strategy delete,删除输入长度超过max_length的数据,并重新采样。 - Megatron-SWIFT:
a. 支持使LoRA训练,现支持CPT/SFT/DPO,显著加速MoE训练速度。
- 文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html#lora
- 训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
b. 支持loss scale,方便Agent训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
c. 默认megatron-core版本升级至0.13。
d. 支持bshd格式,方便自定义attention_mask。
e. 日志优化:新增GPU占用、剩余训练时间等信息打印,并输出logging.jsonl存储训练日志。
f. 模型加载与转换速度优化,并增加模型加载进度条。 - 训练:
a. 支持Flash-Attention-3(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
b. 新增--new_speical_tokens参数,方便新增特殊tokens。训练脚本参考: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
c. 新增--cached_dataset参数,支持CPT/SFT的离线tokenize。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
d. 序列Packing模块重构。加速Packing速度,并对多模态packing的磁盘存储问题优化。
e. 支持Qwen2.5-VL混合模态数据(即单条数据中含多种模态) + deepspeed训练。
f. 多模态模型训练支持 loss_scale。
g. rope_scaling 支持传入字典,此外支持设置 max_model_len 对 rope_scaling 的 factor 自动调整。
h. 支持DeepSpeed-AutoTP(该技术不支持LoRA)。
i. 多模态Packing兼容 transformers>=4.53;序列并行兼容 transformers>=4.52。
j. resume_only_model默认将进行数据跳过,并使用ignore_data_skip参数进行控制。
k. MoE模型训练支持 router_aux_loss_coef 参数。
l. template新增max_length裁剪保护机制,不对图像/视频等tokens进行裁剪。
m. tuner_backend unsloth 支持moe模型、device_map和DDP。
n. embedding训练支持liger_kernel。 - RLHF:
a. 支持MPO训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
b. 多模态DPO支持了拒绝图片输入,在数据集中加入rejected_images列。 - 推理部署:
a. 支持embedding系列模型的推理部署,包括pt/vllm/sglang的infer_backend。部署脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
b. InferEngine支持return_details参数,以输出prompt_token_ids和token_ids。
c. vLLM推理引擎兼容更多多模态模型:ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4。
d. vLLM参数重构,参数名前加入vllm_前缀。GRPO模块复用vLLM参数。 - 导出:
a. QLoRA支持Merge-LoRA,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
b. 支持MoE/多模态模型的FP8/BNB量化,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize
新模型
- 纯文本模型:
a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, Qwen/Qwen3-4B-[Instruct/Thinking]-2507系列(含Megatron-SWIFT),训练脚本参考:#5033
b. openai-mirror/gpt-oss-20b系列,最佳实践参考:#5277
c. ZhipuAI/GLM-4.5系列(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
d. Hunyuan-7B-Instruct系列,最佳实践参考:#5236
e. mistralai/Devstral-Small-2505 - 多模态模型:
a. OpenBMB/MiniCPM-V-4,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh
English Version
New Features
- GRPO
a. Added support for the GSPO algorithm. Use--importance_sampling_level sequenceduring GRPO training. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
b. GRPO “server mode” now supports multi-node rollout; pass in multiplevllm_server_host/port. Example script: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
c. GRPO rollout is now GYM-compatible (thanks to contributor Mouse). Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
d. Addedentropy_maskfor filtering low-entropy tokens during loss computation, and the logger now tracks entropy dynamics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
e. Added support for the multi-round DeepEyes algorithm. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
f. GRPO supports--truncation_strategy delete: remove samples whose input length exceedsmax_lengthand resample. - Megatron-SWIFT
a. Added LoRA training (CPT/SFT/DPO) to significantly accelerate MoE training.
- Docs: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html#lora-training
- Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
b. Added loss-scaling to simplify Agent training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
c. Defaultmegatron-coreupgraded to 0.13.
d. Addedbshdtensor format to facilitate customattention_mask.
e. Logging improvements: prints GPU memory, estimated remaining time, and writeslogging.jsonl.
f. Faster model loading & conversion plus a progress bar. - Training
a. Added Flash-Attention-3 support (including Megatron-SWIFT). Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
b. New--new_special_tokensflag for adding special tokens. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
c. New--cached_datasetflag for offline tokenization in CPT/SFT. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
d. Re-implemented the sequence-packing module for faster packing and better multimodal disk I/O.
e. Qwen2.5-VL hybrid-modal data (multiple modalities in a single sample) + DeepSpeed training supported.
f. Multimodal training now supports loss-scaling.
g.rope_scalingnow accepts a dict;max_model_lencan auto-adjust the scaling factor.
h. Added DeepSpeed-AutoTP (not compatible with LoRA).
i. Multimodal packing is compatible with transformers ≥ 4.53; sequence parallelism with transformers ≥ 4.52.
j. Withresume_only_model, data skipping is enabled by default; control viaignore_data_skip.
k. MoE training supportsrouter_aux_loss_coef.
l. Template files get a max_length clipping safeguard (no clipping of image/video tokens).
m.tuner_backend unslothnow supports MoE models,device_map, and DDP.
n. Embedding training supportsliger_kernel. - RLHF
a. Added MPO training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
b. Multimodal DPO can now reject image inputs by adding arejected_imagescolumn. - Inference & Deployment
a. Added deployment for embedding models across pt/vllm/sglang back-ends. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
b.InferEnginesupportsreturn_detailsto outputprompt_token_idsandtoken_ids.
c. vLLM back-end now supports more multimodal models: ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4.
d. vLLM arguments refactored: all start with thevllm_prefix. GRPO module reuses the same options. - Export
a. QLoRA now supports Merge-LoRA. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
b. Added FP8 / BNB quantization for MoE and multimodal models. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize
New Models
- Text-only
a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, and Qwen/Qwen3-4B-[Instruct/Thinking]-2507 (Megatron-SWIFT supported). Training script: #5033
b. openai-mirror/gpt-oss-20b family. Best-practice: #5277
c. ZhipuAI/GLM-4.5 family (Megatron-SWIFT supported). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
d. Hunyuan-7B-Instruct family. Best-practice: #5236
e. mistralai/Devstral-Small-2505 - Multimodal
a. OpenBMB/MiniCPM-V-4. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh
What's Changed
- [grpo] fix server arg check by @hjh0119 in #4865
- [SP] clean up imports by @hjh0119 in #4878
- fix loss_scale sp by @tastelikefeet in #4880
- fix seq_cls generation_config by @Jintao-Huang in #4882
- optimize imports by @tastelikefeet in #4883
- [model] fix qwen eos_token by @Jintao-Huang in #4888
- Fix: Correct training hang for Keye-VL on DeepSpeed with mixed data by @0russwest0 in #4889
- [megatron] support LoRA & support loss_scale by @Jintao-Huang in #4812
- update framework.txt by @Jintao-Huang in #4896
- [megatron] fix pp mla by @Jintao-Huang in https://gi...
Patch release v3.6.4
Full Changelog: v3.6.3...v3.6.4
Patch release v3.6.3
Full Changelog: v3.6.2...v3.6.3
Patch release v3.6.2
Full Changelog: v3.6.1...v3.6.2
Patch release v3.6.1
Full Changelog: v3.6.0...v3.6.1
v3.6.0
中文版
新特性
- Megatron-SWIFT:
a. 支持更多的 MoE 模型结构,包括:DeepseekV3ForCausalLM、Dots1ForCausalLM 和 Ernie4_5_MoeForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. 支持更多的 Dense 模型结构,包括:MiMoForCausalLM、InternLM3ForCausalLM 和 Ernie4_5_ForCausalLM。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. 支持 DPO 训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. 支持 FP8 训练。
e. 支持更多 rope scaling 类型,包括:default、linear、yarn、dynamic、longrope、llama3 等。
f.--test_convert_precision参数优化,方便测试 mcore 与 huggingface 模型权重转换精度。 - GRPO:
a. GRPO 多轮训练重构,支持使用 AsyncEngine 加速多轮推理,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. offload_model 参数额外对参考模型进行卸载。
c. 优化 sleep_level 和 offload_model 参数下的显存管理。
d. reward_funcs 增加了 trainer_state 入参,方便获取当前训练步数和总步数。 - 训练:
a. 支持 reranker 训练,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO 纯文本大模型训练支持 ring-attention 切分序列长度,降低显存占用。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. channel loss 在CPT/SFT训练时,兼容 padding_free 与 packing。 感谢招商银行技术团队的贡献。
d. remove_unused_columns 参数优化。设置为 False,则将额外数据集传递至 Trainer 内,方便自定义损失函数。
e.split_dataset_ratio参数默认值从0.01修改为0,默认不再进行验证集切分,需要手动设置--split_dataset_ratio或者--val_dataset。
f. 多模态模型 packing/padding_free 损失对齐问题修复。详见此PR:#4838
g. swanlab 支持训练完成后的飞书通知回调。 - RLHF:
a. 纯文本/多模态模型支持 GKD 训练,部分场景下支持 padding_free 和 packing,训练脚本如下:
i. 大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. 多模态大模型:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. reward model 训练支持 margin 参数支持,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - 全链路:
a. 支持使用 SGLang 推理引擎对 ms-swift 推理/部署/评测/ui模块进行加速,设置--infer_backend sglang即可。推理脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. 支持 FP8 量化,量化脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. 支持 SFT/RLHF/GRPO 在不同 Tab 页面训练,支持保存训练命令行。
b. Web-UI 界面支持数据采样。
新模型
- 多模态模型:
a. ZhipuAI/GLM-4.1V-9B-Thinking系列
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it系列 - 纯文本模型:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT系列
b. rednote-hilab/dots.llm1.inst系列
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k系列(推理)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
English Version
New Features
- Megatron-SWIFT:
a. Support for more MoE model architectures, including: DeepseekV3ForCausalLM, Dots1ForCausalLM, and Ernie4_5_MoeForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/moe
b. Support for more Dense model architectures, including: MiMoForCausalLM, InternLM3ForCausalLM, and Ernie4_5_ForCausalLM. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/dense
c. DPO training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/rlhf/dpo
d. FP8 training supported.
e. More rope scaling types supported, including: default, linear, yarn, dynamic, longrope, llama3, etc.
f.--test_convert_precisionparameter optimized for easier testing of weight conversion precision between mcore and huggingface models. - GRPO:
a. GRPO multi-turn training refactored, supporting accelerated multi-turn inference with AsyncEngine. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/%E5%A4%9A%E8%BD%AE%E8%AE%AD%E7%BB%83.html
b. The offload_model parameter now also offloads the reference model.
c. Optimized GPU memory management under sleep_level and offload_model parameters.
d. Added trainer_state as an input parameter to reward_funcs, making it easier to obtain the current and total training steps. - Training:
a. Reranker training supported. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker
b. CPT/SFT/DPO/GRPO pure-text large model training supports ring-attention sequence length partitioning, reducing memory usage. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text/ring_attention
c. Channel loss in CPT/SFT training is compatible with padding_free and packing. Thanks to the technical team at China Merchants Bank for their contribution.
d. Optimized remove_unused_columns parameter. When set to False, extra dataset columns are passed to the Trainer for custom loss functions.
e. The default value forsplit_dataset_ratiochanged from 0.01 to 0, so the validation set is not split by default. You now need to manually set--split_dataset_ratioor--val_dataset.
f. Fixed loss alignment issue between packing/padding_free for multimodal models. For details, see this PR: #4838
g. Swanlab now supports Feishu (Lark Suite) notification callback after training is completed. - RLHF:
a. Pure-text and multimodal models support GKD training, with some scenarios supporting padding_free and packing. Training scripts:
i. Large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/gkd.sh
ii. Multimodal large models: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/rlhf/gkd.sh
b. Reward model training now supports the margin parameter. Documentation: https://swift.readthedocs.io/zh-cn/latest/Instruction/%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90.html#rm - Full Pipeline:
a. SGLang inference engine can be used to accelerate ms-swift inference/deployment/evaluation/ui modules, by setting--infer_backend sglang. Inference script reference: https://github.com/modelscope/ms-swift/tree/main/examples/infer/sglang
b. FP8 quantization supported. Quantization script reference: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/fp8.sh - Web-UI:
a. Supports SFT/RLHF/GRPO training on different Tab pages, and saves training command lines.
b. Web-UI interface supports data sampling.
New Models
- Multimodal Models:
a. ZhipuAI/GLM-4.1V-9B-Thinking series
b. Kwai-Keye/Keye-VL-8B-Preview
c. moonshotai/Kimi-VL-A3B-Thinking-2506
d. google/gemma-3n-E2B-it series - Pure Text Models:
a. PaddlePaddle/ERNIE-4.5-21B-A3B-PT series
b. rednote-hilab/dots.llm1.inst series
c. Tencent-Hunyuan/Hunyuan-A13B-Instruct
d. MiniMax/MiniMax-M1-80k series (inference)
e. moonshotai/Kimi-Dev-72B
f. cognitivecomputations/DeepSeek-R1-0528-AWQ
What's Changed
- fix emb script and docs by @tastelikefeet in #4521
- [grpo] update doc about move_model_batches by @hjh0119 in #4523
- fix LoraModel by @Jintao-Huang in #4536
- support cognitivecomputations/DeepSeek-R1-0528-AWQ by @Jintao-Huang in #4537
- fix: handle INFONCE_HARD_NEGATIVES as integer if provided by @dlutwy in #4545
- fix qwen3 embedding saving by @tastelikefeet in #4548
- [megatron/dpo] fix megatron packing_cache & update DPOTrainer by @Jintao-Huang in #4556
- [megatron] support DPO by @Jintao-Huang in #4193
- support dots1 by @Jintao-Huang in #4560
- [grpo] support offloading reference model by @hjh0119 in #4554
- [grpo] fix the pickle data collator by @hjh0119 in #4562
- [dataset] fix toolbench (local) by @Jintao-Huang in #4563
- [Bug]Fix ulysses train steps, embedding negative sample length by @tastelikefeet in #4565
- fix args.json by @Jintao-Huang in #4566
- [model] fix ovis gradient_checkpointing vit no_grad by @Jintao-Huang in #4571
- [megatron] Fix megatron all_reduce warning by @Jintao-Huang in #4568
- [grpo] remove data collator to top-level to avoid pickle error in spawn mode by @hjh0119 in #4582
- [grpo] model weight synchronization before first turn rollout with async generation by @hjh0119 in #4584
- [megatron] support more rope_scaling & support deepseek-r1-qwen3-8b/internlm3/mimo-7b by @Jintao-Huang in #4576
- [grpo] restore num_generations check by @hjh0119 in #4590
- fix gc_kwargs by @Jintao-Huang in #4591
- Fix UI llm_train by @slin000111 in #4592
- [mirror] update swift mirror by @Jintao-Huang in #4601
- [megatron] compat megatron-core main branch by @Jintao-Huang in https://github.com/modelscope/ms-swift...