Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
v3.12.1
What's Changed
- [bugfix] fix glm4_7 agent_template by @Jintao-Huang in #7256
- [bugfix] fix DeepSeek-OCR vllm deploy by @hjh0119 in #7258
- [feat] add async reward function support for GRPO training by @hjh0119 in #7252
- [model] support medgemma by @slin000111 in #7261
- [megatron] Support MiniMaxAI/MiniMax-M2.1 by @Jintao-Huang in #7262
- Support muonclip optimizer by @vx120 in #7191
- add task_type by @slin000111 in #7265
- [bugfix] fix mtp save by @Jintao-Huang in #7267
- [feat] support megatron grpo entropy mask & log by @hjh0119 in #7263
- [model] support iquestcoder by @Jintao-Huang in #7271
- [bugfix] fix reward model adapters by @hjh0119 in #7293
- Fix the issue of repeated inference in multi-turn scheduler. by @Simon-ss7 in #7279
- [bugfix] auto-enable async engine for vLLM encode tasks by @hjh0119 in #7301
- [bugfix] fix vllm_engine load_format by @Jintao-Huang in #7302
- fix npu megatron cp by @addsubmuldiv in #7299
- [misc] Remove unnecessary clone operations during weight synchronization by @hjh0119 in #7308
- [model] support youtu-llm by @hjh0119 in #7306
- [megatron] fix gpt_bridge oom by @Jintao-Huang in #7310
- [misc] fix youtu agent template type-checking by @hjh0119 in #7311
- [bugfix] Fix duplicate 'load_format' argument being passed in rollout by @hjh0119 in #7312
New Contributors
- @Simon-ss7 made their first contribution in #7279
Full Changelog: v3.12.0...v3.12.1
v3.12.0
中文版
新特性
- Megatron-SWIFT
a. GKD算法支持Megatron训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
b. 新模型支持:GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
c.save_safetensors支持断点续训,将 Mcore-Bridge 加载和存储方式作为推荐方式。
d. 非 padding-free 训练模式支持更多训练阶段:GRPO/DPO/KTO/RM/序列分类。
e. group_by_length 参数支持,将数据集长度大致相同的样本分组在一起(含随机因素),加速非packing模式下训练速度。
f. 支持--report_to参数,将训练日志在 wandb/swanlab 中记录并可视化。
g. Qwen3-Next 使用 Zero-Centered RMSNorm,与 transformers 对齐。
h.train_dataloader_shuffle参数支持,控制训练数据集是否随机。
i. template.encode 新增重试机制,避免 megatron 训练因网络问题获取图片/视频报错而卡住。 - RL
a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2),文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
c. 优化 GKD loss 计算的显存峰值。
d. GRPO/GKD server mode 支持使用 ipv6 地址。
e. 支持使用 structured_outputs_regex 进行结构化输出采样。 - 训练
a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. 支持--fsdp fsdp2使用 ms-swift 内置的 FSDP2 配置文件。
c. loss_scale 支持3种基本策略:'default'、'last_round'、'all'与其他策略的混合使用,例如:'last_round+ignore_empty_think'。
d. cached_dataset 支持 embedding/reranker/序列分类训练任务,训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. thinking template 重构,ThinkingTemplate 功能合入 Template,新增enable_thinking,add_non_thinking_prefix参数。
f. 新增SWIFT_PATCH_CONV3D环境变量,避免 torch2.9 环境跑 conv3d 缓慢的问题。
g. 支持swanlab_notification_method参数,在训练完成/发生错误时,指定 swanlab 通知方式。
h.dataloader_prefetch_factor参数默认值从10修改为2。 - 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
a. 新增更多训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL 混合算子支持,具体查看这个PR:#7079
c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档,参考这里:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html
新模型
- 纯文本模型:
a. ZhipuAI/GLM-4.7系列
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (感谢 @PiggerZZM 的贡献) - 多模态模型:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
b. Tencent-Hunyuan/HunyuanOCR
English Version
New Features
- Megatron-SWIFT
a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
c.save_safetensorssupports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
e.group_by_lengthparameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
f. Support for--report_toparameter to log and visualize training logs in wandb/swanlab.
g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
h.train_dataloader_shuffleparameter support to control whether training dataset is shuffled.
i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos. - RL
a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO addsnum_generations_evalparameter to set the number of generations during eval stage.
c. Optimized memory peak for GKD loss calculation.
d. GRPO/GKD server mode supports using ipv6 addresses.
e. Support for structured output sampling usingstructured_outputs_regex. - Training
a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. Support for--fsdp fsdp2to use ms-swift built-in FSDP2 configuration file.
c.loss_scalesupports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
d.cached_datasetsupports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. Thinking template refactored, ThinkingTemplate functionality merged into Template, addedenable_thinkingandadd_non_thinking_prefixparameters.
f. AddedSWIFT_PATCH_CONV3Denvironment variable to avoid slow conv3d execution in torch2.9 environment.
g. Support forswanlab_notification_methodparameter to specify swanlab notification method when training completes/errors occur.
h.dataloader_prefetch_factorparameter default value changed from 10 to 2. - Domestic Hardware (Thanks to Ascend and CMB technical teams)
a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL hybrid operator support, see this PR: #7079
c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html
New Models
- Text-only models:
a. ZhipuAI/GLM-4.7 series
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution) - Multimodal models:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
b. Tencent-Hunyuan/HunyuanOCR
What's Changed
- [model] Support GLM4.6-V by @Jintao-Huang in #6948
- [model] support glm4_6v flash by @Jintao-Huang in #6959
- [bugfix] fix truncation_strategy left by @Jintao-Huang in #6961
- [bugfix] fix megatron save_checkpoint by @Jintao-Huang in #6963
- [feat] GKD support truncation strategy delete to resample by @hjh0119 in #6964
- [misc] megatron grpo check rollout_logps by @hjh0119 in #6970
- [misc] set default group_port for vllm client by @hjh0119 in #6972
- [grpo] support Off-Policy Sequence Masking by @hjh0119 in #6978
- [megatron, misc] support check_latest_model by @hjh0119 in #6988
- [bugfix] fix reranker_padding_free by @Jintao-Huang in #6989
- [megatron] fix eval_iters 1 by @Jintao-Huang in #6990
- Add dense_npu.sh for megatron lora training in huawei npu by @vx120 in #6976
- fix system
swift ptby @Jintao-Huang in #7003 - [bugfix] fix qwen_vl_utils torchvision base64 by @Jintao-Huang in #7004
- [bugfix] fix liger_kernel flash_attn by @Jintao-Huang in #7005
- [bugfix] fix qwen3_vl bridge by @Jintao-Huang in #7006
- [bugfix] fix reranker padding_free & fix seq_cls omni padding_free by @Jintao-Huang in #7007
- [npu] add npu qwen3_omni sft example for mindspeed backend by @tongtong0613 in #7008
- [bugfix] qwen-omni3 vllm infer with USE_AUDIO_IN_VIDEO by @hjh0119 in #7009
- [bugfix] fix grpo sleep_level 2 causes gibberish outputs by @hjh0119 in #7017
- add npu vllm-ascend docs and examples by @addsubmuldiv in #7013
- [compat] fix mcore012 compat torch new by @Jintao-Huang in #7021
- [megatron] Megatron support random/non-random dataloader by @Jintao-Huang in #7016
- [bugfix] megatron add retry to avoid hang by @Jintao-Huang in #7023
- [trainer] refactor acc metrics by @Jintao-Huang in #7026
- [infer] update embddding/reranker demo by @Jintao-Huang in #7029
- [train] support embeding/reranker packing & support reranker/embedding cache_dataset by @Jintao-Huang in #6987
- update readme by @Jintao-Huang in #7033
- [misc] update swift image by @Jintao-Huang in #7039
- [bugfix] remove add_eos for rm in grpo by @hjh0119 in #7040
- [npu] Fix device mismatch in weight sync for HCCL communicator by @singing4you in #7036
- collect npu profiling data by @OneMondy in #6977
- [bugfix] fix null_ref_context by @Jintao-Huang in #7042
- [model] support hunyuan_ocr by @slin000111 in #7038
- update flash_attn version; fix mcore 0.15 hang by @Jintao-Huang in #7043
- [bugfix] fix grpo multi turn log_entropy by @hjh0119 in #7044
- [bugfix] fix dataloader megatron by @Jintao-Huang in #7050
- [grpo] support num_generations_eva...
Patch release v3.11.3
Full Changelog: v3.11.2...v3.11.3
Patch release v3.11.2
Full Changelog: v3.11.1...v3.11.2
Patch release v3.11.1
Full Changelog: v3.11.0...v3.11.1
v3.11.0
中文版
新特性
- Megatron-SWIFT
a. 支持 GRPO Megatron 训练,训练文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GRPO.html
b. FP8 blockwise 训练支持,支持FP8加载和导出权重,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
c. MTP 训练支持,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
d. 新模型支持:GPT-OSS,Llama4,InternVL3.5-GPT-OSS等。
e. 支持--save_strategy epoch策略存储模型。
f. 兼容 megaron-core 0.12-0.15 版本。 - RL
a. 新算法 SAPO 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
b. 新算法 CISPO 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
c. 缓解训推不一致的算法支持,包括 TIS/MIS 与 rollout off-policy metrics 记录,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
d. tree-rollout 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/treepo.html (感谢招商银行团队 @li2zhi 的贡献)
e. gkd 训练支持使用 liger_kernel loss(--use_liger_kernel true)。
f. 新增 GRPO loss_type,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/loss_types.html - 训练
a. cached dataset 重构,更好支持大型数据集离线 tokenize 场景,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
b. 预训练场景--truncation_strategy split策略支持,将长文本切成多条数据样本避免 tokens 浪费。
c.packing_num_proc参数支持。
d. Qwen2.5-VL系列模型兼容使用 "qwen_vl_utils>=0.14"。
e. MFU 日志插件支持。(感谢 @y2logic 的贡献) - 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
a. Megatron-SWIFT 支持昇腾 NPU,文档参考:https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html
b. 昇腾NPU混合算子支持 Qwen2、Qwen3、Qwen3-MoE 系列模型,加速训练过程。
新模型
- 纯文本模型:
a. moonshotai/Kimi-K2-Thinking - 多模态模型:
a. SenseNova/SenseNova-SI-InternVL3-2B系列
b. mistralai/Ministral-3-3B-Instruct-2512系列
c. mistralai/Mistral-Small-3.2-24B-Instruct-2506
English Version
New Features
- Megatron-SWIFT
a. GRPO training support on Megatron, documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GRPO.html
b. FP8 blockwise training support, including FP8 weight loading and exporting. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
c. MTP training support, training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
d. New model support: GPT-OSS, Llama4, InternVL3.5-GPT-OSS, etc.
e. Support for saving strategy--save_strategy epoch.
f. Compatible with megaron-core versions 0.12–0.15. - RL
a. New algorithm SAPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
b. New algorithm CISPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
c. Algorithms for mitigating training–inference mismatch, including TIS/MIS and rollout off-policy metrics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
d. Tree-rollout support, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/treepo.html (Thanks to CMB team @li2zhi for the contribution)
e. GKD training supports liger_kernel loss (--use_liger_kernel true).
f. New GRPO loss types added, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/loss_types.html - Training
a. Cached dataset refactoring for better offline tokenization of large datasets. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
b. Pretraining--truncation_strategy splitsupport, splitting long text into multiple samples to avoid token waste.
c. Addedpacking_num_procparameter support.
d. Qwen2.5-VL series models compatible with "qwen_vl_utils>=0.14".
e. MFU logging plugin support (Thanks to @y2logic). - Domestic Hardware Support (Thanks to Ascend and CMB technical teams)
a. Megatron-SWIFT supports Ascend NPU, documentation: https://swift.readthedocs.io/en/latest/BestPractices/NPU-support.html
b. Ascend NPU mixed operators support Qwen2, Qwen3, Qwen3-MoE series models, accelerating training.
New Models
- Text-only models:
a. moonshotai/Kimi-K2-Thinking - Multimodal models:
a. SenseNova/SenseNova-SI-InternVL3-2B series
b. mistralai/Ministral-3-3B-Instruct-2512 series
c. mistralai/Mistral-Small-3.2-24B-Instruct-2506
What's Changed
- bump version 3.11.0.dev by @Jintao-Huang in #6560
- [model] support Kimi-K2 by @Jintao-Huang in #6562
- [bugfix] fix pp vit_lr by @Jintao-Huang in #6565
- [bugfix] fix tools parse in gkd/grpo server mode by @hjh0119 in #6568
- [bugfix] fix grpo with reward model by @hjh0119 in #6567
- [bugfix] fix mcore-bridge vpp by @Jintao-Huang in #6581
- qwen2.5-vl compat qwen_vl_utils version by @Jintao-Huang in #6584
- [bugfix] fix packing_length by @Jintao-Huang in #6594
- [dataset] support packing_num_proc by @Jintao-Huang in #6592
- Fix emb loss scale by @tastelikefeet in #6597
- [megatron] compat megatron-core 0.12-0.14 by @Jintao-Huang in #6599
- [kto] fix kto loss_type=apo_zero_unpaired by @Jintao-Huang in #6601
- Fix command line display for UI by @slin000111 in #6603
- Support Megatron GRPO by @hjh0119 in #6025
- [megatron] fix train_iters by @Jintao-Huang in #6611
- [bugfix] fix modelscope patch_hub by @Jintao-Huang in #6612
- [template] support add_eos by @Jintao-Huang in #6613
- [dataset] refactor cached_dataset by @Jintao-Huang in #6561
- [bugfix]fix add_eos in gkd/grpo for truncated sample encode by @hjh0119 in #6618
- Support GKD Liger Kernel Loss by @hjh0119 in #6619
- Support generative reranker right pad by @0russwest0 in #6573
- update swift image 3.10.1 by @Jintao-Huang in #6622
- [model] support mistral 2506 by @Jintao-Huang in #6624
- update peft version by @Jintao-Huang in #6621
- [bugfix] Fix multinode write conflict mcore-bridge (deepseek-v3) by @Jintao-Huang in #6626
- Initialize chord dataset after accelerator setup in GRPOTrainer by @tongchen126 in #6638
- [bugfix] fix megatron grpo max_epochs by @hjh0119 in #6646
- [bugfix] fix megatron grpo server mode sync weight by @hjh0119 in #6648
- [megatron] fix save barrier by @Jintao-Huang in #6653
- [bugfix] fix megatron grpo rollout_group by @hjh0119 in #6655
- [bugfix] fix chatml chat template by @Jintao-Huang in #6656
- [bugfix] fix train_type full freeze_llm by @Jintao-Huang in #6651
- [mcore-bridge] optimize gpt_bridge comm by @Jintao-Huang in #6659
- [algo] support cispo algorithm by @hjh0119 in #6572
- [model] Support SenseNova-SI by @hjh0119 in #6657
- [megatron] fix
swift exportmerge_lora by @Jintao-Huang in #6664 - [bugfix] memory log is missing on Ascend NPU by @baymax591 in #6647
- update doc by @tastelikefeet in #6665
- [bugfix] Fix GKD with TRL >= 0.24 & GKD Liger by @hjh0119 in #6663
- [template] support truncation_strategy spllit (swift pt) by @Jintao-Huang in #6672
- [bugfix] fix qwen3_omni seq_cls by @Jintao-Huang in #6673
- [bugfix] getattr error for activation_offloading in RM training by @hjh0119 in #6677
- [bugfix] fix liger-kernel version check by @hjh0119 in #6679
- [bugfix] fix qwen3_vl image_list fps by @Jintao-Huang in #6696
- [bugfix] fix logprobs in vllm sampling params by @hjh0119 in #6698
- [megatron] support global_aux_loss by @Jintao-Huang in #6699
- [bugfix] fix megatron grpo local jsonl writer by @hjh0119 in #6700
- fix type_type=rm eval trl>=0.25 by @Jintao-Huang in #6701
- add npu fsdp example by @addsubmuldiv in #6697
- add npu deepspeed example by @addsubmuldiv in https://github.com/model...
Patch release v3.10.3
Full Changelog: v3.10.2...v3.10.3
Patch release v3.10.2
Full Changelog: v3.10.1...v3.10.2
Patch release v3.10.1
Full Changelog: v3.10.0...v3.10.1
v3.10.0
中文版
新特性
- Megatron-SWIFT
a. Mcore-Bridge发布。支持直接加载和存储 safetensors 格式的模型权重;支持LoRA增量权重双向转换;支持多机转换。文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Mcore-Bridge.html 。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. megatron-core 版本升级至0.14.0。
c. 多模态模型训练新增vit_lr和aligner_lr参数支持。
d. 新增存储优化参数:async_save, save_retain_interval等。
e. 支持batched mrope,加速Qwen3-VL、Qwen2.5-VL等模型的训练速度。 - RL
a. GRPO LoRA 训练权重同步速度优化,具体参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/GetStarted/GRPO.html#id3
b. GRPO 训练显存优化以降低峰值显存占用。
c. RLVR 新算法支持:RLOO,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/RLOO.html 。REINFORCE++ Baseline,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD 支持使用 vLLM 加速策略模型rollout,并新增参数teacher_deepspeed额外控制教师模型分片策略。文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html
e. GSPO 支持使用liger_kernel减少显存使用。 - 训练
a. PT/SFT/采样/数据蒸馏中支持了RAY,具体参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
b. Qwen3-VL、Qwen3-Omni支持混合模态数据训练;Qwen3-VL支持ulysses序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. 支持 yaml 方式配置训练参数,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. 新增 FSDP2 训练启动案例,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. 新增自定义多模态模型注册最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/MLLM-Registration.html
f. embedding 训练中的 InfoNCE 损失与 Qwen3-Embedding 论文描述对齐。具体参考文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Embedding.html
g. 新增多标签分类训练案例,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template 支持 seed-oss。感谢@hpsun1109的贡献。 - 全链路
a.swift export支持 GPTQ-v2 量化,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh 。感谢@zzc0430的贡献。
b.swift deployvllm推理后端支持 DP 部署,使用--vllm_data_parallel_size参数。感谢@YushunXiang 的贡献。
c.swift deploy新增 health/ping endpoints。
d. vLLM 部署新增参数vllm_mm_processor_cache_gb/vllm_engine_kwargs。
新模型
- 纯文本模型:
a. Qwen/Qwen3Guard-Gen-0.6B系列
b. MiniMax/MiniMax-M2 - 多模态模型:
a. Qwen/Qwen3-VL-2B-Instruct系列
b. deepseek-ai/DeepSeek-OCR,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking系列
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct系列
English Version
New Features
- Megatron-SWIFT
a. Mcore-Bridge Release. Supports direct loading and saving of model weights in safetensors format; supports bidirectional conversion of LoRA incremental weights; supports multi-node conversion. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Mcore-Bridge.html. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. Upgraded megatron-core version to 0.14.0.
c. Addedvit_lrandaligner_lrparameter support for multimodal model training.
d. Added storage optimization parameters: async_save, save_retain_interval, etc.
e. Support for batched mrope to accelerate training speed of Qwen3-VL, Qwen2.5-VL, and other models. - RL
a. GRPO LoRA training weight synchronization speed optimization. Details: https://swift.readthedocs.io/en/latest/Instruction/GRPO/GetStarted/GRPO.html#memory-optimization-solutions-in-colocate-mode
b. GRPO training memory optimization to reduce peak memory consumption.
c. New RLVR algorithm support: RLOO, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/RLOO.html. REINFORCE++ Baseline, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD supports using vLLM to accelerate policy model rollout, with new parameter teacher_deepspeed for additional control of teacher model sharding strategy. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GKD.html
e. GSPO supports using liger_kernel to reduce memory usage. - Training
a. RAY support added for PT/SFT/Sampling/Data Distillation, documentation: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
b. Qwen3-VL and Qwen3-Omni support mixed modality data training; Qwen3-VL supports Ulysses sequence parallelism. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. Support for YAML-based training parameter configuration, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. Added FSDP2 training launch example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. Added best practice for custom multimodal model registration: https://swift.readthedocs.io/en/latest/BestPractices/MLLM-Registration.html
f. InfoNCE loss in embedding training aligned with Qwen3-Embedding paper description. Documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
g. Added multi-label classification training example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template supports seed-oss. Thanks to @hpsun1109 for the contribution. - Full Pipeline
a.swift exportsupports GPTQ-v2 quantization, scripts: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh. Thanks to @zzc0430 for the contribution.
b. swift deploy vLLM inference backend supports DP deployment, using--vllm_data_parallel_sizeparameter. Thanks to @YushunXiang for the contribution.
c. swift deploy added health/ping endpoints.
d. vLLM deployment added parametersvllm_mm_processor_cache_gb/vllm_engine_kwargs.
New Models
- Text-only models:
a. Qwen/Qwen3Guard-Gen-0.6B series
b. MiniMax/MiniMax-M2 - Multimodal models:
a. Qwen/Qwen3-VL-2B-Instruct series
b. deepseek-ai/DeepSeek-OCR, training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking series
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct series
What's Changed
- [bugfix] fix image_list qwen2.5/3-omni by @Jintao-Huang in #6122
- [model] Support Qwen3-VL dense by @Jintao-Huang in #6120
- feat: support gptq_v2 quantization method by @zzc0430 in #6102
- [bugfix] fix gptq_v2 by @Jintao-Huang in #6126
- [bugfix] patch timeout & fix print_rich_table by @Jintao-Huang in #6137
- Add the support for vLLM data parallel configuration in SwiftDeploy by @YushunXiang in #6114
- [docs] update vllm deploy DP docs by @Jintao-Huang in #6139
- [model] Support Qwen/Qwen3-VL-4B-Instruct series by @Jintao-Huang in #6143
- Update loss_scale method call to pass through inputs.extra_kwargs by @CJack812 in #6160
- [bugfix] fix qwen3_vl videos by @Jintao-Huang in #6162
- Fix bug of sp/cp by @tastelikefeet in #6163
- [deploy] update vllm_enable_prefix_caching by @Jintao-Huang in #6165
- [bugfix] qwen3-vl support mixed data by @Jintao-Huang in #6161
- [template] add_retry by @Jintao-Huang in #6138
- [bugfix] Fix multimodal lazy_tokenize false by @Jintao-Huang in #6172
- [template] update qwen3_vl grounding dataset format by @Jintao-Huang in #6178
- [docs] update docs by @Jintao-Huang in #6180
- [bugfix] add tools fileds in inputs2reqeusts by @hjh0119 in #6054
- [grpo] Optimize vLLM weight synchronization & update buitin accuracy reward by @hjh0119 in #5773
- [model] support Qwen/Qwen3Guard-Gen-0.6B series by @Jintao-Huang in #6189
- [template] Support qwen3 omni mixed data by @Jintao-Huang in #6196
- [docs] update qwen3_vl best practice by @Jintao-Huang in #6206
- [vllm] support vllm_mm_processor_cache_gb by @hjh0119 in #6210
- [megatron] fix qwen3_vl new_special_tokens by @Jintao-Huang in #6213
- [megatron] add mcore save_args by @Jintao-Huang in #6216
- [bugfix] fix dtype warning by @Jintao-Huang in #6219
- [bugfix] fix infer pt dp by @Jintao-Huang in #6222
- support training for multimodal reranker by @0russwest0 in #6192
- [bugfix] fix reward_trainer logger by @Jintao-Huang in #6240
- [model] Support deepseek-ocr by @Jintao-Huang in #6238
- [docs] update deepseek_ocr docs by @Jintao-Huang in #6242
- [bugfix] fi...