Skip to content

Releases: modelscope/ms-swift

v3.12.1

08 Jan 02:29

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.12.0...v3.12.1

v3.12.0

30 Dec 03:24

Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT
    a. GKD算法支持Megatron训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
    b. 新模型支持:GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
    c. save_safetensors 支持断点续训,将 Mcore-Bridge 加载和存储方式作为推荐方式。
    d. 非 padding-free 训练模式支持更多训练阶段:GRPO/DPO/KTO/RM/序列分类。
    e. group_by_length 参数支持,将数据集长度大致相同的样本分组在一起(含随机因素),加速非packing模式下训练速度。
    f. 支持 --report_to 参数,将训练日志在 wandb/swanlab 中记录并可视化。
    g. Qwen3-Next 使用 Zero-Centered RMSNorm,与 transformers 对齐。
    h. train_dataloader_shuffle 参数支持,控制训练数据集是否随机。
    i. template.encode 新增重试机制,避免 megatron 训练因网络问题获取图片/视频报错而卡住。
  2. RL
    a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2),文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
    b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
    c. 优化 GKD loss 计算的显存峰值。
    d. GRPO/GKD server mode 支持使用 ipv6 地址。
    e. 支持使用 structured_outputs_regex 进行结构化输出采样。
  3. 训练
    a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
    b. 支持 --fsdp fsdp2 使用 ms-swift 内置的 FSDP2 配置文件。
    c. loss_scale 支持3种基本策略:'default'、'last_round'、'all'与其他策略的混合使用,例如:'last_round+ignore_empty_think'。
    d. cached_dataset 支持 embedding/reranker/序列分类训练任务,训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    e. thinking template 重构,ThinkingTemplate 功能合入 Template,新增enable_thinking, add_non_thinking_prefix参数。
    f. 新增 SWIFT_PATCH_CONV3D 环境变量,避免 torch2.9 环境跑 conv3d 缓慢的问题。
    g. 支持 swanlab_notification_method 参数,在训练完成/发生错误时,指定 swanlab 通知方式。
    h. dataloader_prefetch_factor 参数默认值从10修改为2。
  4. 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
    a. 新增更多训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/ascend
    b. Qwen3-VL 混合算子支持,具体查看这个PR:#7079
    c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档,参考这里:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html

新模型

  1. 纯文本模型:
    a. ZhipuAI/GLM-4.7系列
    b. iic/QwenLong-L1.5-30B-A3B
    c. gongjy/MiniMind2 (感谢 @PiggerZZM 的贡献)
  2. 多模态模型:
    a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
    b. Tencent-Hunyuan/HunyuanOCR

English Version

New Features

  1. Megatron-SWIFT
    a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
    b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
    c. save_safetensors supports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
    d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
    e. group_by_length parameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
    f. Support for --report_to parameter to log and visualize training logs in wandb/swanlab.
    g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
    h. train_dataloader_shuffle parameter support to control whether training dataset is shuffled.
    i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos.
  2. RL
    a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
    b. GRPO adds num_generations_eval parameter to set the number of generations during eval stage.
    c. Optimized memory peak for GKD loss calculation.
    d. GRPO/GKD server mode supports using ipv6 addresses.
    e. Support for structured output sampling using structured_outputs_regex.
  3. Training
    a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
    b. Support for --fsdp fsdp2 to use ms-swift built-in FSDP2 configuration file.
    c. loss_scale supports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
    d. cached_dataset supports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    e. Thinking template refactored, ThinkingTemplate functionality merged into Template, added enable_thinking and add_non_thinking_prefix parameters.
    f. Added SWIFT_PATCH_CONV3D environment variable to avoid slow conv3d execution in torch2.9 environment.
    g. Support for swanlab_notification_method parameter to specify swanlab notification method when training completes/errors occur.
    h. dataloader_prefetch_factor parameter default value changed from 10 to 2.
  4. Domestic Hardware (Thanks to Ascend and CMB technical teams)
    a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
    b. Qwen3-VL hybrid operator support, see this PR: #7079
    c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html

New Models

  1. Text-only models:
    a. ZhipuAI/GLM-4.7 series
    b. iic/QwenLong-L1.5-30B-A3B
    c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution)
  2. Multimodal models:
    a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
    b. Tencent-Hunyuan/HunyuanOCR

What's Changed

Read more

Patch release v3.11.3

28 Dec 12:54

Choose a tag to compare

Patch release v3.11.2

21 Dec 02:59

Choose a tag to compare

Patch release v3.11.1

15 Dec 01:10

Choose a tag to compare

v3.11.0

09 Dec 02:44

Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT
    a. 支持 GRPO Megatron 训练,训练文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GRPO.html
    b. FP8 blockwise 训练支持,支持FP8加载和导出权重,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
    c. MTP 训练支持,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
    d. 新模型支持:GPT-OSS,Llama4,InternVL3.5-GPT-OSS等。
    e. 支持 --save_strategy epoch 策略存储模型。
    f. 兼容 megaron-core 0.12-0.15 版本。
  2. RL
    a. 新算法 SAPO 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
    b. 新算法 CISPO 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
    c. 缓解训推不一致的算法支持,包括 TIS/MIS 与 rollout off-policy metrics 记录,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
    d. tree-rollout 支持,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/treepo.html (感谢招商银行团队 @li2zhi 的贡献)
    e. gkd 训练支持使用 liger_kernel loss(--use_liger_kernel true)。
    f. 新增 GRPO loss_type,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/loss_types.html
  3. 训练
    a. cached dataset 重构,更好支持大型数据集离线 tokenize 场景,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    b. 预训练场景 --truncation_strategy split 策略支持,将长文本切成多条数据样本避免 tokens 浪费。
    c. packing_num_proc 参数支持。
    d. Qwen2.5-VL系列模型兼容使用 "qwen_vl_utils>=0.14"。
    e. MFU 日志插件支持。(感谢 @y2logic 的贡献)
  4. 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
    a. Megatron-SWIFT 支持昇腾 NPU,文档参考:https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html
    b. 昇腾NPU混合算子支持 Qwen2、Qwen3、Qwen3-MoE 系列模型,加速训练过程。

新模型

  1. 纯文本模型:
    a. moonshotai/Kimi-K2-Thinking
  2. 多模态模型:
    a. SenseNova/SenseNova-SI-InternVL3-2B系列
    b. mistralai/Ministral-3-3B-Instruct-2512系列
    c. mistralai/Mistral-Small-3.2-24B-Instruct-2506

English Version

New Features

  1. Megatron-SWIFT
    a. GRPO training support on Megatron, documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GRPO.html
    b. FP8 blockwise training support, including FP8 weight loading and exporting. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
    c. MTP training support, training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
    d. New model support: GPT-OSS, Llama4, InternVL3.5-GPT-OSS, etc.
    e. Support for saving strategy --save_strategy epoch.
    f. Compatible with megaron-core versions 0.12–0.15.
  2. RL
    a. New algorithm SAPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
    b. New algorithm CISPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
    c. Algorithms for mitigating training–inference mismatch, including TIS/MIS and rollout off-policy metrics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
    d. Tree-rollout support, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/treepo.html (Thanks to CMB team @li2zhi for the contribution)
    e. GKD training supports liger_kernel loss (--use_liger_kernel true).
    f. New GRPO loss types added, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/loss_types.html
  3. Training
    a. Cached dataset refactoring for better offline tokenization of large datasets. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    b. Pretraining --truncation_strategy split support, splitting long text into multiple samples to avoid token waste.
    c. Added packing_num_proc parameter support.
    d. Qwen2.5-VL series models compatible with "qwen_vl_utils>=0.14".
    e. MFU logging plugin support (Thanks to @y2logic).
  4. Domestic Hardware Support (Thanks to Ascend and CMB technical teams)
    a. Megatron-SWIFT supports Ascend NPU, documentation: https://swift.readthedocs.io/en/latest/BestPractices/NPU-support.html
    b. Ascend NPU mixed operators support Qwen2, Qwen3, Qwen3-MoE series models, accelerating training.

New Models

  1. Text-only models:
    a. moonshotai/Kimi-K2-Thinking
  2. Multimodal models:
    a. SenseNova/SenseNova-SI-InternVL3-2B series
    b. mistralai/Ministral-3-3B-Instruct-2512 series
    c. mistralai/Mistral-Small-3.2-24B-Instruct-2506

What's Changed

Read more

Patch release v3.10.3

30 Nov 06:35

Choose a tag to compare

Patch release v3.10.2

23 Nov 09:58

Choose a tag to compare

Patch release v3.10.1

16 Nov 16:50

Choose a tag to compare

v3.10.0

11 Nov 12:14

Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT
    a. Mcore-Bridge发布。支持直接加载和存储 safetensors 格式的模型权重;支持LoRA增量权重双向转换;支持多机转换。文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Mcore-Bridge.html 。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
    b. megatron-core 版本升级至0.14.0。
    c. 多模态模型训练新增 vit_lraligner_lr 参数支持。
    d. 新增存储优化参数:async_save, save_retain_interval等。
    e. 支持batched mrope,加速Qwen3-VL、Qwen2.5-VL等模型的训练速度。
  2. RL
    a. GRPO LoRA 训练权重同步速度优化,具体参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/GetStarted/GRPO.html#id3
    b. GRPO 训练显存优化以降低峰值显存占用。
    c. RLVR 新算法支持:RLOO,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/RLOO.htmlREINFORCE++ Baseline,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
    d. GKD 支持使用 vLLM 加速策略模型rollout,并新增参数teacher_deepspeed额外控制教师模型分片策略。文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html
    e. GSPO 支持使用liger_kernel减少显存使用。
  3. 训练
    a. PT/SFT/采样/数据蒸馏中支持了RAY,具体参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
    b. Qwen3-VL、Qwen3-Omni支持混合模态数据训练;Qwen3-VL支持ulysses序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
    c. 支持 yaml 方式配置训练参数,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/yaml
    d. 新增 FSDP2 训练启动案例,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
    e. 新增自定义多模态模型注册最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/MLLM-Registration.html
    f. embedding 训练中的 InfoNCE 损失与 Qwen3-Embedding 论文描述对齐。具体参考文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Embedding.html
    g. 新增多标签分类训练案例,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
    h. agent_template 支持 seed-oss。感谢@hpsun1109的贡献。
  4. 全链路
    a. swift export支持 GPTQ-v2 量化,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh 。感谢@zzc0430的贡献。
    b. swift deploy vllm推理后端支持 DP 部署,使用--vllm_data_parallel_size参数。感谢@YushunXiang 的贡献。
    c. swift deploy 新增 health/ping endpoints。
    d. vLLM 部署新增参数 vllm_mm_processor_cache_gb/vllm_engine_kwargs

新模型

  1. 纯文本模型:
    a. Qwen/Qwen3Guard-Gen-0.6B系列
    b. MiniMax/MiniMax-M2
  2. 多模态模型:
    a. Qwen/Qwen3-VL-2B-Instruct系列
    b. deepseek-ai/DeepSeek-OCR,训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
    c. PaddlePaddle/PaddleOCR-VL
    d. ZhipuAI/Glyph
    e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking系列
    f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct系列

English Version

New Features

  1. Megatron-SWIFT
    a. Mcore-Bridge Release. Supports direct loading and saving of model weights in safetensors format; supports bidirectional conversion of LoRA incremental weights; supports multi-node conversion. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Mcore-Bridge.html. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
    b. Upgraded megatron-core version to 0.14.0.
    c. Added vit_lr and aligner_lr parameter support for multimodal model training.
    d. Added storage optimization parameters: async_save, save_retain_interval, etc.
    e. Support for batched mrope to accelerate training speed of Qwen3-VL, Qwen2.5-VL, and other models.
  2. RL
    a. GRPO LoRA training weight synchronization speed optimization. Details: https://swift.readthedocs.io/en/latest/Instruction/GRPO/GetStarted/GRPO.html#memory-optimization-solutions-in-colocate-mode
    b. GRPO training memory optimization to reduce peak memory consumption.
    c. New RLVR algorithm support: RLOO, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/RLOO.html. REINFORCE++ Baseline, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
    d. GKD supports using vLLM to accelerate policy model rollout, with new parameter teacher_deepspeed for additional control of teacher model sharding strategy. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GKD.html
    e. GSPO supports using liger_kernel to reduce memory usage.
  3. Training
    a. RAY support added for PT/SFT/Sampling/Data Distillation, documentation: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
    b. Qwen3-VL and Qwen3-Omni support mixed modality data training; Qwen3-VL supports Ulysses sequence parallelism. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
    c. Support for YAML-based training parameter configuration, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
    d. Added FSDP2 training launch example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
    e. Added best practice for custom multimodal model registration: https://swift.readthedocs.io/en/latest/BestPractices/MLLM-Registration.html
    f. InfoNCE loss in embedding training aligned with Qwen3-Embedding paper description. Documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
    g. Added multi-label classification training example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
    h. agent_template supports seed-oss. Thanks to @hpsun1109 for the contribution.
  4. Full Pipeline
    a. swift export supports GPTQ-v2 quantization, scripts: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh. Thanks to @zzc0430 for the contribution.
    b. swift deploy vLLM inference backend supports DP deployment, using --vllm_data_parallel_size parameter. Thanks to @YushunXiang for the contribution.
    c. swift deploy added health/ping endpoints.
    d. vLLM deployment added parameters vllm_mm_processor_cache_gb/vllm_engine_kwargs.

New Models

  1. Text-only models:
    a. Qwen/Qwen3Guard-Gen-0.6B series
    b. MiniMax/MiniMax-M2
  2. Multimodal models:
    a. Qwen/Qwen3-VL-2B-Instruct series
    b. deepseek-ai/DeepSeek-OCR, training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
    c. PaddlePaddle/PaddleOCR-VL
    d. ZhipuAI/Glyph
    e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking series
    f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct series

What's Changed

Read more