modelscope · hjh0119 · Oct 17, 2025 · Sep 8, 2025 · Sep 8, 2025 · Sep 8, 2025
diff --git a/docs/source/BestPractices/GRPO代码训练.md b/docs/source/BestPractices/GRPO代码训练.md
@@ -42,7 +42,9 @@
 ```bash
 CUDA_VISIBLE_DEVICES=7 \
 swift rollout \
-  --model Qwen/Qwen2.5-7B-Instruct
+  --model Qwen/Qwen2.5-7B-Instruct \
+  --vllm_enable_lora true \
+  --vllm_max_lora_rank 16
 ```
 
 ```bash
@@ -61,6 +63,8 @@ swift rlhf \
     --vllm_server_host 127.0.0.1 \
     --vllm_server_port 8000 \
     --train_type lora \
+    --lora_rank 16 \
+    --lora_alpha 32 \
     --torch_dtype bfloat16 \
     --dataset 'open-r1/verifiable-coding-problems-python-10k' \
     --load_from_cache_file true \

diff --git a/docs/source/Instruction/GRPO/GetStarted/GRPO.md b/docs/source/Instruction/GRPO/GetStarted/GRPO.md
@@ -185,7 +185,7 @@ swift rollout \
 
 更多 rollout 参数参考[vLLM参数](../../../Instruction/命令行参数.md#vllm参数)和[rollout 参数](../../../Instruction/命令行参数.md#rollout参数)
 
-注意：在使用 use_async_engine 时，仅开启 DP 可能会导致错误，相关问题参考： [vllm issue](https://github.com/vllm-project/vllm/issues/18567)。如果出现错误，请尝试同时启用 TP 和 DP。
+注意：在使用 use_async_engine 时，仅开启 DP 可能会导致错误，相关问题参考： [vllm issue](https://github.com/vllm-project/vllm/issues/18567)。如果出现错误，请尝试同时启用 TP 和 DP，或升级vLLM
 
 
 训练使用以下参数配置外部 vLLM 服务器
@@ -196,6 +196,30 @@ swift rollout \
 --vllm_server_port <服务端口> \
 --vllm_server_timeout <超时时间> \
 ```
+#### 权重同步加速
+swift 3.10 优化了权重同步，设置以下参数可以进一步优化 LoRA 训练的权重同步速度。
+
+```bash
+# rollout(server mode)
+swift rollout \
+    --vllm_enable_lora true \
+    --vllm_max_lora_rank xxx # 与训练脚本lora_rank一致
+    ...
+
+# grpo(colocate mode)
+swift rlhf \
+    --rlhf_type grpo \
+    --vllm_mode colocate \
+    --vllm_enable_lora true \
+    ...
+```
+
+注意：以下情况无法使用该优化：
+
+- 训练多模态模型的ViT层(freeze_vit false)
+- MoE 模型
+
+优化实现细节请参考该[PR](https://github.com/modelscope/ms-swift/pull/5773)
 
 ## logged metrics
 - completions/mean_length：生成的 completion 的平均长度。

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -526,17 +526,20 @@ reward模型参数将在PPO、GRPO中使用。
 - vllm_mode: vLLM 集成模式，可选项为 `server` 和 `colocate`。server 模式使用 `swift rollout` 拉起的 vLLM 服务器进行采样，colocate 模式在程序内部署 vLLM。使用server端时，
 - vllm_mode server 参数
   - vllm_server_base_url: vLLM server的Base URL(比如 http://local_host:8000), 默认为None。设置后，忽略host和port设置。
-  - vllm_server_host：vLLM server host地址，默认为None，使用外部vLLM server时使用。
+  - vllm_server_host：vLLM server host地址，默认为None。
   - vllm_server_port vLLM server 服务端口，默认为8000。
   - vllm_server_timeout 连接vLLM server的超时时间，默认为 240s。
   - vllm_server_pass_dataset: 透传额外的数据集信息到vLLM server，用于多轮训练。
   - async_generate: 异步rollout以提高训练速度，注意开启时采样会使用上一轮更新的模型进行采样，不支持多轮场景。默认`false`.
+  - SWIFT_UPDATE_WEIGHTS_BUCKET_SIZE：环境变量，用于控制权重同步时的传输桶大小（bucket size），适用于 Server Mode 下的全参数训练，单位为 MB，默认值为 512 MB。
 - vllm_mode colocate 参数（更多参数支持参考[vLLM参数](#vLLM参数)。）
   - vllm_gpu_memory_utilization: vllm透传参数，默认为0.9。
   - vllm_max_model_len: vllm透传参数，默认为None。
   - vllm_enforce_eager: vllm透传参数，默认为False。
   - vllm_limit_mm_per_prompt: vllm透传参数，默认为None。
   - vllm_enable_prefix_caching: vllm透传参数，默认为True。
+  - vllm_tensor_parallel_size: tp并行数，默认为`1`。
+  - vllm_enable_lora: 支持vLLM Engine 加载 LoRA adapter，默认为False。用于加速LoRA训练的权重同步，具体参考[文档](./GRPO/GetStarted/GRPO.md#权重同步加速)。
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放
   - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
   - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
@@ -549,7 +552,7 @@ reward模型参数将在PPO、GRPO中使用。
 - sync_ref_model: 是否定期同步ref_model，默认为False。
   - ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
   - ref_model_sync_steps：同步频率，默认为512。
-- move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。注意：该参数仅对LoRA(PEFT)训练有意义。
+- move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
 - multi_turn_scheduler: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。
 - max_turns: 多轮GRPO的轮数上限。默认为None，不做限制。
 - dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
@@ -604,8 +607,10 @@ soft overlong 奖励参数
 
 ### Rollout参数
 Rollout参数继承于[部署参数](#部署参数)
-- multi_turn_scheduler: 多轮GRPO训练规划器，传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。默认为None，具体参考[文档](./GRPO/DeveloperGuide/多轮训练.md)
+- multi_turn_scheduler: 多轮GRPO训练规划器，传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。默认为None，具体参考[文档](./GRPO/DeveloperGuide/多轮训练.md)。
 - max_turns: 多轮GRPO训练下的最大轮数，默认为None，即不做约束。
+- vllm_enable_lora: 支持vLLM Engine 加载 LoRA adapter，默认为False。用于加速LoRA训练的权重同步，具体参考[文档](./GRPO/GetStarted/GRPO.md#权重同步加速)。
+- vllm_max_lora_rank: vLLM Engine LoRA参数，需大于等于训练的lora_rank，建议等于。默认为16。
 
 ### Web-UI参数
 - server_name: web-ui的host，默认为'0.0.0.0'。

diff --git a/docs/source_en/BestPractices/GRPO-Code-Training.md b/docs/source_en/BestPractices/GRPO-Code-Training.md
@@ -46,7 +46,9 @@ launch external vLLM server using following script
 ```bash
 CUDA_VISIBLE_DEVICES=7 \
 swift rollout \
-  --model Qwen/Qwen2.5-7B-Instruct
+  --model Qwen/Qwen2.5-7B-Instruct \
+  --vllm_enable_lora true \
+  --vllm_max_lora_rank 16
 ```
 
 ```bash
@@ -65,6 +67,8 @@ swift rlhf \
     --vllm_server_host 127.0.0.1 \
     --vllm_server_port 8000 \
     --train_type lora \
+    --lora_rank 16 \
+    --lora_alpha 32 \
     --torch_dtype bfloat16 \
     --dataset 'open-r1/verifiable-coding-problems-python-10k' \
     --load_from_cache_file true \

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -535,17 +535,20 @@ The meanings of the following parameters can be referenced [here](https://huggin
 - vllm_mode: Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `server` or `colocate`
 - vllm_mode server parameter
   - vllm_server_base_url: Base URL for the vLLM server (e.g., 'http://localhost:8000'). If provided, `vllm_server_host` " "and `vllm_server_port` are ignored. Default is None.
-  - vllm_server_host: The host address of the vLLM server. Default is None. This is used when connecting to an external vLLM server.
+  - vllm_server_host: The host address of the vLLM server. Default is None.
   - vllm_server_port: The service port of the vLLM server. Default is 8000.
   - vllm_server_timeout: The connection timeout for the vLLM server. Default is 240 seconds.
   - vllm_server_pass_dataset: pass additional dataset information through to the vLLM server for multi-turn training.
   - async_generate: Use async rollout to improve train speed. Note that rollout will use the model updated in the previous round when enabled. Multi-turn scenarios are not supported. Default is `false`.
+  - SWIFT_UPDATE_WEIGHTS_BUCKET_SIZE: An environment variable that controls the bucket size (in MB) for weight synchronization during full-parameter training in Server Mode. Default is 512 MB.
 - vllm_mode colocate parameter (For more parameter support, refer to the [vLLM Arguments](#vLLM-Arguments).)
   - vllm_gpu_memory_utilization: vLLM passthrough parameter, default is 0.9.
   - vllm_max_model_len: vLLM passthrough parameter, the total length limit of model, default is None.
   - vllm_enforce_eager: vLLM passthrough parameter, default is False.
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
+  - vllm_enable_prefix_caching: A pass-through parameter for vLLM, default is True.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
+  - vllm_enable_lora: Enable the vLLM engine to load LoRA adapters; defaults to False. Used to accelerate weight synchronization during LoRA training. See the [documentation](./GRPO/GetStarted/GRPO.md#weight-sync-acceleration) for details.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
   - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
   - offload_model: Whether to offload the model during inference with vLLM. The default is `False`.
@@ -563,7 +566,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
 - sync_ref_model: Whether to synchronize the reference model. Default is False。
   - ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
   - ref_model_sync_steps：The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
-- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches. This parameter is only meaningful for LoRA (PEFT).
+- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
 - multi_turn_scheduler: Multi-turn GRPO parameter; pass the corresponding plugin name, and make sure to implement it in plugin/multi_turn.py.
 - max_turns: Maximum number of rounds for multi-turn GRPO. The default is None, which means there is no limit.
 - dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
@@ -623,6 +626,8 @@ Deployment Arguments inherit from the [inference arguments](#inference-arguments
 The rollout parameters inherit from the [deployment parameters](#deployment-arguments).
 - multi_turn_scheduler: The scheduler for multi-turn GRPO training. Pass the corresponding plugin name, and ensure the implementation is added in `plugin/multi_turn.py`. Default is `None`. See [documentation](./GRPO/DeveloperGuide/multi_turn.md) for details.
 - max_turns: Maximum number of turns in multi-turn GRPO training. Default is `None`, meaning no limit.
+- vllm_enable_lora: Enable the vLLM engine to load LoRA adapters; defaults to False. Used to accelerate weight synchronization during LoRA training. See the [documentation](./GRPO/GetStarted/GRPO.md#weight-sync-acceleration) for details.
+- vllm_max_lora_rank: LoRA parameter for the vLLM engine. Must be greater than or equal to the training lora_rank; it is recommended to set them equal. Defaults to 16.
 
 ### Web-UI Arguments
 - server_name: Host for the web UI, default is '0.0.0.0'.

diff --git a/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md b/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
@@ -183,7 +183,7 @@ swift rollout \
 ```
 For more rollout parameters, refer to the [vllm arguments](../../../Instruction/Command-line-parameters.md#vllm-arguments) and [rollout arguments](../../../Instruction/Command-line-parameters.md#rollout-arguments)
 
-Note: When set `use_async_engine`, enabling only DP (Data Parallelism) may cause errors. [Related issue](https://github.com/vllm-project/vllm/issues/18567). If errors occur, try enabling both TP (Tensor Parallelism) and DP.
+Note: When set `use_async_engine`, enabling only DP (Data Parallelism) may cause errors. [Related issue](https://github.com/vllm-project/vllm/issues/18567). If errors occur, try enabling both TP (Tensor Parallelism) and DP or upgrading vLLM.
 
 To configure the external vLLM server during training, use the following parameters:
 
@@ -194,6 +194,31 @@ To configure the external vLLM server during training, use the following paramet
 --vllm_server_port <service_port> \
 --vllm_server_timeout <timeout> \
 ```
+
+#### Weight-Sync Acceleration
+Swift 3.10 optimizes weight synchronization, and setting the following parameters can further improve the weight synchronization speed for LoRA training:
+
+```bash
+# rollout(server mode)
+swift rollout \
+    --vllm_enable_lora true \
+    --vllm_max_lora_rank xxx # match the lora_rank in the training script
+    ...
+
+# grpo(colocate mode)
+swift rlhf \
+    --rlhf_type grpo \
+    --vllm_mode colocate \
+    --vllm_enable_lora true \
+    ...
+```
+Note: This optimization cannot be used in the following cases:
+
+- Training the ViT layers of multimodal models (freeze_vit set to false)
+- MoE models
+
+For implementation details, please refer to the [PR](https://github.com/modelscope/ms-swift/pull/5773)
+
 ## logged metrics
 - completions/mean_length: The average length of generated completions.
 - completions/min_length: The minimum length among generated completions.

diff --git a/examples/train/grpo/external/README.md b/examples/train/grpo/external/README.md
@@ -7,6 +7,12 @@
 1. vLLM version 0.8.3 or higher.
 2. trl version 0.17.0 or higher
 
+For LoRA Training, set following parameters to speed up weight update
+```bash
+  --vllm_enable_lora true
+  --vllm_max_lora_rank xxx # same as lora_rank in training script
+```
+
 ## **Introduction**
 
 The GRPO (Group Relative Policy Optimization) training framework supports high-performance inference engines like vLLM to accelerate the sampling process. The **External Mode** allows you to connect to an external vLLM inference server, separating the inference service from the training process. This mode is ideal for scenarios where you want to offload inference to dedicated hardware or servers, improving resource utilization and scalability.

diff --git a/examples/train/grpo/external/moe_full.sh b/examples/train/grpo/external/moe_full.sh
@@ -0,0 +1,44 @@
+# 8*80G
+
+# CUDA_VISIBLE_DEVICES=0 \
+# swift rollout \
+#     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+#     --vllm_max_model_len 16384 \
+#     --vllm_enable_prefix_caching true
+
+CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 \
+NPROC_PER_NODE=7 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --reward_funcs accuracy \
+    --use_vllm true \
+    --vllm_mode server \
+    --vllm_server_host 127.0.0.1 \
+    --vllm_server_port 8000 \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --dataset AI-MO/NuminaMath-TIR#1000 \
+    --max_length 12000 \
+    --max_completion_length 8192 \
+    --overlong_filter true \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 4 \
+    --save_strategy 'steps' \
+    --eval_strategy 'steps' \
+    --eval_steps 1000 \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --logging_steps 1 \
+    --warmup_ratio 0.01 \
+    --dataloader_num_workers 4 \
+    --num_generations 14 \
+    --temperature 1.0 \
+    --deepspeed zero3_offload \
+    --log_completions true \
+    --report_to tensorboard swanlab \
+    --num_iterations 1 \
+    --beta 0.001 \
+    --move_model_batches 5
diff --git a/examples/train/grpo/external/moe_lora.sh b/examples/train/grpo/external/moe_lora.sh
@@ -0,0 +1,44 @@
+# 8*80G
+
+# CUDA_VISIBLE_DEVICES=0 \
+# swift rollout \
+#     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+#     --vllm_max_model_len 16384 \
+#     --vllm_enable_prefix_caching true
+
+CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 \
+NPROC_PER_NODE=7 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --reward_funcs accuracy \
+    --use_vllm true \
+    --vllm_mode server \
+    --vllm_server_host 127.0.0.1 \
+    --vllm_server_port 8000 \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset AI-MO/NuminaMath-TIR#1000 \
+    --max_length 12000 \
+    --max_completion_length 8192 \
+    --overlong_filter true \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 4 \
+    --save_strategy 'steps' \
+    --eval_strategy 'steps' \
+    --eval_steps 1000 \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --logging_steps 1 \
+    --warmup_ratio 0.01 \
+    --dataloader_num_workers 4 \
+    --num_generations 14 \
+    --temperature 1.0 \
+    --deepspeed zero3 \
+    --log_completions true \
+    --report_to tensorboard swanlab \
+    --num_iterations 1 \
+    --beta 0.001 \
+    --move_model_batches 5
diff --git a/examples/train/grpo/internal/moe_full.sh b/examples/train/grpo/internal/moe_full.sh
@@ -0,0 +1,40 @@
+# 8*80G
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --reward_funcs accuracy \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.4 \
+    --vllm_tensor_parallel_size 2 \
+    --vllm_max_model_len 16384 \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --dataset AI-MO/NuminaMath-TIR#1000 \
+    --max_length 12000 \
+    --max_completion_length 8192 \
+    --overlong_filter true \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 4 \
+    --save_strategy 'steps' \
+    --eval_strategy 'steps' \
+    --eval_steps 1000 \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --logging_steps 1 \
+    --warmup_ratio 0.01 \
+    --dataloader_num_workers 4 \
+    --num_generations 16 \
+    --temperature 1.0 \
+    --deepspeed zero3_offload \
+    --log_completions true \
+    --sleep_level 1 \
+    --report_to tensorboard swanlab \
+    --num_iterations 1 \
+    --beta 0.001 \
+    --move_model_batches 10
diff --git a/examples/train/grpo/internal/moe_lora.sh b/examples/train/grpo/internal/moe_lora.sh
@@ -0,0 +1,42 @@
+# 8*80G
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --reward_funcs accuracy \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.4 \
+    --vllm_tensor_parallel_size 2 \
+    --vllm_max_model_len 16384 \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset AI-MO/NuminaMath-TIR#1000 \
+    --max_length 12000 \
+    --max_completion_length 8192 \
+    --overlong_filter true \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 4 \
+    --save_strategy 'steps' \
+    --eval_strategy 'steps' \
+    --eval_steps 1000 \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --logging_steps 1 \
+    --warmup_ratio 0.01 \
+    --dataloader_num_workers 4 \
+    --num_generations 16 \
+    --temperature 1.0 \
+    --deepspeed zero3 \
+    --log_completions true \
+    --sleep_level 1 \
+    --offload_model true \
+    --offload_optimizer true \
+    --report_to tensorboard swanlab \
+    --num_iterations 1 \
+    --beta 0.001 \
+    --move_model_batches 10
diff --git a/examples/train/grpo/internal/vllm_72b_4gpu.sh b/examples/train/grpo/internal/vllm_72b_4gpu.sh
@@ -36,7 +36,6 @@ swift rlhf \
     --top_p 1.0 \
     --top_k 80 \
     --log_completions true \
-    --async_generate false \
     --move_model_batches 16 \
     --offload_optimizer true \
     --offload_model true \