Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions docs/source/BestPractices/Elastic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Elastic



## 安装依赖

集群部署K8S,并在集群中部署DLrover,[DLRover](https://github.com/intelligent-machine-learning/dlrover),
`pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`

经过反复测试验证的训练镜像中的其它依赖以及版本:
deepspeed 0.16.5(需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题)
pytorch 2.6.0


## 如何启动
命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外,其他参数与torchrun一致;
dlrover-run 参数如下:
```
usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
[--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID]
[--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
[--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}]
[--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
[-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK]
[--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
[--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT]
[--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint]
[--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT]
[--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH]
[--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check]
[--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE]
training_script
```
在弹性训练中我们需要关注的参数为:

--nnodes NNODES Number of nodes, or the range of nodes in form
<minimum_nodes>:<maximum_nodes>.

--nproc-per-node NPROC_PER_NODE Number of processes per node.
示例:

```bash
model=your model path
dataset=your dataset
output= your output dir
export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置
deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json

dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to sft.py is hardcoded to a specific conda environment. This approach is brittle and not easily portable. It would be more robust to use the -m flag of dlrover-run to execute the module directly.

Suggested change
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
-m swift.cli.sft --model $model \

--model_type qwen3 \
--train_type lora \
--torch_dtype bfloat16 \
--dataset $dataset \
--num_train_epochs 4 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-7 \
--gradient_accumulation_steps 8 \
--eval_steps 500 \
--save_steps 10 \
--save_total_limit 20 \
--logging_steps 1 \
--output_dir $output \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--temperature 1.0 \
--system You\ are\ a\ helpful\ assistant. \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--dataset_num_proc 1 \
--use_flash_ckpt true \
--deepspeed $deepspeed_config_or_type \
--elastic
```

## 配置文件示例
默认情况下的zero1为以下示例配置,

```json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"bf16": {
"enabled": "auto"
},

"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"elasticity": {
"ignore_non_elastic_batch_info": true,
"enabled": true,
"max_train_batch_size": 8,
"micro_batch_sizes": [
4,
2
],
"min_gpus": 1,
"max_gpus": 4,
"min_time": 20,
"version": 0.1
}
}
```

如果用户需要自定义,可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为:
```json
...

"elasticity": {
"ignore_non_elastic_batch_info": true,
"enabled": true,
"max_train_batch_size": 8,
"micro_batch_sizes": [
4,
2
],
"min_gpus": 1,
"max_gpus": 4,
"min_time": 20,
"version": 0.1
}
```

- ignore_non_elastic_batch_info:代表在elasticity里的配置会忽略外层的batch_size相关的配置,训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数
计算原则为:
 global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size
- max_train_batch_size:最大batch_size数
- micro_batch_sizes:即train_micro_batch_size_per_gpu
- min_gpus:最小gpu数目
- max_gpus:最大gpu数目
更详细的内容见:[Deepspeed](https://www.deepspeed.ai/docs/config-json/#elastic-training-config-v01-and-v02)


## 启动训练

```yaml
---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
name: deepspeed-elastic-swift
namespace: dlrover
spec:
distributionStrategy: AllreduceStrategy
optimizeMode: single-job
replicaSpecs:
worker:
replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】
template:
spec:
restartPolicy: Never
containers:
- name: main
image: #【训练镜像,需要安装deepspeed,dlrover 和swift 】
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- sh start.sh # 启动脚本
resources:
limits:
cpu: '8'
memory: 16Gi
nvidia.com/gpu: '1'
volumeMounts:
- mountPath: /model
name: volume-model
- mountPath: /dev/shm
name: volume-shm
restartPolicy: Never
volumes:
- hostPath:
path: /model
type: Directory
name: volume-model
- emptyDir:
medium: Memory
sizeLimit: 200Gi
name: volume-shm

```
7 changes: 6 additions & 1 deletion docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ gradient_checkpointing: true
- 🔥gradient_checkpointing: 是否使用gradient_checkpointing,默认为True。该参数可以显著降低显存占用,但降低训练速度。
- 🔥vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为None,即设置为`gradient_checkpointing`。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh)。
- 注意:多模态模型且是LoRA训练时,当设置了`--freeze_vit false`,且命令行中出现以下警告:`UserWarning: None of the inputs have requires_grad=True. Gradients will be None`,请设置`--vit_gradient_checkpointing false`,或提相关issue。全参数训练则不会出现该问题。(如果RLHF LoRA训练中,ref_model抛出来的警告,则是正常的)
- activation_cpu_offload: 激活卸载是 ms-swift 中的一种内存优化技术,它将前向传播中保存的激活张量卸载到 CPU 内存,在反向传播需要时再加载回 GPU。该功能显著减少 GPU 内存使用,允许您训练更大的模型或使用更大的批次大小。
- 🔥deepspeed: 默认为None。可以设置为'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload'来使用ms-swift内置的deepspeed配置文件。你也可以传入自定义deepspeed配置文件的路径。
- zero_hpz_partition_size: 默认为None,这个参数是ZeRO++的特性,即node内模型分片,node间数据分片,如果遇到grad_norm NaN,请尝试使用`--torch_dtype float16`。
- deepspeed_autotp_size: DeepSpeed张量并行大小,默认为1。使用DeepSpeed AutoTP时需将参数`--deepspeed`设置为'zero0'、'zero1'或'zero2'。(注意:该功能只支持全参数)
Expand Down Expand Up @@ -250,6 +251,9 @@ gradient_checkpointing: true
- 🔥neftune_noise_alpha: neftune添加的噪声系数。默认为0,通常可以设置为5、10、15。
- 🔥use_liger_kernel: 是否启用[Liger](https://github.com/linkedin/Liger-Kernel)内核加速训练并减少显存消耗。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger)。
- 注意:liger_kernel不支持device_map,请使用DDP/DeepSpeed进行多卡训练。liger_kernel目前只支持`task_type='causal_lm'`。
- use_cce: 是否启用[cut-cross-entropy](https://github.com/apple/ml-cross-entropy)融合算子降低显存并加速训练。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/cce)。
- use_tiled_mlp: 是否启用Tiled MLP进行内存高效的长序列训练。启用后,MLP层会被替换为分块实现,将序列分成多个shard进行计算以减少显存占用。默认为False。
- tiled_mlp_num_shards: Tiled MLP计算时将序列分成的shard数量。默认为None,即设置为4。较大的值可以减少显存但可能增加计算时间。
- average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True,将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False。
- max_grad_norm: 梯度裁剪。默认为1.。
- 注意:日志中的grad_norm记录的是裁剪前的值。
Expand Down Expand Up @@ -473,7 +477,8 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
- eval_dataset_args: 评测数据集参数,json格式,可设置多个数据集的参数。
- eval_limit: 评测数据集采样数。
- eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化,目前暂不支持safetensors格式;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The formatting for the pip command is incorrect, which may cause rendering issues. Additionally, there are some punctuation inconsistencies, such as using English commas instead of Chinese ones.

Suggested change
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md)
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover)`pip install dlrover && pip install tornado && pip install kubernetes`,具体使用参考[示例](../BestPractices/Elastic.md)

- early_stop_interval: 早停的间隔,会检验best_metric在early_stop_interval个周期内(基于`save_steps`, 建议`eval_steps`和`save_steps`设为同值)没有提升时终止训练。具体代码在[callback plugin](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/callback.py)中。同时,如果有较为复杂的早停需求,直接覆盖callback.py中的已有实现即可。

#### SWANLAB
Expand Down
Loading
Loading