-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support features of cut cross entropy, TiledMLP and activation_offload #7129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
6cdaad8
support deepspeed elastic
meichangsu1 7bb510a
code refactor
meichangsu1 5316bb2
code refactor
meichangsu1 8bbb327
[feat] Add support cut-cross-entropy
w1ida 73773fc
[misc] add use_cce test
w1ida ecbd2ae
[misc] fix code-assist bot problem
w1ida 5d3308b
Merge branch 'modelscope:main' into main
meichangsu1 af10ed4
[misc] add docs & example
w1ida 5be67fe
Merge branch 'modelscope:main' into main
meichangsu1 e5229ef
tiled mlp
kevssim d0cb8bd
lint fix
kevssim 4efc13a
feat: use Axolotl fork to support more models
w1ida 7d11e87
update docs
kevssim 3ec4180
Limit CCE model mapping to Liger-supported types
w1ida 4700709
npu support
kevssim 8901b38
Merge branch 'modelscope:main' into feat/tiled_mlp
vx120 9a4179d
[feat] support activation cpu offload in fsdp and fsdp2
meichangsu1 2dee5e5
Merge feature cut_cross_entropy
vx120 ab26acc
[feat] support activation cpu offload in fsdp and fsdp2 lint fix
meichangsu1 e64c462
Merge feature TiledMLP
vx120 8b8deb6
Merge feature activation_offload
vx120 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,209 @@ | ||
| # Elastic | ||
|
|
||
|
|
||
|
|
||
| ## 安装依赖 | ||
|
|
||
| 集群部署K8S,并在集群中部署DLrover,[DLRover](https://github.com/intelligent-machine-learning/dlrover), | ||
| `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift` | ||
|
|
||
| 经过反复测试验证的训练镜像中的其它依赖以及版本: | ||
| deepspeed 0.16.5(需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题) | ||
| pytorch 2.6.0 | ||
|
|
||
|
|
||
| ## 如何启动 | ||
| 命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外,其他参数与torchrun一致; | ||
| dlrover-run 参数如下: | ||
| ``` | ||
| usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE] | ||
| [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID] | ||
| [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS] | ||
| [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}] | ||
| [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS] | ||
| [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK] | ||
| [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR] | ||
| [--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT] | ||
| [--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint] | ||
| [--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT] | ||
| [--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH] | ||
| [--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check] | ||
| [--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE] | ||
| training_script | ||
| ``` | ||
| 在弹性训练中我们需要关注的参数为: | ||
|
|
||
| --nnodes NNODES Number of nodes, or the range of nodes in form | ||
| <minimum_nodes>:<maximum_nodes>. | ||
|
|
||
| --nproc-per-node NPROC_PER_NODE Number of processes per node. | ||
| 示例: | ||
|
|
||
| ```bash | ||
| model=your model path | ||
| dataset=your dataset | ||
| output= your output dir | ||
| export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置 | ||
| deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json | ||
|
|
||
| dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \ | ||
| /opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \ | ||
| --model_type qwen3 \ | ||
| --train_type lora \ | ||
| --torch_dtype bfloat16 \ | ||
| --dataset $dataset \ | ||
| --num_train_epochs 4 \ | ||
| --per_device_train_batch_size 1 \ | ||
| --per_device_eval_batch_size 1 \ | ||
| --learning_rate 5e-7 \ | ||
| --gradient_accumulation_steps 8 \ | ||
| --eval_steps 500 \ | ||
| --save_steps 10 \ | ||
| --save_total_limit 20 \ | ||
| --logging_steps 1 \ | ||
| --output_dir $output \ | ||
| --warmup_ratio 0.01 \ | ||
| --dataloader_num_workers 4 \ | ||
| --temperature 1.0 \ | ||
| --system You\ are\ a\ helpful\ assistant. \ | ||
| --lora_rank 8 \ | ||
| --lora_alpha 32 \ | ||
| --target_modules all-linear \ | ||
| --dataset_num_proc 1 \ | ||
| --use_flash_ckpt true \ | ||
| --deepspeed $deepspeed_config_or_type \ | ||
| --elastic | ||
| ``` | ||
|
|
||
| ## 配置文件示例 | ||
| 默认情况下的zero1为以下示例配置, | ||
|
|
||
| ```json | ||
| { | ||
| "fp16": { | ||
| "enabled": "auto", | ||
| "loss_scale": 0, | ||
| "loss_scale_window": 1000, | ||
| "initial_scale_power": 16, | ||
| "hysteresis": 2, | ||
| "min_loss_scale": 1 | ||
| }, | ||
|
|
||
| "bf16": { | ||
| "enabled": "auto" | ||
| }, | ||
|
|
||
| "zero_optimization": { | ||
| "stage": 1, | ||
| "offload_optimizer": { | ||
| "device": "none", | ||
| "pin_memory": true | ||
| }, | ||
| "allgather_partitions": true, | ||
| "allgather_bucket_size": 2e8, | ||
| "overlap_comm": false, | ||
| "reduce_scatter": true, | ||
| "reduce_bucket_size": 2e8, | ||
| "contiguous_gradients": true | ||
| }, | ||
|
|
||
| "gradient_accumulation_steps": "auto", | ||
| "gradient_clipping": "auto", | ||
| "steps_per_print": 2000, | ||
| "train_batch_size": "auto", | ||
| "train_micro_batch_size_per_gpu": "auto", | ||
| "wall_clock_breakdown": false, | ||
| "elasticity": { | ||
| "ignore_non_elastic_batch_info": true, | ||
| "enabled": true, | ||
| "max_train_batch_size": 8, | ||
| "micro_batch_sizes": [ | ||
| 4, | ||
| 2 | ||
| ], | ||
| "min_gpus": 1, | ||
| "max_gpus": 4, | ||
| "min_time": 20, | ||
| "version": 0.1 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 如果用户需要自定义,可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为: | ||
| ```json | ||
| ... | ||
|
|
||
| "elasticity": { | ||
| "ignore_non_elastic_batch_info": true, | ||
| "enabled": true, | ||
| "max_train_batch_size": 8, | ||
| "micro_batch_sizes": [ | ||
| 4, | ||
| 2 | ||
| ], | ||
| "min_gpus": 1, | ||
| "max_gpus": 4, | ||
| "min_time": 20, | ||
| "version": 0.1 | ||
| } | ||
| ``` | ||
|
|
||
| - ignore_non_elastic_batch_info:代表在elasticity里的配置会忽略外层的batch_size相关的配置,训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数 | ||
| 计算原则为: | ||
| global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size | ||
| - max_train_batch_size:最大batch_size数 | ||
| - micro_batch_sizes:即train_micro_batch_size_per_gpu | ||
| - min_gpus:最小gpu数目 | ||
| - max_gpus:最大gpu数目 | ||
| 更详细的内容见:[Deepspeed](https://www.deepspeed.ai/docs/config-json/#elastic-training-config-v01-and-v02) | ||
|
|
||
|
|
||
| ## 启动训练 | ||
|
|
||
| ```yaml | ||
| --- | ||
| apiVersion: elastic.iml.github.io/v1alpha1 | ||
| kind: ElasticJob | ||
| metadata: | ||
| name: deepspeed-elastic-swift | ||
| namespace: dlrover | ||
| spec: | ||
| distributionStrategy: AllreduceStrategy | ||
| optimizeMode: single-job | ||
| replicaSpecs: | ||
| worker: | ||
| replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】 | ||
| template: | ||
| spec: | ||
| restartPolicy: Never | ||
| containers: | ||
| - name: main | ||
| image: #【训练镜像,需要安装deepspeed,dlrover 和swift 】 | ||
| imagePullPolicy: IfNotPresent | ||
| command: | ||
| - /bin/bash | ||
| - -c | ||
| - sh start.sh # 启动脚本 | ||
| resources: | ||
| limits: | ||
| cpu: '8' | ||
| memory: 16Gi | ||
| nvidia.com/gpu: '1' | ||
| volumeMounts: | ||
| - mountPath: /model | ||
| name: volume-model | ||
| - mountPath: /dev/shm | ||
| name: volume-shm | ||
| restartPolicy: Never | ||
| volumes: | ||
| - hostPath: | ||
| path: /model | ||
| type: Directory | ||
| name: volume-model | ||
| - emptyDir: | ||
| medium: Memory | ||
| sizeLimit: 200Gi | ||
| name: volume-shm | ||
|
|
||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -188,6 +188,7 @@ gradient_checkpointing: true | |||||
| - 🔥gradient_checkpointing: 是否使用gradient_checkpointing,默认为True。该参数可以显著降低显存占用,但降低训练速度。 | ||||||
| - 🔥vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为None,即设置为`gradient_checkpointing`。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh)。 | ||||||
| - 注意:多模态模型且是LoRA训练时,当设置了`--freeze_vit false`,且命令行中出现以下警告:`UserWarning: None of the inputs have requires_grad=True. Gradients will be None`,请设置`--vit_gradient_checkpointing false`,或提相关issue。全参数训练则不会出现该问题。(如果RLHF LoRA训练中,ref_model抛出来的警告,则是正常的) | ||||||
| - activation_cpu_offload: 激活卸载是 ms-swift 中的一种内存优化技术,它将前向传播中保存的激活张量卸载到 CPU 内存,在反向传播需要时再加载回 GPU。该功能显著减少 GPU 内存使用,允许您训练更大的模型或使用更大的批次大小。 | ||||||
| - 🔥deepspeed: 默认为None。可以设置为'zero0', 'zero1', 'zero2', 'zero3', 'zero2_offload', 'zero3_offload'来使用ms-swift内置的deepspeed配置文件。你也可以传入自定义deepspeed配置文件的路径。 | ||||||
| - zero_hpz_partition_size: 默认为None,这个参数是ZeRO++的特性,即node内模型分片,node间数据分片,如果遇到grad_norm NaN,请尝试使用`--torch_dtype float16`。 | ||||||
| - deepspeed_autotp_size: DeepSpeed张量并行大小,默认为1。使用DeepSpeed AutoTP时需将参数`--deepspeed`设置为'zero0'、'zero1'或'zero2'。(注意:该功能只支持全参数) | ||||||
|
|
@@ -250,6 +251,9 @@ gradient_checkpointing: true | |||||
| - 🔥neftune_noise_alpha: neftune添加的噪声系数。默认为0,通常可以设置为5、10、15。 | ||||||
| - 🔥use_liger_kernel: 是否启用[Liger](https://github.com/linkedin/Liger-Kernel)内核加速训练并减少显存消耗。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger)。 | ||||||
| - 注意:liger_kernel不支持device_map,请使用DDP/DeepSpeed进行多卡训练。liger_kernel目前只支持`task_type='causal_lm'`。 | ||||||
| - use_cce: 是否启用[cut-cross-entropy](https://github.com/apple/ml-cross-entropy)融合算子降低显存并加速训练。默认为False。示例shell参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/cce)。 | ||||||
| - use_tiled_mlp: 是否启用Tiled MLP进行内存高效的长序列训练。启用后,MLP层会被替换为分块实现,将序列分成多个shard进行计算以减少显存占用。默认为False。 | ||||||
| - tiled_mlp_num_shards: Tiled MLP计算时将序列分成的shard数量。默认为None,即设置为4。较大的值可以减少显存但可能增加计算时间。 | ||||||
| - average_tokens_across_devices: 是否在设备之间进行token数平均。如果设置为True,将使用all_reduce同步`num_tokens_in_batch`以进行精确的损失计算。默认为False。 | ||||||
| - max_grad_norm: 梯度裁剪。默认为1.。 | ||||||
| - 注意:日志中的grad_norm记录的是裁剪前的值。 | ||||||
|
|
@@ -473,7 +477,8 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数, | |||||
| - eval_dataset_args: 评测数据集参数,json格式,可设置多个数据集的参数。 | ||||||
| - eval_limit: 评测数据集采样数。 | ||||||
| - eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`。 | ||||||
| - use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化,目前暂不支持safetensors格式;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。 | ||||||
| - use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。 | ||||||
| - elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The formatting for the
Suggested change
|
||||||
| - early_stop_interval: 早停的间隔,会检验best_metric在early_stop_interval个周期内(基于`save_steps`, 建议`eval_steps`和`save_steps`设为同值)没有提升时终止训练。具体代码在[callback plugin](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/callback.py)中。同时,如果有较为复杂的早停需求,直接覆盖callback.py中的已有实现即可。 | ||||||
|
|
||||||
| #### SWANLAB | ||||||
|
|
||||||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to
sft.pyis hardcoded to a specific conda environment. This approach is brittle and not easily portable. It would be more robust to use the-mflag ofdlrover-runto execute the module directly.