多卡微调Qwen2.5-14B显存分配不均 #7055

Jimmy-L99 · 2025-02-24T15:20:31Z

Reminder

I have read the above rules and searched the existing issues.

System Info

environment：

llamafactory            0.9.1.dev0
liger_kernel            0.5.3
torch                   2.4.0
transformers            4.44.0
flash-attn              2.7.2.post1
deepspeed               0.14.4
A100 80GB *2

yaml:

### model
model_name_or_path: /root/sdb2/Qwen2.5-14B-Instruct

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
deepspeed: /root/LLaMA_Factory/LLaMA-Factory/examples/deepspeed/ds_z3_offload_config.json
enable_liger_kernel: True
flash_attn: fa2

### dataset
dataset: deepseek-110k
template: qwen
cutoff_len: 12800
max_samples: 27500
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /root/LoRA/LoRA_deepseek110k/qwen_model/deepseek110k_finetune_qwen2.5-14b-lora_lr1e-4_r16_alpha32_ld0.05_20250224
logging_steps: 500
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 200

GPU memory情况：

1084564 root   1  Compute 100%  72782MiB  89%   100%  27464MiB /root/anaconda3/envs/llama-factory/bin/python -u /root/LLaMA_Factory/LLaMA-Factory/src/llamafactory/laun
1084563 root   0  Compute  99%  40408MiB  49%   101%  27486MiB /root/anaconda3/envs/llama-factory/bin/python -u /root/LLaMA_Factory/LLaMA-Factory/src/llamafactory/laun

显存总是有一张卡占用特别高，另一张卡吃一半，训练到一定时间就自动oom了，这咋解决呀~~

Reproduction

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-02-24T15:36:54Z

tru use_unsloth_gc

Jimmy-L99 · 2025-02-25T01:24:10Z

tru use_unsloth_gc

It did stabilize a little bit at around 39GB with use_unsloth_gc. There is also the question why val process memory is much higher than training, 80*2 is almost full, with per_device_train_batch_size: 2, gradient_accumulation_steps: 2, per_device_eval_batch_size: 2

Jimmy-L99 added bug Something isn't working pending This problem is yet to be addressed labels Feb 24, 2025

Jimmy-L99 mentioned this issue Feb 28, 2025

显存分配不平衡 #7110

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡微调Qwen2.5-14B显存分配不均 #7055

多卡微调Qwen2.5-14B显存分配不均 #7055

Jimmy-L99 commented Feb 24, 2025

hiyouga commented Feb 24, 2025

Jimmy-L99 commented Feb 25, 2025 •

edited

Loading

多卡微调Qwen2.5-14B显存分配不均 #7055

多卡微调Qwen2.5-14B显存分配不均 #7055

Comments

Jimmy-L99 commented Feb 24, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Feb 24, 2025

Jimmy-L99 commented Feb 25, 2025 • edited Loading

Jimmy-L99 commented Feb 25, 2025 •

edited

Loading