Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡微调Qwen2.5-14B显存分配不均 #7055

Open
1 task done
Jimmy-L99 opened this issue Feb 24, 2025 · 2 comments
Open
1 task done

多卡微调Qwen2.5-14B显存分配不均 #7055

Jimmy-L99 opened this issue Feb 24, 2025 · 2 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@Jimmy-L99
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

environment:

llamafactory            0.9.1.dev0
liger_kernel            0.5.3
torch                   2.4.0
transformers            4.44.0
flash-attn              2.7.2.post1
deepspeed               0.14.4
A100 80GB *2

yaml:

### model
model_name_or_path: /root/sdb2/Qwen2.5-14B-Instruct

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
deepspeed: /root/LLaMA_Factory/LLaMA-Factory/examples/deepspeed/ds_z3_offload_config.json
enable_liger_kernel: True
flash_attn: fa2

### dataset
dataset: deepseek-110k
template: qwen
cutoff_len: 12800
max_samples: 27500
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /root/LoRA/LoRA_deepseek110k/qwen_model/deepseek110k_finetune_qwen2.5-14b-lora_lr1e-4_r16_alpha32_ld0.05_20250224
logging_steps: 500
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.03
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 200

GPU memory情况:

1084564 root   1  Compute 100%  72782MiB  89%   100%  27464MiB /root/anaconda3/envs/llama-factory/bin/python -u /root/LLaMA_Factory/LLaMA-Factory/src/llamafactory/laun
1084563 root   0  Compute  99%  40408MiB  49%   101%  27486MiB /root/anaconda3/envs/llama-factory/bin/python -u /root/LLaMA_Factory/LLaMA-Factory/src/llamafactory/laun

显存总是有一张卡占用特别高,另一张卡吃一半,训练到一定时间就自动oom了,这咋解决呀~~

Reproduction

Others

No response

@Jimmy-L99 Jimmy-L99 added bug Something isn't working pending This problem is yet to be addressed labels Feb 24, 2025
@hiyouga
Copy link
Owner

hiyouga commented Feb 24, 2025

tru use_unsloth_gc

@Jimmy-L99
Copy link
Author

Jimmy-L99 commented Feb 25, 2025

tru use_unsloth_gc

It did stabilize a little bit at around 39GB with use_unsloth_gc. There is also the question why val process memory is much higher than training, 80*2 is almost full, with per_device_train_batch_size: 2, gradient_accumulation_steps: 2, per_device_eval_batch_size: 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants