Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i try to start train with bash scripts/finetune.sh #22

Open
MahmoudElsayedMahmoud opened this issue Jan 7, 2025 · 6 comments
Open

i try to start train with bash scripts/finetune.sh #22

MahmoudElsayedMahmoud opened this issue Jan 7, 2025 · 6 comments

Comments

@MahmoudElsayedMahmoud
Copy link

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 5/5 [00:06<00:00, 1.20s/it]
Installed CUDA version 12.0 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination
Using /home/mahmoud/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Emitting ninja build file /home/mahmoud/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
[rank0]: Traceback (most recent call last):
[rank0]: File "/media/mahmoud/새 볼륨/Llama3.2-Vision-Finetune/src/training/train.py", line 225, in
[rank0]: train()
[rank0]: File "/media/mahmoud/새 볼륨/Llama3.2-Vision-Finetune/src/training/train.py", line 200, in train
[rank0]: trainer.train()
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _inner_training_loop
[rank0]: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/accelerate/accelerator.py", line 1318, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in init
[rank0]: self._configure_optimizer(optimizer, model_parameters)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1276, in _configure_optimizer
[rank0]: basic_optimizer = self._configure_basic_optimizer(model_parameters)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1347, in _configure_basic_optimizer
[rank0]: optimizer = DeepSpeedCPUAdam(model_parameters,
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank0]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
[rank0]: return self.jit_load(verbose)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
[rank0]: op_module = load(name=self.name,
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank0]: return _jit_compile(
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1746, in _jit_compile
[rank0]: return _import_module_from_library(name, build_directory, is_python_module)
[rank0]: File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2140, in _import_module_from_library
[rank0]: module = importlib.util.module_from_spec(spec)
[rank0]: File "", line 571, in module_from_spec
[rank0]: File "", line 1176, in create_module
[rank0]: File "", line 241, in _call_with_frames_removed
[rank0]: ImportError: /home/mahmoud/anaconda3/envs/llama3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /home/mahmoud/.cache/torch_extensions/py310_cu124/cpu_adam/cpu_adam.so)
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7a8af0d48d30>
Traceback (most recent call last):
File "/home/mahmoud/anaconda3/envs/llama3/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

@2U1
Copy link
Owner

2U1 commented Jan 7, 2025

I think the version of your cuda and the environment is a bit different. Can you reinstall the torch to your current cuda version and retry?

@MahmoudElsayedMahmoud
Copy link
Author

can u tell me the version cuda and torch ?

@MahmoudElsayedMahmoud
Copy link
Author

i use
Torch version: 2.5.1+cu124
CUDA available: True
CUDA version: 12.4

@2U1
Copy link
Owner

2U1 commented Jan 7, 2025

The error code says you have cuda 12.0. Thats odd.
Think it's some kinda version issue. I'll try to get it.

@2U1
Copy link
Owner

2U1 commented Jan 10, 2025

@MahmoudElsayedMahmoud It might be a problem with gcc.
deepspeedai/DeepSpeed#4257
You could follow the solution here.

@MahmoudElsayedMahmoud
Copy link
Author

ok i will try and tell u thx for try to help me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants