[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

kefeiyao · 2023-09-21T07:56:18Z

Description

Fix memory usage did not get printed correctly

<Before>
[Rank 5] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 4] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 3] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 1] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
steps: 10 loss: 12.4282 iter time (s): 11.992 samples/sec: 5.337
[Rank 0] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
iteration       10/   10000 | consumed samples:          640 | consumed tokens:      1310720 | elapsed time per iteration (ms):
<After>
[Rank 4] (after 3 iterations) memory (MB) | allocated: 65639.96411132812 | max allocated: 96745.81176757812 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 3] (after 3 iterations) memory (MB) | allocated: 65257.65905761719 | max allocated: 96747.86450195312 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 6] (after 3 iterations) memory (MB) | allocated: 65621.44079589844 | max allocated: 96728.14770507812 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 0] (after 3 iterations) memory (MB) | allocated: 65329.9560546875 | max allocated: 96694.22534179688 | reserved: 96883.21484375 | max reserved: 96883.21484375

📝 Please include a summary of the changes.
[Megatron-DeepSpeed script] Dispatch to appropriate device APIs according to current device type to correctly obtain memory usage info

List any dependencies that are required for the changes.

Type of changes

Please specify the type of changes, and delete the options that are not relevant.

[* ] Bug fix (changes which fix an issue)

Tests

📝 Please describe the tests that you ran to verify your changes.
Run llama13b training on Gaudi2 and wait until memory usage get printed

Provide the instructions so that we can reproduce.
MODEL_REFERENCES_ROOT=$your_model_refences_root HL_DATA_DIR_ROOT=$your_data_set ./run_llama13b.sh

Please also list any relevant details for your test configuration.

Checklist

[ *] I agree with the Developer Certificate of Origin.
[ *] My code conforms to the following coding guidelines:
- [ *] Use Python 3
- [* ] Python code follows PEP 8 Coding Styles
- [ *] For TensorFlow models, use TensorFlow 2 high-level APIs
[ *] I have performed a self code review.
I have made corresponding changes to the documentation.
[ *] I have added tests that prove my fix is effective or that my feature works.

according to current device type to correctly obtain memory usage info

[Megatron-DeepSpeed script] Dispatch to appropriate device APIs

94fa000

according to current device type to correctly obtain memory usage info

kefeiyao changed the title ~~[Megatron-DeepSpeed script] Dispatch to appropriate device APIs~~ [Megatron-DeepSpeed script] Fix memory usage did not get printed correctly Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

kefeiyao commented Sep 21, 2023 •

edited

Loading

[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

Are you sure you want to change the base?

[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

Conversation

kefeiyao commented Sep 21, 2023 • edited Loading

Description

Type of changes

Tests

Checklist

kefeiyao commented Sep 21, 2023 •

edited

Loading