Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Megatron-DeepSpeed script] Fix memory usage did not get printed correctly #34

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kefeiyao
Copy link

@kefeiyao kefeiyao commented Sep 21, 2023

Description

Fix memory usage did not get printed correctly

<Before>
[Rank 5] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 4] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 3] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
[Rank 1] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
steps: 10 loss: 12.4282 iter time (s): 11.992 samples/sec: 5.337
[Rank 0] (after 10 iterations) memory (MB) | allocated: 0.0 | max allocated: 0.0 | reserved: 0.0 | max reserved: 0.0
iteration       10/   10000 | consumed samples:          640 | consumed tokens:      1310720 | elapsed time per iteration (ms):
<After>
[Rank 4] (after 3 iterations) memory (MB) | allocated: 65639.96411132812 | max allocated: 96745.81176757812 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 3] (after 3 iterations) memory (MB) | allocated: 65257.65905761719 | max allocated: 96747.86450195312 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 6] (after 3 iterations) memory (MB) | allocated: 65621.44079589844 | max allocated: 96728.14770507812 | reserved: 96883.21484375 | max reserved: 96883.21484375
[Rank 0] (after 3 iterations) memory (MB) | allocated: 65329.9560546875 | max allocated: 96694.22534179688 | reserved: 96883.21484375 | max reserved: 96883.21484375

📝 Please include a summary of the changes.
[Megatron-DeepSpeed script] Dispatch to appropriate device APIs according to current device type to correctly obtain memory usage info

  • List any dependencies that are required for the changes.

Type of changes

Please specify the type of changes, and delete the options that are not relevant.

  • [* ] Bug fix (changes which fix an issue)

Tests

📝 Please describe the tests that you ran to verify your changes.
Run llama13b training on Gaudi2 and wait until memory usage get printed

  • Provide the instructions so that we can reproduce.
    MODEL_REFERENCES_ROOT=$your_model_refences_root HL_DATA_DIR_ROOT=$your_data_set ./run_llama13b.sh
  • Please also list any relevant details for your test configuration.

Checklist

  • [ *] I agree with the Developer Certificate of Origin.
  • [ *] My code conforms to the following coding guidelines:
    • [ *] Use Python 3
    • [* ] Python code follows PEP 8 Coding Styles
    • [ *] For TensorFlow models, use TensorFlow 2 high-level APIs
  • [ *] I have performed a self code review.
  • I have made corresponding changes to the documentation.
  • [ *] I have added tests that prove my fix is effective or that my feature works.

according to current device type to correctly obtain memory usage info
@kefeiyao kefeiyao changed the title [Megatron-DeepSpeed script] Dispatch to appropriate device APIs [Megatron-DeepSpeed script] Fix memory usage did not get printed correctly Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant