Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

> Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns: #53

Open
zmtttt opened this issue Sep 4, 2024 · 3 comments

Comments

@zmtttt
Copy link

zmtttt commented Sep 4, 2024

          > Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:

1:Before calling init_ndtimeline, initialization is required. Would this conflict with Megatron's own initialize_megatron function? Both involve operations related to process groups, so this could potentially cause communication issues later on.

2:The interfaces of Megatron-LM and vescale are different. How can I integrate the computational interfaces, such as major-metrics, tp-stream-metrics, dp-stream-metrics, pp-batch-stream-metrics, and pp-forward-stream-metrics? Has anyone successfully used ndtimeline-tool with Megatron-GPT before?

thanks!

(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py
but failed to get right timeline.
(2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py?
I do not use register instruction , I use @ndtimer(SEND_BACKWARD)
def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.
wrong-megatron

Originally posted by @zmtttt in #51 (comment)

@mzxcpp
Copy link

mzxcpp commented Sep 14, 2024

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

@zmtttt
Copy link
Author

zmtttt commented Sep 18, 2024

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics,
but I only achieve this on single machine, I wandered how to use this with muti machines

@mzxcpp
Copy link

mzxcpp commented Sep 18, 2024

I‘m also interested in using ndtimeline for my code, but struggling to modify it.

I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines

Great try! I'm doing some exps on megatron and deepspeed's PP and struggle to monitor the communications as well. Hope there would be an example for custom users by official team or you.I think a more general and stand-alone toolkit would be better for the development of the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants