You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:
#53
> Using ndtimeline-tool to Monitor Megatron-GPT I want to use the ndtimeline-tool to monitor the computation and communication of each rank in Megatron-GPT. I have two concerns:
1:Before calling init_ndtimeline, initialization is required. Would this conflict with Megatron's own initialize_megatron function? Both involve operations related to process groups, so this could potentially cause communication issues later on.
2:The interfaces of Megatron-LM and vescale are different. How can I integrate the computational interfaces, such as major-metrics, tp-stream-metrics, dp-stream-metrics, pp-batch-stream-metrics, and pp-forward-stream-metrics? Has anyone successfully used ndtimeline-tool with Megatron-GPT before?
thanks!
(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py
but failed to get right timeline.
(2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py?
I do not use register instruction , I use @ndtimer(SEND_BACKWARD)
def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.
I‘m also interested in using ndtimeline for my code, but struggling to modify it.
I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics,
but I only achieve this on single machine, I wandered how to use this with muti machines
I‘m also interested in using ndtimeline for my code, but struggling to modify it.
I have sloved , just adding ndtimeline_init , modifying schedules and p2p-communication with correct metrics, but I only achieve this on single machine, I wandered how to use this with muti machines
Great try! I'm doing some exps on megatron and deepspeed's PP and struggle to monitor the communications as well. Hope there would be an example for custom users by official team or you.I think a more general and stand-alone toolkit would be better for the development of the community.
(1)my progress:I modify nditimeline/init and p2p_communication.py and schedule.py
but failed to get right timeline.
(2)why??and I wandered why need to register instruction in ndtimeline/pipedream_flush.py?
I do not use register instruction , I use @ndtimer(SEND_BACKWARD)
def send_backward(input_tensor_grads, tensor_shapes, config) in megatron/core/pipeline_parallel/schedules.py , all interces use the same method.
Originally posted by @zmtttt in #51 (comment)
The text was updated successfully, but these errors were encountered: