-
Notifications
You must be signed in to change notification settings - Fork 53
[Test] Testing the generalization of fused moe #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| type=str_to_bool, | ||
| default=False, | ||
| help="define small bs on certain rank", | ||
| "--debug", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to add debug mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
|
|
||
| if not test_topk_minus1: | ||
| # ----- Compare Recv Count ----- | ||
| if args.topk_drop_col < 0 and args.topk_drop_prob == 0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pre-integration operator does not support topk=-1. You can calculate the number of tokens received on the CPU side and then verify it. Refer to test_intranode.py or internode.py for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
|
|
||
| # ----- Routing(topk_idx) ----- | ||
| if args.active_ranks: | ||
| if args.debug and args.active_ranks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the debug mode; it affects the original test logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
| print(f"expected_recv: {expected_recv}") | ||
| print(f"fused_recv: {fused_recv}") | ||
|
|
||
| diff = (expected_recv - fused_recv).abs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of assertion; it is recommended to add the following code:
assert torch.all(diff == 0), (
f"Recv count mismatch on rank {rank}. Max difference: {diff.max().item()}",
f"\nExpected:\n{expected_recv}\nActual:\n{fused_recv}"
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check already exists below.
| print( | ||
| f"[Rank {rank}] gbl_num_tokens_per_expert: {gbl_num_tokens_per_expert.tolist()}" | ||
| ) | ||
| base_prefix_sum = num_tokens_per_expert.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "base_prefix_sum" is too generic and does not specify what it counts.
If it counts the tokens sent to all experts, it is recommended to rename it to something like "local_expert_send_counts" or "local_expert_token_counts" to improve the self-explanatory nature of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to local_expert_token_counts
| gbl_num_tokens_per_expert = num_tokens_per_expert.clone() | ||
| dist.all_reduce(gbl_num_tokens_per_expert, group=group) | ||
|
|
||
| print(f"[Rank {rank}] num_tokens_per_expert: {num_tokens_per_expert.tolist()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a large number of print statements in the code (including detailed tensor prints for Rank 0). These print statements should be removed or placed under strict debugging conditions (e.g., using if DEBUG_MODE: or the logging system logging.debug) when the code is deployed to production, as they can affect performance and generate excessive output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add debug mode
Add testing fused moe:
Generalization for hiddensize
Test for eplb. Because the small operator does not support the scenario where topk=-1, in the test cases we use global_base_prefix_sum to calculate eplb and compare the results with the fusion operator. Since the output of the fusion operator eplb is an auto ep_recv_count = at::empty({num_local_experts * num_ranks}, at::dtype(at::kInt).device(device)); shape=[num_local_experts * num_ranks] one-dimensional tensor, but global_base_prefix_sum is a shape=[num_local_experts, num_ranks], we perform some preprocessing on global_base_prefix_sum to match the fusion operator ep_recv_count.