[Test] Testing the generalization of fused moe #167

kaniel-outis · 2025-11-06T07:24:37Z

Add testing fused moe:

Generalization for hiddensize
Test for eplb. Because the small operator does not support the scenario where topk=-1, in the test cases we use global_base_prefix_sum to calculate eplb and compare the results with the fusion operator. Since the output of the fusion operator eplb is an auto ep_recv_count = at::empty({num_local_experts * num_ranks}, at::dtype(at::kInt).device(device)); shape=[num_local_experts * num_ranks] one-dimensional tensor, but global_base_prefix_sum is a shape=[num_local_experts, num_ranks], we perform some preprocessing on global_base_prefix_sum to match the fusion operator ep_recv_count.

gemini-code-assist · 2025-11-06T07:24:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Yael-X · 2025-11-07T02:46:21Z

tests/python/deepep/test_fused_deep_moe.py

-        type=str_to_bool,
-        default=False,
-        help="define small bs on certain rank",
+        "--debug",


there is no need to add debug mode

Yael-X · 2025-11-07T02:48:34Z

tests/python/deepep/test_fused_deep_moe.py


-    if not test_topk_minus1:
+    # ----- Compare Recv Count -----
+    if args.topk_drop_col < 0 and args.topk_drop_prob == 0.0:


The pre-integration operator does not support topk=-1. You can calculate the number of tokens received on the CPU side and then verify it. Refer to test_intranode.py or internode.py for reference.

Yael-X · 2025-11-07T02:49:31Z

tests/python/deepep/test_fused_deep_moe.py

+
    # ----- Routing(topk_idx) -----
-    if args.active_ranks:
+    if args.debug and args.active_ranks:


Remove the debug mode; it affects the original test logic here.

Yael-X · 2025-11-18T13:25:00Z

tests/python/deepep/test_fused_deep_moe.py

+    print(f"expected_recv: {expected_recv}")
+    print(f"fused_recv: {fused_recv}")
+
+    diff = (expected_recv - fused_recv).abs()


Lack of assertion; it is recommended to add the following code:

assert torch.all(diff == 0), ( f"Recv count mismatch on rank {rank}. Max difference: {diff.max().item()}", f"\nExpected:\n{expected_recv}\nActual:\n{fused_recv}" )

This check already exists below.

Yael-X · 2025-11-18T13:27:10Z

tests/python/deepep/test_fused_deep_moe.py

+        print(
+            f"[Rank {rank}] gbl_num_tokens_per_expert: {gbl_num_tokens_per_expert.tolist()}"
        )
+    base_prefix_sum = num_tokens_per_expert.clone()


The term "base_prefix_sum" is too generic and does not specify what it counts.
If it counts the tokens sent to all experts, it is recommended to rename it to something like "local_expert_send_counts" or "local_expert_token_counts" to improve the self-explanatory nature of the code.

change to local_expert_token_counts

Yael-X · 2025-11-18T13:29:41Z

tests/python/deepep/test_fused_deep_moe.py

+    gbl_num_tokens_per_expert = num_tokens_per_expert.clone()
+    dist.all_reduce(gbl_num_tokens_per_expert, group=group)
+
+    print(f"[Rank {rank}] num_tokens_per_expert: {num_tokens_per_expert.tolist()}")


There are a large number of print statements in the code (including detailed tensor prints for Rank 0). These print statements should be removed or placed under strict debugging conditions (e.g., using if DEBUG_MODE: or the logging system logging.debug) when the code is deployed to production, as they can affect performance and generate excessive output.

add debug mode

Testing the generalization of fusion operators

5f07e2c

Kaniel_Zhou added 12 commits November 6, 2025 15:26

fix lint

ea04152

fix lint

9053340

add to build

9c28fbc

fix word

26025ad

fix word

015bbfd

fix word

d9e3c18

fix word

4139531

update test case

8f21280

update test case

3bf64de

update test case

76f5c3a

update test case

e02f612

update test case

9804516

Yael-X previously approved these changes Nov 7, 2025

View reviewed changes

update test case

5bb7c55

kaniel-outis dismissed Yael-X’s stale review via 5bb7c55 November 10, 2025 06:41

Yael-X approved these changes Nov 18, 2025

View reviewed changes

Yael-X reviewed Nov 18, 2025

View reviewed changes

Yael-X previously approved these changes Nov 18, 2025

View reviewed changes

fix cleancode

bf3f45e

kaniel-outis dismissed Yael-X’s stale review via bf3f45e November 21, 2025 02:10

fix lint

752f68c

[Test] Testing the generalization of fused moe #167

Are you sure you want to change the base?

[Test] Testing the generalization of fused moe #167

Conversation

kaniel-outis commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaniel-outis commented Nov 6, 2025 •

edited

Loading