-
Notifications
You must be signed in to change notification settings - Fork 190
Fmoe_fp32_pro #1923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fmoe_fp32_pro #1923
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds an FP32-output (likely FP32-accumulation) path for the fused MoE g1u1 per-token int8 SiLU kernel on gfx942, wiring a new hsaco + CSV config into the existing heuristic kernel-selection pipeline and exposing it through the Python/C++ APIs.
Changes:
- Add gfx942 FP32 per-token int8 g1u1 SiLU hsaco (
.co) and its kernel list CSV. - Extend
fmoe_g1u1_a16(C++/HIP) to select an FP32 config map and launch withT_O=floatwhenoutis FP32. - Update Python
asm_moeto optionally allocatemoe_bufas FP32 and cast back to the original dtype after execution.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_vs_smf_silu_1tg_32x384.co |
Adds the gfx942 FP32 SiLU g1u1 per-token int8 assembled kernel blob. |
hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv |
Registers the new FP32 kernel in the asm config generation pipeline. |
csrc/py_itfs_cu/asm_fmoe.cu |
Enables selecting/launching the FP32-output kernel variant for fmoe_g1u1_a16. |
aiter/fused_moe_bf16_asm.py |
Adds a Python-side heuristic to allocate FP32 output buffer and convert back to the original dtype. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| is_g1u1 = ( | ||
| w2.shape[2] * 2 * lastdim_mul == w1.shape[1] and fc2_smooth_scale is not None | ||
| ) | ||
| enable_fp32 = ( | ||
| fc2_smooth_scale is not None | ||
| and is_g1u1 | ||
| and (inter_dim % 384 == 0) | ||
| and w1.dtype == dtypes.i8 | ||
| and a16 | ||
| ) | ||
| moebuf_dtype = torch.float32 if enable_fp32 else dtype |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_fp32 can currently become True even when activation is GELU or when running on non-gfx942 GPUs. In those cases moe_buf is allocated as FP32 and routed into aiter.fmoe_g1u1_a16, but the C++ side only wires up FP32 output support for (int8 weights, Silu) and only has an FP32 hsaco listed for gfx942; this will reliably throw at runtime (unsupported activation/dtype or “No suitable kernel found”). Gate enable_fp32 at least on activation == ActivationType.Silu and the current arch (e.g. get_gfx() == "gfx942"), or add a safe fallback when the FP32 kernel isn’t available.
| else if(out.dtype() == at::ScalarType::Float && activation == ActivationType::Silu) | ||
| config_map = &cfg_fmoe_fp32_pertokenInt8_g1u1_silu; | ||
| else |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FP32 output is now selectable here, but the only shipped FP32 hsaco in this PR is for gfx942 and subGU_n=384 (see hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv). If callers pass out as FP32 on other arches or with an inter_dim not divisible by 384, this will fail later with a generic “No suitable kernel found”. Consider adding an explicit early TORCH_CHECK documenting the FP32 constraints (arch + supported inter_dim tiles) so the failure mode is clearer.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist