topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909
+260
−15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support
Summary
This PR adds a GFX9-optimized
topk_sigmoidkernel using DPP intrinsics while preserving the CK implementation as a fallback for other architectures.Performance Highlights
DPP vs CK direct comparison: avg 1.66x, median 1.65x, range 1.42x - 1.94x
Motivation
The existing CK-based implementation of topk_sigmoid has room for performance enhancements through the use of DPP intrinsics. This PR also adds support for 256 experts and FP32 dtype - the CK implementation silently returns garbage for these cases because
topk_softmax_api.cpphas no matching branch.Technical Approach
Architecture-Aware Dispatch
Runtime detection automatically routes to the optimal kernel:
Benchmark Results
Full Side-by-Side Comparison (40 configs, fp16/bf16, 64/128 experts)
64 Experts
128 Experts
Summary: 40/40 PASS | Avg: 1.66x | Median: 1.65x | Best: 1.98x | Worst: 1.31x
Test Plan
Used internal op test:
op_tests/test_moe_topk_sigmoid.pyReproduce Benchmarks
This PR:
docker run --rm \ --device=/dev/kfd --device=/dev/dri \ --group-add video --shm-size=16G \ -v /path/to/aiter:/aiter \ rocm/pytorch:latest \ bash -c " cd /aiter rm -rf aiter/jit/*.so aiter/jit/build pip install -e . python op_tests/test_moe_topk_sigmoid.py \ --num-experts 64,128,256 \ --num-tokens 256,512,1024,2048,4096 \ --topk 4,8 \ --dtype fp16,bf16,fp32 "Baseline (upstream aiter, CK kernel):
Checklist
Environment
Hardware: AMD MI300X (gfx942), ROCm 7.2.0,
rocm/pytorch:latest