topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909

stevenarellano · 2026-01-26T13:35:09Z

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support

Summary

This PR adds a GFX9-optimized topk_sigmoid kernel using DPP intrinsics while preserving the CK implementation as a fallback for other architectures.

Performance Highlights

Metric	CK Baseline	DPP Kernel	Improvement
Avg speedup (vs PyTorch)	4.32x	9.81x	+127%
Best case (vs PyTorch)	6.40x	15.81x	+147%
Worst case (vs PyTorch)	2.86x	5.99x	+109%
256 Experts	UNSUPPORTED	Works	SUPPORTED
FP32 Support	UNSUPPORTED	Works	SUPPORTED

DPP vs CK direct comparison: avg 1.66x, median 1.65x, range 1.42x - 1.94x

Motivation

The existing CK-based implementation of topk_sigmoid has room for performance enhancements through the use of DPP intrinsics. This PR also adds support for 256 experts and FP32 dtype - the CK implementation silently returns garbage for these cases because topk_softmax_api.cpp has no matching branch.

Technical Approach

Architecture-Aware Dispatch

Runtime detection automatically routes to the optimal kernel:

void topk_sigmoid(...) {
    if (isGPUArch({"gfx9"})) {
        topk_sigmoid_gfx9(...);  // DPP-optimized
    } else {
        topk_sigmoid_ck(...);    // CK fallback
    }
}

Benchmark Results

Full Side-by-Side Comparison (40 configs, fp16/bf16, 64/128 experts)

64 Experts

Tokens	TopK	DType	CK (μs)	DPP (μs)	Speedup	Status
256	4	fp16	4.19	2.57	1.63x	PASS
256	4	bf16	3.80	2.41	1.58x	PASS
256	8	fp16	5.74	3.17	1.81x	PASS
256	8	bf16	5.44	3.03	1.80x	PASS
512	4	fp16	3.98	2.52	1.58x	PASS
512	4	bf16	3.85	2.38	1.62x	PASS
512	8	fp16	5.65	3.21	1.76x	PASS
512	8	bf16	5.79	2.93	1.98x	PASS
1024	4	fp16	4.01	2.55	1.57x	PASS
1024	4	bf16	4.04	2.54	1.59x	PASS
1024	8	fp16	5.81	3.34	1.74x	PASS
1024	8	bf16	5.71	2.97	1.92x	PASS
2048	4	fp16	3.93	3.01	1.31x	PASS
2048	4	bf16	4.10	2.59	1.58x	PASS
2048	8	fp16	5.79	3.43	1.69x	PASS
2048	8	bf16	5.67	3.64	1.56x	PASS
4096	4	fp16	4.79	3.32	1.44x	PASS
4096	4	bf16	4.77	3.19	1.50x	PASS
4096	8	fp16	7.26	4.65	1.56x	PASS
4096	8	bf16	7.33	4.61	1.59x	PASS

128 Experts

Tokens	TopK	DType	CK (μs)	DPP (μs)	Speedup	Status
256	4	fp16	5.04	2.80	1.80x	PASS
256	4	bf16	4.84	2.86	1.69x	PASS
256	8	fp16	6.62	3.45	1.92x	PASS
256	8	bf16	6.62	3.59	1.84x	PASS
512	4	fp16	4.64	2.73	1.70x	PASS
512	4	bf16	4.82	2.96	1.63x	PASS
512	8	fp16	6.58	3.75	1.75x	PASS
512	8	bf16	6.66	3.67	1.81x	PASS
1024	4	fp16	4.78	2.90	1.65x	PASS
1024	4	bf16	4.93	3.05	1.62x	PASS
1024	8	fp16	6.60	3.48	1.90x	PASS
1024	8	bf16	6.66	3.80	1.75x	PASS
2048	4	fp16	4.93	2.96	1.67x	PASS
2048	4	bf16	4.76	3.10	1.54x	PASS
2048	8	fp16	6.65	4.03	1.65x	PASS
2048	8	bf16	6.60	3.93	1.68x	PASS
4096	4	fp16	5.92	3.69	1.60x	PASS
4096	4	bf16	5.85	3.89	1.50x	PASS
4096	8	fp16	8.48	5.53	1.53x	PASS
4096	8	bf16	8.28	5.57	1.49x	PASS

Summary: 40/40 PASS | Avg: 1.66x | Median: 1.65x | Best: 1.98x | Worst: 1.31x

Test Plan

Used internal op test: op_tests/test_moe_topk_sigmoid.py

Reproduce Benchmarks

This PR:

docker run --rm \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video --shm-size=16G \
    -v /path/to/aiter:/aiter \
    rocm/pytorch:latest \
    bash -c "
        cd /aiter
        rm -rf aiter/jit/*.so aiter/jit/build
        pip install -e .
        python op_tests/test_moe_topk_sigmoid.py \
            --num-experts 64,128,256 \
            --num-tokens 256,512,1024,2048,4096 \
            --topk 4,8 \
            --dtype fp16,bf16,fp32
    "

Baseline (upstream aiter, CK kernel):

# Note: Use 64,128 experts and fp16,bf16 only - CK silently fails on 256 experts and fp32
docker run --rm \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video --shm-size=16G \
    -v /path/to/upstream-aiter:/aiter \
    rocm/pytorch:latest \
    bash -c "
        cd /aiter
        rm -rf aiter/jit/*.so aiter/jit/build
        pip install -e .
        python op_tests/test_moe_topk_sigmoid.py \
            --num-experts 64,128 \
            --num-tokens 256,512,1024,2048,4096 \
            --topk 4,8 \
            --dtype fp16,bf16
    "

Checklist

Code builds successfully
All 144 test configurations pass (4 expert counts × 3 token counts × 4 topk × 3 dtypes)
Performance benchmarked on MI300X
Backward compatible (CK fallback for non-GFX9)
256/512 experts support added
FP32 dtype support added
Pre-commit hooks pass

Environment

Hardware: AMD MI300X (gfx942), ROCm 7.2.0, rocm/pytorch:latest

- 1.66x average speedup over CK implementation (range 1.42x - 1.94x) - Adds support for 256 experts (CK limited to 192) - Adds FP32 dtype support (CK only supports fp16/bf16)

Copilot

Pull request overview

This PR adds a GFX9‑optimized DPP implementation of topk_sigmoid and wires it into the existing CK-based interface with architecture-aware dispatch, extending support to 256+ experts and fp32 while keeping CK as a fallback for non‑gfx9.

Changes:

Implemented a warp‑level DPP topk_sigmoid_kernel with support for varying experts-per-thread (up to 512 experts) and fused sigmoid + top‑k selection.
Added a topk_sigmoid_gfx9 launcher and is_gfx9_arch-based runtime dispatch, with CK (topk_softmax) as a fallback for other architectures.
Updated the CK wrapper to route through topk_sigmoid_ck, clarifying expert-count limits and dtype support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

csrc/py_itfs_ck/topk_sigmoid_kernels.cu

- Add TORCH_CHECK for topk_weights (float32) and topk_indices (int32) - Add TORCH_CHECK for topk <= num_experts - Extend test parametrization to cover 256/512 experts and fp32

topk_sigmoid: add GFX9 DPP kernel with 256-expert and FP32 support

62abe7c

- 1.66x average speedup over CK implementation (range 1.42x - 1.94x) - Adds support for 256 experts (CK limited to 192) - Adds FP32 dtype support (CK only supports fp16/bf16)

stevenarellano requested review from a team and Copilot January 26, 2026 13:35

Copilot started reviewing on behalf of stevenarellano January 26, 2026 13:35 View session

Merge branch 'main' into feat/topk-sigmoid-dpp-optimization

3f97b20

Copilot AI reviewed Jan 26, 2026

View reviewed changes

csrc/py_itfs_ck/topk_sigmoid_kernels.cu Show resolved Hide resolved

csrc/py_itfs_ck/topk_sigmoid_kernels.cu Show resolved Hide resolved

csrc/py_itfs_ck/topk_sigmoid_kernels.cu Show resolved Hide resolved

csrc/py_itfs_ck/topk_sigmoid_kernels.cu Show resolved Hide resolved

valarLip requested a review from junhaha666 January 26, 2026 13:52

stevenarellano and others added 3 commits January 26, 2026 22:37

Add input validation and extend test coverage

4162f30

- Add TORCH_CHECK for topk_weights (float32) and topk_indices (int32) - Add TORCH_CHECK for topk <= num_experts - Extend test parametrization to cover 256/512 experts and fp32

Merge branch 'main' into feat/topk-sigmoid-dpp-optimization

4d6250e

Merge branch 'main' into feat/topk-sigmoid-dpp-optimization

6e3f91d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909

stevenarellano commented Jan 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909

Are you sure you want to change the base?

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support #1909

Conversation

stevenarellano commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

topk_sigmoid: 1.66x faster DPP kernel with 256-expert and FP32 support

Summary

Performance Highlights

Motivation

Technical Approach

Architecture-Aware Dispatch

Benchmark Results

Full Side-by-Side Comparison (40 configs, fp16/bf16, 64/128 experts)

64 Experts

128 Experts

Test Plan

Reproduce Benchmarks

Checklist

Environment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stevenarellano commented Jan 26, 2026 •

edited

Loading