[Placeholder] Implement fusedMoE kernels for Intel XPU support #15

msinnha1 · 2025-10-13T18:00:33Z

Complete fusedMoE (Mixture of Experts) kernel implementation optimized for Intel XPU devices.

Key Features

Intel XPU/SYCL optimization with bfloat16 precision
PyTorch fallback implementations for compatibility
Comprehensive Python API with error handling
Multi-configuration testing and validation

Core Operations

fused_moe_forward(): Main MoE computation with expert routing
grouped_gemm_moe(): Efficient grouped GEMM for MoE workloads
silu_and_mul_moe(): Fused SiLU activation and gating
moe_align_block_size(): Token alignment for optimal processing

Performance Results

Tested on Intel Arc B580 Graphics (xpu:0)
Latency: 0.03-1.27ms for 1024 tokens (avg 0.10ms)
Support for 2-16 experts, various tensor sizes
Memory-efficient bfloat16 operations

Architecture

C++ kernel interface: include/sgl_moe_kernel_ops.h
SYCL implementation: src/sycl/fused_moe.cpp
Python API: python/sgl_kernel/moe.py
PyTorch integration: src/torch_extension_sycl.cc

Provides foundation for future CUTLASS/SYCL kernel optimization while ensuring immediate usability through robust PyTorch fallbacks.

Complete fusedMoE (Mixture of Experts) kernel implementation optimized for Intel XPU devices. ## Key Features - Intel XPU/SYCL optimization with bfloat16 precision - PyTorch fallback implementations for compatibility - Comprehensive Python API with error handling - Multi-configuration testing and validation ## Core Operations - fused_moe_forward(): Main MoE computation with expert routing - grouped_gemm_moe(): Efficient grouped GEMM for MoE workloads - silu_and_mul_moe(): Fused SiLU activation and gating - moe_align_block_size(): Token alignment for optimal processing ## Performance Results - Tested on Intel Arc B580 Graphics (xpu:0) - Latency: 0.03-1.27ms for 1024 tokens (avg 0.10ms) - Support for 2-16 experts, various tensor sizes - Memory-efficient bfloat16 operations ## Architecture - C++ kernel interface: include/sgl_moe_kernel_ops.h - SYCL implementation: src/sycl/fused_moe.cpp - Python API: python/sgl_kernel/moe.py - PyTorch integration: src/torch_extension_sycl.cc Provides foundation for future CUTLASS/SYCL kernel optimization while ensuring immediate usability through robust PyTorch fallbacks.

Signed-off-by: FusedMoE Implementation <[email protected]>

FusedMoE Implementation and others added 2 commits October 13, 2025 17:39

cleanup and the print on kernel call

64521f6

Signed-off-by: FusedMoE Implementation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Placeholder] Implement fusedMoE kernels for Intel XPU support #15

[Placeholder] Implement fusedMoE kernels for Intel XPU support #15

Uh oh!

msinnha1 commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Placeholder] Implement fusedMoE kernels for Intel XPU support #15

Are you sure you want to change the base?

[Placeholder] Implement fusedMoE kernels for Intel XPU support #15

Uh oh!

Conversation

msinnha1 commented Oct 13, 2025

Key Features

Core Operations

Performance Results

Architecture

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant