Skip to content

Conversation

@msinnha1
Copy link

Complete fusedMoE (Mixture of Experts) kernel implementation optimized for Intel XPU devices.

Key Features

  • Intel XPU/SYCL optimization with bfloat16 precision
  • PyTorch fallback implementations for compatibility
  • Comprehensive Python API with error handling
  • Multi-configuration testing and validation

Core Operations

  • fused_moe_forward(): Main MoE computation with expert routing
  • grouped_gemm_moe(): Efficient grouped GEMM for MoE workloads
  • silu_and_mul_moe(): Fused SiLU activation and gating
  • moe_align_block_size(): Token alignment for optimal processing

Performance Results

  • Tested on Intel Arc B580 Graphics (xpu:0)
  • Latency: 0.03-1.27ms for 1024 tokens (avg 0.10ms)
  • Support for 2-16 experts, various tensor sizes
  • Memory-efficient bfloat16 operations

Architecture

  • C++ kernel interface: include/sgl_moe_kernel_ops.h
  • SYCL implementation: src/sycl/fused_moe.cpp
  • Python API: python/sgl_kernel/moe.py
  • PyTorch integration: src/torch_extension_sycl.cc

Provides foundation for future CUTLASS/SYCL kernel optimization while ensuring immediate usability through robust PyTorch fallbacks.

FusedMoE Implementation and others added 2 commits October 13, 2025 17:39
Complete fusedMoE (Mixture of Experts) kernel implementation optimized for Intel XPU devices.

## Key Features
- Intel XPU/SYCL optimization with bfloat16 precision
- PyTorch fallback implementations for compatibility
- Comprehensive Python API with error handling
- Multi-configuration testing and validation

## Core Operations
- fused_moe_forward(): Main MoE computation with expert routing
- grouped_gemm_moe(): Efficient grouped GEMM for MoE workloads
- silu_and_mul_moe(): Fused SiLU activation and gating
- moe_align_block_size(): Token alignment for optimal processing

## Performance Results
- Tested on Intel Arc B580 Graphics (xpu:0)
- Latency: 0.03-1.27ms for 1024 tokens (avg 0.10ms)
- Support for 2-16 experts, various tensor sizes
- Memory-efficient bfloat16 operations

## Architecture
- C++ kernel interface: include/sgl_moe_kernel_ops.h
- SYCL implementation: src/sycl/fused_moe.cpp
- Python API: python/sgl_kernel/moe.py
- PyTorch integration: src/torch_extension_sycl.cc

Provides foundation for future CUTLASS/SYCL kernel optimization while ensuring
immediate usability through robust PyTorch fallbacks.
Signed-off-by: FusedMoE Implementation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant