Create high-performance GPU kernels for state-of-the-art LLM architectures on NVIDIA Blackwell GPUs with humans and/or AI agents.
FlashInfer-Bench is our official framework to evaluate your AI-generated kernels.
- 2026.02.05: Full dataset for definitions and workloads are released at HuggingFace
The competition features three tracks, each targeting a critical LLM operation:
| Track | Description |
|---|---|
| fused_moe | Fused Mixture-of-Experts kernel for efficient expert routing and computation |
| sparse_attention | Sparse attention mechanisms for long-context inference |
| gated_delta_net | Gated delta network operations for efficient state updates |
Fork this template once per track you want to compete in (separate repos for each track).
Click "Use this template" or fork this repository to create your solution repo.
conda create -n fi-bench python=3.12
conda activate fi-bench
pip install flashinfer-bench modalWe provide kernel definitions and workloads in FlashInfer-Trace format. Clone the competition dataset from HuggingFace:
git lfs install
git clone https://huggingface.co/datasets/flashinfer-ai/mlsys26-contestSet the environment variable:
export FIB_DATASET_PATH=/path/to/flashinfer-traceEdit config.toml to set your track and team info:
[solution]
name = "my-team-solution-v1" # Solution name
definition = "fused_moe" # Track: fused_moe | sparse_attention | gated_delta_net
author = "team-name" # Team/author name
[build]
language = "triton" # triton | cuda
entry_point = "kernel" # Kernel function nameFor Triton:
Edit solution/triton/kernel.py with your implementation.
For CUDA:
Edit solution/cuda/kernel.cu and solution/cuda/binding.py with your implementation.
Generate solution.json from your source files:
python scripts/pack_solution.pyTest your solution on your local GPU:
python scripts/run_local.pyRequires: Local CUDA-capable GPU and FIB_DATASET_PATH environment variable.
Test your solution on NVIDIA B200 GPUs via Modal:
One-time setup:
modal setup
modal volume create flashinfer-trace
modal volume put flashinfer-trace /path/to/flashinfer-traceRun benchmark:
modal run scripts/run_modal.pyTo submit your solution for evaluation:
- Ensure your implementation is complete and tested
- Run
python scripts/pack_solution.pyto generatesolution.json - Commit and push your changes
- Tag your commit for evaluation (e.g.,
git tag submission-v1)
flashinfer-bench-starter-kit/
├── README.md # This file
├── config.toml # Track configuration (edit this)
├── solution/ # Solution source files
│ ├── triton/ # Triton implementation
│ │ └── kernel.py # Your Triton kernel
│ └── cuda/ # CUDA implementation
│ ├── kernel.cu # Your CUDA kernel
│ └── binding.py # TVM FFI bindings
├── scripts/ # Utility scripts
│ ├── run_local.py # Local benchmark runner
│ ├── run_modal.py # Modal cloud benchmark runner
│ └── pack_solution.py # Pack source files into solution.json
└── images/ # Sponsor logos
FlashInfer Trace consists of multiple JSON objects (definitions, workloads, solutions, and traces), which can contain large code blocks. To easily visualize and inspect these objects, you can use the FlashInfer Trace Viewer. Simply paste any FlashInfer Trace JSON into the viewer to get a friendly, structured view of its contents.
from flashinfer_bench import BuildSpec
from flashinfer_bench.agents import pack_solution_from_files, extract_solution_to_files
# Pack source files into a Solution object
spec = BuildSpec(
language="triton", # or "cuda"
target_hardware=["cuda"],
entry_point="my_kernel",
)
solution = pack_solution_from_files(
path="./my_solution_dir",
spec=spec,
name="my_solution_v1",
definition="fused_moe",
author="your_name",
)
# Extract a Solution to files in a working directory
extract_solution_to_files(solution, "./output_dir")from flashinfer_bench.agents import flashinfer_bench_run_sanitizer
output = flashinfer_bench_run_sanitizer(
solution=solution,
workload=workload,
sanitizer_types=["memcheck", "racecheck", "synccheck", "initcheck"],
timeout=300,
)
print(output)from flashinfer_bench.agents import flashinfer_bench_run_ncu
output = flashinfer_bench_run_ncu(
solution=solution,
workload=workload,
set="detailed",
page="details",
timeout=120,
)
print(output)from flashinfer_bench.agents import get_all_tool_schemas
schemas = get_all_tool_schemas()
# Returns list of OpenAI-compatible function schemasFlashInfer-Bench uses destination passing style (DPS) by default, where both inputs and outputs are passed as function parameters. DPS avoids measuring tensor allocation overhead, resulting in more accurate performance numbers. We recommend using DPS when possible, as it yields better benchmark results.
Important: Avoid using variadic input arguments in your kernel signatures, as they will fail the builder validation check.
If your kernel uses value-returning style (i.e., returns output tensors instead of writing to pre-allocated ones), set destination_passing_style to false in your solution's spec:
{
"name": "my_solution",
"definition": "gdn_decode_qk4_v8_d128_k_last",
"author": "my_name",
"spec": {
"language": "triton",
"target_hardware": ["cuda"],
"entry_point": "kernel.py::my_kernel",
"dependencies": [],
"destination_passing_style": false
},
"sources": [...]
}Common error when DPS is mismatched:
Destination-passing style callable: expected xx parameters, but got xx
This can happen for two reasons: (1) your kernel function signature has the wrong number of parameters, or (2) your kernel uses value-returning style but the solution still has destination_passing_style set to true by default. For the latter case, fix by setting destination_passing_style to false.
For CUDA kernel implementations, we recommend using TVM FFI for Python bindings. The flashinfer_bench.agents module provides TVM FFI agent instruction prompts to assist with development.
You can set the binding field in your solution's spec to specify the C++ binding type. Defaults to "tvm-ffi" if not specified. Supported values: "tvm-ffi", "torch".
{
"name": "my_cuda_solution",
"definition": "gdn_decode_qk4_v8_d128_k_last",
"author": "my_name",
"spec": {
"language": "cuda",
"target_hardware": ["cuda"],
"entry_point": "kernel.cu::my_kernel",
"dependencies": [],
"binding": "torch"
},
"sources": [...]
}

