An LLM agent baseline for the MLSys 2026 FlashInfer AI Kernel Generation Contest. See the flashinfer-bench-starter-kit to get started.
An LLM agent baseline that iteratively generates and refines Triton kernels for high-performance LLM operations on NVIDIA GPUs, evaluated via FlashInfer-Bench. For the benchmarking framework code, see the flashinfer-bench repo.
agent/
main.py # Entry point & task orchestration
iterative_agent.py # Iterative Agent: propose + refine loop
evolve_agent.py # Evolve Agent: elite pool evolution loop
api.py # LLM API client (OpenAI / Claude)
eval.py # Kernel evaluation via flashinfer-bench API
modal_eval.py # Remote kernel evaluation on Modal GPU
utils.py # Shared utilities & data helpers
prompt/
proposer_prompt.py # Kernel proposal prompt
tuner_prompt.py # Kernel tuning prompt (str_replace edits)
config/
config_iterative.yaml # Iterative agent config
config_evolve.yaml # Evolve agent config
config_mini_test.yaml # Quick smoke test config
tasks_default.txt # Default task list
tasks_mini.txt # Minimal task list for smoke test
datasets/ # FlashInfer-Trace / MLSys contest datasets
requirements.txt # Python dependencies
pip install -r requirements.txtmkdir datasets
git lfs install
git clone https://huggingface.co/datasets/flashinfer-ai/mlsys26-contest datasets/mlsys26-contestexport ANTHROPIC_API_KEY=... # or OPENAI_API_KEYLocal GPU:
python3 -m agent.main --config config/config_mini_test.yamlRemote GPU via Modal (no local GPU needed):
pip install modal
modal setup # one-time auth
python3 -m agent.main --config config/config_mini_test.yaml \
--eval_backend modal --modal_gpu B200The dataset is automatically uploaded to a Modal Volume on first run and cached for subsequent runs.
| Type | Description |
|---|---|
| iterative | Proposes an initial kernel, then repeatedly tunes it via str_replace edits |
| evolve | Proposes multiple kernels, maintains a recent + elite pool, samples and evolves |
Example (config/config_iterative.yaml):
test_source: mlsys26-contest
agent_type: iterative
tasks_path: config/tasks_default.txt
gpu_name: B200
gpu_architecture: Blackwell
api_type: claude
model_name: claude-sonnet-4-5
total_steps: 25
eval_backend: local # "local" or "modal"
modal_gpu: B200 # GPU type for Modal (ignored when eval_backend=local)Available configs:
| Config | Agent Type |
|---|---|
config_iterative.yaml |
Iterative Agent |
config_evolve.yaml |
Evolve Agent |
config_mini_test.yaml |
Quick smoke test |
Key parameters:
test_source:mlsys26-contestorflashinfer-traceagent_type:iterativeorevolvetasks_path: file listing op types / problem IDs to solvetotal_steps: number of iterations per taskapi_type:openaiorclaudemodel_name: LLM model to useeval_backend:local(default) ormodalfor remote GPU evaluationmodal_gpu: GPU type on Modal (e.g.B200)
One op type per line. Optionally specify kernel definition IDs after the op type:
dsa_paged
gdn
moe
gemm gemm_n128_k2048, gemm_n256_k4096
If no kernel definition IDs are given, all kernel definitions under that op type are loaded.
Results are saved under outputs/:
outputs/<agent_type>_<test_source>_<steps>_<timestamp>/
config.yaml
<op_type>_<problem_id>/
reference_src.py
proposal_0_1.py / tune_0_2.py / ...
global_best_kernel_25.py
global_best_metrics_25.json
python3 -m agent.main \
--config config/config_iterative.yaml \
--resume_from outputs/iterative_mlsys26-contest_25_20260208-121400Tasks with existing results are skipped; incomplete tasks continue from where they left off.
See LICENSE.