[RFC] Helion Autotuning service for instantaneous auto-tuning times

### Helion Autotuning service for near instantaneous auto-tuning times

I'm opening up this RFC since it's a feature I personally really need which is Helion but without waiting. I'd love to hear whether this is something the team is interested in maintaining in the repo or whether you prefer this continue being a fork. [msaroufim/helion](https://github.com/msaroufim/helion) 

So I have a POC that uses Modal to dispatch every individual autotuning run to its own warm Modal machine which is a serverless provider. The benefit is I can autotune over 574 configs with multiple rounds of dispatch in and end to 70s over 10 machines with an estimated cost of 80c. It's also trivial to parallelize over 100 or 1000 machines using the same setup with the real bottleneck often being triton compilation times.

And I did all of this from my local Macbook without having access to a GPU at all! Truly webscale!! And I can now use Helion to make submissions to gpumode.com since I can now do a large scale autotune on my mac and then submit the specific config I care about without autotuning in prod. I don't have to waste expensive cloud credits on GPUs because well Helion mostly spends time on CPU coming up with candidate kernels and then compiling them. See this PR for some more context https://github.com/pytorch/helion/pull/1416

I suspect many people feel the same way, I did when I worked on the #739 and looks like @hinriksnaer is also trying to go about solving this problem in other ways like early termination and overlapped compilation.

 it's a big change from how helion is pitched today so I'm not sure if opening up a PR is appropriate quite yet but you can see my fork here https://github.com/msaroufim/helion if you're curious about learning more

You can try this out on your own kernels if you run `HELION_AUTOTUNER=ModalSearch python my_kernel.py` or programatically using `best_config = modal_autotune(my_kernel_fn, *args, gpu_type="H100", n_configs=20)`

The logs make it clearer what's going on, whenever we dispatch configs we parallelize them over N machines on Modal and we finally get the best kernel autotune decorator we can use.

```
Autotuning helion matmul: 4096x4096 @ 4096x4096 (float16)
  Using ModalSearch -> dispatching to Modal H100 workers

  [0s] Autotune random seed: 1943058816
  [0s] Dispatching 20 configs to Modal (H100)
  [3s] Initial population:
  ok=20
  min=0.5067
  mid=5.0798
  max=34.9512
  best={'block_sizes': [128, 64, 128],
   'indexing': ['block_ptr', 'pointer', 'pointer'],
   'l2_groupings': [32],
   'load_eviction_policies': ['', 'first'],
   'loop_orders': [[1, 0]],
   'maxnreg': 64,
   'num_sm_multiplier': 32,
   'num_stages': 4,
   'num_warps': 32,
   'pid_type': 'persistent_blocked',
   'range_warp_specializes': []}
  [3s] Generation 1: exploring 77 neighbors
  [4s] Dispatching 77 configs to Modal (H100)
  [13s] Generation 1: improved 0.5067ms -> 0.3513ms (30.67%)
  [13s] Generation 2: exploring 75 neighbors
  [13s] Dispatching 75 configs to Modal (H100)
  [25s] Generation 2: improved 0.3513ms -> 0.3281ms (6.60%)
  [25s] Generation 3: exploring 73 neighbors
  [25s] Dispatching 73 configs to Modal (H100)
  [30s] Generation 3: improved 0.3281ms -> 0.2516ms (23.34%)
  [30s] Generation 4: exploring 75 neighbors
  [30s] Dispatching 75 configs to Modal (H100)
  [34s] Generation 4: improved 0.2516ms -> 0.2401ms (4.55%)
  [34s] Generation 5: exploring 74 neighbors
  [34s] Dispatching 74 configs to Modal (H100)
  [37s] Generation 5: improved 0.2401ms -> 0.2389ms (0.51%)
  [37s] Generation 6: exploring 73 neighbors
  [37s] Dispatching 73 configs to Modal (H100)
  [41s] Generation 6: improved 0.2389ms -> 0.2378ms (0.45%)
  [41s] Generation 7: exploring 68 neighbors
  [41s] Dispatching 68 configs to Modal (H100)
  [44s] Generation 7: improved 0.2378ms -> 0.2109ms (11.33%)
  [44s] Generation 8: exploring 71 neighbors
  [44s] Dispatching 71 configs to Modal (H100)
  [48s] Generation 8: improved 0.2109ms -> 0.1918ms (9.02%)
  [48s] Generation 9: exploring 73 neighbors
  [48s] Dispatching 73 configs to Modal (H100)
  [53s] Generation 9: improved 0.1918ms -> 0.1894ms (1.25%)
  [53s] Generation 10: exploring 50 neighbors
  [53s] Dispatching 50 configs to Modal (H100)
  [56s] Generation 10: improved 0.1894ms -> 0.1881ms (0.73%)
  [56s] Starting finishing phase with 3 rounds
  [56s] Dispatching 11 configs to Modal (H100)
  [59s] Dispatching 1 configs to Modal (H100)
  [61s] Finishing round 1: simplified to Config(block_sizes=[128, 256, 64], indexing=['block_ptr', 'block_ptr', 'pointer'],
  l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=256, num_sm_multiplier=1, num_stages=3,
  num_warps=8, pid_type='persistent_blocked', range_warp_specializes=[]), perf=0.1889ms
  [61s] Dispatching 9 configs to Modal (H100)
  [64s] Dispatching 1 configs to Modal (H100)
  [66s] Finishing round 2: simplified to Config(block_sizes=[128, 256, 64], indexing=['pointer', 'pointer', 'pointer'],
  l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], maxnreg=None, num_sm_multiplier=1, num_stages=3,
  num_warps=8, pid_type='persistent_blocked', range_warp_specializes=[]), perf=0.1882ms
  [66s] Dispatching 6 configs to Modal (H100)
  [69s] Finishing round 3: no simplification maintained performance, stopping early
  [69s] Finishing phase complete: final config=Config(block_sizes=[128, 256, 64], num_sm_multiplier=1, num_stages=3, num_warps=8,
  pid_type='persistent_blocked')
  [70s] Autotuning complete in 70.6s after searching 757 configs.
  One can hardcode the best config and skip autotuning with:
      @helion.kernel(config=helion.Config(block_sizes=[128, 256, 64], indexing=['block_ptr', 'block_ptr', 'pointer'],
  l2_groupings=[16], load_eviction_policies=['', 'first'], maxnreg=256, num_sm_multiplier=32, num_stages=3, num_warps=8,
  pid_type='persistent_blocked'), static_shapes=True)
```

And the end to end architecture looks something like

<img width="598" height="477" alt="Image" src="https://github.com/user-attachments/assets/254557e4-98d3-473e-842b-462542892947" />

#### Implementation

This won't work with slow cold starts so here's what we did to fix those problems
1. Each autotuning run would need to receive over the network the input tensors which for large tensors is prohibitive, so we eat the cost once by dropping those tensors in a distributed KV store that all other workers have access to
2. Each autotuning run needs to potentially instantiate a new container: a complete cold start can take anywhere from 30s-1min which is prohibitive but if we warm up the containers or if we accept a 1 time warmup for many different autotuning runs then this overhead is closer to 1-2s. 

This also works quite nicely for helion developers, you can make local changes, when a new container is spun up, it'll just build helion from source capturing the current state of the repo and run super fast autotuning runs for you.

It's also important to note that the Modal Autotuner integrates natively in Helion so any search algorithm whether it be PatternSearch, LFBO, DE will just work and improvements to autotuning times there are still additive because we could then run even larger searches. But again since Modal is serverless it's quite easy to increase the number of concurrent GPU runs or change the GPU types for very fast speedy autotuning runs

So the main bottleneck is now Triton compilation times which is great! Since that's the one thing we don't have full control over.

To close up it's quite beautiful seeing all the queued autotuning jobs on Modal pass successfully , just think of it like spawning many python processes in parallel on separate machines

<img width="1362" height="762" alt="Image" src="https://github.com/user-attachments/assets/479104f3-3919-4c6d-a128-c2061aa7e99b" />


#### Open questions and concerns

There a few legitimate reasons why it might take more work to make this really ready for prod use, I've listed the problems I know of but I'm sure there are more. @charlesfrye might find the below fun to think through

* Modal dependency: Modal is popular but not everyone uses Modal, I expect we could generalize the code to something like QueueBasedJobSubmissionAutotune for teams that do have multiple machines they can parallelize over, there's obviously more work people need to put into this but I know this is something we're looking into to better manage PyTorch CI machines and is something @seemethere has been thinking a lot about. Granted something out of core repo does not really need to be generic so being Modal only is still probably fine
* Autotuning sensitivity: Modal is not a neocloud so there's no guarantee that all the machines we're getting are the exact same SKU. However I'd argue that this is fine, we should maybe have a notion of "autotune result within noise" and have autotuning results that are robust to small variations in machine types and temperatures. I'm not sure exactly how to do this quite yet but I think it's posible to think through. One idea @ngc92 proposed is we make the remote auto tuning process give us 5 candidate solutions and we then run those serially on the target machine 
* Extend to more DSLs not just Triton: @drisspg has an awesome PR on making the Helion autotuner work for more DSLs, there's an explosion of DSLs that all need an autotuner. AI's writing kernels love to also splat a mini autotune run to get some speedups and this is the kind of service that I suspect would be quite popular
* Building this as a standalone service: One obvious first step is we just tell Helion users to get a Modal key and they can go ahead and autotune on their laptops but ultimately I suspect a more powerful solution is for us to manage a backend service where people author Helion kernels, we autotune them on our end and end up with a rich dataset of autotuning results, one obvious benefit is we could just cache results for important kernels like matmuls but we could also then use the dataset to train an even better autotuner and recursively self improve Helion
* This is cool but ultimately out of scope and should be a separate repo: I would buy this argument, I wouldn't be too upset, well a bit upset but I'd understand but I suspect we'd do something of this nature in a gpu-mode org project

AI disclosure: The prototype was entirely AI written, I evaluated the code heavily myself, spotchecked the results locally and in Modal. This issue was written with no AI assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Helion Autotuning service for instantaneous auto-tuning times #1518

Helion Autotuning service for near instantaneous auto-tuning times

Implementation

Open questions and concerns

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Helion Autotuning service for instantaneous auto-tuning times #1518

Description

Helion Autotuning service for near instantaneous auto-tuning times

Implementation

Open questions and concerns

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions