Problem
C++ abort() calls during Triton compile (e.g. assertion failures in LinearLayout::reshapeOuts, or any future internal Triton invariant failure) kill the parent Python process and take down the entire session. The autotuner has no chance to discard the offending config and try another. The user sees an unrecoverable crash that looks like "Shape Failure" per shape.
Concrete recent example: rope-bwd autotune on H100/B200/MI350X aborts on the tt.reshape <1xAxBxbf16> -> <AxBxbf16> pattern emitted from emit_tl_dot_with_padding's squeeze path. Triton's assertion fires, parent dies, all 8 rope-bwd shapes are reported as Shape Failures. PR #2494 worked around the specific reshape, but the underlying robustness gap remains: any other compile-time C++ crash will reproduce the same outcome.
Proposal
Compile-phase precompile should be isolated from the parent process the same way the benchmark phase already is. The benchmark side (PR #2111, extended in #2487) uses a long-lived spawn worker so a hung or crashing benchmark kernel can be killed without losing autotune progress. The same pattern can be extended to precompile.
Some prior attempts:
Constraint
Compile time must not regress versus the current fork default. Past experiments (see #2128) showed plain spawn is about 2x slower than fork on total compile time. Any new isolation mechanism needs to preserve fork-level speed, e.g. via a long-lived worker pool, reuse of imported modules across compile calls, or amortizing the spawn cost across many configs.
Acceptance criteria
- A compile-time C++ abort in any one config skips that config and lets the process continue with the rest.
- Geomean compile time on dashboard (https://helionlang.com/dashboard/) stays the same as of the current fork default.
Related
Problem
C++
abort()calls during Triton compile (e.g. assertion failures inLinearLayout::reshapeOuts, or any future internal Triton invariant failure) kill the parent Python process and take down the entire session. The autotuner has no chance to discard the offending config and try another. The user sees an unrecoverable crash that looks like "Shape Failure" per shape.Concrete recent example: rope-bwd autotune on H100/B200/MI350X aborts on the
tt.reshape <1xAxBxbf16> -> <AxBxbf16>pattern emitted fromemit_tl_dot_with_padding's squeeze path. Triton's assertion fires, parent dies, all 8 rope-bwd shapes are reported as Shape Failures. PR #2494 worked around the specific reshape, but the underlying robustness gap remains: any other compile-time C++ crash will reproduce the same outcome.Proposal
Compile-phase precompile should be isolated from the parent process the same way the benchmark phase already is. The benchmark side (PR #2111, extended in #2487) uses a long-lived spawn worker so a hung or crashing benchmark kernel can be killed without losing autotune progress. The same pattern can be extended to precompile.
Some prior attempts:
abort()from the runtime.autotune_precompilemodes (forkdefault,spawn,None) trade off: fork is fast but inherits CUDA state; spawn isolates fully but is roughly 2x slower per the [Autotuner] Long-lived worker pool for parallel precompile #2128 measurements;Noneruns in-process and is what fails today.Constraint
Compile time must not regress versus the current
forkdefault. Past experiments (see #2128) showed plain spawn is about 2x slower than fork on total compile time. Any new isolation mechanism needs to preserve fork-level speed, e.g. via a long-lived worker pool, reuse of imported modules across compile calls, or amortizing the spawn cost across many configs.Acceptance criteria
Related