Compile-phase subprocess isolation to survive Triton C++ aborts

## Problem

C++ `abort()` calls during Triton compile (e.g. assertion failures in `LinearLayout::reshapeOuts`, or any future internal Triton invariant failure) kill the parent Python process and take down the entire session. The autotuner has no chance to discard the offending config and try another. The user sees an unrecoverable crash that looks like "Shape Failure" per shape.

Concrete recent example: rope-bwd autotune on H100/B200/MI350X aborts on the `tt.reshape <1xAxBxbf16> -> <AxBxbf16>` pattern emitted from `emit_tl_dot_with_padding`'s squeeze path. Triton's assertion fires, parent dies, all 8 rope-bwd shapes are reported as Shape Failures. PR #2494 worked around the specific reshape, but the underlying robustness gap remains: any other compile-time C++ crash will reproduce the same outcome.

## Proposal

Compile-phase precompile should be isolated from the parent process the same way the benchmark phase already is. The benchmark side (PR #2111, extended in #2487) uses a long-lived spawn worker so a hung or crashing benchmark kernel can be killed without losing autotune progress. The same pattern can be extended to precompile.

Some prior attempts:
- #2142 catches Python exceptions thrown by the parent during fork-precompile setup, but doesn't help with crashes inside the forked child or with C++ `abort()` from the runtime.
- #2128 / #2287 / #2289 / #2291 / #2297 explored pool-based precompile workers that aimed to give spawn-level isolation at fork-level speed. They did not land, as they increased compile time compared to fork. 
- The current `autotune_precompile` modes (`fork` default, `spawn`, `None`) trade off: fork is fast but inherits CUDA state; spawn isolates fully but is roughly 2x slower per the #2128 measurements; `None` runs in-process and is what fails today.

## Constraint

Compile time must not regress versus the current `fork` default. Past experiments (see #2128) showed plain spawn is about 2x slower than fork on total compile time. Any new isolation mechanism needs to preserve fork-level speed, e.g. via a long-lived worker pool, reuse of imported modules across compile calls, or amortizing the spawn cost across many configs.

## Acceptance criteria

- A compile-time C++ abort in any one config skips that config and lets the process continue with the rest.
- Geomean compile time on dashboard (https://helionlang.com/dashboard/) stays the same as of the current fork default.

## Related

- #2111, #2487: subprocess isolation for the benchmark phase (the pattern to extend).
- #2142: classify-and-skip for exceptions during fork precompile setup.
- #2128: long-lived worker pool precompile (closed).
- #1914: per-config repeated-run check inside the spawn precompile subprocess (open).
- #2494: the helion-side workaround for the rope-bwd reshape crash that motivated this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile-phase subprocess isolation to survive Triton C++ aborts #2531

Problem

Proposal

Constraint

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Compile-phase subprocess isolation to survive Triton C++ aborts #2531

Description

Problem

Proposal

Constraint

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions