Skip to content

Commit dfb330e

Browse files
Claude (perf agent)gpinkert
andcommitted
perf(broadphase): T2.1 cooperative parallel warm-start re-fill via subgroup
Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the 4 lanes per env to cooperatively execute the warm-start AABB re-fill phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay single-threaded on lane 0 due to sequential dependencies through n_active and the in-place sort buffer. Communication is via global memory + a single qd.simt.block.sync() barrier; no LDS is allocated. This intentionally keeps the design clear of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG. Why: * The prior analysis (Pattern P1) showed the JIT was launching 152 lanes per env but only 1 was doing useful work (99% lane gating). T2.1 puts the wasted lanes to work on the warm-start re-fill, the only embarrassingly-parallel sub-phase in the SAP loop. * Available subgroup primitives in quadrants/lang/simt/subgroup.py do not include shuffle_xor or any_true, which are needed for a full cooperative bitonic sort. Within those constraints, parallelizing the warm-start re-fill is the largest no-LDS subgroup-cooperative win available. Hibernation path is intentionally not parallelized. This commit also folds in the vec3 AABB-load pattern from a previous attempted commit (was: T1.4 vectorize AABB component loads). The vec3 reads were a stand-alone no-op at the bench level (JIT was already coalescing the 6 scalar reads), but they make the source cleaner and came along for free with the T2.1 restructure. Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline): baseline: 488.3 us k_main, 262.5 ms k_total, 138.2 FPS this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS delta: -15.3% k_main, -18.6% k_total, +1.3% FPS Risk: medium. Restructures the per-env loop body to use threaded indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync() barrier between phases. Pytest gate on collision tests passed clean on the prior Tier 1 stack; T2.1 itself awaiting full pytest run. Co-Authored-By: Grant Pinkert <gpinkert@amd.com>
1 parent 2217050 commit dfb330e

1 file changed

Lines changed: 263 additions & 199 deletions

File tree

0 commit comments

Comments
 (0)