Commit dfb330e
perf(broadphase): T2.1 cooperative parallel warm-start re-fill via subgroup
Activate THREADS_PER_ENV=4 (BLOCK_DIM=64, ENVS_PER_BLOCK=16) and use the
4 lanes per env to cooperatively execute the warm-start AABB re-fill
phase, partitioning the 2*n_geoms events 4 ways. Sort and sweep stay
single-threaded on lane 0 due to sequential dependencies through
n_active and the in-place sort buffer.
Communication is via global memory + a single qd.simt.block.sync()
barrier; no LDS is allocated. This intentionally keeps the design clear
of the LDS-occupancy regression that PR #50 hit at 14-18 KiB/WG.
Why:
* The prior analysis (Pattern P1) showed the JIT was launching 152
lanes per env but only 1 was doing useful work (99% lane gating).
T2.1 puts the wasted lanes to work on the warm-start re-fill, the
only embarrassingly-parallel sub-phase in the SAP loop.
* Available subgroup primitives in quadrants/lang/simt/subgroup.py
do not include shuffle_xor or any_true, which are needed for a full
cooperative bitonic sort. Within those constraints, parallelizing
the warm-start re-fill is the largest no-LDS subgroup-cooperative
win available.
Hibernation path is intentionally not parallelized.
This commit also folds in the vec3 AABB-load pattern from a previous
attempted commit (was: T1.4 vectorize AABB component loads). The vec3
reads were a stand-alone no-op at the bench level (JIT was already
coalescing the 6 scalar reads), but they make the source cleaner and
came along for free with the T2.1 restructure.
Measured (cx63, 3-run mean FPS @ 8192 envs, vs amd-integration baseline):
baseline: 488.3 us k_main, 262.5 ms k_total, 138.2 FPS
this commit (full): 413.8 us k_main, 213.7 ms k_total, 140.0 FPS
delta: -15.3% k_main, -18.6% k_total, +1.3% FPS
Risk: medium. Restructures the per-env loop body to use threaded
indexing (i_thread, i_b, i_t). Adds an explicit qd.simt.block.sync()
barrier between phases. Pytest gate on collision tests passed clean
on the prior Tier 1 stack; T2.1 itself awaiting full pytest run.
Co-Authored-By: Grant Pinkert <gpinkert@amd.com>1 parent 2217050 commit dfb330e
1 file changed
Lines changed: 263 additions & 199 deletions
0 commit comments