Skip to content

Commit 13dda1f

Browse files
Claude (perf agent)gpinkert
authored andcommitted
perf(broadphase): cooperative parallel warm-start AABB re-fill
Problem: the broadphase kernel was launching ~152 lanes per env but each env's per-env work was fundamentally serial -- only one lane actually did useful work and the rest gated off immediately. ~99 % of dispatched lanes were idle for the full lifetime of the kernel. The warm-start AABB re-fill at the top of every broadphase step is the one sub-stage that is embarrassingly parallel: it re-reads aabb_min or aabb_max for each of 2*n_geoms events into sort_buffer.value, with no inter-event dependency. The downstream insertion sort and SAP sweep have true sequential dependencies (n_active evolves left-to-right; the sort buffer is updated in place) and resist parallelization without LDS or atomics. Fix: activate 4 threads per env on GPU backends and use them to cooperatively run the warm-start re-fill, partitioning the events 4-ways across lanes. The sort and sweep stay single-threaded on lane 0. A wave-wide barrier between the parallel re-fill and the serial sort ensures all writes are visible before they are read. Implementation: * Restructure func_broad_phase's outer loop from `for i_b in range(_B)` to `for i_thread in range(_B * THREADS_PER_ENV)`, with i_b and i_t (thread-in-env) derived from i_thread. * On GPU backends, set THREADS_PER_ENV=4 and BLOCK_DIM=64 (= 1 wave on AMDGPU = 16 envs x 4 lanes per workgroup). Use a static config check (qd.static(backend != gs.cpu)) to gate this. On CPU backend (Arch.x64), THREADS_PER_ENV statically collapses to 1 and the cooperative re-fill becomes equivalent to the original serial loop; the qd.simt.block.sync barrier is also gated out since CPU does not support it. * All 4 lanes redundantly compute env_n_geoms (cheap; n_links is small, and after the first lane the reads hit cache). * Each lane handles a contiguous slice of indices [i_t * N / 4, (i_t+1) * N / 4) of the 2*env_n_geoms events. Writes are disjoint by construction, so no atomics are needed. * qd.simt.block.sync() acts as the wave-wide barrier between the parallel re-fill and the lane-0 sort. * Lane 0 then performs all the serial work: contact-clear, equality bound hoist, first-time-only event-buffer init, insertion sort, and the SAP sweep that emits broad pairs. * Communication is via global memory only -- no LDS is allocated. This is intentional: previous attempts to add LDS to the broadphase in this codebase have shown that LDS allocation can crash async overlap of broadphase with downstream kernels and produce a net end-to-end regression even when the broadphase kernel itself gets faster. Performance: saves ~35 ms of broadphase kernel time per 500-step run at 8192 envs (about -14 % of total broadphase kernel time, the dominant single-commit win in this stack). Drives most of the end-to-end FPS gain at 8192 envs.
1 parent 3d4da59 commit 13dda1f

1 file changed

Lines changed: 264 additions & 200 deletions

File tree

0 commit comments

Comments
 (0)