Skip to content

Commit 68dd67c

Browse files
Claude (perf agent)yaoliu13
authored andcommitted
perf(broadphase): cooperative parallel warm-start AABB re-fill
Problem: the broadphase kernel was launching many more lanes per env than were doing useful work. Most of the per-env work is fundamentally serial (only one lane has work to do; the rest gate off immediately after the launcher dispatches them). The warm-start AABB re-fill at the top of every broadphase step is the one sub-stage that is embarrassingly parallel: it re-reads aabb_min or aabb_max for each of 2*n_geoms events into sort_buffer.value, with no inter-event dependency. The downstream insertion sort and SAP sweep have true sequential dependencies (n_active evolves left-to-right; the sort buffer is updated in place) and resist parallelization without LDS or atomics. Fix: activate 4 threads per env on GPU backends and use them to cooperatively run the warm-start re-fill, partitioning the events 4-ways across lanes. The sort and sweep stay single-threaded on lane 0. A wave-wide barrier between the parallel re-fill and the serial sort ensures all writes are visible before they are read. Implementation: * Restructure func_broad_phase's outer loop from `for i_b in range(_B)` to `for i_thread in range(_B * THREADS_PER_ENV)`, with i_b and i_t (thread-in-env) derived from i_thread. * On GPU backends, set THREADS_PER_ENV=4 and BLOCK_DIM=64 (= 1 wave on AMDGPU = 16 envs x 4 lanes per workgroup). Use a static config check (qd.static(backend != gs.cpu)) to gate this. On CPU backend (Arch.x64), THREADS_PER_ENV statically collapses to 1 and the cooperative re-fill becomes equivalent to the original serial loop; the qd.simt.block.sync barrier is also gated out since CPU does not support it. * All 4 lanes redundantly compute env_n_geoms (cheap; n_links is small, and after the first lane the reads hit cache). * Each lane handles a contiguous slice of indices [i_t * N / 4, (i_t+1) * N / 4) of the 2*env_n_geoms events. Writes are disjoint by construction, so no atomics are needed. * qd.simt.block.sync() acts as the wave-wide barrier between the parallel re-fill and the lane-0 sort. * Lane 0 then performs all the serial work: contact-clear, equality bound hoist, first-time-only event-buffer init, insertion sort, and the SAP sweep that emits broad pairs. * Communication is via global memory only -- no LDS is allocated. This is intentional: previous attempts to add LDS to the broadphase in this codebase have shown that LDS allocation can crash async overlap of broadphase with downstream kernels and produce a net end-to-end regression even when the broadphase kernel itself gets faster.
1 parent 7c09468 commit 68dd67c

1 file changed

Lines changed: 265 additions & 201 deletions

File tree

0 commit comments

Comments
 (0)