Commit 68dd67c
perf(broadphase): cooperative parallel warm-start AABB re-fill
Problem: the broadphase kernel was launching many more lanes per env
than were doing useful work. Most of the per-env work is fundamentally
serial (only one lane has work to do; the rest gate off immediately
after the launcher dispatches them).
The warm-start AABB re-fill at the top of every broadphase step is
the one sub-stage that is embarrassingly parallel: it re-reads
aabb_min or aabb_max for each of 2*n_geoms events into
sort_buffer.value, with no inter-event dependency. The downstream
insertion sort and SAP sweep have true sequential dependencies
(n_active evolves left-to-right; the sort buffer is updated in place)
and resist parallelization without LDS or atomics.
Fix: activate 4 threads per env on GPU backends and use them to
cooperatively run the warm-start re-fill, partitioning the events
4-ways across lanes. The sort and sweep stay single-threaded on
lane 0. A wave-wide barrier between the parallel re-fill and the
serial sort ensures all writes are visible before they are read.
Implementation:
* Restructure func_broad_phase's outer loop from
`for i_b in range(_B)` to `for i_thread in range(_B * THREADS_PER_ENV)`,
with i_b and i_t (thread-in-env) derived from i_thread.
* On GPU backends, set THREADS_PER_ENV=4 and BLOCK_DIM=64 (= 1 wave
on AMDGPU = 16 envs x 4 lanes per workgroup). Use a static config
check (qd.static(backend != gs.cpu)) to gate this. On CPU backend
(Arch.x64), THREADS_PER_ENV statically collapses to 1 and the
cooperative re-fill becomes equivalent to the original serial loop;
the qd.simt.block.sync barrier is also gated out since CPU does not
support it.
* All 4 lanes redundantly compute env_n_geoms (cheap; n_links is
small, and after the first lane the reads hit cache).
* Each lane handles a contiguous slice of indices [i_t * N / 4,
(i_t+1) * N / 4) of the 2*env_n_geoms events. Writes are disjoint
by construction, so no atomics are needed.
* qd.simt.block.sync() acts as the wave-wide barrier between the
parallel re-fill and the lane-0 sort.
* Lane 0 then performs all the serial work: contact-clear, equality
bound hoist, first-time-only event-buffer init, insertion sort,
and the SAP sweep that emits broad pairs.
* Communication is via global memory only -- no LDS is allocated.
This is intentional: previous attempts to add LDS to the broadphase
in this codebase have shown that LDS allocation can crash async
overlap of broadphase with downstream kernels and produce a net
end-to-end regression even when the broadphase kernel itself gets
faster.1 parent 7c09468 commit 68dd67c
1 file changed
Lines changed: 265 additions & 201 deletions
0 commit comments