Commit 13dda1f
perf(broadphase): cooperative parallel warm-start AABB re-fill
Problem: the broadphase kernel was launching ~152 lanes per env but
each env's per-env work was fundamentally serial -- only one lane
actually did useful work and the rest gated off immediately. ~99 %
of dispatched lanes were idle for the full lifetime of the kernel.
The warm-start AABB re-fill at the top of every broadphase step is the
one sub-stage that is embarrassingly parallel: it re-reads aabb_min or
aabb_max for each of 2*n_geoms events into sort_buffer.value, with no
inter-event dependency. The downstream insertion sort and SAP sweep
have true sequential dependencies (n_active evolves left-to-right; the
sort buffer is updated in place) and resist parallelization without
LDS or atomics.
Fix: activate 4 threads per env on GPU backends and use them to
cooperatively run the warm-start re-fill, partitioning the events
4-ways across lanes. The sort and sweep stay single-threaded on lane 0.
A wave-wide barrier between the parallel re-fill and the serial sort
ensures all writes are visible before they are read.
Implementation:
* Restructure func_broad_phase's outer loop from
`for i_b in range(_B)` to `for i_thread in range(_B * THREADS_PER_ENV)`,
with i_b and i_t (thread-in-env) derived from i_thread.
* On GPU backends, set THREADS_PER_ENV=4 and BLOCK_DIM=64 (= 1 wave
on AMDGPU = 16 envs x 4 lanes per workgroup). Use a static config
check (qd.static(backend != gs.cpu)) to gate this. On CPU backend
(Arch.x64), THREADS_PER_ENV statically collapses to 1 and the
cooperative re-fill becomes equivalent to the original serial loop;
the qd.simt.block.sync barrier is also gated out since CPU does not
support it.
* All 4 lanes redundantly compute env_n_geoms (cheap; n_links is small,
and after the first lane the reads hit cache).
* Each lane handles a contiguous slice of indices [i_t * N / 4,
(i_t+1) * N / 4) of the 2*env_n_geoms events. Writes are disjoint
by construction, so no atomics are needed.
* qd.simt.block.sync() acts as the wave-wide barrier between the
parallel re-fill and the lane-0 sort.
* Lane 0 then performs all the serial work: contact-clear, equality
bound hoist, first-time-only event-buffer init, insertion sort,
and the SAP sweep that emits broad pairs.
* Communication is via global memory only -- no LDS is allocated.
This is intentional: previous attempts to add LDS to the broadphase
in this codebase have shown that LDS allocation can crash async
overlap of broadphase with downstream kernels and produce a net
end-to-end regression even when the broadphase kernel itself gets
faster.
Performance: saves ~35 ms of broadphase kernel time per 500-step run
at 8192 envs (about -14 % of total broadphase kernel time, the
dominant single-commit win in this stack). Drives most of the
end-to-end FPS gain at 8192 envs.1 parent 3d4da59 commit 13dda1f
1 file changed
Lines changed: 264 additions & 200 deletions
0 commit comments