perf(abd): skip j_pos/j_quat forward pass, row-major rank-1 update, widen constraint grid launches#65
Open
peizhang56 wants to merge 5 commits into
Open
Conversation
Author
|
/run-ci |
5016e34 to
628a6ce
Compare
Author
|
/run-ci |
1 similar comment
Author
|
/run-ci |
8569f3b to
ae7b8de
Compare
Collaborator
|
/run-ci |
Restructures four GRID-STARVED kernels in the AMDGPU constraint solver
init path. Each was launching with too few workgroups to fill the
MI300X CU array because the parallel axis was just the per-env batch.
This commit widens the launch geometry of each via a different lever;
kernel bodies are unchanged.
1. tiled-wc block-shape constants (solver_amdgpu.py)
_TWC_BLOCK_DIM 64 -> 128
_TWC_ENVS_PER_BLOCK 8 -> 16
Doubles threads-per-block and envs-per-block for the tiled
wave-cooperative variant, so each block does more work and the
per-env constraint loops draw from a larger lane pool.
2. initialize_Jaref (solver.py)
Was a 1D loop `for i_b in range(_B)` with an inner serial
`for i_c in range(n_constraints[i_b])`. Rewritten as a 2D ndrange
`for i_c, i_b in qd.ndrange(len_constraints, _B)` with an
`if i_c >= n_constraints[i_b]: continue` guard for the ragged
tail. Grid width grows from _B to len_constraints * _B.
3. CG mass-solve in func_update_gradient_tiled (kernel_8)
When hibernation is disabled (compile-time-known via the
`use_hibernation` template flag), the inner serial loop over
entities is promoted into the parallel grid: the old
`for i_b in range(_B)` calling `func_solve_mass_batch` is
replaced by `for i_e, i_b in qd.ndrange(n_entities_, _B)` calling
`func_solve_mass_entity` directly. The entity body is already
guarded by `mass_mat_mask[i_e, i_b]`, so zero-DOF entities (e.g.
Plane) become near-no-op threads -- same total work, wider
dispatch. Hibernation path keeps the original 1D form because
n_awake_entities is dynamic.
4. Dense qfrc gather kernel_5 (solver.py)
The dense `qfrc_constraint = J^T @ efc_force` gather was nested
inside `func_update_constraint_batch`'s per-env loop. Split into
a new `func_update_qfrc_constraint_dense` 2D kernel over
(n_dofs, _B). Routing controlled by a new `defer_dense_qfrc`
template flag on `func_update_constraint_batch`:
- True (set by func_update_constraint init caller): skip the
inline dense gather, follow up with the 2D kernel.
- False (set by func_solve_iter and
func_solve_iter_post_linesearch per-iter callers):
keep the gather inline. Those callers run inside a
per-env loop with no follow-up dispatch site, so
deferring there would leave qfrc_constraint stale and
NaN out the gradient.
Sparse path is unchanged -- its scatter would race on
`qfrc_constraint[i_d, i_b]` without atomic-add, so it stays
inside the 1D per-env loop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bump _TWC_BLOCK_DIM 128->256 and _TWC_ENVS_PER_BLOCK 16->32 in _kernel_solve_body_tiled_wc_amdgpu. Doubles L2 cache reuse per workgroup with unchanged total launched waves. BLOCK_DIM > 256 is gated by the Quadrants 100-iter unroll cap on the qd.static(range(BLOCK_DIM)) act_red reduction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
npoulad1
approved these changes
Jun 3, 2026
Collaborator
|
/run-ci |
ae7b8de to
26932d0
Compare
26932d0 to
234f8b2
Compare
Skip the per-link joint pose pass (Pass 6 of func_com_links_split and the equivalent block in func_COM_links_entity) when requires_grad=False. The j_pos/j_quat values in links_state are only consumed by the backward adjoint cache (func_copy_cartesian_space in diff.py) and are dead work in pure forward simulation. The guard is a qd.static branch keyed on requires_grad, so the skip is resolved at compile time with no runtime overhead. Replace the flat-index rank-1 update in the tiled Cholesky (func_factor_mass) with a row-major nested loop. The previous implementation decoded each flat triangle index to (row, col) via a sqrt + integer correction on every iteration of the update loop. The new loop strides threads across rows and iterates columns sequentially within each row, eliminating all sqrt calls. The pivot row value sh_pivot[_r] is loaded once into a VGPR and reused across the inner column loop, reducing LDS reads for the pivot row. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Author
|
/run-ci |
|
A couple of things before this can merge:
same at :3262 ("the 8192-thread (128 wg) launch geometry") and :3322/:3326 ("8192/32=256 wgs", "2*8192 = 16384 threads"). Can you rephrase these in terms of _B / per-env so the count is not hardcoded?
Approach itself looks reasonable. |
… GPU names) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Author
|
/run-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three performance changes: