Skip to content

Implement the new tuning API for deterministic (rfa) reduce dispatch#7346

Open
bernhardmgruber wants to merge 5 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_determin_reduce
Open

Implement the new tuning API for deterministic (rfa) reduce dispatch#7346
bernhardmgruber wants to merge 5 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_determin_reduce

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Jan 25, 2026

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor Author

/ok to test 1962a21

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber marked this pull request as ready for review February 3, 2026 21:41
@bernhardmgruber bernhardmgruber requested review from a team as code owners February 3, 2026 21:41
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Feb 3, 2026
Comment on lines 47 to 53
#if !TUNE_BASE
cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t, policy_selector_t>(
nullptr, temp_storage_bytes, d_in, d_out, elements, {}, 0);
#else
cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t>(
nullptr, temp_storage_bytes, d_in, d_out, elements, {}, 0);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps is preprocessor to define auto rfa_dispatch differently based on TUNE_BASE and use that throughout to reduce branching?

Suggested change
#if !TUNE_BASE
cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t, policy_selector_t>(
nullptr, temp_storage_bytes, d_in, d_out, elements, {}, 0);
#else
cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t>(
nullptr, temp_storage_bytes, d_in, d_out, elements, {}, 0);
#endif
#if !TUNE_BASE
auto rfa_dispatch = cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t, policy_selector_t>;
#else
auto rfa_dispatch = cub::detail::rfa::dispatch<input_it_t, output_it_t, int, init_t, transform_t, accum_t>;
#endif
rfa_dispatch(nullptr, temp_storage_bytes, d_in, d_out, elements, {}, 0);

Then similarly for the actual invocation few lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting suggestion, but if I take the address of the template instantiation of dispatch when I need to later provide all function arguments (defaults no longer work). I didn't know that! I refactored it differently now.

The test was wrong but always passed because the rfa dispatcher ignored CUB_DETAIL_DEFAULT_KERNEL_LAUNCHER_FACTORY since it was hardcoded to triple_chevron.
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

😬 CI Workflow Results

🟥 Finished in 2h 44m: Pass: 98%/99 | Total: 22h 42m | Max: 2h 44m | Hits: 99%/108245

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants