Improve reduce implementation by miscco · Pull Request #7493 · NVIDIA/cccl

miscco · 2026-02-04T08:47:16Z

We weren properly checking for errors of the CUB calls.

Also make it so that the allocation guard does not sync in the destructor because we want to explicitly do that in code to ensure we do not forget it

Pulled out of #7382

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

bernhardmgruber · 2026-02-04T11:38:03Z

Found the issue. Here is the diff to be applied to main that recovers performance for PSTL (SASS is now identical to Thrust):

diff --git a/libcudacxx/include/cuda/std/__pstl/reduce.h b/libcudacxx/include/cuda/std/__pstl/reduce.h
index 803edcc584..be0d33fc71 100644
--- a/libcudacxx/include/cuda/std/__pstl/reduce.h
+++ b/libcudacxx/include/cuda/std/__pstl/reduce.h
@@ -86,7 +86,11 @@ _CCCL_REQUIRES(__has_forward_traversal<_Iter> _CCCL_AND is_execution_policy_v<_P
 [[nodiscard]] _CCCL_HOST_API _Tp reduce(const _Policy& __policy, _Iter __first, _Iter __last, _Tp __init)
 {
   return ::cuda::std::reduce(
-    __policy, ::cuda::std::move(__first), ::cuda::std::move(__last), ::cuda::std::move(__init), ::cuda::std::plus<>{});
+    __policy,
+    ::cuda::std::move(__first),
+    ::cuda::std::move(__last),
+    ::cuda::std::move(__init),
+    ::cuda::std::plus<_Tp>{});
 }
 
 _CCCL_TEMPLATE(class _Policy, class _Iter)
@@ -94,7 +98,11 @@ _CCCL_REQUIRES(__has_forward_traversal<_Iter> _CCCL_AND is_execution_policy_v<_P
 [[nodiscard]] _CCCL_HOST_API iter_value_t<_Iter> reduce(const _Policy& __policy, _Iter __first, _Iter __last)
 {
   return ::cuda::std::reduce(
-    __policy, ::cuda::std::move(__first), ::cuda::std::move(__last), iter_value_t<_Iter>{}, ::cuda::std::plus<>{});
+    __policy,
+    ::cuda::std::move(__first),
+    ::cuda::std::move(__last),
+    iter_value_t<_Iter>{},
+    ::cuda::std::plus<iter_value_t<_Iter>>{});
 }
 
 _CCCL_END_NAMESPACE_ARCH_DEPENDENT

The main insight is that using cuda::std::plus<void> will lead to a wrong accumulator type for any type that promotes during addition. E.g. reducing over short will use accumulator type int, while Thrust uses short. So we have to use cuda::std::plus<T>.

We weren properly checking for errors of the CUB calls. Also make it so that the allocation guard does not sync in the destructor because we want to explicitly do that in code to ensure we do not forget it

miscco · 2026-02-04T12:57:46Z

New tunings look good:

['thrust_reduce.json', 'pstl_reduce.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |  21.486 us |      17.39% |  21.118 us |      17.94% | -0.369 us |  -1.72% |   SAME   |
|   I8    |    2^20    |  24.545 us |       2.39% |  24.521 us |       2.60% | -0.023 us |  -0.10% |   SAME   |
|   I8    |    2^24    |  52.304 us |       3.42% |  52.112 us |       3.08% | -0.192 us |  -0.37% |   SAME   |
|   I8    |    2^28    | 411.529 us |       2.88% | 411.554 us |       2.90% |  0.025 us |   0.01% |   SAME   |
|   I16   |    2^16    |  21.979 us |      15.83% |  22.042 us |      15.79% |  0.063 us |   0.29% |   SAME   |
|   I16   |    2^20    |  25.677 us |       2.76% |  25.701 us |       3.79% |  0.024 us |   0.09% |   SAME   |
|   I16   |    2^24    |  74.401 us |       2.39% |  74.274 us |       2.42% | -0.127 us |  -0.17% |   SAME   |
|   I16   |    2^28    | 786.817 us |       2.30% | 787.486 us |       2.35% |  0.668 us |   0.08% |   SAME   |
|   I32   |    2^16    |  17.764 us |      16.98% |  17.478 us |      16.30% | -0.286 us |  -1.61% |   SAME   |
|   I32   |    2^20    |  27.802 us |       3.57% |  27.826 us |       3.19% |  0.024 us |   0.09% |   SAME   |
|   I32   |    2^24    | 119.648 us |       1.82% | 119.914 us |       2.02% |  0.266 us |   0.22% |   SAME   |
|   I32   |    2^28    |   1.541 ms |       1.71% |   1.540 ms |       1.64% | -0.555 us |  -0.04% |   SAME   |
|   I64   |    2^16    |  22.849 us |      13.40% |  22.647 us |      14.32% | -0.202 us |  -0.89% |   SAME   |
|   I64   |    2^20    |  33.601 us |       5.45% |  33.445 us |       4.90% | -0.156 us |  -0.46% |   SAME   |
|   I64   |    2^24    | 214.350 us |       1.10% | 214.274 us |       1.09% | -0.076 us |  -0.04% |   SAME   |
|   I64   |    2^28    |   3.046 ms |       1.06% |   3.047 ms |       1.19% |  1.244 us |   0.04% |   SAME   |
|  I128   |    2^16    |  25.115 us |       4.09% |  25.117 us |       3.54% |  0.002 us |   0.01% |   SAME   |
|  I128   |    2^20    |  52.146 us |       2.48% |  52.310 us |       2.91% |  0.165 us |   0.32% |   SAME   |
|  I128   |    2^24    | 420.804 us |       0.51% | 420.896 us |       0.50% |  0.092 us |   0.02% |   SAME   |
|  I128   |    2^28    |   6.094 ms |       0.67% |   6.094 ms |       0.74% |  0.603 us |   0.01% |   SAME   |
|   F32   |    2^16    |  18.340 us |      19.93% |  17.495 us |      17.58% | -0.845 us |  -4.61% |   SAME   |
|   F32   |    2^20    |  29.053 us |       3.75% |  29.054 us |       3.25% |  0.002 us |   0.01% |   SAME   |
|   F32   |    2^24    | 121.628 us |       3.01% | 122.183 us |       3.03% |  0.555 us |   0.46% |   SAME   |
|   F32   |    2^28    |   1.541 ms |       1.69% |   1.540 ms |       1.60% | -0.775 us |  -0.05% |   SAME   |
|   F64   |    2^16    |  23.936 us |       8.16% |  24.677 us |       7.63% |  0.741 us |   3.10% |   SAME   |
|   F64   |    2^20    |  35.803 us |       8.76% |  36.001 us |       8.97% |  0.198 us |   0.55% |   SAME   |
|   F64   |    2^24    | 220.234 us |       1.02% | 220.217 us |       0.98% | -0.017 us |  -0.01% |   SAME   |
|   F64   |    2^28    |   3.053 ms |       1.05% |   3.052 ms |       0.94% | -1.325 us |  -0.04% |   SAME   |

github-actions · 2026-02-04T13:14:53Z

🥳 CI Workflow Results

🟩 Finished in 1h 30m: Pass: 100%/95 | Total: 16h 17m | Max: 1h 05m | Hits: 99%/248072

See results here.

miscco requested a review from a team as a code owner February 4, 2026 08:47

miscco requested a review from pciolkosz February 4, 2026 08:47

github-project-automation bot added this to CCCL Feb 4, 2026

github-project-automation bot moved this to Todo in CCCL Feb 4, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 4, 2026

miscco mentioned this pull request Feb 4, 2026

Implement parallel cuda::std::count #7382

Open

This comment was marked as outdated.

Sign in to view

This comment has been minimized.

Sign in to view

davebayer reviewed Feb 4, 2026

View reviewed changes

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h Outdated Show resolved Hide resolved

Improve reduce implementation

3ac29ab

We weren properly checking for errors of the CUB calls. Also make it so that the allocation guard does not sync in the destructor because we want to explicitly do that in code to ensure we do not forget it

miscco force-pushed the improve_reduce_implementation branch from 7f04ec8 to 3ac29ab Compare February 4, 2026 11:43

miscco enabled auto-merge (squash) February 4, 2026 12:57

bernhardmgruber approved these changes Feb 4, 2026

View reviewed changes

miscco merged commit 7942d8c into NVIDIA:main Feb 4, 2026
113 checks passed

miscco deleted the improve_reduce_implementation branch February 4, 2026 14:06

github-project-automation bot moved this from In Review to Done in CCCL Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reduce implementation#7493

Improve reduce implementation#7493
miscco merged 1 commit intoNVIDIA:mainfrom
miscco:improve_reduce_implementation

miscco commented Feb 4, 2026

Uh oh!

This comment was marked as outdated.

This comment has been minimized.

Uh oh!

bernhardmgruber commented Feb 4, 2026

Uh oh!

miscco commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miscco commented Feb 4, 2026

Uh oh!

This comment was marked as outdated.

This comment has been minimized.

Uh oh!

bernhardmgruber commented Feb 4, 2026

Uh oh!

miscco commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 30m: Pass: 100%/95 | Total: 16h 17m | Max: 1h 05m | Hits: 99%/248072

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants