Skip to content

Improve reduce implementation#7493

Merged
miscco merged 1 commit intoNVIDIA:mainfrom
miscco:improve_reduce_implementation
Feb 4, 2026
Merged

Improve reduce implementation#7493
miscco merged 1 commit intoNVIDIA:mainfrom
miscco:improve_reduce_implementation

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Feb 4, 2026

We weren properly checking for errors of the CUB calls.

Also make it so that the allocation guard does not sync in the destructor because we want to explicitly do that in code to ensure we do not forget it

Pulled out of #7382

@miscco miscco requested a review from a team as a code owner February 4, 2026 08:47
@miscco miscco requested a review from pciolkosz February 4, 2026 08:47
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 4, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 4, 2026
@miscco

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

Found the issue. Here is the diff to be applied to main that recovers performance for PSTL (SASS is now identical to Thrust):

diff --git a/libcudacxx/include/cuda/std/__pstl/reduce.h b/libcudacxx/include/cuda/std/__pstl/reduce.h
index 803edcc584..be0d33fc71 100644
--- a/libcudacxx/include/cuda/std/__pstl/reduce.h
+++ b/libcudacxx/include/cuda/std/__pstl/reduce.h
@@ -86,7 +86,11 @@ _CCCL_REQUIRES(__has_forward_traversal<_Iter> _CCCL_AND is_execution_policy_v<_P
 [[nodiscard]] _CCCL_HOST_API _Tp reduce(const _Policy& __policy, _Iter __first, _Iter __last, _Tp __init)
 {
   return ::cuda::std::reduce(
-    __policy, ::cuda::std::move(__first), ::cuda::std::move(__last), ::cuda::std::move(__init), ::cuda::std::plus<>{});
+    __policy,
+    ::cuda::std::move(__first),
+    ::cuda::std::move(__last),
+    ::cuda::std::move(__init),
+    ::cuda::std::plus<_Tp>{});
 }
 
 _CCCL_TEMPLATE(class _Policy, class _Iter)
@@ -94,7 +98,11 @@ _CCCL_REQUIRES(__has_forward_traversal<_Iter> _CCCL_AND is_execution_policy_v<_P
 [[nodiscard]] _CCCL_HOST_API iter_value_t<_Iter> reduce(const _Policy& __policy, _Iter __first, _Iter __last)
 {
   return ::cuda::std::reduce(
-    __policy, ::cuda::std::move(__first), ::cuda::std::move(__last), iter_value_t<_Iter>{}, ::cuda::std::plus<>{});
+    __policy,
+    ::cuda::std::move(__first),
+    ::cuda::std::move(__last),
+    iter_value_t<_Iter>{},
+    ::cuda::std::plus<iter_value_t<_Iter>>{});
 }
 
 _CCCL_END_NAMESPACE_ARCH_DEPENDENT

The main insight is that using cuda::std::plus<void> will lead to a wrong accumulator type for any type that promotes during addition. E.g. reducing over short will use accumulator type int, while Thrust uses short. So we have to use cuda::std::plus<T>.

We weren properly checking for errors of the CUB calls.

Also make it so that the allocation guard does not sync in the destructor because we want to explicitly do that in code to ensure we do not forget it
@miscco miscco force-pushed the improve_reduce_implementation branch from 7f04ec8 to 3ac29ab Compare February 4, 2026 11:43
@miscco
Copy link
Contributor Author

miscco commented Feb 4, 2026

New tunings look good:

['thrust_reduce.json', 'pstl_reduce.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |  21.486 us |      17.39% |  21.118 us |      17.94% | -0.369 us |  -1.72% |   SAME   |
|   I8    |    2^20    |  24.545 us |       2.39% |  24.521 us |       2.60% | -0.023 us |  -0.10% |   SAME   |
|   I8    |    2^24    |  52.304 us |       3.42% |  52.112 us |       3.08% | -0.192 us |  -0.37% |   SAME   |
|   I8    |    2^28    | 411.529 us |       2.88% | 411.554 us |       2.90% |  0.025 us |   0.01% |   SAME   |
|   I16   |    2^16    |  21.979 us |      15.83% |  22.042 us |      15.79% |  0.063 us |   0.29% |   SAME   |
|   I16   |    2^20    |  25.677 us |       2.76% |  25.701 us |       3.79% |  0.024 us |   0.09% |   SAME   |
|   I16   |    2^24    |  74.401 us |       2.39% |  74.274 us |       2.42% | -0.127 us |  -0.17% |   SAME   |
|   I16   |    2^28    | 786.817 us |       2.30% | 787.486 us |       2.35% |  0.668 us |   0.08% |   SAME   |
|   I32   |    2^16    |  17.764 us |      16.98% |  17.478 us |      16.30% | -0.286 us |  -1.61% |   SAME   |
|   I32   |    2^20    |  27.802 us |       3.57% |  27.826 us |       3.19% |  0.024 us |   0.09% |   SAME   |
|   I32   |    2^24    | 119.648 us |       1.82% | 119.914 us |       2.02% |  0.266 us |   0.22% |   SAME   |
|   I32   |    2^28    |   1.541 ms |       1.71% |   1.540 ms |       1.64% | -0.555 us |  -0.04% |   SAME   |
|   I64   |    2^16    |  22.849 us |      13.40% |  22.647 us |      14.32% | -0.202 us |  -0.89% |   SAME   |
|   I64   |    2^20    |  33.601 us |       5.45% |  33.445 us |       4.90% | -0.156 us |  -0.46% |   SAME   |
|   I64   |    2^24    | 214.350 us |       1.10% | 214.274 us |       1.09% | -0.076 us |  -0.04% |   SAME   |
|   I64   |    2^28    |   3.046 ms |       1.06% |   3.047 ms |       1.19% |  1.244 us |   0.04% |   SAME   |
|  I128   |    2^16    |  25.115 us |       4.09% |  25.117 us |       3.54% |  0.002 us |   0.01% |   SAME   |
|  I128   |    2^20    |  52.146 us |       2.48% |  52.310 us |       2.91% |  0.165 us |   0.32% |   SAME   |
|  I128   |    2^24    | 420.804 us |       0.51% | 420.896 us |       0.50% |  0.092 us |   0.02% |   SAME   |
|  I128   |    2^28    |   6.094 ms |       0.67% |   6.094 ms |       0.74% |  0.603 us |   0.01% |   SAME   |
|   F32   |    2^16    |  18.340 us |      19.93% |  17.495 us |      17.58% | -0.845 us |  -4.61% |   SAME   |
|   F32   |    2^20    |  29.053 us |       3.75% |  29.054 us |       3.25% |  0.002 us |   0.01% |   SAME   |
|   F32   |    2^24    | 121.628 us |       3.01% | 122.183 us |       3.03% |  0.555 us |   0.46% |   SAME   |
|   F32   |    2^28    |   1.541 ms |       1.69% |   1.540 ms |       1.60% | -0.775 us |  -0.05% |   SAME   |
|   F64   |    2^16    |  23.936 us |       8.16% |  24.677 us |       7.63% |  0.741 us |   3.10% |   SAME   |
|   F64   |    2^20    |  35.803 us |       8.76% |  36.001 us |       8.97% |  0.198 us |   0.55% |   SAME   |
|   F64   |    2^24    | 220.234 us |       1.02% | 220.217 us |       0.98% | -0.017 us |  -0.01% |   SAME   |
|   F64   |    2^28    |   3.053 ms |       1.05% |   3.052 ms |       0.94% | -1.325 us |  -0.04% |   SAME   |

@miscco miscco enabled auto-merge (squash) February 4, 2026 12:57
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 30m: Pass: 100%/95 | Total: 16h 17m | Max: 1h 05m | Hits: 99%/248072

See results here.

@miscco miscco merged commit 7942d8c into NVIDIA:main Feb 4, 2026
113 checks passed
@miscco miscco deleted the improve_reduce_implementation branch February 4, 2026 14:06
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants