Implement parallel `cuda::std::reduce` #6777

miscco · 2025-11-25T09:12:12Z

This implements cuda::std::reduce utilizing a CUB backend

bernhardmgruber · 2025-11-25T10:10:47Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cub::DeviceReduce::Reduce,
+      "__pstl_cuda_reduce: cub::DeviceReduce::Reduce failed",
+      ::cuda::std::move(__first),
+      __device_ret_ptr,
+      __count,
+      ::cuda::std::move(__func),
+      ::cuda::std::move(__init),
+      ::cuda::std::move(__policy));


Important: In the for_each_n implementation we create a ::cuda::stream_ref __stream{cudaStreamPerThread}; and pass the stream instead of the policy here. I think we need to add __stream to __policy here.

Agreed, we should pass a stream. Regarding adding stream to policy, it's a complicated subject that we are deferring until after reduction is merged.

bernhardmgruber · 2025-11-25T10:11:15Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));


Important: This must be cudaMallocAsync.

I don't think cudaMallocAsync is enough. Many calls below throw without ever freeing allocated memory. We need a RAII abstraction, like async_buffer, at least internally.

How about a unique_ptr?

bernhardmgruber · 2025-11-25T10:12:06Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _CCCL_TRY_CUDA_API(
+      ::cudaMemcpy,
+      "__pstl_cuda_reduce: copy of result from device to host failed",
+      ::cuda::std::addressof(__ret),
+      __device_ret_ptr,
+      sizeof(_Tp),
+      ::cudaMemcpyDeviceToHost);
+
+    _CCCL_TRY_CUDA_API(::cudaFree, "__pstl_cuda_reduce: deallocate failed", __device_ret_ptr);


Important: this should be cudaMemcpyAsync and cudaFreeAsync, followed by a sync of the stream.

github-actions · 2025-11-25T12:26:25Z

🥳 CI Workflow Results

🟩 Finished in 3h 12m: Pass: 100%/90 | Total: 3d 05h | Max: 3h 02m | Hits: 49%/212260

See results here.

gevtushenko · 2025-11-25T17:58:52Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    _Tp* __device_ret_ptr = nullptr;
+
+    _CCCL_TRY_CUDA_API(
+      ::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));
+


important: this might lead to an issue if user type has a non-trivial constructor. Thrust hanles that by invoking a kernel only when a constructor is needed. We might do a bit better. Consider having a fancy iterator that does in-place new for non-trivial types and a raw pointer otherwise.

Yep, I think the iterator handling the output should do placement new.

gevtushenko · 2025-11-25T17:59:48Z

libcudacxx/include/cuda/std/__pstl/cuda/reduce.h

+    // Allocate memory for result
+    _Tp* __device_ret_ptr = nullptr;
+
+    _CCCL_TRY_CUDA_API(


important: given that we throw below, this looks like a memory leak. Consider a RAII abstraction

gevtushenko · 2025-11-25T18:02:39Z

libcudacxx/test/libcudacxx/std/numerics/numeric.ops/reduce/pstl_reduce.cu

+  thrust::sequence(data.begin(), data.end(), 1);
+
+  const auto policy  = cuda::execution::__cub_par_unseq;
+  decltype(auto) res = cuda::std::reduce(policy, data.begin(), data.end(), 42, plus_two{});


suggestion: consider adding a check in the binary operator to see if it's actually invoked on GPU

Implement parallel cuda::std::reduce

9c3f735

miscco requested review from a team as code owners November 25, 2025 09:12

miscco requested a review from wmaxey November 25, 2025 09:12

github-project-automation bot added this to CCCL Nov 25, 2025

github-project-automation bot moved this to Todo in CCCL Nov 25, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 25, 2025

bernhardmgruber reviewed Nov 25, 2025

View reviewed changes

gevtushenko reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement parallel `cuda::std::reduce` #6777

Implement parallel `cuda::std::reduce` #6777

Uh oh!

miscco commented Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

bernhardmgruber Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

gevtushenko Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_CCCL_TRY_CUDA_API(
		::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));

Implement parallel cuda::std::reduce #6777

Are you sure you want to change the base?

Implement parallel cuda::std::reduce #6777

Uh oh!

Conversation

miscco commented Nov 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 25, 2025

🥳 CI Workflow Results

🟩 Finished in 3h 12m: Pass: 100%/90 | Total: 3d 05h | Max: 3h 02m | Hits: 49%/212260

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement parallel `cuda::std::reduce` #6777

Implement parallel `cuda::std::reduce` #6777