Skip to content

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Nov 25, 2025

This implements cuda::std::reduce utilizing a CUB backend

@miscco miscco requested review from a team as code owners November 25, 2025 09:12
@miscco miscco requested a review from wmaxey November 25, 2025 09:12
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 25, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 25, 2025
Comment on lines +71 to +79
_CCCL_TRY_CUDA_API(
::cub::DeviceReduce::Reduce,
"__pstl_cuda_reduce: cub::DeviceReduce::Reduce failed",
::cuda::std::move(__first),
__device_ret_ptr,
__count,
::cuda::std::move(__func),
::cuda::std::move(__init),
::cuda::std::move(__policy));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: In the for_each_n implementation we create a ::cuda::stream_ref __stream{cudaStreamPerThread}; and pass the stream instead of the policy here. I think we need to add __stream to __policy here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should pass a stream. Regarding adding stream to policy, it's a complicated subject that we are deferring until after reduction is merged.

Comment on lines +65 to +66
_CCCL_TRY_CUDA_API(
::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This must be cudaMallocAsync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cudaMallocAsync is enough. Many calls below throw without ever freeing allocated memory. We need a RAII abstraction, like async_buffer, at least internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a unique_ptr?

Comment on lines +81 to +89
_CCCL_TRY_CUDA_API(
::cudaMemcpy,
"__pstl_cuda_reduce: copy of result from device to host failed",
::cuda::std::addressof(__ret),
__device_ret_ptr,
sizeof(_Tp),
::cudaMemcpyDeviceToHost);

_CCCL_TRY_CUDA_API(::cudaFree, "__pstl_cuda_reduce: deallocate failed", __device_ret_ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this should be cudaMemcpyAsync and cudaFreeAsync, followed by a sync of the stream.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 12m: Pass: 100%/90 | Total: 3d 05h | Max: 3h 02m | Hits: 49%/212260

See results here.

Comment on lines +63 to +67
_Tp* __device_ret_ptr = nullptr;

_CCCL_TRY_CUDA_API(
::cudaMalloc, "__pstl_cuda_reduce: allocation failed", reinterpret_cast<void**>(&__device_ret_ptr), sizeof(_Tp));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: this might lead to an issue if user type has a non-trivial constructor. Thrust hanles that by invoking a kernel only when a constructor is needed. We might do a bit better. Consider having a fancy iterator that does in-place new for non-trivial types and a raw pointer otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think the iterator handling the output should do placement new.

// Allocate memory for result
_Tp* __device_ret_ptr = nullptr;

_CCCL_TRY_CUDA_API(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: given that we throw below, this looks like a memory leak. Consider a RAII abstraction

thrust::sequence(data.begin(), data.end(), 1);

const auto policy = cuda::execution::__cub_par_unseq;
decltype(auto) res = cuda::std::reduce(policy, data.begin(), data.end(), 42, plus_two{});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: consider adding a check in the binary operator to see if it's actually invoked on GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants