Skip to content

Conversation

@PointKernel
Copy link
Member

@PointKernel PointKernel commented Oct 31, 2025

Description

A temporary workaround of NVIDIA/cccl#6430

This PR updates the groupby logic to round up output buffer sizes to a 4-byte boundary when the column data type is smaller than 4 bytes. This prevents false-positive memcheck failures from CCCL, where 4B CAS loops are used to emulate 1B or 2B atomic operations.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel self-assigned this Oct 31, 2025
@PointKernel PointKernel requested a review from a team as a code owner October 31, 2025 23:07
@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function labels Oct 31, 2025
@PointKernel PointKernel added non-breaking Non-breaking change bug Something isn't working and removed improvement Improvement / enhancement to an existing function labels Oct 31, 2025
auto const mask_flag = nullable ? mask_state::ALL_NULL : mask_state::UNALLOCATED;
return make_fixed_width_column(
cudf::detail::target_type(col_type, agg), output_size, mask_flag, stream, mr);
cudf::detail::target_type(col_type, agg), adjusted_size, mask_flag, stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we change the number of rows in the output column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should not do this. The size of output results should match exactly with the number of output keys. I think the only way for us to fix this issue is to modify the aggregation kernel instead, adding specializing code for the cases with output type having size smaller than 4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did notice this issue, but all the C++ tests passed when I ran them locally, including the Compute Sanitizer checks. Let me take a closer look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this instead? This will create the column manually instead of using the factory methods.

    auto const col_type =
      is_dictionary(col.type()) ? dictionary_column_view(col).keys().type() : col.type();
    auto const nullable =
      agg != aggregation::COUNT_VALID && agg != aggregation::COUNT_ALL && col.has_nulls();
    auto const make_empty_column = [&](data_type dt, size_type size, mask_state state) {
      auto const type_size = cudf::size_of(dt);
      if (type_size < 4) {
        auto adjusted_size = cudf::util::round_up_safe(size, static_cast<size_type>(4));
        auto buffer        = rmm::device_buffer(adjusted_size * type_size, stream, mr);
        auto mask          = cudf::detail::create_null_mask(size, state, stream, mr);
        auto null_count    = state == mask_state::UNINITIALIZED ? 0 : state_null_count(state, size);
        return std::make_unique<column>(dt, size, std::move(buffer), std::move(mask), null_count);
      }
      return make_fixed_width_column(dt, size, state, stream, mr);
    };
    if (agg != aggregation::SUM_WITH_OVERFLOW) {
      auto const target_type = cudf::detail::target_type(col_type, agg);
      auto const mask_flag   = nullable ? mask_state::ALL_NULL : mask_state::UNALLOCATED;
      return make_empty_column(target_type, output_size, mask_flag);
    }
    auto make_children = [&make_empty_column](size_type size) {
      std::vector<std::unique_ptr<column>> children;
      children.push_back(
        make_empty_column(data_type{type_id::INT64}, size, mask_state::UNALLOCATED));
      children.push_back(
        make_empty_column(data_type{type_id::BOOL8}, size, mask_state::UNALLOCATED));
      return children;
    };

This passed all the tests as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using a buffer directly with cursor instead of the column factory but didn’t have any luck. Thanks to @davidwendt for showing how to get it working.

@Aminsed
Copy link

Aminsed commented Nov 2, 2025

@PointKernel Thanks for putting this workaround together! FYI I just opened NVIDIA/cccl#6442, which fixes directly by making cuda::atomic_ref for sub-4-byte types use a byte-granular path. Once that lands and RAPIDS pulls the new CCCL, compute-sanitizer no longer flags the groupby run, so the padding workaround shouldn’t be necessary anymore. Thanks again for keeping things unblocked in the meantime!

@ttnghia
Copy link
Contributor

ttnghia commented Nov 3, 2025

Once that lands and RAPIDS pulls the new CCCL, compute-sanitizer no longer flags the groupby run, so the padding workaround shouldn’t be necessary anymore.

@PointKernel Should we just close this PR and wait instead?

@PointKernel
Copy link
Member Author

Once that lands and RAPIDS pulls the new CCCL, compute-sanitizer no longer flags the groupby run, so the padding workaround shouldn’t be necessary anymore.

@PointKernel Should we just close this PR and wait instead?

The CCCL PR was closed, concluding that silencing the warning rather than dispatching on smaller types is the right approach, so the proper fix may take longer than expected. IMO, we should make this work in the meantime, but I’m open to suggestions.

@PointKernel PointKernel changed the title Align groupby output buffers to 4 bytes for small data types Round up small-type groupby outputs to 4-byte boundary Nov 3, 2025
@PointKernel PointKernel requested a review from ttnghia November 3, 2025 19:17
@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit fae01c4 into rapidsai:main Nov 4, 2025
137 checks passed
@PointKernel PointKernel deleted the fix/groupby-atomic-memcheck branch November 4, 2025 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants