-
Notifications
You must be signed in to change notification settings - Fork 981
Round up small-type groupby outputs to 4-byte boundary #20455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Round up small-type groupby outputs to 4-byte boundary #20455
Conversation
cpp/src/groupby/hash/output_utils.cu
Outdated
| auto const mask_flag = nullable ? mask_state::ALL_NULL : mask_state::UNALLOCATED; | ||
| return make_fixed_width_column( | ||
| cudf::detail::target_type(col_type, agg), output_size, mask_flag, stream, mr); | ||
| cudf::detail::target_type(col_type, agg), adjusted_size, mask_flag, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we change the number of rows in the output column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should not do this. The size of output results should match exactly with the number of output keys. I think the only way for us to fix this issue is to modify the aggregation kernel instead, adding specializing code for the cases with output type having size smaller than 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did notice this issue, but all the C++ tests passed when I ran them locally, including the Compute Sanitizer checks. Let me take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this instead? This will create the column manually instead of using the factory methods.
auto const col_type =
is_dictionary(col.type()) ? dictionary_column_view(col).keys().type() : col.type();
auto const nullable =
agg != aggregation::COUNT_VALID && agg != aggregation::COUNT_ALL && col.has_nulls();
auto const make_empty_column = [&](data_type dt, size_type size, mask_state state) {
auto const type_size = cudf::size_of(dt);
if (type_size < 4) {
auto adjusted_size = cudf::util::round_up_safe(size, static_cast<size_type>(4));
auto buffer = rmm::device_buffer(adjusted_size * type_size, stream, mr);
auto mask = cudf::detail::create_null_mask(size, state, stream, mr);
auto null_count = state == mask_state::UNINITIALIZED ? 0 : state_null_count(state, size);
return std::make_unique<column>(dt, size, std::move(buffer), std::move(mask), null_count);
}
return make_fixed_width_column(dt, size, state, stream, mr);
};
if (agg != aggregation::SUM_WITH_OVERFLOW) {
auto const target_type = cudf::detail::target_type(col_type, agg);
auto const mask_flag = nullable ? mask_state::ALL_NULL : mask_state::UNALLOCATED;
return make_empty_column(target_type, output_size, mask_flag);
}
auto make_children = [&make_empty_column](size_type size) {
std::vector<std::unique_ptr<column>> children;
children.push_back(
make_empty_column(data_type{type_id::INT64}, size, mask_state::UNALLOCATED));
children.push_back(
make_empty_column(data_type{type_id::BOOL8}, size, mask_state::UNALLOCATED));
return children;
};This passed all the tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using a buffer directly with cursor instead of the column factory but didn’t have any luck. Thanks to @davidwendt for showing how to get it working.
|
@PointKernel Thanks for putting this workaround together! FYI I just opened NVIDIA/cccl#6442, which fixes directly by making cuda::atomic_ref for sub-4-byte types use a byte-granular path. Once that lands and RAPIDS pulls the new CCCL, compute-sanitizer no longer flags the groupby run, so the padding workaround shouldn’t be necessary anymore. Thanks again for keeping things unblocked in the meantime! |
@PointKernel Should we just close this PR and wait instead? |
The CCCL PR was closed, concluding that silencing the warning rather than dispatching on smaller types is the right approach, so the proper fix may take longer than expected. IMO, we should make this work in the meantime, but I’m open to suggestions. |
|
/merge |
Description
A temporary workaround of NVIDIA/cccl#6430
This PR updates the groupby logic to round up output buffer sizes to a 4-byte boundary when the column data type is smaller than 4 bytes. This prevents false-positive memcheck failures from CCCL, where 4B CAS loops are used to emulate 1B or 2B atomic operations.
Checklist