-
Notifications
You must be signed in to change notification settings - Fork 952
Optimizations for tdigest generation. #19140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Conversation
- A cpu path for cases where we have small numbers of potentially large inputs - Use floats instead of doubles to run trig functions in the scale function - When appropriate, use a worst-case memory allocation strategy to cut the number of cluster computing kernel calls in half.
…back from pinned to device in the CPU or partial CPU case. Changed tests so that they forcibly run in non-CPU mode as well as CPU mode.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
… lot of work in the scale function. Nets about a 40% win in the function itself.
… numbers of large groups to test a customer use case.
The changes look good to me. I ran some tests.
Effectively I am calculating the difference between the actual percentile and the approximate version for both the CPU and the GPU, so that we can see if the GPU is a better, or worse estimate than the CPU is. For all of the tests that I ran the GPU is as good as or better than the CPU both before this change and after it.
That said if I compare the new GPU calculation to the old GPU calculation we differ only on the 100,000,000 rows version. In one case we are slightly better, |
Instead of showing old vs new benchmark, can you run the comparing script please? Using https://github.com/NVIDIA/nvbench/blob/main/scripts/nvbench_compare.py. |
auto col = cudf::make_numeric_column( | ||
cudf::data_type{cudf::type_id::INT32}, num_rows, cudf::mask_state::UNALLOCATED, stream, mr); | ||
thrust::sequence(rmm::exec_policy_nosync(stream), | ||
col->mutable_view().begin<int>(), | ||
col->mutable_view().end<int>(), | ||
0); | ||
|
||
// group offsets, labels, valid counts | ||
auto iter = thrust::make_counting_iterator(0); | ||
rmm::device_uvector<cudf::size_type> group_offsets(num_groups + 1, stream, mr); | ||
thrust::transform(rmm::exec_policy_nosync(stream), | ||
iter, | ||
iter + group_offsets.size(), | ||
group_offsets.begin(), | ||
cuda::proclaim_return_type<int>( | ||
[rows_per_group] __device__(int i) { return i * rows_per_group; })); | ||
rmm::device_uvector<cudf::size_type> group_labels(num_rows, stream, mr); | ||
thrust::transform(rmm::exec_policy_nosync(stream), | ||
iter, | ||
iter + num_rows, | ||
group_labels.begin(), | ||
cuda::proclaim_return_type<int>( | ||
[rows_per_group] __device__(int i) { return i / rows_per_group; })); | ||
rmm::device_uvector<cudf::size_type> group_valid_counts(num_groups, stream, mr); | ||
auto valid_count_iter = thrust::make_constant_iterator(rows_per_group); | ||
thrust::copy(rmm::exec_policy_nosync(stream), | ||
valid_count_iter, | ||
valid_count_iter + group_valid_counts.size(), | ||
group_valid_counts.begin()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all of this can be done with cudf APIs which would be a bit shorter but also mean this file could probably be just a .cpp file instead of .cu file.
Here is my attempt:
auto zero = cudf::numeric_scalar<cudf::size_type>(0);
auto col = cudf::sequence(num_rows, zero);
auto rows_per_group = num_rows / num_groups;
auto rpg_scalar = cudf::numeric_scalar<cudf::size_type>(rows_per_group);
auto group_offsets = cudf::sequence(num_groups + 1, zero, rpg_scalar);
auto rpt_tbl = cudf::repeat(cudf::table_view({cudf::slice(col->view(), {0, num_groups}).front()}),
rows_per_group)->release();
auto group_labels = std::move(rpt_tbl.front());
auto group_valid_counts = cudf::sequence(num_groups, rpg_scalar, zero);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial flush. We should use cuda::std::XX
in device and host device code
This PR implements optimizations for speeding up a specific part of the tdigest generation code. Tdigest generation happens in two parts:
The issue lies with the first step. The process of choosing the buckets is inherently linear per input group. It uses a nonlinear 'scale function' to determine where the next bucket cutoff point should be based on where you currently are. The current implementation uses 1 thread per input group. The use case here involves only 1 group, with a large number of values. Broadly,
~5,000,000
input values, using a max of10,000
segments, resulting in~5000
actual segments being generated. The computation of those 5000 segments is the `issue.As implemented today, it takes roughly
6.5ms
to generate that on an A5000 GPU.There were 3 optimizations pursued:
250us
to do the same work. This optimization works as follows: If we have up to 64 groups of work, run 32 of them on the CPU and the remainder on the GPU. 32 is the rough break-even point where they take about the same amount of time. Above 64 groups, run it all on the GPU because the inherent parallelism starts to dominate, and we don't have to do a device->pinned copy stop to make the data available to the CPU. Since our specific customer case involves one group, this drops the processing time for the section from6.5ms
to230us
. Below we can see the case (highlighted in green) where the CPU and GPU are overlapping the same amount of work (32 groups CPU and 32 groups GPU).Second optimization. As with many GPU algorithms, we have to run this computation twice. Once to determine the size the outputs, and once to actually fill them in. So we're doubling our cost. However, we can make some reasonable assumptions about how big the worst case size will be. It is tied to the cluster limit (
10,000
specified by the user). So instead of exactly computing the number of clusters per group in a separate pass, we can just allocate the worst case and do the whole process in one. This number can in theory be large though. For example, 2000 groups * 10000 clusters * 8 bytes : 160MB. So I chose a somewhat arbitrary cap of256 MB
. If it is going to require more than that, we just do the two passes. It might be worth making this configurable via en environment variable.Third optimization. A big part of the slowness is related to using doubles with some trig functions (sin and asin). Switching to floats brings the time down by about 40%. However, it caused some accuracy errors that I couldn't easily fix by just adjusting our error tolerance without getting unreasonable (in
ulps
). But, an easy reformulation of the math was pointed out to me which allows us to remove the sin and asin altogether, instead just costing a (double) sqrt per call. This results in about a 50% speedup for the function.I added a benchmark specific to tdigest reductions (we needed it anyway), with emphasis on this specific case: small numbers of large groups. Speedups over the existing code are significant:
Old:
New:
This yields as much as a 10x speedup. You can see where the time starts to flatten out at 64 groups, where the parallelism of the GPU starts to win.
Existing (merge aggregation) benchmarks saw some wins too
Old
New
Leaving as a draft for now to get some specific Spark testing done with it.