Skip to content

Use cuda::std:: traits and utilities for AST operators #19179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 24, 2025

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Jun 16, 2025

Description

While reviewing #19105, I noticed that the AST operators make extensive use of std:: traits and utilities within device functions. This is suboptimal, as it depends on relaxed constexpr behavior and may lead to undefined behavior. This PR updates all such usages to their corresponding cuda::std:: equivalents thus helping #7795.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel self-assigned this Jun 16, 2025
@PointKernel PointKernel requested a review from a team as a code owner June 16, 2025 18:15
@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 16, 2025
Copy link
Contributor

@bdice bdice Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a .cuh? See rules in developer guide. This is primarily __device__ functions, so I can see the argument, but I believe CUDF_HOST_DEVICE is meant to be usable by host compilers too.

Copy link
Member Author

@PointKernel PointKernel Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure CUDF_HOST_DEVICE was ever meant to support host-only compilers. I had assumed it was just a temporary workaround from when CCCL wasn’t mature, and we had to rely on relaxed constexpr to use STL utilities on device. Back then, we didn’t really try to separate host and device usage because we didn’t have the option. Also, since CUDA rdc is disabled in libcudf, device function definitions can't be separated from declarations. If they contain host-device code, even headers ending in .hpp have likely always been compiled with nvcc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the pieces like ast_operator_dispatcher, ast_operator_return_type, ast_operator_arity, etc. are meant to be usable in host-only code, that's how we build up the AST program.

Copy link
Member Author

@PointKernel PointKernel Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me move ast_operator_return_type and ast_operator_arity to operators.hpp. As for ast_operator_dispatcher, it seems that it's used in both host and device code so I will will probably leave it in the .cuh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a .cuh? See rules in developer guide. This is primarily device functions, so I can see the argument, but I believe CUDF_HOST_DEVICE is meant to be usable by host compilers too.

Counterpoint: Given that the operator()()s run in a __device__ context, it appears that there isn't any getting around nvcc having to process this file. Wouldn't switching it to a .hpp then at best match the current compile time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, CUDF_HOST_DEVICE does not rely on relaxed constexpr. The whole point of specifying CUDF_HOST_DEVICE is so that we get valid host and device callables for those functions without any UB around constexpr functions.

@PointKernel PointKernel requested a review from bdice June 17, 2025 20:39
static constexpr auto arity{2};

template <typename LHS, typename RHS>
__device__ inline auto operator()(LHS lhs, RHS rhs) -> decltype(lhs + rhs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get away with these being const:

Suggested change
__device__ inline auto operator()(LHS lhs, RHS rhs) -> decltype(lhs + rhs)
__device__ inline auto operator()(LHS lhs, RHS rhs) const -> decltype(lhs + rhs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, is there value in lhs and rhs being passed by reference instead of value? Given that operator_functor is stateless, shouldn't pass-by-reference be better?

Or is it that the types are native (or at least cheap to copy, like string_view)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If we're only dealing with primitive types like ints and floats, passing by value is more efficient and avoids the overhead of pointer or reference indirection. This is particularly important on GPUs, where passing by value can help reduce register usage, a well-known performance bottleneck, especially for AST operators. I took a quick look through the file and didn’t see any template specializations for complex types like string_view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing cudf::string_view by value has also shown reduced register pressure in other cases (e.g. regex device code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is very sensitive to by-value / by-reference. It must be passed by value to keep register pressure down.

Copy link
Contributor

@bdice bdice Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I recall some unexpected performance sensitivity around constness as well. 😒 Please benchmark any changes you make here, with mixed joins and AST transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sensitivity around constness...

Constness of the method or the arguments? (The latter, I can see.)

Thank you for the notes on register pressure. TIL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see about 1% slowdown when making operators const, e.g. for mixed join:

# mixed_inner_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 132.910 us |      10.02% | 130.861 us |      10.29% |  -2.049 us |  -1.54% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 159.854 us |       2.35% | 158.918 us |       3.56% |  -0.936 us |  -0.59% |   SAME   |
|  I32  |     0      |  10000000   |     1000     |   5.281 ms |       0.80% |   5.253 ms |       0.86% | -28.259 us |  -0.54% |   SAME   |
|  I32  |     0      |   100000    |    100000    | 183.792 us |       3.61% | 183.111 us |       4.78% |  -0.681 us |  -0.37% |   SAME   |
|  I32  |     0      |  10000000   |    100000    |   6.205 ms |       0.50% |   6.207 ms |       0.50% |   2.867 us |   0.05% |   SAME   |
|  I32  |     0      |  10000000   |   10000000   |  24.936 ms |       0.08% |  24.995 ms |       0.05% |  58.292 us |   0.23% |   SLOW   |
|  I32  |     1      |    1000     |     1000     | 125.346 us |       6.82% | 125.753 us |       4.35% |   0.407 us |   0.32% |   SAME   |
|  I32  |     1      |   100000    |     1000     | 191.004 us |       4.90% | 186.326 us |       3.57% |  -4.678 us |  -2.45% |   SAME   |
|  I32  |     1      |  10000000   |     1000     |   6.961 ms |       0.22% |   7.026 ms |       0.18% |  64.520 us |   0.93% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 214.082 us |       1.67% | 211.498 us |       1.71% |  -2.584 us |  -1.21% |   SAME   |
|  I32  |     1      |  10000000   |    100000    |   8.381 ms |       0.18% |   8.460 ms |       0.19% |  78.724 us |   0.94% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |  11.391 ms |       0.29% |  11.427 ms |       0.14% |  36.116 us |   0.32% |   SLOW   |
|  I64  |     0      |    1000     |     1000     | 130.117 us |       6.49% | 128.317 us |       3.66% |  -1.800 us |  -1.38% |   SAME   |
|  I64  |     0      |   100000    |     1000     | 183.099 us |       3.07% | 181.291 us |       1.82% |  -1.808 us |  -0.99% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   6.581 ms |       0.50% |   6.592 ms |       0.52% |  10.553 us |   0.16% |   SAME   |
|  I64  |     0      |   100000    |    100000    | 205.050 us |       2.08% | 206.821 us |       3.17% |   1.772 us |   0.86% |   SAME   |
|  I64  |     0      |  10000000   |    100000    |   7.491 ms |       0.50% |   7.488 ms |       0.11% |  -3.175 us |  -0.04% |   SAME   |
|  I64  |     0      |  10000000   |   10000000   |  25.330 ms |       0.03% |  25.403 ms |       0.07% |  72.953 us |   0.29% |   SLOW   |
|  I64  |     1      |    1000     |     1000     | 130.264 us |       2.81% | 132.201 us |       9.46% |   1.937 us |   1.49% |   SAME   |
|  I64  |     1      |   100000    |     1000     | 195.976 us |       4.53% | 195.315 us |       2.02% |  -0.662 us |  -0.34% |   SAME   |
|  I64  |     1      |  10000000   |     1000     |   7.115 ms |       0.21% |   7.159 ms |       0.19% |  44.305 us |   0.62% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 211.711 us |       1.68% | 213.626 us |       1.67% |   1.915 us |   0.90% |   SAME   |
|  I64  |     1      |  10000000   |    100000    |   8.547 ms |       0.21% |   8.601 ms |       0.25% |  53.279 us |   0.62% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |  11.503 ms |       0.17% |  11.548 ms |       0.15% |  45.039 us |   0.39% |   SLOW   |

# mixed_left_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 124.804 us |       2.37% | 124.582 us |       2.89% |  -0.222 us |  -0.18% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 165.494 us |       3.29% | 166.799 us |       2.71% |   1.305 us |   0.79% |   SAME   |
|  I32  |     0      |  10000000   |     1000     |   5.864 ms |       0.85% |   5.835 ms |       0.85% | -29.155 us |  -0.50% |   SAME   |
|  I32  |     0      |   100000    |    100000    | 191.874 us |       2.16% | 193.675 us |       3.38% |   1.801 us |   0.94% |   SAME   |
|  I32  |     0      |  10000000   |    100000    |   6.873 ms |       0.50% |   6.858 ms |       0.46% | -15.563 us |  -0.23% |   SAME   |
|  I32  |     0      |  10000000   |   10000000   |  25.489 ms |       0.07% |  25.550 ms |       0.11% |  61.195 us |   0.24% |   SLOW   |
|  I32  |     1      |    1000     |     1000     | 127.568 us |       2.90% | 127.223 us |       8.21% |  -0.345 us |  -0.27% |   SAME   |
|  I32  |     1      |   100000    |     1000     | 195.345 us |       2.53% | 192.850 us |       2.47% |  -2.494 us |  -1.28% |   SAME   |
|  I32  |     1      |  10000000   |     1000     |   7.771 ms |       0.16% |   7.917 ms |       0.20% | 146.327 us |   1.88% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 216.120 us |       1.61% | 217.166 us |       1.85% |   1.046 us |   0.48% |   SAME   |
|  I32  |     1      |  10000000   |    100000    |   9.259 ms |       0.16% |   9.386 ms |       0.15% | 126.970 us |   1.37% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |  12.237 ms |       0.13% |  12.334 ms |       0.13% |  96.711 us |   0.79% |   SLOW   |
|  I64  |     0      |    1000     |     1000     | 128.702 us |       2.71% | 128.971 us |       2.59% |   0.270 us |   0.21% |   SAME   |
|  I64  |     0      |   100000    |     1000     | 193.107 us |       3.73% | 191.436 us |       2.61% |  -1.671 us |  -0.87% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   7.078 ms |       0.73% |   7.069 ms |       0.72% |  -9.482 us |  -0.13% |   SAME   |
|  I64  |     0      |   100000    |    100000    | 211.049 us |       2.34% | 212.658 us |       3.70% |   1.608 us |   0.76% |   SAME   |
|  I64  |     0      |  10000000   |    100000    |   8.074 ms |       0.50% |   8.084 ms |       0.50% |  10.309 us |   0.13% |   SAME   |
|  I64  |     0      |  10000000   |   10000000   |  25.852 ms |       0.04% |  25.932 ms |       0.10% |  80.211 us |   0.31% |   SLOW   |
|  I64  |     1      |    1000     |     1000     | 133.222 us |       4.65% | 133.169 us |       4.50% |  -0.053 us |  -0.04% |   SAME   |
|  I64  |     1      |   100000    |     1000     | 198.697 us |       2.09% | 202.376 us |       3.17% |   3.679 us |   1.85% |   SAME   |
|  I64  |     1      |  10000000   |     1000     |   7.942 ms |       0.22% |   8.029 ms |       0.18% |  87.332 us |   1.10% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 219.467 us |       2.21% | 221.055 us |       2.15% |   1.587 us |   0.72% |   SAME   |
|  I64  |     1      |  10000000   |    100000    |   9.429 ms |       0.14% |   9.513 ms |       0.16% |  83.830 us |   0.89% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |  12.363 ms |       0.15% |  12.459 ms |       0.14% |  95.220 us |   0.77% |   SLOW   |

# mixed_full_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 186.402 us |       5.14% | 187.243 us |       3.55% |   0.840 us |   0.45% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 204.759 us |       5.05% | 204.760 us |       2.69% |   0.001 us |   0.00% |   SAME   |
|  I32  |     0      |  10000000   |     1000     |   6.243 ms |       0.82% |   6.233 ms |       0.80% |  -9.219 us |  -0.15% |   SAME   |
|  I32  |     0      |   100000    |    100000    | 261.039 us |       5.22% | 257.991 us |       3.41% |  -3.047 us |  -1.17% |   SAME   |
|  I32  |     0      |  10000000   |    100000    |   7.126 ms |       0.50% |   7.147 ms |       0.50% |  20.664 us |   0.29% |   SAME   |
|  I32  |     0      |  10000000   |   10000000   |  27.310 ms |       0.07% |  27.355 ms |       0.08% |  44.859 us |   0.16% |   SLOW   |
|  I32  |     1      |    1000     |     1000     | 187.769 us |       4.25% | 185.046 us |       3.52% |  -2.723 us |  -1.45% |   SAME   |
|  I32  |     1      |   100000    |     1000     | 256.381 us |       2.64% | 258.007 us |       5.78% |   1.626 us |   0.63% |   SAME   |
|  I32  |     1      |  10000000   |     1000     |   8.194 ms |       0.21% |   8.338 ms |       0.50% | 143.214 us |   1.75% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 278.476 us |       1.99% | 284.770 us |       3.33% |   6.294 us |   2.26% |   SLOW   |
|  I32  |     1      |  10000000   |    100000    |   9.687 ms |       0.21% |   9.807 ms |       0.17% | 120.468 us |   1.24% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |  13.219 ms |       0.16% |  13.312 ms |       0.11% |  92.846 us |   0.70% |   SLOW   |
|  I64  |     0      |    1000     |     1000     | 189.413 us |       3.85% | 190.383 us |       4.73% |   0.969 us |   0.51% |   SAME   |
|  I64  |     0      |   100000    |     1000     | 230.558 us |       1.93% | 229.936 us |       3.78% |  -0.623 us |  -0.27% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   7.458 ms |       0.86% |   7.484 ms |       0.78% |  26.126 us |   0.35% |   SAME   |
|  I64  |     0      |   100000    |    100000    | 274.719 us |       2.01% | 278.216 us |       3.04% |   3.498 us |   1.27% |   SAME   |
|  I64  |     0      |  10000000   |    100000    |   8.353 ms |       0.43% |   8.424 ms |       0.54% |  70.820 us |   0.85% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  27.661 ms |       0.04% |  27.741 ms |       0.09% |  79.485 us |   0.29% |   SLOW   |
|  I64  |     1      |    1000     |     1000     | 194.380 us |       2.67% | 196.744 us |       3.10% |   2.364 us |   1.22% |   SAME   |
|  I64  |     1      |   100000    |     1000     | 259.710 us |       1.35% | 264.317 us |       2.14% |   4.607 us |   1.77% |   SLOW   |
|  I64  |     1      |  10000000   |     1000     |   8.365 ms |       0.22% |   8.469 ms |       0.33% | 104.722 us |   1.25% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 285.634 us |       2.74% | 288.712 us |       2.30% |   3.078 us |   1.08% |   SAME   |
|  I64  |     1      |  10000000   |    100000    |   9.866 ms |       0.37% |   9.942 ms |       0.15% |  75.998 us |   0.77% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |  13.368 ms |       0.26% |  13.452 ms |       0.16% |  84.495 us |   0.63% |   SLOW   |

The impact on AST benchmarks is even smaller, so I assume making them const should be acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for benchmarking. I'm happy with this outcome, more const is better and I'm glad the compiler was happy with it.

Comment on lines +876 to +877
// Case 3: One value is null, while the other is not. If it's true we return null, otherwise we
// return false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead on, and agrees with Spark's semantics as well.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, apart from a couple of nitpicks.

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit bd518be into rapidsai:branch-25.08 Jun 24, 2025
91 checks passed
@PointKernel PointKernel deleted the cccl-ast-operators branch June 24, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants