Skip to content

Conversation

@mroeschke
Copy link
Contributor

Description

xref #20418 (comment)

Output with validation
~/cudf$ python python/cudf/cudf/pandas/_benchmarks/pdsh.py 0,1,2,3,4,5,6,7,8,9 --validate --scale=0.1 --iterations=1 --path "/cudf/sf0.1" 

✅ Query 0 passed validation!
Empty DataFrame
Columns: []
Index: []
Query 0 - Iteration 0 finished in 0.0003s
❌ query=1 unable to execute on CPU. Skipping validation.
Traceback (most recent call last):
  File "/pandas/core/ops/array_ops.py", line 218, in _na_arithmetic_op
    result = func(left, right)
  File "/pandas/core/computation/expressions.py", line 242, in evaluate
    return _evaluate(op, op_str, a, b)  # type: ignore[misc]
  File "/pandas/core/computation/expressions.py", line 73, in _evaluate_standard
    return op(a, b)
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 469, in run_pandas
    cpu_result = execute_query(q_id, i, q, cpu_run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 306, in execute_query
    return q(run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/pdsh.py", line 58, in q1
    filt["disc_price"] = filt.l_extendedprice * (1.0 - filt.l_discount)
                                                 ~~~~^~~~~~~~~~~~~~~~~
  File "/pandas/core/ops/common.py", line 76, in new_method
    return method(self, other)
  File "/pandas/core/arraylike.py", line 198, in __rsub__
    return self._arith_method(other, roperator.rsub)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/pandas/core/series.py", line 6154, in _arith_method
    return base.IndexOpsMixin._arith_method(self, other, op)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/pandas/core/base.py", line 1391, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
  File "/pandas/core/ops/array_ops.py", line 283, in arithmetic_op
    res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
  File "/pandas/core/ops/array_ops.py", line 227, in _na_arithmetic_op
    result = _masked_arith_op(left, right, op)
  File "/pandas/core/ops/array_ops.py", line 182, in _masked_arith_op
    result[mask] = op(xrav[mask], y)
                   ~~^^^^^^^^^^^^^^^
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

  l_returnflag l_linestatus     sum_qty  ...     avg_price  avg_disc  count_order
0            A            F  3774200.00  ...  36002.123829  0.050145       147790
1            N            F    95257.00  ...  35521.326916  0.049394         3765
2            N            O  7459297.00  ...  36000.924688  0.050096       292000
3            R            F  3785523.00  ...  35994.029214  0.049989       148301

[4 rows x 10 columns]
Query 1 - Iteration 0 finished in 2.2644s
✅ Query 2 passed validation!
   s_acctbal  ...                                          s_comment
29   9828.21  ...             s the slyly even ideas poach fluffily 
5    9508.37  ...  ests sleep quickly express ideas. ironic ideas...
39   9508.37  ...  ests sleep quickly express ideas. ironic ideas...
21   9453.01  ...  gular frets. permanently special multipliers b...
31   9453.01  ...  gular frets. permanently special multipliers b...
32   9192.10  ...  es across the carefully express accounts boost...
10   9032.15  ...                      nding dependencies nag furiou
27   8702.02  ...  oss the deposits cajole carefully even pinto b...
22   8615.50  ...  y quickly regular deposits? quickly pending pa...
34   8615.50  ...  y quickly regular deposits? quickly pending pa...
14   8488.53  ...  ages. carefully final excuses nag finally. car...
26   8430.52  ...  ites among the always final ideas kindle accor...
7    8271.39  ...  s cajole quickly special requests. quickly ent...
0    8096.98  ...  ully after the regular requests. slyly final d...
16   7392.78  ...                   ake carefully across the quickly
24   7205.20  ...   excuses wake express deposits. furiously care...
30   6820.35  ...  s unwind silently furiously regular courts. fi...
6    6721.70  ...                   ect blithely blithely final acco
23   6329.90  ...                 ironic forges cajole blithely agai
41   6173.87  ...  blithely pending packages cajole furiously sly...
33   5364.99  ...      packages boost carefully. express ideas along
36   5069.27  ...           he unusual ideas. slyly final packages a
15   4941.88  ...                         y final requests impress s
28   4672.25  ...  arls wake furiously deposits. even, regular depen
11   4586.49  ...   the regularly regular dependencies. carefully...
42   4518.31  ...  ts detect along the foxes. final Tiresias are....
43   4315.15  ...  ronic orbits are furiously across the requests...
17   3526.53  ...                          lar dinos nag slyly brave
37   3526.53  ...                          lar dinos nag slyly brave
8    3294.68  ...  e slyly special foxes. furiously unusual depos...
1    2972.26  ...     ously express ideas haggle quickly dugouts? fu
4    2963.09  ...  eep blithely regular dependencies. blithely re...
35   2221.25  ...  nal foxes eat slyly about the fluffily permane...
40   1381.97  ...  gular ideas. bravely bold deposits haggle thro...
18    906.07  ...  ickly unusual requests cajole. accounts above ...
25    765.69  ...  nusual requests. furiously unusual epitaphs in...
13    727.89  ...  gular excuses. furiously regular excuses sleep...
9     683.07  ...                    ly regular requests cajole abou
2     167.56  ...    the theodolites. ironic, ironic deposits above 
19     91.39  ...       pinto beans. carefully express requests hagg
38   -314.06  ...                    bold deposits. carefully even d
3    -820.89  ...   y final, slow theodolites. furiously regular req
20   -845.44  ...                         ctions. carefully sly requ
12   -942.73  ...  slyly furiously final decoys; silent, special ...

[44 rows x 8 columns]
Query 2 - Iteration 0 finished in 0.1925s
❌ Query 3 failed validation!
Attributes of DataFrame.iloc[:, 2] (column name="o_orderdate") are different

Attribute "dtype" are different
[left]:  datetime64[s]
[right]: object
      l_orderkey      revenue o_orderdate  o_shippriority
435       223140  355369.0698  1995-03-14               0
1175      584291  354494.7318  1995-02-21               0
796       405063  353125.4577  1995-03-03               0
1150      573861  351238.2770  1995-03-09               0
1113      554757  349181.7426  1995-03-14               0
1019      506021  321075.5810  1995-03-10               0
218       121604  318576.4154  1995-03-07               0
197       108514  314967.0754  1995-02-20               0
928       462502  312604.5420  1995-03-08               0
346       178727  309728.9306  1995-02-25               0
Query 3 - Iteration 0 finished in 0.1852s
✅ Query 4 passed validation!
   o_orderpriority  order_count
0         1-URGENT          999
1           2-HIGH          997
2         3-MEDIUM         1031
3  4-NOT SPECIFIED          989
4            5-LOW         1077
Query 4 - Iteration 0 finished in 0.1540s
❌ query=5 unable to execute on CPU. Skipping validation.
Traceback (most recent call last):
  File "/pandas/core/ops/array_ops.py", line 218, in _na_arithmetic_op
    result = func(left, right)
  File "/pandas/core/computation/expressions.py", line 242, in evaluate
    return _evaluate(op, op_str, a, b)  # type: ignore[misc]
  File "/pandas/core/computation/expressions.py", line 73, in _evaluate_standard
    return op(a, b)
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 469, in run_pandas
    cpu_result = execute_query(q_id, i, q, cpu_run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 306, in execute_query
    return q(run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/pdsh.py", line 230, in q5
    jn5["revenue"] = jn5.l_extendedprice * (1.0 - jn5.l_discount)
                                            ~~~~^~~~~~~~~~~~~~~~
  File "/pandas/core/ops/common.py", line 76, in new_method
    return method(self, other)
  File "/pandas/core/arraylike.py", line 198, in __rsub__
    return self._arith_method(other, roperator.rsub)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/pandas/core/series.py", line 6154, in _arith_method
    return base.IndexOpsMixin._arith_method(self, other, op)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/pandas/core/base.py", line 1391, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
  File "/pandas/core/ops/array_ops.py", line 283, in arithmetic_op
    res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
  File "/pandas/core/ops/array_ops.py", line 227, in _na_arithmetic_op
    result = _masked_arith_op(left, right, op)
  File "/pandas/core/ops/array_ops.py", line 182, in _masked_arith_op
    result[mask] = op(xrav[mask], y)
                   ~~^^^^^^^^^^^^^^^
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

      n_name       revenue
4    VIETNAM -4.497841e+06
2  INDONESIA -5.580475e+06
3      JAPAN -6.000077e+06
1      INDIA -6.376122e+06
0      CHINA -7.822103e+06
Query 5 - Iteration 0 finished in 0.2077s
❌ Query 6 failed validation!
DataFrame.iloc[:, 0] (column name="revenue") are different

DataFrame.iloc[:, 0] (column name="revenue") values are different (100.0 %)
[index]: [0]
[left]:  [11803420.2534]
[right]: [8575171.6224]
At positional index 0, first diff: 11803420.2534 != 8575171.6224
         revenue
0  11803420.2534
Query 6 - Iteration 0 finished in 0.0627s
❌ query=7 unable to execute on CPU. Skipping validation.
Traceback (most recent call last):
  File "/pandas/core/ops/array_ops.py", line 218, in _na_arithmetic_op
    result = func(left, right)
  File "/pandas/core/computation/expressions.py", line 242, in evaluate
    return _evaluate(op, op_str, a, b)  # type: ignore[misc]
  File "/pandas/core/computation/expressions.py", line 73, in _evaluate_standard
    return op(a, b)
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 469, in run_pandas
    cpu_result = execute_query(q_id, i, q, cpu_run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 306, in execute_query
    return q(run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/pdsh.py", line 306, in q7
    1.0 - total["l_discount"]
    ~~~~^~~~~~~~~~~~~~~~~~~~~
  File "/pandas/core/ops/common.py", line 76, in new_method
    return method(self, other)
  File "/pandas/core/arraylike.py", line 198, in __rsub__
    return self._arith_method(other, roperator.rsub)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/pandas/core/series.py", line 6154, in _arith_method
    return base.IndexOpsMixin._arith_method(self, other, op)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/pandas/core/base.py", line 1391, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
  File "/pandas/core/ops/array_ops.py", line 283, in arithmetic_op
    res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
  File "/pandas/core/ops/array_ops.py", line 227, in _na_arithmetic_op
    result = _masked_arith_op(left, right, op)
  File "/pandas/core/ops/array_ops.py", line 182, in _masked_arith_op
    result[mask] = op(xrav[mask], y)
                   ~~^^^^^^^^^^^^^^^
  File "/pandas/core/roperator.py", line 15, in rsub
    return right - left
           ~~~~~~^~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'decimal.Decimal'

  supp_nation cust_nation  l_year       revenue
0      FRANCE     GERMANY    1995 -4.637235e+06
1      FRANCE     GERMANY    1996 -5.224780e+06
2     GERMANY      FRANCE    1995 -6.232819e+06
3     GERMANY      FRANCE    1996 -5.557312e+06
Query 7 - Iteration 0 finished in 0.3059s
❌ query=8 unable to execute on CPU. Skipping validation.
Traceback (most recent call last):
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 469, in run_pandas
    cpu_result = execute_query(q_id, i, q, cpu_run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 306, in execute_query
    return q(run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/pdsh.py", line 357, in q8
    jn7["o_year"] = jn7["o_orderdate"].dt.year
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/pandas/core/generic.py", line 6321, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/pandas/core/accessor.py", line 224, in __get__
    accessor_obj = self._accessor(obj)
  File "/pandas/core/indexes/accessors.py", line 643, in __new__
    raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values. Did you mean: 'at'?

   o_year  mkt_share
0    1995       0.03
1    1996       0.02
Query 8 - Iteration 0 finished in 0.6356s
❌ query=9 unable to execute on CPU. Skipping validation.
Traceback (most recent call last):
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 469, in run_pandas
    cpu_result = execute_query(q_id, i, q, cpu_run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/utils.py", line 306, in execute_query
    return q(run_config)
  File "cudf/python/cudf/cudf/pandas/_benchmarks/pdsh.py", line 396, in q9
    jn5["o_year"] = jn5["o_orderdate"].dt.year
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/pandas/core/generic.py", line 6321, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/pandas/core/accessor.py", line 224, in __get__
    accessor_obj = self._accessor(obj)
  File "/pandas/core/indexes/accessors.py", line 643, in __new__
    raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values. Did you mean: 'at'?

      nation  o_year    sum_profit
0    ALGERIA    1998  4.716279e+06
1    ALGERIA    1997  8.071240e+06
2    ALGERIA    1996  9.273503e+06
3    ALGERIA    1995  8.472341e+06
4    ALGERIA    1994  8.718336e+06
..       ...     ...           ...
170  VIETNAM    1996  8.576511e+06
171  VIETNAM    1995  8.890273e+06
172  VIETNAM    1994  8.934413e+06
173  VIETNAM    1993  6.282243e+06
174  VIETNAM    1992  8.378368e+06

[175 rows x 3 columns]
Query 9 - Iteration 0 finished in 0.4312s
Iteration Summary
=======================================
query: 0
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.0003
max time : 0.0003
mean time: 0.0003
=======================================
query: 1
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 2.2644
max time : 2.2644
mean time: 2.2644
=======================================
query: 2
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.1925
max time : 0.1925
mean time: 0.1925
=======================================
query: 3
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.1852
max time : 0.1852
mean time: 0.1852
=======================================
query: 4
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.1540
max time : 0.1540
mean time: 0.1540
=======================================
query: 5
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.2077
max time : 0.2077
mean time: 0.2077
=======================================
query: 6
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.0627
max time : 0.0627
mean time: 0.0627
=======================================
query: 7
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.3059
max time : 0.3059
mean time: 0.3059
=======================================
query: 8
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.6356
max time : 0.6356
mean time: 0.6356
=======================================
query: 9
path: cudf/sf0.1
scale_factor: 0.1
executor: in-memory
iterations: 1
---------------------------------------
min time : 0.4312
max time : 0.4312
mean time: 0.4312
=======================================
Total mean time across all queries: 4.4396 seconds

Validation Summary
==================
2 queries failed validation: [3, 6]
  1. I noticed disable_module_accelerator doesn't set cudf.pandas.LOADED to False, but maybe this is OK/impractical because when calling cudf.pandas in a threaded context (which we have tests for), mutating this global variable probably isn't safe
  2. Some of these queries don't fully run with pandas due to pandas data type limitations reading for parquet:
    a. Decimal data is represented as a object type of decimal.Decimals, and this data doesn't support binary subtract with floats
    b. Date data is represented as string data therefore, the dt operator doesn't work on this type.
  • To address 2, we could modify the queries to ensure they run with pandas., but I didn't necessarily intend these benchmark to be directly comparable with pandas. Open to making the modifications though.
  • Query 3 fails due to 2, pandas represents dates from parquet as strings (object) and cuDF representes them as datetime64[s])
  • Query 6 actually fails because we return a different numerical result than pandas.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke self-assigned this Nov 6, 2025
@mroeschke mroeschke requested a review from a team as a code owner November 6, 2025 22:57
@mroeschke mroeschke added the bug Something isn't working label Nov 6, 2025
@mroeschke mroeschke added the non-breaking Non-breaking change label Nov 6, 2025
@mroeschke mroeschke added the cudf.pandas Issues specific to cudf.pandas label Nov 6, 2025
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 6, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Nov 6, 2025
@mroeschke mroeschke mentioned this pull request Nov 6, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant