Skip to content

[BUG] Multiple DataFrame.loc operations gives confusing error message upon compute on Dask-cuDF #11434

@alextxu

Description

@alextxu

Describe the bug
After creating a Dask-cuDF data frame, if I perform multiple .loc operations on it using boolean Dask-cuDF series, then when I compute the data frame, it produces a runtime error with the message cuDF failure at: ../src/stream_compaction/apply_boolean_mask.cu:73: Column size mismatch. A similar snippet works as expected on cuDF.

Steps/Code to reproduce bug

import dask_cudf
import cudf
ddf1 = dask_cudf.from_cudf(cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]}), npartitions=2)
f1 = dask_cudf.from_cudf(cudf.Series([False, True, True]), npartitions=2)
f2 = dask_cudf.from_cudf(cudf.Series([True, False]), npartitions=2)
ddf2 = ddf1.loc[f1]
ddf3 = ddf2.loc[f2]
print(ddf2.compute())
print(ddf3.compute())

The above code produces the following output:

   a  b                        
1  2  5
2  3  6                                                       
Traceback (most recent call last):     
  File "temp.py", line 9, in <module>
    print(ddf3.compute())
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 292, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 575, in compute
    results = schedule(dsk, keys, **kwargs)    
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 554, in get_sync
    return get_async(                                         
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 497, in get_async                                       
    for key, res_info, failed in queue_get(queue).result():
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 539, in submit
    fut.set_result(fn(*args, **kwargs))
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 235, in batch_execute_tasks
    return [execute_task(*a) for a in it]
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 235, in <listcomp>
    return [execute_task(*a) for a in it]
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 226, in execute_task
    result = pack_exception(e, dumps)
  File "/opt/conda/lib/python3.8/site-packages/dask/local.py", line 221, in execute_task
    result = _execute_task(task, data)
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/opt/conda/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/opt/conda/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/opt/conda/lib/python3.8/site-packages/dask/utils.py", line 39, in apply
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 6330, in apply_and_enforce
    df = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/methods.py", line 37, in loc
    return df.loc[iindexer]
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 127, in __getitem__
    return self._getitem_tuple_arg(arg)
  File "/opt/conda/lib/python3.8/site-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 267, in _getitem_tuple_arg
    df = columns_df._apply_boolean_mask(tmp_arg[0])
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/indexed_frame.py", line 1696, in _apply_boolean_mask
    libcudf.stream_compaction.apply_boolean_mask(
  File "cudf/_lib/stream_compaction.pyx", line 101, in cudf._lib.stream_compaction.apply_boolean_mask
RuntimeError: cuDF failure at: ../src/stream_compaction/apply_boolean_mask.cu:73: Column size mismatch

Expected behavior
Expected output (verified with cudf instead of dask-cudf):

   a  b
1  2  5
2  3  6
   a  b
1  2  5

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • docker run -it --rm --gpus all --ipc=host --network=host -v .

Environment details
cuDF version 22.4.0a0+306.g0cb75a4913

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPythonAffects Python cuDF API.bugSomething isn't workingdaskDask issue

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions