Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data placement on different GPUs when using dask_cudf library #17966

Open
preorat-sion opened this issue Feb 10, 2025 · 6 comments
Open

Data placement on different GPUs when using dask_cudf library #17966

preorat-sion opened this issue Feb 10, 2025 · 6 comments

Comments

@preorat-sion
Copy link

I am using the dask_cudf library with two GPUs (CUDA_VISIBLE_DEVICES="0,1"). When reading a CSV file with dask_cudf, the dataframe is automatically split into a number of partitions, which are then processed by various operations.
temp = dask_cudf.read_csv('./test.csv', dtype={"id1":"Int32","id2":"Int32","id4":"object","id5":"object","v2":"Float64"})
Could you please tell me if there is any way to track which specific GPU each partition is located on, and whether it is possible to manually direct a partition to a specific GPU? And is it possible to track which specific GPU is performing operations (such as join, groupby) on a particular partition (So far, I can only do this using the GPU memory tab)?

@quasiben
Copy link
Member

You can track which partitions are where using the who_has/has_what functions:

In [13]: client.who_has()
Out[13]:
{('frompandas-cabff790f1b8b84828c3c5e69884b775',
  3): ('tcp://127.0.0.1:38207',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  2): ('tcp://127.0.0.1:43273',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  1): ('tcp://127.0.0.1:38207',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  0): ('tcp://127.0.0.1:43273',)}

In [14]: client.has_what()
Out[14]:
{'tcp://127.0.0.1:38207': (('frompandas-cabff790f1b8b84828c3c5e69884b775', 3),
  ('frompandas-cabff790f1b8b84828c3c5e69884b775', 1)),
 'tcp://127.0.0.1:43273': (('frompandas-cabff790f1b8b84828c3c5e69884b775', 2),
  ('frompandas-cabff790f1b8b84828c3c5e69884b775', 0))}

In [15]: client.who_has()
Out[15]:
{('frompandas-cabff790f1b8b84828c3c5e69884b775',
  3): ('tcp://127.0.0.1:38207',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  2): ('tcp://127.0.0.1:43273',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  1): ('tcp://127.0.0.1:38207',),
 ('frompandas-cabff790f1b8b84828c3c5e69884b775',
  0): ('tcp://127.0.0.1:43273',)}

Dask enables some amount of user control with the placement of data. I would suggest reading the data locality page, specifically:

However, we generally recommend letting the scheduler control distribution of data across the workers

@preorat-sion
Copy link
Author

After reading the CSV and using has_what() (or who_has()), the functions often behave strangely, showing that key_count = 0 on both GPUs or nothing is displayed at all, even though the partitioning took place and they should be shown.
Key | Copies | Workers

@quasiben
Copy link
Member

My guess is that the graph hasn't been executed and data hasn't been persisted:

ddf = read_csv(...)
ddf = ddf.persist()
client.who_has()

@preorat-sion
Copy link
Author

preorat-sion commented Feb 10, 2025

Yes, you need to execute the task graph first. After that, the display works correctly. However, sometimes the partitions are distributed very unevenly across GPUs (for example, out of 10 partitions, 2 may go to one GPU and 8 to another). I believe I read that you can also distribute partitions across different GPUs using map_partitions.

@preorat-sion
Copy link
Author

Is it possible to partition data by a key, so that values with the same key are placed in the same partition on the same GPU? This would be especially useful for operations like groupby and join.

@quasiben
Copy link
Member

Yes, you can call shuffle directly but groupy and join will be calling that as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants