-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data placement on different GPUs when using dask_cudf library #17966
Comments
You can track which partitions are where using the who_has/has_what functions: In [13]: client.who_has()
Out[13]:
{('frompandas-cabff790f1b8b84828c3c5e69884b775',
3): ('tcp://127.0.0.1:38207',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
2): ('tcp://127.0.0.1:43273',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
1): ('tcp://127.0.0.1:38207',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
0): ('tcp://127.0.0.1:43273',)}
In [14]: client.has_what()
Out[14]:
{'tcp://127.0.0.1:38207': (('frompandas-cabff790f1b8b84828c3c5e69884b775', 3),
('frompandas-cabff790f1b8b84828c3c5e69884b775', 1)),
'tcp://127.0.0.1:43273': (('frompandas-cabff790f1b8b84828c3c5e69884b775', 2),
('frompandas-cabff790f1b8b84828c3c5e69884b775', 0))}
In [15]: client.who_has()
Out[15]:
{('frompandas-cabff790f1b8b84828c3c5e69884b775',
3): ('tcp://127.0.0.1:38207',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
2): ('tcp://127.0.0.1:43273',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
1): ('tcp://127.0.0.1:38207',),
('frompandas-cabff790f1b8b84828c3c5e69884b775',
0): ('tcp://127.0.0.1:43273',)} Dask enables some amount of user control with the placement of data. I would suggest reading the data locality page, specifically:
However, we generally recommend letting the scheduler control distribution of data across the workers |
After reading the CSV and using has_what() (or who_has()), the functions often behave strangely, showing that key_count = 0 on both GPUs or nothing is displayed at all, even though the partitioning took place and they should be shown. |
My guess is that the graph hasn't been executed and data hasn't been persisted: ddf = read_csv(...)
ddf = ddf.persist()
client.who_has() |
Yes, you need to execute the task graph first. After that, the display works correctly. However, sometimes the partitions are distributed very unevenly across GPUs (for example, out of 10 partitions, 2 may go to one GPU and 8 to another). I believe I read that you can also distribute partitions across different GPUs using map_partitions. |
Is it possible to partition data by a key, so that values with the same key are placed in the same partition on the same GPU? This would be especially useful for operations like groupby and join. |
Yes, you can call shuffle directly but groupy and join will be calling that as well |
I am using the dask_cudf library with two GPUs (CUDA_VISIBLE_DEVICES="0,1"). When reading a CSV file with dask_cudf, the dataframe is automatically split into a number of partitions, which are then processed by various operations.
temp = dask_cudf.read_csv('./test.csv', dtype={"id1":"Int32","id2":"Int32","id4":"object","id5":"object","v2":"Float64"})
Could you please tell me if there is any way to track which specific GPU each partition is located on, and whether it is possible to manually direct a partition to a specific GPU? And is it possible to track which specific GPU is performing operations (such as join, groupby) on a particular partition (So far, I can only do this using the GPU memory tab)?
The text was updated successfully, but these errors were encountered: