Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to achieve *stable* I/O Pipelining for read_parquet with Multithreaded/Multistream #17873

Open
JigaoLuo opened this issue Jan 30, 2025 · 2 comments
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested

Comments

@JigaoLuo
Copy link

JigaoLuo commented Jan 30, 2025

Hi cudf Team,

I’m trying I/O pipelining for read_parquet in multithreaded/multistream workflows, using cudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp as a common starting point.

Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving efficient pipelining for non-first batches, as the first read batch’s behavior seems hard to control regarding the issue 16936.

What is your question?

My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.

Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran: ./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.

Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.

Overview

Thread0

Thread0, Second Read Batch

We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its read_parquet call.

Thread1&2 -> Questions

Thread1, Second Read Batch

Thread2, Second Read Batch

Now, we see the key issue I want to ask about: threads 1 and 2 each have two cuFile read range appearing simultaneously. This suggests that KvikIO is handling I/O for both read_parquet calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.

My question is: What is the standard approach to prevent such I/O overlap between multiple read_parquet calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936

I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.

@JigaoLuo JigaoLuo added the question Further information is requested label Jan 30, 2025
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jan 31, 2025

Thank you @JigaoLuo for investigating this behavior. We've noted mixed success in efficient IO pipelining, especially in microbenchmarks rather than full query workloads. We've also heard feedback that multi-GPU IO can show poor throughput with KvikIO. Both of these symptoms could be improved by better ordering or priority mechanism for the task queue in KvikIO. We plan to start by introducing KvikIO benchmarks to measure the impact on IO throughput of different task queue changes. rapidsai/kvikio#606

We would love your feedback on the task count, sizes, compression ratios, GDS vs host reads, data sources, and more that might be of interest to you.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jan 31, 2025
@JigaoLuo
Copy link
Author

JigaoLuo commented Feb 2, 2025

Thank you @GregoryKimball for the information. I agree that a more refined ordering or priority mechanism for the task queue in KvikIO would be beneficial.

To reiterate the file-reading-decompression scenario: I strongly favor an I/O-Computation pipelining approach. This would allow one stream to handle I/O (e.g., readParquet) while others wait. To decompression it could enable multiple streams to perform decompression simultaneously, as I also often see in NVcomp does so.

Initial Idea:
I feel introducing stream support into KvikIO could better facilitate such pipelining. Currently, KvikIO lacks stream support (excluding raw I/O calls), which limits its ability to enable and optimize for these workflows.

To Benchmarking:
I have a slightly different perspective on KvikIO benchmarks. While pure I/O benchmarks (e.g., fio, gdsio, and current KvikIO benchmarks) are useful, they don’t/unable to capture I/O-Computation pipelining, which is critical for real-world use cases. For example, GDS performs always well in sequential I/O benchmarks (in GDSIO or KvikIO benchmarks), but this issue in readParquet workflows only become visible in nsys profile.
A more comprehensive benchmark suite with diverse options would be valuable, but incorporating pipelining scenarios remains challenging with benchmark tools only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants