[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

JigaoLuo · 2025-01-30T12:15:10Z

Hi cudf Team,

I’m trying I/O pipelining for read_parquet in multithreaded/multistream workflows, using cudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp as a common starting point.

Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving efficient pipelining for non-first batches, as the first read batch’s behavior seems hard to control regarding the issue 16936.

Also in the profiling figure from Issue [FEA] Add synchronization for IO between read_parquet calls on different threads #16936, Batch 2 and Batch 3 appear to begin nvCOMP decompression immediately, with no visible I/O stage like read functions.

What is your question?

My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.

Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran: ./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.

Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.

The first read batch (~1.0s per thread) can be ignored due to the [FEA] Add synchronization for IO between read_parquet calls on different threads #16936
Then I’ll highlight the I/O patterns in the second read batch to demonstrate where pipelining breaks.

Thread0

We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its read_parquet call.

Thread1&2 -> Questions

Now, we see the key issue I want to ask about: threads 1 and 2 each have two cuFile read range appearing simultaneously. This suggests that KvikIO is handling I/O for both read_parquet calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.

My question is: What is the standard approach to prevent such I/O overlap between multiple read_parquet calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936

I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2025-01-31T22:26:42Z

Thank you @JigaoLuo for investigating this behavior. We've noted mixed success in efficient IO pipelining, especially in microbenchmarks rather than full query workloads. We've also heard feedback that multi-GPU IO can show poor throughput with KvikIO. Both of these symptoms could be improved by better ordering or priority mechanism for the task queue in KvikIO. We plan to start by introducing KvikIO benchmarks to measure the impact on IO throughput of different task queue changes. rapidsai/kvikio#606

We would love your feedback on the task count, sizes, compression ratios, GDS vs host reads, data sources, and more that might be of interest to you.

JigaoLuo · 2025-02-02T12:49:28Z

Thank you @GregoryKimball for the information. I agree that a more refined ordering or priority mechanism for the task queue in KvikIO would be beneficial.

To reiterate the file-reading-decompression scenario: I strongly favor an I/O-Computation pipelining approach. This would allow one stream to handle I/O (e.g., readParquet) while others wait. To decompression it could enable multiple streams to perform decompression simultaneously, as I also often see in NVcomp does so.

Initial Idea:
I feel introducing stream support into KvikIO could better facilitate such pipelining. Currently, KvikIO lacks stream support (excluding raw I/O calls), which limits its ability to enable and optimize for these workflows.

To Benchmarking:
I have a slightly different perspective on KvikIO benchmarks. While pure I/O benchmarks (e.g., fio, gdsio, and current KvikIO benchmarks) are useful, they don’t/unable to capture I/O-Computation pipelining, which is critical for real-world use cases. For example, GDS performs always well in sequential I/O benchmarks (in GDSIO or KvikIO benchmarks), but this issue in readParquet workflows only become visible in nsys profile.
A more comprehensive benchmark suite with diverse options would be valuable, but incorporating pipelining scenarios remains challenging with benchmark tools only.

JigaoLuo added the question Further information is requested label Jan 30, 2025

GregoryKimball added this to the Speed-of-light IO in libcudf milestone Jan 31, 2025

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

JigaoLuo commented Jan 30, 2025 •

edited

Loading

GregoryKimball commented Jan 31, 2025 •

edited

Loading

JigaoLuo commented Feb 2, 2025

[QST] How to achieve *stable* I/O Pipelining for read_parquet with Multithreaded/Multistream #17873

[QST] How to achieve *stable* I/O Pipelining for read_parquet with Multithreaded/Multistream #17873

Comments

JigaoLuo commented Jan 30, 2025 • edited Loading

What is your question?

Thread0

Thread1&2 -> Questions

GregoryKimball commented Jan 31, 2025 • edited Loading

JigaoLuo commented Feb 2, 2025

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

[QST] How to achieve stable I/O Pipelining for `read_parquet` with Multithreaded/Multistream #17873

JigaoLuo commented Jan 30, 2025 •

edited

Loading

GregoryKimball commented Jan 31, 2025 •

edited

Loading