[QST] How to achieve *stable* I/O Pipelining for read_parquet
with Multithreaded/Multistream
#17873
Labels
Milestone
read_parquet
with Multithreaded/Multistream
#17873
Hi cudf Team,
I’m trying I/O pipelining for
read_parquet
in multithreaded/multistream workflows, usingcudf/cpp/examples/parquet_io/parquet_io_multithreaded.cpp
as a common starting point.Before asking my questions, I have reviewed #16936, so I will ask my questions about achieving efficient pipelining for non-first batches, as the first read batch’s behavior seems hard to control regarding the issue 16936.
read_parquet
calls on different threads #16936, Batch 2 and Batch 3 appear to begin nvCOMP decompression immediately, with no visible I/O stage likeread
functions.What is your question?
My question builds on the existing issue but with read I/O pipelines with nvCOMP kernels. The attached profiling results show clear I/O and computation from nvCOMP pipelining.
Test Setup is Using the standard parquet_io_multithreaded.cpp, I ran:
./parquet_io_multithreaded SNAPPY.parquet 10 FILEPATH 1 3
This configures 3 threads/streams to call read_parquet 10 times total, with thread 0 handling 4 reads and threads 1-2 handling 3 reads each. The uneven workload distribution appears to prevent effective I/O pipelining.
Profiling Observations:
In the attached nsys profile overview, you can see the 3 threads/streams.
read_parquet
calls on different threads #16936Thread0
We can see the dense blue lines, which are cuFile reads to the SSD. Notably, thread0 does not show any cuFile reads at the start of its
read_parquet
call.Thread1&2 -> Questions
Now, we see the key issue I want to ask about: threads 1 and 2 each have two cuFile read range appearing simultaneously. This suggests that KvikIO is handling I/O for both
read_parquet
calls at the same time, meaning there is no I/O pipeline, ordering, or priority between the two threads in their second read batch.My question is: What is the standard approach to prevent such I/O overlap between multiple
read_parquet
calls? If I/O operations overlap, it means there is no exclusive ownership of I/O, which could lead to slower performance compared to a scenario where each thread has exclusive access. This is then also sounds like the issue #16936I should note that in most cases, I/O pipelining works as expected—after the initial read batch, only one thread performs I/O at a time. However, this profiling result is an uncommon case (possibly difficult to reproduce due to nondeterministic) that I encountered, and I wanted to ask about it.
The text was updated successfully, but these errors were encountered: