Preprocessing offsets for Parquet non-dictionary string columns #20430

pmattione-nvidia · 2025-10-30T16:41:12Z

For non-dictionary string columns, the strings are laid out in memory as: 4-byte length, string data ..., 4-byte length, string data, ... With this layout we cannot read the lengths in parallel, as we don't know where in global memory the next length value is without knowing the previous value. This causes the dominant bottleneck for non-dictionary, non-Fixed-Length-Byte-Array (all string lengths are the same) string columns, taking six times as long as the rest of the decode.

This PR moves the reading of the string lengths out of the main decode kernel and into a preprocessing kernel. In this kernel we allocate a small buffer of shared memory that all threads in the warp can cooperatively prefetch into from global memory. Then when thread 0 extracts the string lengths, it is doing so from shared memory instead of global memory so we spend much less time bottlenecked on a single thread.

However if the average string length is large, prefetching a buffer into shared memory is slower. If only a few strings fit into the buffer, then we spend too much time prefetching data that we mostly don't need. Benchmarking showed that this performance dropoff was around fitting 16 length-64 strings into the 1KB shared memory buffer. Thus above this cutoff we revert to a fully single-threaded, non-buffering algorithm. For strings lengths 128+ there was no performance change.

There are several advantages to preprocessing instead of doing this in the main decode kernel. First of all, by having only one warp per page instead of 4 warps per page in the decode kernel, we can go much wider, and not so many threads are bottlenecked. But primarily we have significantly less shared memory and register pressure, and can safely throw a lot more memory at the prefetch buffer without reducing occupancy.

The performance gains:

Flat non-chunked: 20% faster
Struct non-chunked: 25% faster
List non-chunked: 10% faster
Chunked: 15% faster for all types.

I attempted double-buffering the data, by having one warps filling half of the prefetch buffer while thread 0 on the other warp read from the other half, but this was slower than single-buffering. Comparing, with two warps (64 threads) and 2kb of shared memory you can do either:

Single buffering: 2 pages at once, all 64 threads prefetch, but when buffers full 62 threads wait while the thread 0's extract lengths from shared memory
Double buffering: 1 page at once, 32 threads continuously prefetch, 1 thread extracts lengths from other buffer, 31 threads do absolutely nothing

The increased speed from single buffering is because the 62 threads are probably not waiting very long for the thread 0's to extract the lengths, so mostly all 64 threads prefetch, instead of just 32 in the double-buffering scheme.

Instead of saving both string lengths and offsets as before, only offsets are saved now, as the string lengths can be extracted from them. Also, the string lengths of for fixed-length-byte-array columns are known from the column data header, so they aren't saved at all. Finally, by doing this preprocessing, I was able to free some shared memory from the main strings decode kernel due to buffers no longer being needed.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-10-30T16:41:16Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mhaseeb123

Some very early comments

cpp/src/io/parquet/reader_impl.hpp

cpp/src/io/parquet/page_string_decode.cu

mhaseeb123

Flushing some more early comments from the first pass.

cpp/src/io/parquet/page_string_utils.cuh

mhaseeb123 · 2025-10-30T23:59:28Z

cpp/src/io/parquet/page_string_utils.cuh

+                                         ? 0
+                                         : next_offset - input_thread_string_offset - 4;
+        }
+        if (input_thread_string_offset >= (uint32_t)s->dict_size) {


Is this a valid case? Should we set error_code here instead?

This logic is lifted from gpuGetStringData(). I believe it is impossible to occur, but I'm trying to keep the same behavior as before.

cpp/src/io/parquet/page_string_utils.cuh

cpp/src/io/parquet/reader_impl.cpp

cpp/src/io/parquet/page_string_decode.cu

mhaseeb123 · 2025-10-31T01:13:10Z

Benchmarking showed that this performance dropoff was around fitting 16 length-64 strings into the 1KB shared memory buffer

@pmattione-nvidia
Quick question: I am understanding that the above sentence means 16 strings with average length of 64 or is it the other way around (it doesn't seem like from the context but just to clarify)

nvdbaranec · 2025-10-31T18:21:54Z

I haven't read through the PR yet, but I'm assuming these preprocessed offsets are stored in global memory? This seems like it could become a pretty big number. We might need to take this into account in the chunked reader memory limit computation.

nvdbaranec · 2025-10-31T18:33:36Z

Part of me wonders here: if we're already prefetching everything into shared memory here to get the lengths, how hard would it be to also just do the rest of the work (copying to output) right here? Seems like we're going to end up touching every byte in the input twice otherwise. I guess the hard part is you don't know where your particular page's strings start in the output. Hmm...

Edit: Actually, I think we do know this. We already compute all the string sizes per-page during the setup. So we could scan that to get the start output for each pages first string. Then as the warp(s) march through the precached buffers, it's easy enough to know where each prefetched piece goes.

nvdbaranec · 2025-10-31T19:06:54Z

Note: I'm not suggesting we change this PR. But it's something to think about maybe for later. We'd still have to run the decode kernel to generate validity/list info in some cases, but it seems like this might be a way to get a useful overall speedup.

cpp/src/io/parquet/decode_fixed.cu

cpp/src/io/parquet/page_string_decode.cu

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

nvdbaranec · 2025-10-31T20:08:10Z

cpp/src/io/parquet/reader_impl_preprocess.cu

+  // Allocate the string offset buffer
+  _string_offset_buffer = rmm::device_uvector<uint32_t>(total_num_offsets, _stream, _mr);


So this is the part where we might need to take this into account for the input chunk limiting size. If we had 50 million rows, that'd be 256MB, which would be a significant chunk of the typical memory limit set by spark for temporary mem usage. I think an estimate (maybe worst case, pretending there's no nulls) might be ok.

Here's where we add in the temp memory usage for decompression

cudf/cpp/src/io/parquet/reader_impl_chunking.cu

Line 267 in d2a1bca

// include scratch space needed for decompression. for certain codecs (eg ZSTD) this

nvdbaranec

Whoops. Didn't mean to approve.

pmattione-nvidia · 2025-11-04T17:13:37Z

Part of me wonders here: if we're already prefetching everything into shared memory here to get the lengths, how hard would it be to also just do the rest of the work (copying to output) right here? Seems like we're going to end up touching every byte in the input twice otherwise. I guess the hard part is you don't know where your particular page's strings start in the output. Hmm...

Edit: Actually, I think we do know this. We already compute all the string sizes per-page during the setup. So we could scan that to get the start output for each pages first string. Then as the warp(s) march through the precached buffers, it's easy enough to know where each prefetched piece goes.

The problem with doing this in the main decode kernel is that we have a lot of shared memory taken up by all of the random decode buffers AND the string decode buffers (so that all of the threads in the block can cooperatively copy the strings). In fact, we use so much shared memory for string decode that it is the limiting factor on the occupancy, not the register usage. We're better off having this as a separate kernel where we can have a larger buffer without hurting occupancy. Plus, reading the lengths takes 6x as long as the rest of the decode kernel combined, so hitting the memory again later doesn't take as long (probably because we don't need to be single threaded there).

mhaseeb123 · 2025-11-04T19:10:52Z

CC: @vuule for viz only

mhaseeb123

Flush comments so far

cpp/src/io/parquet/page_string_decode.cu

pmattione-nvidia added 4 commits October 23, 2025 16:55

work in progress

b715f3d

Merge branch 'main' into pq_string_sizes

bfc862f

working

92c01ca

fix style errors

7988c06

pmattione-nvidia self-assigned this Oct 30, 2025

pmattione-nvidia added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 30, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 30, 2025

further cleanup

8857dd2

pmattione-nvidia marked this pull request as ready for review October 30, 2025 17:53

pmattione-nvidia requested a review from a team as a code owner October 30, 2025 17:53

pmattione-nvidia requested review from bdice and shrshi October 30, 2025 17:53

Merge branch 'main' into pq_string_sizes

9216048

pmattione-nvidia requested review from mhaseeb123 and nvdbaranec October 30, 2025 17:54

mhaseeb123 reviewed Oct 30, 2025

View reviewed changes

cpp/src/io/parquet/reader_impl.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

mhaseeb123 reviewed Oct 31, 2025

View reviewed changes

nvdbaranec approved these changes Oct 31, 2025

View reviewed changes

nvdbaranec requested changes Oct 31, 2025

View reviewed changes

address comments

89418b8

run compute_page_string_offset_indices on the gpu

297989b

pmattione-nvidia and others added 2 commits November 6, 2025 10:41

add memory cost of string offsets to subpass calc

fab0491

Merge branch 'main' into pq_string_sizes

52e7c9a

mhaseeb123 reviewed Nov 6, 2025

View reviewed changes

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved

pmattione-nvidia added 2 commits November 6, 2025 13:47

fix case of different encodings within column

f2aced5

add consts

b1b0a0e

mhaseeb123 added this to libcudf Nov 10, 2025

mhaseeb123 moved this to Burndown in libcudf Nov 10, 2025

		// Allocate the string offset buffer
		_string_offset_buffer = rmm::device_uvector<uint32_t>(total_num_offsets, _stream, _mr);

Preprocessing offsets for Parquet non-dictionary string columns #20430

Are you sure you want to change the base?

Preprocessing offsets for Parquet non-dictionary string columns #20430

Uh oh!

Conversation

pmattione-nvidia commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

copy-pr-bot bot commented Oct 30, 2025

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhaseeb123 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pmattione-nvidia Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 commented Oct 31, 2025

Uh oh!

nvdbaranec commented Oct 31, 2025

Uh oh!

nvdbaranec commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvdbaranec commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvdbaranec Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

pmattione-nvidia Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

nvdbaranec left a comment

Choose a reason for hiding this comment

Uh oh!

pmattione-nvidia commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhaseeb123 commented Nov 4, 2025

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pmattione-nvidia commented Oct 30, 2025 •

edited

Loading

nvdbaranec commented Oct 31, 2025 •

edited

Loading

pmattione-nvidia commented Nov 4, 2025 •

edited

Loading