Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray #45622

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 25, 2025

Rationale for this change

See #45594

What changes are included in this PR?

  1. Add a hack interface for binary builder
  2. Optimize decoding in DeltaLengthByteArray

Are these changes tested?

Covered by existing

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wgtmac as a code owner February 25, 2025 09:55
Copy link

⚠️ GitHub issue #45594 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 25, 2025

On my MacOS ( this is not the newest )

After:

BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/1024        1803 ns         1794 ns       366634 bytes_per_second=3.27674G/s items_per_second=570.698M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/4096        5502 ns         5453 ns       129056 bytes_per_second=4.21124G/s items_per_second=751.186M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/32768      39024 ns        38960 ns        17944 bytes_per_second=4.68932G/s items_per_second=841.076M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/65536      76109 ns        76037 ns         9217 bytes_per_second=4.79637G/s items_per_second=861.897M/s

Before:

BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/1024       11620 ns        10801 ns        55154 bytes_per_second=557.395M/s items_per_second=94.8042M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/4096       51641 ns        51339 ns        13196 bytes_per_second=458.007M/s items_per_second=79.7829M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/32768     479142 ns       459910 ns         1550 bytes_per_second=406.772M/s items_per_second=71.2488M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/65536     963371 ns       929585 ns          759 bytes_per_second=401.743M/s items_per_second=70.5003M/s

@mapleFU mapleFU changed the title GH-45594: [C++][Parquet] Optimize Parquet DecodeArrow in DeltaLengthByteArray GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray Feb 25, 2025
@mapleFU mapleFU force-pushed the optimize-decode-delta-length-byte-array branch from 153b065 to aa25e2e Compare February 25, 2025 11:05

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
@mapleFU mapleFU force-pushed the optimize-decode-delta-length-byte-array branch from aa25e2e to 4a4847f Compare February 25, 2025 11:12
@mapleFU
Copy link
Member Author

mapleFU commented Feb 25, 2025

cc @pitrou this interface is a bit ugly but I don't know whether we have better way for this. Would you mind take a look?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 27, 2025
@pitrou
Copy link
Member

pitrou commented Feb 27, 2025

Hmm, I really don't like the new BinaryBuilder API that this is adding.

Perhaps we should instead add these APIs and let Parquet use those builders?

diff --git a/cpp/src/arrow/array/builder_binary.h b/cpp/src/arrow/array/builder_binary.h
index 442e4a2632..e568279508 100644
--- a/cpp/src/arrow/array/builder_binary.h
+++ b/cpp/src/arrow/array/builder_binary.h
@@ -359,6 +359,9 @@ class BaseBinaryBuilder
   /// \return data pointer of the value date builder
   const offset_type* offsets_data() const { return offsets_builder_.data(); }
 
+  TypedBufferBuilder<offset_type>* offsets_builder() { return &offsets_builder_; }
+  TypedBufferBuilder<uint8_t>* value_data_builder() { return &value_data_builder_; }
+
   /// Temporary access to a value.
   ///
   /// This pointer becomes invalid on the next modifying operation.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 27, 2025

Perhaps we should instead add these APIs and let Parquet use those builders?

Previously I've using a poc like this, maybe unsafe or other prefix can make this better?

diff --git a/cpp/src/arrow/array/builder_binary.h b/cpp/src/arrow/array/builder_binary.h
index 442e4a2632..e568279508 100644
--- a/cpp/src/arrow/array/builder_binary.h
+++ b/cpp/src/arrow/array/builder_binary.h
@@ -359,6 +359,9 @@ class BaseBinaryBuilder
   /// \return data pointer of the value date builder
   const offset_type* offsets_data() const { return offsets_builder_.data(); }
 
+  TypedBufferBuilder<offset_type>* unsafe_offsets_builder() { return &offsets_builder_; }
+  TypedBufferBuilder<uint8_t>* unsafe_value_data_builder() { return &value_data_builder_; }
+
   /// Temporary access to a value.
   ///
   /// This pointer becomes invalid on the next modifying operation.

@pitrou
Copy link
Member

pitrou commented Feb 27, 2025

I don't think adding "unsafe" would really bring anything (there is no risk of crashing, for instance). However, we should add a docstring explaining the caveats when using these methods.

mapleFU added 3 commits March 3, 2025 11:51

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@mapleFU mapleFU requested a review from pitrou March 3, 2025 05:11
@mapleFU
Copy link
Member Author

mapleFU commented Mar 6, 2025

@pitrou I've address all comments here, would you mind take a look when you've spare time?

RETURN_NOT_OK(offsets_builder->Reserve(num_values));
accum_length = 0;
if (valid_bits == nullptr) {
for (int i = 0; i < max_values; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, VisitNullBitmapInline can also handle the case where valid_bits == nullptr since it uses OptionalBitBlockCounter under the hood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, would update this

}
accum_length += length_ptr[i];
}
if (ARROW_PREDICT_FALSE(accum_length > std::numeric_limits<int32_t>::max())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CanFit returned true, can this actually happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just I just DCHECK this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for large_binary it might even exceed int32, no?

Copy link
Member Author

@mapleFU mapleFU Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but currently, the builder is just BinaryBuilder, and large binary might uses multiple chunks, so it might hit the case !CanFit. But once fit this would never accum_length > std::numeric_limits<int32_t>::max()

int num_values, int null_count, const uint8_t* valid_bits,
int64_t valid_bits_offset, typename EncodingTraits<ByteArrayType>::Accumulator* out,
int* out_num_values) {
int max_values = num_values - null_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call this num_non_null_values?

@mapleFU mapleFU requested a review from pitrou March 12, 2025 14:24
@mapleFU
Copy link
Member Author

mapleFU commented Mar 18, 2025

@pitrou I've tried to fix the comments, would you mind take a look again?

if (ARROW_PREDICT_FALSE(decoder_->bytes_left() < accum_length)) {
return Status::Invalid("Binary data is too short");
}
RETURN_NOT_OK(out->builder->ValidateOverflow(accum_length));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if accum_length > chunk_space_remaining_. Would it just fail? It should probably append a new chunk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would never happens, chunk_space_remaining_ >= decoder_->bytes_left() >= accum_length, so this would not happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why is chunk_space_remaining_ >= decoder_->bytes_left()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is because we know that CanFit(decoder_->bytes_left()), right? Adding a comment might help...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here: c21878c

return DecodeArrowDenseFastPath(num_values, null_count, valid_bits,
valid_bits_offset, out, out_num_values);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need the slow path below at all. Instead, we could just expose chunk_space_remaining_ and PushChunk, and use them to build the output chunk by chunk as necessary.

Copy link
Member Author

@mapleFU mapleFU Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm I don't know what does this mean, when Page's buffer is larger than a chunk can handle, we need find "maximum buffer we can pushinto", and segment the arrow array to multiple sub-chunks, then switch to next batch in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically something like this:

    int64_t start = 0;
    while (start < num_non_null_values) {
      // Find the number of values that we can write in this chunk
      int64_t chunk_data_length = 0;
      for (int64_t end = start; end < num_non_null_values; ++end) {
        const int64_t length = length_ptr[end];
        if (ARROW_PREDICT_FALSE(length < 0)) {
          return Status::Invalid("negative string delta length");
        }
        if (chunk_data_length + length > out->chunk_space_remaining()) {
          break;
        }
        chunk_data_length += length;
      }
      // Write chunk [start, end)
      ...
      out->PushChunk();
      start = end;
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should also find out that, given a num_non_null_values, what's the number of num_values in this range?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahem... that's a good point. Ideally we would need something like VisitNullBitmapInline, but that would stop after a certain number of non-null values...

Copy link
Member Author

@mapleFU mapleFU Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems this requires a underlying "loop" abstraction which could hint "next-value", "end", "break"...This also requires handling the all-nulls values. So I prefer using this fast-path firstly.

This abstraction might be like:

struct Stride {
  int32_t num_non_null_values;
  int32_t num_values;
  int32_t binary_length;
};

Stride nextStride(int32_t max_length) {
  if (decoder_->bytes_left() <= out->chunk_space_remaining()) {
    // accumulate value lengths until max_length
    // ...
    return Stride {...};
  }
  VisitNullBitmapInline(...); // find out the fitable stride
  return Stride{..};
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I move forward with some slow stride checking or leave it a fastpath here? @pitrou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Well, I was underestimating the complexity of the clean solution, so leaving it as a fast path sounds fine.

Copy link
Member

@pitrou pitrou Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining problem, though, is that the fallback isn't tested?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants