-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400
base: main
Are you sure you want to change the base?
Conversation
This is port of #35691 . I'm busy previous days and now I've time on it now. The previous comment are solved. cc @pitrou @wgtmac @emkornfield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! I just did an initial review except the test.
# Conflicts: # cpp/src/parquet/column_writer_test.cc
505e23b
to
e9c550a
Compare
Two need fix:
|
fd3856d
to
79bdaeb
Compare
79bdaeb
to
d892819
Compare
@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices. |
The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield |
Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze. |
# Conflicts: # cpp/src/parquet/column_writer.cc # cpp/src/parquet/type_fwd.h
Rebased, ready for review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed for another pass and generally LGTM.
It would be good if @pitrou @emkornfield can take a look after the holiday season.
ASSERT_EQ(nullptr, bloom_filter); | ||
} else { | ||
ASSERT_NE(nullptr, bloom_filter); | ||
bloom_filters_.push_back(std::move(bloom_filter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing bloom_filters_
to be an output parameter to function ReadBloomFilters
instead of a class member variable?
std::vector<std::unique_ptr<BloomFilter>> bloom_filters_; | ||
}; | ||
|
||
TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?
cpp/src/parquet/metadata.h
Outdated
struct BloomFilterLocation { | ||
/// Row group bloom filter index locations which uses row group ordinal as the key. | ||
/// | ||
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here | |
/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here |
/// | ||
/// Number of columns with a bloom filter to be relatively small compared to | ||
/// the number of overall columns, so map is used. | ||
using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about defining RowGroupBloomFilterLocation
and FileBloomFilterLocation
in the BloomFilterLocation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PageIndexLocation doesn't defined them here. what about keeping it consistent?
# Conflicts: # cpp/src/parquet/file_writer.cc # cpp/src/parquet/properties.h
Could you please fix the CI failure? |
I've resolve the comments and fix the ci, would you mind re-check ? @wgtmac |
Rationale for this change
Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.
This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.
What changes are included in this PR?
Allow writing bf:
ParquetPageIndexRoundTripTest
Are these changes tested?
Yes
Are there any user-facing changes?
User can create Bloom Filter in parquet with C++ api