Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 26, 2023

Rationale for this change

Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.

This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.

What changes are included in this PR?

Allow writing bf:

  • Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
  • Add BloomFilterBuilder for parquet
  • From FileSerializer to ColumnWriter, adding bloomfilter
  • Ensure Bloom Filter info is written to the file
  • Testing logic for BloomFilterBuilder
  • Testing logic for BloomFilter and ColumnWriter
  • Testing whole roundtrip like ParquetPageIndexRoundTripTest

Are these changes tested?

Yes

Are there any user-facing changes?

User can create Bloom Filter in parquet with C++ api

Sorry, something went wrong.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 26, 2023

This is port of #35691 . I'm busy previous days and now I've time on it now.

The previous comment are solved. cc @pitrou @wgtmac @emkornfield

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 26, 2023
@mapleFU mapleFU requested review from pitrou and emkornfield August 26, 2023 18:43
@mapleFU mapleFU changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement Aug 27, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I just did an initial review except the test.

@wgtmac wgtmac changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation Aug 30, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 1, 2023
# Conflicts:
#	cpp/src/parquet/column_writer_test.cc
@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 505e23b to e9c550a Compare November 12, 2024 05:48
@mapleFU
Copy link
Member Author

mapleFU commented Nov 12, 2024

Two need fix:

/arrow/cpp/src/parquet/bloom_filter.h:118: error: The following parameter of parquet::BloomFilter::Hash(const FLBA &value, uint32_t type_len) const is not documented:
  parameter 'type_len' (warning treated as error, aborting now)
D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx
In file included from D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx:16:
D:/a/arrow/arrow/cpp/src/parquet/schema.cc: In function 'void parquet::schema::PrintRepLevel(parquet::Repetition::type, std::ostream&)':
D:/a/arrow/arrow/cpp/src/parquet/schema.cc:630:30: error: expected unqualified-id before ':' token
  630 |     case Repetition::OPTIONAL:
      |                              ^

@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from fd3856d to 79bdaeb Compare November 15, 2024 04:49
@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 79bdaeb to d892819 Compare November 15, 2024 06:57
@mapleFU
Copy link
Member Author

mapleFU commented Nov 19, 2024

@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices.

@amoeba
Copy link
Member

amoeba commented Dec 19, 2024

The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield

@wgtmac
Copy link
Member

wgtmac commented Dec 20, 2024

Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze.

# Conflicts:
#	cpp/src/parquet/column_writer.cc
#	cpp/src/parquet/type_fwd.h
@mapleFU
Copy link
Member Author

mapleFU commented Dec 20, 2024

Rebased, ready for review now

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed for another pass and generally LGTM.

It would be good if @pitrou @emkornfield can take a look after the holiday season.

ASSERT_EQ(nullptr, bloom_filter);
} else {
ASSERT_NE(nullptr, bloom_filter);
bloom_filters_.push_back(std::move(bloom_filter));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing bloom_filters_ to be an output parameter to function ReadBloomFilters instead of a class member variable?

std::vector<std::unique_ptr<BloomFilter>> bloom_filters_;
};

TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?

struct BloomFilterLocation {
/// Row group bloom filter index locations which uses row group ordinal as the key.
///
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here

///
/// Number of columns with a bloom filter to be relatively small compared to
/// the number of overall columns, so map is used.
using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining RowGroupBloomFilterLocation and FileBloomFilterLocation in the BloomFilterLocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PageIndexLocation doesn't defined them here. what about keeping it consistent?

# Conflicts:
#	cpp/src/parquet/file_writer.cc
#	cpp/src/parquet/properties.h
@wgtmac
Copy link
Member

wgtmac commented Feb 6, 2025

Could you please fix the CI failure?

@mapleFU
Copy link
Member Author

mapleFU commented Mar 26, 2025

I've resolve the comments and fix the ci, would you mind re-check ? @wgtmac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Allow writing BloomFilter for specific column
8 participants