Skip to content

GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation #37400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 26, 2023

Rationale for this change

Currently we allow reading bloom filter for specific column and rowgroup, now this patch allow it writing BF.

This patch is just a skeleton. If reviewer thinks interface would be OK, I'll go on and add testing.

What changes are included in this PR?

Allow writing bf:

  • Add WriterProperties config for writing bloom filter, including bf and (per-rowgroup) ndv estimation.
  • Add BloomFilterBuilder for parquet
  • From FileSerializer to ColumnWriter, adding bloomfilter
  • Ensure Bloom Filter info is written to the file
  • Testing logic for BloomFilterBuilder
  • Testing logic for BloomFilter and ColumnWriter
  • Testing whole roundtrip like ParquetPageIndexRoundTripTest

Are these changes tested?

Yes

Are there any user-facing changes?

User can create Bloom Filter in parquet with C++ api

@mapleFU
Copy link
Member Author

mapleFU commented Aug 26, 2023

This is port of #35691 . I'm busy previous days and now I've time on it now.

The previous comment are solved. cc @pitrou @wgtmac @emkornfield

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 26, 2023
@mapleFU mapleFU requested review from pitrou and emkornfield August 26, 2023 18:43
@mapleFU mapleFU changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement Aug 27, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I just did an initial review except the test.

@wgtmac wgtmac changed the title GH-34785: [C++][Parquet] Parquet Bloom Filter Write Implement GH-34785: [C++][Parquet] Parquet Bloom Filter Writer Implementation Aug 30, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 1, 2023
@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 505e23b to e9c550a Compare November 12, 2024 05:48
@mapleFU
Copy link
Member Author

mapleFU commented Nov 12, 2024

Two need fix:

/arrow/cpp/src/parquet/bloom_filter.h:118: error: The following parameter of parquet::BloomFilter::Hash(const FLBA &value, uint32_t type_len) const is not documented:
  parameter 'type_len' (warning treated as error, aborting now)
D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx
In file included from D:/a/arrow/arrow/build/cpp/src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_3_cxx.cxx:16:
D:/a/arrow/arrow/cpp/src/parquet/schema.cc: In function 'void parquet::schema::PrintRepLevel(parquet::Repetition::type, std::ostream&)':
D:/a/arrow/arrow/cpp/src/parquet/schema.cc:630:30: error: expected unqualified-id before ':' token
  630 |     case Repetition::OPTIONAL:
      |                              ^

@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from fd3856d to 79bdaeb Compare November 15, 2024 04:49
@mapleFU mapleFU force-pushed the parquet/support-write-bloom-filter branch from 79bdaeb to d892819 Compare November 15, 2024 06:57
@mapleFU
Copy link
Member Author

mapleFU commented Nov 19, 2024

@pitrou @wgtmac @emkornfield Sorry for late reply, I believe all comments are replyed or fixed now. Now the bloom filter becoming map in all use cases, since it would be more sparse then page-indices.

@amoeba
Copy link
Member

amoeba commented Dec 19, 2024

The feature freeze for Arrow 19 is planned for January 6, 2025 and I'm curious if there might be capacity to get this fully reviewed and merged by then (or soon after). If not, feel free to comment with what you'd need (more reviewers, more time, etc). cc @mapleFU @pitrou @wgtmac @emkornfield

@wgtmac
Copy link
Member

wgtmac commented Dec 20, 2024

Sorry for missing this! I will take a look. Meanwhile, I don't think we need to hurry for the code freeze.

# Conflicts:
#	cpp/src/parquet/column_writer.cc
#	cpp/src/parquet/type_fwd.h
@mapleFU
Copy link
Member Author

mapleFU commented Dec 20, 2024

Rebased, ready for review now

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed for another pass and generally LGTM.

It would be good if @pitrou @emkornfield can take a look after the holiday season.

ASSERT_EQ(nullptr, bloom_filter);
} else {
ASSERT_NE(nullptr, bloom_filter);
bloom_filters_.push_back(std::move(bloom_filter));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing bloom_filters_ to be an output parameter to function ReadBloomFilters instead of a class member variable?

std::vector<std::unique_ptr<BloomFilter>> bloom_filters_;
};

TEST_F(ParquetBloomFilterRoundTripTest, SimpleRoundTrip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three test cases below share a lot of common logic (with exactly same data). Should we refactor them to eliminate the duplicate?

struct BloomFilterLocation {
/// Row group bloom filter index locations which uses row group ordinal as the key.
///
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: Before Parquet 2.10, the bloom filter index only have "offset". But here
/// Note: Before Parquet Format v2.10, the bloom filter index only have "offset". But here

///
/// Number of columns with a bloom filter to be relatively small compared to
/// the number of overall columns, so map is used.
using RowGroupBloomFilterLocation = std::map<int32_t, IndexLocation>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining RowGroupBloomFilterLocation and FileBloomFilterLocation in the BloomFilterLocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PageIndexLocation doesn't defined them here. what about keeping it consistent?

# Conflicts:
#	cpp/src/parquet/file_writer.cc
#	cpp/src/parquet/properties.h
@wgtmac
Copy link
Member

wgtmac commented Feb 6, 2025

Could you please fix the CI failure?

@mapleFU
Copy link
Member Author

mapleFU commented Mar 26, 2025

I've resolve the comments and fix the ci, would you mind re-check ? @wgtmac

#include "arrow/io/type_fwd.h"
#include "parquet/types.h"

namespace parquet::internal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an internal namespace, how about renaming these files to bloom_filter_internal.h/cc ?

///
/// The bloom filter cannot be modified after this method is called.
///
/// \param[out] sink The output stream to write the bloom filter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sink should be param[in/out] I guess?

/// ```
class PARQUET_EXPORT BloomFilterBuilder {
public:
/// \brief API to create a BloomFilterBuilder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain the expecting lifetime of schema and properties.


/// \brief Get the BloomFilter from column ordinal.
///
/// \param column_ordinal Column ordinal in schema, which is only for leaf columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why some comments use \param while others use @param?


template <typename ArrayType>
void UpdateBinaryBloomFilter(BloomFilter* bloom_filter, const ArrayType& array) {
// Using a smaller size because an extra `byte_arrays` are used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Using a smaller size because an extra `byte_arrays` are used.
// Using a smaller size because an extra `byte_arrays` is used.

bool finished_ = false;

using RowGroupBloomFilters = std::map<int32_t, std::unique_ptr<BloomFilter>>;
// Using unique_ptr because the `std::unique_ptr<BloomFilter>` is not copyable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Using unique_ptr because the `std::unique_ptr<BloomFilter>` is not copyable.
// Using `std::map` because `std::unique_ptr<BloomFilter>` is not copyable.

return nullptr;
}
const BloomFilterOptions& bloom_filter_options = *bloom_filter_options_opt;
// CheckState() should have checked that file_bloom_filters_ is not empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that we don't need to be fool-proof here especially when CheckState() has done duplicate work. Removing these DCHECK improves readability.

}
if (page_statistics_ != nullptr) {
page_statistics_->UpdateSpaced(values, valid_bits, valid_bits_offset,
num_spaced_values, num_values, num_nulls);
}
UpdateUnencodedDataBytes();
}

void UpdateBloomFilter(const T* values, int64_t num_values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to move all these UpdateXXX to the BloomFilter class? These functions look more like the detail of bloom filter instead of a column writer.

int64_t valid_bits_offset) {
if (bloom_filter_) {
std::array<uint64_t, kHashBatchSize> hashes;
::arrow::internal::VisitSetBitRunsVoid(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my comment above, moving all UpdateBloomFilterXXX functions to the BloomFilter class can significantly reduce the changes in this file.

std::array<uint64_t, kHashBatchSize> hashes;
for (int64_t i = 0; i < num_values; i += kHashBatchSize) {
int64_t current_hash_batch_size = std::min(kHashBatchSize, num_values - i);
bloom_filter_->Hashes(values, static_cast<int>(current_hash_batch_size),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that parquet-cpp has a lot of int num_values used in the function signature. Is it worth changing all of them to int32_t? cc @emkornfield @pitrou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Allow writing BloomFilter for specific column
8 participants