Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Compression for BloomFilter #408

Open
asfimport opened this issue Mar 13, 2023 · 3 comments
Open

Adding Compression for BloomFilter #408

asfimport opened this issue Mar 13, 2023 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 13, 2023

In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:

 


 /\*\*
- The compression used in the Bloom filter.
 \*\*/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}

Reporter: Xuwei Fu / @mapleFU
Assignee: Xuwei Fu / @mapleFU

Note: This issue was originally created as PARQUET-2256. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing.
However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
@mapleFU, would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view.

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
@gszadovszky Yes, I'd like to. I think having compression in standard doesn't means we need always compression. We can do it only when original BloomFilter occupy a lot of space and compression can save lots of time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants