You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:
/\*\*
- The compression used in the Bloom filter.
\*\*/
struct Uncompressed {}
union BloomFilterCompression {
1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}
Gabor Szadovszky / @gszadovszky: @mapleFU, would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view.
Xuwei Fu / @mapleFU: @gszadovszky Yes, I'd like to. I think having compression in standard doesn't means we need always compression. We can do it only when original BloomFilter occupy a lot of space and compression can save lots of time
In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:
Reporter: Xuwei Fu / @mapleFU
Assignee: Xuwei Fu / @mapleFU
Note: This issue was originally created as PARQUET-2256. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: