Adding Compression for BloomFilter #408

asfimport · 2023-03-13T14:30:00Z

In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:


 /\*\*
- The compression used in the Bloom filter.
 \*\*/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}

Reporter: Xuwei Fu / @mapleFU
Assignee: Xuwei Fu / @mapleFU

_{Note: This issue was originally created as PARQUET-2256. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2023-03-13T15:11:23Z

Gang Wu / @wgtmac:
Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing.
However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257

asfimport · 2023-03-17T08:29:48Z

Gabor Szadovszky / @gszadovszky:
@mapleFU, would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view.

asfimport · 2023-03-17T08:32:02Z

Xuwei Fu / @mapleFU:
@gszadovszky Yes, I'd like to. I think having compression in standard doesn't means we need always compression. We can do it only when original BloomFilter occupy a lot of space and compression can save lots of time

wgtmac removed Component: Format labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Compression for BloomFilter #408

Adding Compression for BloomFilter #408

asfimport commented Mar 13, 2023 •

edited

Loading

asfimport commented Mar 13, 2023

asfimport commented Mar 17, 2023

asfimport commented Mar 17, 2023

Adding Compression for BloomFilter #408

Adding Compression for BloomFilter #408

Comments

asfimport commented Mar 13, 2023 • edited Loading

asfimport commented Mar 13, 2023

asfimport commented Mar 17, 2023

asfimport commented Mar 17, 2023

asfimport commented Mar 13, 2023 •

edited

Loading