Optimize large IN filters on integer data types #23456

raunaqmorarka · 2024-09-17T09:51:23Z

Description

Use a bitset based filter instead of a hash set when the range of
values is narrow enough. We use bitset only when it would occupy
lesser space than the equivalent open hash set.
This is useful for making evaluation of dynamic filter more efficient
as we often collect large integer sets in dynamic filters.

    BenchmarkDynamicPageFilter.filterPages
    (filterSize)   (inputDataSet)  (nonNullsSelectivity)   Mode  Cnt  Before score      After score
             100  INT64_FIXED_32K                    0.2  thrpt   30  446.174 ? 10.598   449.113 ?  5.323 ops/s
            1000  INT64_FIXED_32K                    0.2  thrpt   30  407.625 ?  3.139  1379.767 ? 19.318 ops/s
            5000  INT64_FIXED_32K                    0.2  thrpt   30  426.413 ?  6.485  1254.731 ? 11.685 ops/s

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java

lukasz-stec

lgtm % comments

lukasz-stec · 2024-09-20T06:22:08Z

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java

+                .add(BigInteger.valueOf(1));
+        // A Set based on a bitmap uses (max - min) / 64 longs
+        // Create a bitset only if it uses fewer entries than the equivalent hash set
+        if (range.compareTo(BigInteger.valueOf(Integer.MAX_VALUE)) > 0


For a bitset smaller than the L1 cache size, would it make sense to use bitset even if it uses a little bit more memory?

Yes, bitset is indeed faster than open hash set at all L1 cache sizes. However, the gap tends to be much smaller for small sized sets.
I'm also hesitant to increase memory usage because this stuff is unaccounted memory usage. So i'll keep the existing logic for now.

core/trino-main/src/test/java/io/trino/sql/gen/BenchmarkDynamicPageFilter.java

lukasz-stec · 2024-09-20T06:29:21Z

core/trino-main/src/test/java/io/trino/sql/gen/BenchmarkDynamicPageFilter.java

+    {
+        INT32_RANDOM(INTEGER, (block, r) -> INTEGER.writeLong(block, r.nextInt())),
+        INT64_RANDOM(BIGINT, (block, r) -> BIGINT.writeLong(block, r.nextLong())),
+        INT64_FIXED_32K(BIGINT, (block, r) -> BIGINT.writeLong(block, r.nextLong() % 32768)), // LongBitSetFilter


Did you test smaller ranges than 32K? What ranges do we expect in practice?

The 32K here is actually just the range of the input values. Narrowing the input range ensures that bitset will be chosen instead of hashset.
Size of set is determined by filterSize and for that the benchmark params are 100, 1000, 5000

I was asking basically if we expect in practice smaller than 32k value ranges (max - min)?

From what i've seen in benchmarks and other anecdotal cases, the joined columns tend to be dates/times or some kind of unique id or primary key which are usually in a not super-wide range.

core/trino-main/src/test/java/io/trino/sql/gen/BenchmarkDynamicPageFilter.java

lukasz-stec · 2024-09-20T06:37:13Z

Nice improvement!

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java

martint · 2024-09-20T16:02:51Z

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java

+        return new LongBitSetFilter(values, min, max);
+    }
+
+    private static boolean isDirectLongComparisonValidType(Type type)


We have this also in SimplifyContinuousInValues. I would move it to io.trino.type.TypeUtils

That method is slightly different, over there we're additionally looking for types where it's possible to get the next consecutive value by just incrementing the underlying long by 1.
I've added a code comment and renamed it to be clearer.

core/trino-main/src/test/java/io/trino/sql/gen/BenchmarkDynamicPageFilter.java

martint · 2024-09-20T16:06:33Z

core/trino-main/src/test/java/io/trino/sql/gen/BenchmarkDynamicPageFilter.java

+    public void setup(long inputRows)
+    {
+        // Pollute the JVM profile
+        for (DataSet dataSet : ImmutableList.of(DataSet.INT32_RANDOM, DataSet.INT64_FIXED_32K, DataSet.INT64_RANDOM)) {


This is not guaranteed to pollute the profile. It depends on various factors, such whether it runs just on the interpreter, C1, C2, etc. A better way to do that is to run with JMH's bulk warmup mode.

I've made that change, but in practise I'm not finding the JMH's bulk warmup mode to be a better choice.
Using that causes a multi-fold increase in the runtime of the benchmark because it warms up each test permutation for N iterations for each benchmark, whereas what we want is to warm up each test permutation once and then warm N iterations for only the benchmark.
The other way of manually polluting profile may not be guaranteed to work but in practice I have never found it to not work.
On the other hand, the JMH's bulk warmup mode is always so time consuming that I always find myself manually editing the code to remove this and pollute the profile manually to get a JHM result in reasonable amount of time.
And running JMH remotely is not the solution to this problem, it still takes up too much time.

core/trino-main/src/test/java/io/trino/sql/gen/TestDynamicPageFilter.java

core/trino-main/src/test/java/io/trino/util/TestFastutilSetHelper.java

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java

Use a bitset based filter instead of a hash set when the range of values is narrow enough. We use bitset only when it would occupy lesser space than the equivalent open hash set. This is useful for making evaluation of dynamic filter more efficient as we often collect large integer sets in dynamic filters. BenchmarkDynamicPageFilter.filterPages (filterSize) (inputDataSet) (nonNullsSelectivity) Mode Cnt Before score After score 100 INT64_FIXED_32K 0.2 thrpt 30 446.174 ? 10.598 449.113 ? 5.323 ops/s 1000 INT64_FIXED_32K 0.2 thrpt 30 407.625 ? 3.139 1379.767 ? 19.318 ops/s 5000 INT64_FIXED_32K 0.2 thrpt 30 426.413 ? 6.485 1254.731 ? 11.685 ops/s

cla-bot bot added the cla-signed label Sep 17, 2024

raunaqmorarka requested review from sopel39, martint, dain, Dith3r and lukasz-stec September 17, 2024 09:51

Dith3r approved these changes Sep 17, 2024

View reviewed changes

wendigo reviewed Sep 18, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Outdated Show resolved Hide resolved

raunaqmorarka force-pushed the drf-bitset branch from b8ecdf2 to 004c5d3 Compare September 18, 2024 05:01

wendigo reviewed Sep 18, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Outdated Show resolved Hide resolved

wendigo reviewed Sep 18, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Outdated Show resolved Hide resolved

wendigo reviewed Sep 18, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Show resolved Hide resolved

wendigo reviewed Sep 18, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Show resolved Hide resolved

raunaqmorarka force-pushed the drf-bitset branch 2 times, most recently from bbecfd4 to f9082b4 Compare September 19, 2024 03:28

raunaqmorarka requested a review from wendigo September 19, 2024 09:27

raunaqmorarka force-pushed the drf-bitset branch from f9082b4 to 5ce783d Compare September 20, 2024 05:11

wendigo approved these changes Sep 20, 2024

View reviewed changes

lukasz-stec approved these changes Sep 20, 2024

View reviewed changes

martint reviewed Sep 20, 2024

View reviewed changes

raunaqmorarka force-pushed the drf-bitset branch 2 times, most recently from 01dd275 to 3818a6f Compare September 23, 2024 09:44

Dith3r mentioned this pull request Sep 23, 2024

Enable large dynamic filters #22824

Closed

raunaqmorarka force-pushed the drf-bitset branch 2 times, most recently from d788915 to 501034b Compare September 24, 2024 05:29

raunaqmorarka requested a review from martint September 24, 2024 05:41

martint approved these changes Sep 24, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/util/FastutilSetHelper.java Outdated Show resolved Hide resolved

raunaqmorarka added 2 commits September 24, 2024 23:27

Extract createDynamicFilterEvaluator to DynamicFiltersTestUtil

b039d4b

Improve method name and comment in SimplifyContinuousInValues

c169478

raunaqmorarka force-pushed the drf-bitset branch from 501034b to 8ef3e03 Compare September 24, 2024 17:57

raunaqmorarka merged commit 3ac3530 into trinodb:master Sep 25, 2024
95 checks passed

raunaqmorarka deleted the drf-bitset branch September 25, 2024 03:29

raunaqmorarka added the performance label Sep 25, 2024

github-actions bot added this to the 459 milestone Sep 25, 2024

mosabua mentioned this pull request Sep 25, 2024

Add Trino 459 release notes #23484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize large IN filters on integer data types #23456

Optimize large IN filters on integer data types #23456

raunaqmorarka commented Sep 17, 2024

lukasz-stec left a comment

lukasz-stec Sep 20, 2024

raunaqmorarka Sep 23, 2024

lukasz-stec Sep 20, 2024

raunaqmorarka Sep 20, 2024

lukasz-stec Sep 20, 2024

raunaqmorarka Sep 24, 2024

lukasz-stec commented Sep 20, 2024

martint Sep 20, 2024

raunaqmorarka Sep 24, 2024

martint Sep 20, 2024

raunaqmorarka Sep 23, 2024

Optimize large IN filters on integer data types #23456

Optimize large IN filters on integer data types #23456

Conversation

raunaqmorarka commented Sep 17, 2024

Description

Additional context and related issues

Release notes

lukasz-stec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukasz-stec commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment