Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COLLECTIONS-855] Fixed hashing calculation as per report #501

Conversation

Claudenw
Copy link
Contributor

@Claudenw Claudenw commented Jun 7, 2024

Fix for COLLECTIONS-855

Modified the loop counters as recommended in bug report.
Adjusted test data accordingly.

@Claudenw Claudenw requested a review from aherbert June 7, 2024 14:21
@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2024

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.50%. Comparing base (229425c) to head (9bffa24).
Report is 151 commits behind head on master.

Files Patch % Lines
...collections4/bloomfilter/EnhancedDoubleHasher.java 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #501      +/-   ##
============================================
- Coverage     81.60%   81.50%   -0.11%     
- Complexity     4745     4836      +91     
============================================
  Files           295      300       +5     
  Lines         13751    14092     +341     
  Branches       2022     2071      +49     
============================================
+ Hits          11222    11485     +263     
- Misses         1929     1994      +65     
- Partials        600      613      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, thank you @Claudenw !
@aherbert ?

@aherbert
Copy link
Contributor

aherbert commented Jun 7, 2024

Just wondering if we want to fix the code to only compute the index update when it will be used. Currently the final computation inside the loop is wasted. This may be noticeable when using a small number of hash functions (e.g. k=5). The associated issue in COLLECTIONS-855 has an example.

@aherbert aherbert changed the title Fixed hashing calculation as per report COLLECTIONS-855: Fixed hashing calculation as per report Jun 7, 2024
@garydgregory
Copy link
Member

@Claudenw ?

@garydgregory garydgregory changed the title COLLECTIONS-855: Fixed hashing calculation as per report [COLLECTIONS-855] Fixed hashing calculation as per report Jun 9, 2024
@garydgregory
Copy link
Member

@Claudenw
Copy link
Contributor Author

Claudenw commented Jun 11, 2024

I think this block is wrong:

final int k = shape.getNumberOfHashFunctions();
if (k > bits) {
for (int j = k; j > 0;) {
// handle k > bits
final int block = Math.min(j, bits);
j -= block;
for (int i = 1; i <= block; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping.
inc -= i;
inc = inc < 0 ? inc + bits : inc;
}
}
} else {
for (int i = 1; i <= k; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping.
inc -= i;
inc = inc < 0 ? inc + bits : inc;
}
}
return true;

In the upper part we reset the tetrahedral number counter so the hashing algo is reset.
I don't recall why the algo is split into k<=bits and k>bits, the wrapping algorithm should work for either case (I think). What am I missing? Can this be simplified by dropping lines 173-192 inclusive?

I will make the adjustment that @aherbert suggested after we resolve this.

@Claudenw
Copy link
Contributor Author

OK, I figured out what the lines 173-192 are doing . They ensure that the inc does not get greater than bits.

@aherbert
Copy link
Contributor

aherbert commented Jun 11, 2024

The increment inc is adjusted by i. If it is negative then it has wrapped and we add bits. This works only if i is always less than bits.

If i > bits then when we adjust inc by adding bits it may still be negative. This is then an invalid increment to use to adjust index.

In the worst case the inc will be a large enough negative that index -= inc will create a number above the size in bits. That could cause a bounds error on the IntPredicate consumer.

In the real world the upper loop will not be used. It is a fail safe for bad use. A number of hash functions greater than the number of bits would saturate the filter very fast. An alternative less friendly solution would be to throw an IllegalArgumentException if called under these conditions.

As to resetting the tetrahedral number then you are correct. My fail-safe implementation is wrong for a correct enhanced double hasher. What the upper loop requires is for the inc + bits to be repeated until inc is positive:

                    for (int i = 1; i <= k; i++) {
                        if (!consumer.test(index)) {
                            return false;
                        }
                        // Update index and handle wrapping
                        index -= inc;
                        index = index < 0 ? index + bits : index;

                        // Incorporate the counter into the increment to create a
                        // tetrahedral number additional term, and handle wrapping
                        // **given** i can exceed bits
                        inc -= i;
                        if (inc < 0) {
                            inc += bits;
                            while (inc < 0) {
                                inc += bits;
                            }
                        }
                    }

Try that and see if we have coverage. If not then we should add a test to make sure that we are testing extremely bad usage and the hasher still works.

Note: I would not use this as the default implementation as the conditional load to adjust the inc will be faster than the branch statement and additional (unnecessary) loop.

@garydgregory
Copy link
Member

Perhaps some of the information captured in this conversation could be preserved in a Java document or in-line comment.

@Claudenw
Copy link
Contributor Author

Claudenw commented Jun 11, 2024

We do have coverage in the tests for the bad case.

If we start an integer variable int tet = 1; and use it on line 198 rather than i. We can then do

if (++tet == bits) {
    tet = 0;
}

to ensure that tet is always in the proper range. Conceptually this is the same as calling modulus bits. This means that we only have one loop to worry about.

inc -= tet;
inc = inc < 0 ? inc + bits : inc;
if (inc >= bits) {
inc = BitMaps.mod(increment, bits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this. The entire purpose of the increment being tested against zero and adjusted by [0, bits) is to avoid a modulus inside the loop.

Besides, you should be testing tet is within [0, bits), not inc.

I feel that the better solution is to have two versions. One for the idiot who requests more hash functions than there are bits in the filter; the other one optimised for the correct use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrrg..... The if check was there for testing.... I intended to remove it. tet is [0, bits) range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But I still think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this PR is not ready then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                // the tetraheadral incrementer.  We need to ensure that this
                // number does not exceed bits-1 or we may end up with an index > bits.
                int tet = 1;
                for (int i = 1; i < k; i++) {
                    // Update index and handle wrapping
                    index -= inc;
                    index = index < 0 ? index + bits : index;
                    if (!consumer.test(index)) {
                        return false;
                    }

                    // Incorporate the counter into the increment to create a
                    // tetrahedral number additional term, and handle wrapping.
                    inc -= tet;
                    inc = inc < 0 ? inc + bits : inc;
                    if (++tet == bits) {
                        tet = 0;
                    }
                }

The difference between this block and a single simple path is the increment of tet and the if check near the bottom of the block. I don't think this is a significant overhead and significantly simplifies the code. We should also consider that generating the values from the hash is considered an expensive operation so the slight overhead here is not unexpected and I believe faster than executing multiple murmur3 hashes for example.

@Claudenw Claudenw requested a review from aherbert June 12, 2024 06:32
@aherbert
Copy link
Contributor

@Claudenw I think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits. In practical use this will not be needed. I do not see the gain from consolidating this to one loop from a maintenance perspective outweighing the disadvantage of runtime efficiency.

@aherbert
Copy link
Contributor

This looks good with the two loops and the note about why they are there. Perhaps create a ticket under COLLECTIONS in jira to add JMH performance benchmarks for the Bloom filter code. This can have a list of use cases to test, including variants on this hasher (and the difference to a simple hasher without enhanced double hashing).

@garydgregory
Copy link
Member

@aherbert OK to merge then??

@aherbert
Copy link
Contributor

LGTM

@garydgregory garydgregory merged commit a277a2a into apache:master Jun 12, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants