[COLLECTIONS-855] Fixed hashing calculation as per report #501

Claudenw · 2024-06-07T14:19:52Z

Modified the loop counters as recommended in bug report.
Adjusted test data accordingly.

codecov-commenter · 2024-06-07T14:22:01Z

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.50%. Comparing base (229425c) to head (9bffa24).
Report is 151 commits behind head on master.

Files	Patch %	Lines
...collections4/bloomfilter/EnhancedDoubleHasher.java	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #501      +/-   ##
============================================
- Coverage     81.60%   81.50%   -0.11%     
- Complexity     4745     4836      +91     
============================================
  Files           295      300       +5     
  Lines         13751    14092     +341     
  Branches       2022     2071      +49     
============================================
+ Hits          11222    11485     +263     
- Misses         1929     1994      +65     
- Partials        600      613      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

garydgregory

LGMT, thank you @Claudenw !
@aherbert ?

aherbert · 2024-06-07T18:57:42Z

Just wondering if we want to fix the code to only compute the index update when it will be used. Currently the final computation inside the loop is wasted. This may be noticeable when using a small number of hash functions (e.g. k=5). The associated issue in COLLECTIONS-855 has an example.

garydgregory · 2024-06-09T14:42:50Z

@Claudenw ?

garydgregory · 2024-06-09T14:43:34Z

https://issues.apache.org/jira/browse/COLLECTIONS-855

Claudenw · 2024-06-11T10:12:29Z

I think this block is wrong:

commons-collections/src/main/java/org/apache/commons/collections4/bloomfilter/EnhancedDoubleHasher.java

Lines 172 to 207 in bb2fac4

    
           final int k = shape.getNumberOfHashFunctions(); 
        
           if (k > bits) { 
        
               for (int j = k; j > 0;) { 
        
                   // handle k > bits 
        
                   final int block = Math.min(j, bits); 
        
                   j -= block; 
        
                   for (int i = 1; i <= block; i++) { 
        
                       if (!consumer.test(index)) { 
        
                           return false; 
        
                       } 
        
                       // Update index and handle wrapping 
        
                       index -= inc; 
        
                       index = index < 0 ? index + bits : index; 
        
                       // Incorporate the counter into the increment to create a 
        
                       // tetrahedral number additional term, and handle wrapping. 
        
                       inc -= i; 
        
                       inc = inc < 0 ? inc + bits : inc; 
        
                   } 
        
               } 
        
           } else { 
        
               for (int i = 1; i <= k; i++) { 
        
                   if (!consumer.test(index)) { 
        
                       return false; 
        
                   } 
        
                   // Update index and handle wrapping 
        
                   index -= inc; 
        
                   index = index < 0 ? index + bits : index; 
        
                   // Incorporate the counter into the increment to create a 
        
                   // tetrahedral number additional term, and handle wrapping. 
        
                   inc -= i; 
        
                   inc = inc < 0 ? inc + bits : inc; 
        
               } 
        
           } 
        
           return true;

In the upper part we reset the tetrahedral number counter so the hashing algo is reset.
I don't recall why the algo is split into k<=bits and k>bits, the wrapping algorithm should work for either case (I think). What am I missing? Can this be simplified by dropping lines 173-192 inclusive?

I will make the adjustment that @aherbert suggested after we resolve this.

Claudenw · 2024-06-11T11:14:29Z

OK, I figured out what the lines 173-192 are doing . They ensure that the inc does not get greater than bits.

aherbert · 2024-06-11T11:19:24Z

The increment inc is adjusted by i. If it is negative then it has wrapped and we add bits. This works only if i is always less than bits.

If i > bits then when we adjust inc by adding bits it may still be negative. This is then an invalid increment to use to adjust index.

In the worst case the inc will be a large enough negative that index -= inc will create a number above the size in bits. That could cause a bounds error on the IntPredicate consumer.

In the real world the upper loop will not be used. It is a fail safe for bad use. A number of hash functions greater than the number of bits would saturate the filter very fast. An alternative less friendly solution would be to throw an IllegalArgumentException if called under these conditions.

As to resetting the tetrahedral number then you are correct. My fail-safe implementation is wrong for a correct enhanced double hasher. What the upper loop requires is for the inc + bits to be repeated until inc is positive:

                    for (int i = 1; i <= k; i++) {
                        if (!consumer.test(index)) {
                            return false;
                        }
                        // Update index and handle wrapping
                        index -= inc;
                        index = index < 0 ? index + bits : index;

                        // Incorporate the counter into the increment to create a
                        // tetrahedral number additional term, and handle wrapping
                        // **given** i can exceed bits
                        inc -= i;
                        if (inc < 0) {
                            inc += bits;
                            while (inc < 0) {
                                inc += bits;
                            }
                        }
                    }

Try that and see if we have coverage. If not then we should add a test to make sure that we are testing extremely bad usage and the hasher still works.

Note: I would not use this as the default implementation as the conditional load to adjust the inc will be faster than the branch statement and additional (unnecessary) loop.

garydgregory · 2024-06-11T11:36:21Z

Perhaps some of the information captured in this conversation could be preserved in a Java document or in-line comment.

Claudenw · 2024-06-11T12:21:19Z

We do have coverage in the tests for the bad case.

If we start an integer variable int tet = 1; and use it on line 198 rather than i. We can then do

if (++tet == bits) {
    tet = 0;
}

to ensure that tet is always in the proper range. Conceptually this is the same as calling modulus bits. This means that we only have one loop to worry about.

aherbert · 2024-06-11T13:11:02Z

src/main/java/org/apache/commons/collections4/bloomfilter/EnhancedDoubleHasher.java

+                    inc -= tet;
+                    inc = inc < 0 ? inc + bits : inc;
+                    if (inc >= bits) {
+                        inc = BitMaps.mod(increment, bits);


Don't do this. The entire purpose of the increment being tested against zero and adjusted by [0, bits) is to avoid a modulus inside the loop.

Besides, you should be testing tet is within [0, bits), not inc.

I feel that the better solution is to have two versions. One for the idiot who requests more hash functions than there are bits in the filter; the other one optimised for the correct use case.

Arrrg..... The if check was there for testing.... I intended to remove it. tet is [0, bits) range.

OK. But I still think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits.

I assume this PR is not ready then?

// the tetraheadral incrementer. We need to ensure that this // number does not exceed bits-1 or we may end up with an index > bits. int tet = 1; for (int i = 1; i < k; i++) { // Update index and handle wrapping index -= inc; index = index < 0 ? index + bits : index; if (!consumer.test(index)) { return false; } // Incorporate the counter into the increment to create a // tetrahedral number additional term, and handle wrapping. inc -= tet; inc = inc < 0 ? inc + bits : inc; if (++tet == bits) { tet = 0; } }

The difference between this block and a single simple path is the increment of tet and the if check near the bottom of the block. I don't think this is a significant overhead and significantly simplifies the code. We should also consider that generating the values from the hash is considered an expensive operation so the slight overhead here is not unexpected and I believe faster than executing multiple murmur3 hashes for example.

aherbert · 2024-06-12T06:38:21Z

@Claudenw I think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits. In practical use this will not be needed. I do not see the gain from consolidating this to one loop from a maintenance perspective outweighing the disadvantage of runtime efficiency.

aherbert · 2024-06-12T07:03:12Z

This looks good with the two loops and the note about why they are there. Perhaps create a ticket under COLLECTIONS in jira to add JMH performance benchmarks for the Bloom filter code. This can have a list of use cases to test, including variants on this hasher (and the difference to a simple hasher without enhanced double hashing).

garydgregory · 2024-06-12T14:08:32Z

@aherbert OK to merge then??

aherbert · 2024-06-12T16:01:04Z

LGTM

Fixed hashing calculation as per report

bb2fac4

Claudenw requested a review from aherbert June 7, 2024 14:21

Claudenw requested a review from garydgregory June 7, 2024 14:22

garydgregory approved these changes Jun 7, 2024

View reviewed changes

aherbert changed the title ~~Fixed hashing calculation as per report~~ COLLECTIONS-855: Fixed hashing calculation as per report Jun 7, 2024

garydgregory changed the title ~~COLLECTIONS-855: Fixed hashing calculation as per report~~ [COLLECTIONS-855] Fixed hashing calculation as per report Jun 9, 2024

fixed up processIndices and simplified code

94d6cd4

aherbert reviewed Jun 11, 2024

View reviewed changes

removed testing check

6ce3929

Claudenw requested a review from aherbert June 12, 2024 06:32

Claudenw added 2 commits June 12, 2024 08:50

Put back the k>=bits check

3eb3afd

removed blank line

9bffa24

garydgregory merged commit a277a2a into apache:master Jun 12, 2024
8 of 9 checks passed

garydgregory added a commit that referenced this pull request Jun 12, 2024

[COLLECTIONS-855] Fixed hashing calculation as per report #501

4897d0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[COLLECTIONS-855] Fixed hashing calculation as per report #501

[COLLECTIONS-855] Fixed hashing calculation as per report #501

Claudenw commented Jun 7, 2024 •

edited by garydgregory

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading

garydgregory left a comment

aherbert commented Jun 7, 2024

garydgregory commented Jun 9, 2024

garydgregory commented Jun 9, 2024

Claudenw commented Jun 11, 2024 •

edited

Loading

Claudenw commented Jun 11, 2024

aherbert commented Jun 11, 2024 •

edited

Loading

garydgregory commented Jun 11, 2024

Claudenw commented Jun 11, 2024 •

edited

Loading

aherbert Jun 11, 2024

Claudenw Jun 11, 2024

aherbert Jun 11, 2024

garydgregory Jun 11, 2024

Claudenw Jun 12, 2024

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

garydgregory commented Jun 12, 2024

aherbert commented Jun 12, 2024

[COLLECTIONS-855] Fixed hashing calculation as per report #501

[COLLECTIONS-855] Fixed hashing calculation as per report #501

Conversation

Claudenw commented Jun 7, 2024 • edited by garydgregory Loading

codecov-commenter commented Jun 7, 2024 • edited Loading

Codecov Report

garydgregory left a comment

Choose a reason for hiding this comment

aherbert commented Jun 7, 2024

garydgregory commented Jun 9, 2024

garydgregory commented Jun 9, 2024

Claudenw commented Jun 11, 2024 • edited Loading

Claudenw commented Jun 11, 2024

aherbert commented Jun 11, 2024 • edited Loading

garydgregory commented Jun 11, 2024

Claudenw commented Jun 11, 2024 • edited Loading

aherbert Jun 11, 2024

Choose a reason for hiding this comment

Claudenw Jun 11, 2024

Choose a reason for hiding this comment

aherbert Jun 11, 2024

Choose a reason for hiding this comment

garydgregory Jun 11, 2024

Choose a reason for hiding this comment

Claudenw Jun 12, 2024

Choose a reason for hiding this comment

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

garydgregory commented Jun 12, 2024

aherbert commented Jun 12, 2024

Claudenw commented Jun 7, 2024 •

edited by garydgregory

Loading

codecov-commenter commented Jun 7, 2024 •

edited

Loading

Claudenw commented Jun 11, 2024 •

edited

Loading

aherbert commented Jun 11, 2024 •

edited

Loading

Claudenw commented Jun 11, 2024 •

edited

Loading