-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[COLLECTIONS-855] Fixed hashing calculation as per report #501
[COLLECTIONS-855] Fixed hashing calculation as per report #501
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #501 +/- ##
============================================
- Coverage 81.60% 81.50% -0.11%
- Complexity 4745 4836 +91
============================================
Files 295 300 +5
Lines 13751 14092 +341
Branches 2022 2071 +49
============================================
+ Hits 11222 11485 +263
- Misses 1929 1994 +65
- Partials 600 613 +13 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we want to fix the code to only compute the index update when it will be used. Currently the final computation inside the loop is wasted. This may be noticeable when using a small number of hash functions (e.g. k=5). The associated issue in COLLECTIONS-855 has an example. |
I think this block is wrong: Lines 172 to 207 in bb2fac4
In the upper part we reset the tetrahedral number counter so the hashing algo is reset. I will make the adjustment that @aherbert suggested after we resolve this. |
OK, I figured out what the lines 173-192 are doing . They ensure that the |
The increment If In the worst case the In the real world the upper loop will not be used. It is a fail safe for bad use. A number of hash functions greater than the number of bits would saturate the filter very fast. An alternative less friendly solution would be to throw an IllegalArgumentException if called under these conditions. As to resetting the tetrahedral number then you are correct. My fail-safe implementation is wrong for a correct enhanced double hasher. What the upper loop requires is for the for (int i = 1; i <= k; i++) {
if (!consumer.test(index)) {
return false;
}
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping
// **given** i can exceed bits
inc -= i;
if (inc < 0) {
inc += bits;
while (inc < 0) {
inc += bits;
}
}
} Try that and see if we have coverage. If not then we should add a test to make sure that we are testing extremely bad usage and the hasher still works. Note: I would not use this as the default implementation as the conditional load to adjust the |
Perhaps some of the information captured in this conversation could be preserved in a Java document or in-line comment. |
We do have coverage in the tests for the bad case. If we start an integer variable
to ensure that tet is always in the proper range. Conceptually this is the same as calling modulus bits. This means that we only have one loop to worry about. |
inc -= tet; | ||
inc = inc < 0 ? inc + bits : inc; | ||
if (inc >= bits) { | ||
inc = BitMaps.mod(increment, bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this. The entire purpose of the increment being tested against zero and adjusted by [0, bits)
is to avoid a modulus inside the loop.
Besides, you should be testing tet
is within [0, bits)
, not inc
.
I feel that the better solution is to have two versions. One for the idiot who requests more hash functions than there are bits in the filter; the other one optimised for the correct use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrrg..... The if check was there for testing.... I intended to remove it. tet is [0, bits) range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. But I still think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this PR is not ready then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// the tetraheadral incrementer. We need to ensure that this
// number does not exceed bits-1 or we may end up with an index > bits.
int tet = 1;
for (int i = 1; i < k; i++) {
// Update index and handle wrapping
index -= inc;
index = index < 0 ? index + bits : index;
if (!consumer.test(index)) {
return false;
}
// Incorporate the counter into the increment to create a
// tetrahedral number additional term, and handle wrapping.
inc -= tet;
inc = inc < 0 ? inc + bits : inc;
if (++tet == bits) {
tet = 0;
}
}
The difference between this block and a single simple path is the increment of tet and the if check near the bottom of the block. I don't think this is a significant overhead and significantly simplifies the code. We should also consider that generating the values from the hash is considered an expensive operation so the slight overhead here is not unexpected and I believe faster than executing multiple murmur3 hashes for example.
@Claudenw I think this is the wrong solution. You are compromising 99.9999% of use cases to handle the edge case of k >= bits. In practical use this will not be needed. I do not see the gain from consolidating this to one loop from a maintenance perspective outweighing the disadvantage of runtime efficiency. |
This looks good with the two loops and the note about why they are there. Perhaps create a ticket under COLLECTIONS in jira to add JMH performance benchmarks for the Bloom filter code. This can have a list of use cases to test, including variants on this hasher (and the difference to a simple hasher without enhanced double hashing). |
@aherbert OK to merge then?? |
LGTM |
Fix for COLLECTIONS-855
Modified the loop counters as recommended in bug report.
Adjusted test data accordingly.