-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406
COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406
Conversation
Codecov Report
@@ Coverage Diff @@
## master #406 +/- ##
============================================
+ Coverage 81.20% 81.24% +0.03%
- Complexity 4625 4638 +13
============================================
Files 290 290
Lines 13483 13547 +64
Branches 1993 2003 +10
============================================
+ Hits 10949 11006 +57
- Misses 1933 1937 +4
- Partials 601 604 +3
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Some extra checkstyle files were added. I hope to have time to review in the next few days. |
The functional changes are fine, i.e. expose some measure of the capacity of the counts. But without an exact use case I am not certain of how it would be used. I think this could do with some naming improvement. You cannot Should this property instead reflect that it is the maximum count of the same bit index that can be added to the filter before the state is invalidated. So change to Q. Is there a reference computation for an estimate of the number of items that can be added before saturating a bit index with a certain probability. It may be useful to point a reader in the direction of some calculations. Thus they can use the Shape and this property to get an estimate of capacity. |
Cell is a standard term in the literature. I am preparing two more pull requests.
Both the new counting implementation and the stable Bloom filter are dependant upon a CellManager class and some supporting classes. I could present the pull requests in either order, the first one being largish (like the layerd change) with the second being a couple of classes for the specific implementation. I can update the current counting Bloom filter with the term "cell", but I would like to keep the terminology. I thought I had change the method name to Would you be amenable to updating the documentation to refer to cell when we have "bits" that are values more than bits? edit: I misread your post (I should always wear my glasses when I sit down at the computer). I will change the name to |
There are two terms here used through the API: bit index which refers to a position within the shape; and the count of the number of observations of that index. In that context you are stating a cell is a position that can be filled with counts. So it is a countable bit index. Would the term cell ever be used for non-counting Bloom filters? Where does that leave the naming of BitCountProducer? CellCountProducer; CellProducer? This is why I questioned it. Adding new terminology should make sense to those reading the API for the first time. Perhaps it can be done by adding a note to the class javadoc for CountingBloomFilter that bit indices that can be counted are also known as cells (if this is correct). |
Cells are used in stable Bloom filters which are technically not a counting bloom filter they are in a class called "Element decay" Bloom filters. They do use cells, not bits, but only report and accept bits as output and input. There are several Bloom filters in this class. This change is to make the development of the stable and a variable size CountingBloomFilter possible. In that change I was going to propose changing Are you ok with my proceeding to;
|
Go ahead. If if more closely matches the literature then this is what we should do. |
@aherbert Please take a look again and see if you are OK to merge this change. |
…nged BitCount* to Cell* Updated documentation
00adc10
to
eccfd03
Compare
@@ -19,7 +19,7 @@ | |||
import java.util.Objects; | |||
|
|||
/** | |||
* The interface that describes a Bloom filter that associates a count with each | |||
* The interface that describes a Bloom filter that associates a cell with each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing count to cell has led to some loss of information. The word count is obviously a tracking of occurrences. The word cell is not. IIUC a cell is a count for an index, thus it encapsulates both terms, i.e. you cannot have a cell without an index: cell = (index, count). If a cell did not have an associated index then it is simply a count.
Where does this leave the naming of CountingBloomFilter
and ArrayCountingBloomFilter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definition of cell and usage of cell vs count was reviewed and corrections made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you squashed and force pushed to the branch the comments from the last review I made has been lost. FYI: don't do this until the PR is ready to merge.
Note: I started by adding a single comment as I clicked the wrong button in GH. I later noted that you explain what cells are in the package-info. This should probably be repeated in the appropriate interfaces/classes as not many people will read the package info. Simple changing counts
to cells
makes some of the grammar less intuitive. This could use some work.
I cannot remember everything I stated last time. Overall the PR looks OK. I didn't do a deep dive because there is an issue with the getMaxIndex
I raised last time that you have not addressed. It does not count repeats of the same index in the query item.
@Test
void test() {
ArrayCountingBloomFilter f = new ArrayCountingBloomFilter(Shape.fromKM(1, 10));
IndexProducer ip = p -> {
p.test(3);
p.test(3);
return true;
};
f.merge(ip);
System.out.println(f.getMaxInsert(ip));
}
Prints:
2
This is non-sense. It could be fixed. If you don't care that the returned value is an overestimate then the method should be documented with warnings. The value from getMaxInsert
is actually returning the number of times a cell touched by your item has been incremented, and not a factor of the number of times your item has been inserted. As such it is an overestimate of any item that touches the same cell multiple times.
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java
Outdated
Show resolved
Hide resolved
* @return the maximum number of times the IndexProducer could have been inserted. | ||
*/ | ||
default int getMaxInsert(IndexProducer idxProducer) { | ||
return getMaxInsert( BitMapProducer.fromIndexProducer(idxProducer, getShape().getNumberOfBits())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace ( BitMap
src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java
Outdated
Show resolved
Hide resolved
I fixed the issues with getMaxInsert by requiring the CellProducer return distinct ordered indices. It made much of the code simpler, but added more test cases. I will go back and review all the places where Cell is referenced in java doc and ensure that there is some sort of description. |
github.com:Claudenw/commons-collections into unlock_counting_bloom_filters
@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) { | |||
/** | |||
* Merges the specified index producer into this Bloom filter. | |||
* | |||
* <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p> | |||
* <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think your conversion to a CellProducer upholds this contract. The cells will not be incremented by 1; they will be incremented by the count of the indices because merge then calls add.
IIUC a merge is from BloomFilter and should flatten all arguments to an effective bitmap, removing duplicates. The add operation should not remove duplicate indices because they are treated as cells with a count above 1.
We should have some tests to check these definitions are upheld.
ArrayCountingBloomFilter f1 = new ArrayCountingBloomFilter(Shape.fromKM(1, 10));
ArrayCountingBloomFilter f2 = f1.copy();
ArrayCountingBloomFilter f3 = f1.copy();
IndexProducer ip = p -> {
p.test(3);
p.test(3);
return true;
};
// The merge SHOULD increment cells by 1
f1.merge(ip);
// The add should increment cells by 2
f2.add(CellProducer.from(ip));
// This merge will increment by 1 as the round-trip makes the indices unique
f3.merge(IndexProducer.fromIndexArray(ip.asIndexArray()));
f1.forEachCell((k, v) -> {
System.out.printf("%d = %d%n", k, v);
return true;
});
f2.forEachCell((k, v) -> {
System.out.printf("%d = %d%n", k, v);
return true;
});
f3.forEachCell((k, v) -> {
System.out.printf("%d = %d%n", k, v);
return true;
});
Prints:
3 = 2 // ERROR !
3 = 2
3 = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this is incorrect. I am adding this as a test to the AbstractCountingFilterTest.
I think that we should make the following changes.
for counting filters
- merge always increments the cell value by at most 1.
- remove always decrements the cell value by at most 1.
- add and subtract as used to increment/decrement by more than 1.
for index producers
- as array always returns the indices in the number and order they are produced.
- move the "unique" method from hashers to index producers. Hashers create index producers. Let the use of the producer determine if it needs to be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with all points except:
"as array always returns the indices in the number and order they are produced."
This constraint may impact performance. Is it required?
A hasher asIndexArray may not wish to do this as the indices are random. The EnhancedDoubleHasher current returns the indices unordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "required" no, but when working in complex arrangements it mean that the simply converting to an array does not change the state of the data. In most cases this is not an issue. The only case where it is an issue is when a Hasher backed IndexProducer.asArray() is called. In all other cases (I think) the array either already exists or exists in a different form (BitMaps, CellProducers). BitMaps and CellProducers always return distinct values in order.
My plan was to lift the code from BitMapProducer that produces bitMaps and use it to produce an array of indices from a Hasher.
Perhaps my original comment is not clear. Basically, however the underlying storage produces the indices is the order and number that will be produced in the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wish to return a distinct array then the default method in the interface using a bit set will be fine.
If you wish to return a count of duplicate indices then the current implementation will work if you just sort the array it currently creates. This may be faster than using a complicated intermediate data structure. The hasher is likely to have a small array and the speed of the sort will be fast.
This may be an opportunity to use some more characteristics flags where an index producer can declare if the indices are sorted and or duplicates. Then the interface can provide default methods to remove duplicates or sort them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have IndexProducer.asArray output the indices in the order that they would be returned by IndexProducer.forEachIndex() then we are not changing the data based on the representation.
This also means that sorting can be done by generating an array, using Array.sort() and then using the result to create an IndexProducer.fromIndexArray() to create the sorted index.
The characteristics flag may be a good idea. We already have them in the test code.
How about we move this forward as is and merge it. I think the change is large enough (I was trying to keep is small).
I still need to get the LayeredBloom filter added and it will need to be updated once we have this in place.
Finally, I have the StableBloom filter, which may clarify some of the questions we have a bout characteristics and whether we have chosen the proper restrictions on IndexProducer and CellProducers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last bit of my previous message was not meant to indicate that this should not be merged, just that there are 2 more changes coming as different pull requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread your previous statement "as array always returns the indices in the number and order they are produced." I took that to mean in order, i.e. sorted. Re-reading it I see that you are stating that the order matches the output produced by forEachIndex.
This is fine. But it now means the default implementation of asIndexArray (which creates sorted, unique indices) is now in conflict with the javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS. I just saw that you changed the default implementation of asIndexArray to respect the order of forEachIndex. I noted a problem with the internal comment in that method about the capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all be fixed and ready to go now.
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try and update the PR so that the git history is kept for two of the renamed files. You should be able to restore the files from an old commit, use git mv
to rename them and then replace the contents with the new code files you have here.
Please remove changes to IndexFilter. They may be useful. They are not related to this change.
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java
Outdated
Show resolved
Hide resolved
...va/org/apache/commons/collections4/bloomfilter/CellProducerFromDefaultIndexProducerTest.java
Outdated
Show resolved
Hide resolved
...rg/apache/commons/collections4/bloomfilter/CellProducerFromArrayCountingBloomFilterTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultCellProducerTest.java
Outdated
Show resolved
Hide resolved
@Test | ||
public void testMoreThan32Entries() { | ||
int[] values = new int[33]; | ||
for (int i=0; i<33; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int[] values = IntStream.range(0, 33).toArray();
src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java
Outdated
Show resolved
Hide resolved
@aherbert Do you have a tool that finds all the screwy whitespace? I try to keep the code clean but my historical style keeps slipping through. |
I spot them when reviewing. Sometimes there are not many so I point them out. If you would like I can stop pointing them out as it does add noise to a PR review. I can format the code and the tests in this package to be consistent when all your planned revisions are submitted. |
3aac605
to
9164ace
Compare
dad4a75
to
1c6a3f1
Compare
No luck trying to get the move to be recognized. Tried restoring and moving, restoring editing and then moving, restoring clearing the git cache of the new name, moving. Nothing I tried resulted in the github display showing the file as changed. Only deleted and new. If I revert the IndexFilter change the code coverage will not be 100% and no way to get to 100%. If that is the better solution at this point I will do that and make the change later as a separate pull request. |
I think you have fixed it. The GH UI sometimes does not show the renamed file as a move. I have cloned the latest PR code and verified locally that the history is there:
Log revisions date back to 2022. I should have done this before I asked you to restore the history as it may have been there already. Anyway, it is OK now. I think we should revert the IndexFilter change and do that separately. This PR is related to the counting Bloom filters and their ability to indicate the max capacity of a cell and estimate the number of entries of an item. |
I think that IndexUtils method should be moved to ArrayUtils in the root package. |
Note that class is package private, and copied and adapted from Commons Lang. To make the method public requires a bit more effort. For example duplicating for all 8 primitive array types and also a generic type. It may also require adding a method to expand the array for adding more than 1 item. (The current method only requires expanding the array to add 1 item.) This generic functionality may be better suited to ArrayUtils in commons-lang. It is simpler to hide this functionality at the package level for now, and potentially add this as a feature in the future. |
OK. If there are any issues remaining please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing. There is an issue with the use of the IndexUtils to enlarge the array. The class should also be a package-private final utility class.
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java
Outdated
Show resolved
Hide resolved
@@ -72,7 +72,10 @@ public boolean test(final int number) { | |||
if (number >= size) { | |||
throw new IndexOutOfBoundsException(String.format("number too large %d >= %d", number, size)); | |||
} | |||
return !tracker.test(number) || consumer.test(number); | |||
if (tracker.test(number)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You reverted the IndexFilter file but left this change. I think the statements are identical. Since the plan is to replace this with a separate change to the IndexFilter then can you revert this too. The IndexFilter should be unchanged by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change fixes code coverage check. In the original case there are 4 branches, one of which can not be tested. In this case there are only 3 and the code coverage report shows 100% coverage for the Bloomfilter code.
Counting bloom filter tests are dependant upon the maximum value that can be stored in a cell in a Bloom filter.
This pull adds a
getMaxValue()
method to the CountingBloomFilter interface. This is to indicate the maximum value that can be stored in a single cell of the counting filter. A cell is the counter that replaces the bit in the counting filter.Also adds the missing standard function
getMaxInsert()
which returns the maximum times an item could have been merged into the counting filter.Changes to tests
Fix for COLLECTIONS-844