COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406

Claudenw · 2023-07-16T18:13:06Z

Counting bloom filter tests are dependant upon the maximum value that can be stored in a cell in a Bloom filter.

This pull adds a getMaxValue() method to the CountingBloomFilter interface. This is to indicate the maximum value that can be stored in a single cell of the counting filter. A cell is the counter that replaces the bit in the counting filter.

Also adds the missing standard function getMaxInsert() which returns the maximum times an item could have been merged into the counting filter.

Changes to tests

Fix for COLLECTIONS-844

codecov-commenter · 2023-07-16T18:14:27Z

Codecov Report

Merging #406 (f1e159b) into master (99473df) will increase coverage by 0.03%.
Report is 12 commits behind head on master.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #406      +/-   ##
============================================
+ Coverage     81.20%   81.24%   +0.03%     
- Complexity     4625     4638      +13     
============================================
  Files           290      290              
  Lines         13483    13547      +64     
  Branches       1993     2003      +10     
============================================
+ Hits          10949    11006      +57     
- Misses         1933     1937       +4     
- Partials        601      604       +3

Files Changed	Coverage Δ
...ections4/bloomfilter/ArrayCountingBloomFilter.java	`100.00% <100.00%> (ø)`
.../commons/collections4/bloomfilter/BloomFilter.java	`100.00% <100.00%> (ø)`
...commons/collections4/bloomfilter/CellProducer.java	`100.00% <100.00%> (ø)`
.../collections4/bloomfilter/CountingBloomFilter.java	`100.00% <100.00%> (ø)`
.../commons/collections4/bloomfilter/IndexFilter.java	`100.00% <100.00%> (ø)`
...ommons/collections4/bloomfilter/IndexProducer.java	`100.00% <100.00%> (ø)`
...e/commons/collections4/bloomfilter/IndexUtils.java	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

aherbert · 2023-07-16T18:18:07Z

Some extra checkstyle files were added. I hope to have time to review in the next few days.

aherbert · 2023-07-17T14:46:41Z

The functional changes are fine, i.e. expose some measure of the capacity of the counts. But without an exact use case I am not certain of how it would be used.

I think this could do with some naming improvement. You cannot get values from a counting Bloom filter. So getMaxValue seems out of context. Plus no other docs in the class refer to a 'cell'. The filter always uses the term 'bit index'.

Should this property instead reflect that it is the maximum count of the same bit index that can be added to the filter before the state is invalidated. So change to getMaxCount and document that it is the maximum count that can be held for any bit index. If this is exceeded then the filter is invalidated (see #isValid).

Q. Is there a reference computation for an estimate of the number of items that can be added before saturating a bit index with a certain probability. It may be useful to point a reader in the direction of some calculations. Thus they can use the Shape and this property to get an estimate of capacity.

Claudenw · 2023-07-17T19:10:45Z

@aherbert

Cell is a standard term in the literature. I am preparing two more pull requests.

A stable Bloom filter (weird beast) that acts sorta like a counting Bloom filter with random deletions.
Another implementation of counting Bloom filter that uses the core classes from stable Bloom filter to implement counting Bloom filters that can use from 1 to 32 bits (1 is a bit ridiculous but it is supported).

Both the new counting implementation and the stable Bloom filter are dependant upon a CellManager class and some supporting classes. I could present the pull requests in either order, the first one being largish (like the layerd change) with the second being a couple of classes for the specific implementation.

I can update the current counting Bloom filter with the term "cell", but I would like to keep the terminology. I thought I had change the method name to getMaxCount, I'll figure out where that change went.

Would you be amenable to updating the documentation to refer to cell when we have "bits" that are values more than bits?

edit: I misread your post (I should always wear my glasses when I sit down at the computer). I will change the name to getMaxCount, but I still would like to use the cell nomenclature.

aherbert · 2023-07-17T22:41:14Z

There are two terms here used through the API: bit index which refers to a position within the shape; and the count of the number of observations of that index. In that context you are stating a cell is a position that can be filled with counts. So it is a countable bit index. Would the term cell ever be used for non-counting Bloom filters?

Where does that leave the naming of BitCountProducer? CellCountProducer; CellProducer? This is why I questioned it. Adding new terminology should make sense to those reading the API for the first time. Perhaps it can be done by adding a note to the class javadoc for CountingBloomFilter that bit indices that can be counted are also known as cells (if this is correct).

Claudenw · 2023-07-19T09:06:27Z

Cells are used in stable Bloom filters which are technically not a counting bloom filter they are in a class called "Element decay" Bloom filters. They do use cells, not bits, but only report and accept bits as output and input. There are several Bloom filters in this class.

This change is to make the development of the stable and a variable size CountingBloomFilter possible. In that change I was going to propose changing BitCountProducer and BitCountConsumer to CellProdudcer and CellConsumer respectively and change the method call from forEachCount to forEachCell. However, I see that it may make sense to do that in this change to minimize the changes for the stable and variable counting filters.

Are you ok with my proceeding to;

Update the documentation to talk about "cells"
Change BitCountProducer/BitCountConsumer to CellProducer/CellConsumer as noted above?
Change forEachCount to forEachCell

aherbert · 2023-07-19T10:03:36Z

Go ahead. If if more closely matches the literature then this is what we should do.

Claudenw · 2023-07-25T13:19:21Z

@aherbert Please take a look again and see if you are OK to merge this change.

…nged BitCount* to Cell* Updated documentation

aherbert · 2023-07-26T15:01:17Z

src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java

@@ -19,7 +19,7 @@
 import java.util.Objects;

 /**
- * The interface that describes a Bloom filter that associates a count with each
+ * The interface that describes a Bloom filter that associates a cell with each


Changing count to cell has led to some loss of information. The word count is obviously a tracking of occurrences. The word cell is not. IIUC a cell is a count for an index, thus it encapsulates both terms, i.e. you cannot have a cell without an index: cell = (index, count). If a cell did not have an associated index then it is simply a count.

Where does this leave the naming of CountingBloomFilter and ArrayCountingBloomFilter?

Definition of cell and usage of cell vs count was reviewed and corrections made

aherbert

Because you squashed and force pushed to the branch the comments from the last review I made has been lost. FYI: don't do this until the PR is ready to merge.

Note: I started by adding a single comment as I clicked the wrong button in GH. I later noted that you explain what cells are in the package-info. This should probably be repeated in the appropriate interfaces/classes as not many people will read the package info. Simple changing counts to cells makes some of the grammar less intuitive. This could use some work.

I cannot remember everything I stated last time. Overall the PR looks OK. I didn't do a deep dive because there is an issue with the getMaxIndex I raised last time that you have not addressed. It does not count repeats of the same index in the query item.

@Test
void test() {
    ArrayCountingBloomFilter f = new ArrayCountingBloomFilter(Shape.fromKM(1, 10));
    IndexProducer ip = p -> { 
        p.test(3);
        p.test(3);
        return true;
    };
    f.merge(ip);
    System.out.println(f.getMaxInsert(ip));
}

Prints:

This is non-sense. It could be fixed. If you don't care that the returned value is an overestimate then the method should be documented with warnings. The value from getMaxInsert is actually returning the number of times a cell touched by your item has been incremented, and not a factor of the number of times your item has been inserted. As such it is an overestimate of any item that touches the same cell multiple times.

src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java

aherbert · 2023-07-26T15:04:06Z

src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java

+     * @return the maximum number of times the IndexProducer could have been inserted.
+     */
+    default int getMaxInsert(IndexProducer idxProducer) {
+        return getMaxInsert( BitMapProducer.fromIndexProducer(idxProducer, getShape().getNumberOfBits()));


whitespace ( BitMap

src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCountingBloomFilterTest.java

Claudenw · 2023-07-27T08:04:46Z

I fixed the issues with getMaxInsert by requiring the CellProducer return distinct ordered indices. It made much of the code simpler, but added more test cases. I will go back and review all the places where Cell is referenced in java doc and ensure that there is some sort of description.

github.com:Claudenw/commons-collections into unlock_counting_bloom_filters

aherbert · 2023-07-27T14:17:53Z

src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java

@@ -121,7 +186,7 @@ default boolean merge(final Hasher hasher) {
    /**
     * Merges the specified index producer into this Bloom filter.
     *
-     * <p>Specifically: all counts for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>
+     * <p>Specifically: all cells for the indexes identified by the {@code indexProducer} will be incremented by 1.</p>


I don't think your conversion to a CellProducer upholds this contract. The cells will not be incremented by 1; they will be incremented by the count of the indices because merge then calls add.

IIUC a merge is from BloomFilter and should flatten all arguments to an effective bitmap, removing duplicates. The add operation should not remove duplicate indices because they are treated as cells with a count above 1.

We should have some tests to check these definitions are upheld.

ArrayCountingBloomFilter f1 = new ArrayCountingBloomFilter(Shape.fromKM(1, 10)); ArrayCountingBloomFilter f2 = f1.copy(); ArrayCountingBloomFilter f3 = f1.copy(); IndexProducer ip = p -> { p.test(3); p.test(3); return true; }; // The merge SHOULD increment cells by 1 f1.merge(ip); // The add should increment cells by 2 f2.add(CellProducer.from(ip)); // This merge will increment by 1 as the round-trip makes the indices unique f3.merge(IndexProducer.fromIndexArray(ip.asIndexArray())); f1.forEachCell((k, v) -> { System.out.printf("%d = %d%n", k, v); return true; }); f2.forEachCell((k, v) -> { System.out.printf("%d = %d%n", k, v); return true; }); f3.forEachCell((k, v) -> { System.out.printf("%d = %d%n", k, v); return true; });

Prints:

3 = 2 // ERROR ! 3 = 2 3 = 1

Agree this is incorrect. I am adding this as a test to the AbstractCountingFilterTest.

I think that we should make the following changes.

for counting filters

merge always increments the cell value by at most 1.

remove always decrements the cell value by at most 1.

add and subtract as used to increment/decrement by more than 1.

for index producers

as array always returns the indices in the number and order they are produced.

move the "unique" method from hashers to index producers. Hashers create index producers. Let the use of the producer determine if it needs to be unique.

Agree with all points except:

"as array always returns the indices in the number and order they are produced."

This constraint may impact performance. Is it required?

A hasher asIndexArray may not wish to do this as the indices are random. The EnhancedDoubleHasher current returns the indices unordered.

Is it "required" no, but when working in complex arrangements it mean that the simply converting to an array does not change the state of the data. In most cases this is not an issue. The only case where it is an issue is when a Hasher backed IndexProducer.asArray() is called. In all other cases (I think) the array either already exists or exists in a different form (BitMaps, CellProducers). BitMaps and CellProducers always return distinct values in order.

My plan was to lift the code from BitMapProducer that produces bitMaps and use it to produce an array of indices from a Hasher.

Perhaps my original comment is not clear. Basically, however the underlying storage produces the indices is the order and number that will be produced in the array.

If you wish to return a distinct array then the default method in the interface using a bit set will be fine.

If you wish to return a count of duplicate indices then the current implementation will work if you just sort the array it currently creates. This may be faster than using a complicated intermediate data structure. The hasher is likely to have a small array and the speed of the sort will be fast.

This may be an opportunity to use some more characteristics flags where an index producer can declare if the indices are sorted and or duplicates. Then the interface can provide default methods to remove duplicates or sort them.

If we have IndexProducer.asArray output the indices in the order that they would be returned by IndexProducer.forEachIndex() then we are not changing the data based on the representation.

This also means that sorting can be done by generating an array, using Array.sort() and then using the result to create an IndexProducer.fromIndexArray() to create the sorted index.

The characteristics flag may be a good idea. We already have them in the test code.

How about we move this forward as is and merge it. I think the change is large enough (I was trying to keep is small).

I still need to get the LayeredBloom filter added and it will need to be updated once we have this in place.
Finally, I have the StableBloom filter, which may clarify some of the questions we have a bout characteristics and whether we have chosen the proper restrictions on IndexProducer and CellProducers.

The last bit of my previous message was not meant to indicate that this should not be merged, just that there are 2 more changes coming as different pull requests.

I misread your previous statement "as array always returns the indices in the number and order they are produced." I took that to mean in order, i.e. sorted. Re-reading it I see that you are stating that the order matches the output produced by forEachIndex.

This is fine. But it now means the default implementation of asIndexArray (which creates sorted, unique indices) is now in conflict with the javadoc.

PS. I just saw that you changed the default implementation of asIndexArray to respect the order of forEachIndex. I noted a problem with the internal comment in that method about the capacity.

Should all be fixed and ready to go now.

src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java

aherbert

Can you try and update the PR so that the git history is kept for two of the renamed files. You should be able to restore the files from an old commit, use git mv to rename them and then replace the contents with the new code files you have here.

Please remove changes to IndexFilter. They may be useful. They are not related to this change.

src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java

src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java

...va/org/apache/commons/collections4/bloomfilter/CellProducerFromDefaultIndexProducerTest.java

...rg/apache/commons/collections4/bloomfilter/CellProducerFromArrayCountingBloomFilterTest.java

src/test/java/org/apache/commons/collections4/bloomfilter/DefaultCellProducerTest.java

aherbert · 2023-08-01T23:46:10Z

src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java

+    @Test
+    public void testMoreThan32Entries() {
+        int[] values = new int[33];
+        for (int i=0; i<33; i++) {


int[] values = IntStream.range(0, 33).toArray();

src/test/java/org/apache/commons/collections4/bloomfilter/DefaultIndexProducerTest.java

Claudenw · 2023-08-02T08:30:42Z

@aherbert Do you have a tool that finds all the screwy whitespace? I try to keep the code clean but my historical style keeps slipping through.

aherbert · 2023-08-02T14:06:07Z

@aherbert Do you have a tool that finds all the screwy whitespace? I try to keep the code clean but my historical style keeps slipping through.

I spot them when reviewing. Sometimes there are not many so I point them out. If you would like I can stop pointing them out as it does add noise to a PR review. I can format the code and the tests in this package to be consistent when all your planned revisions are submitted.

Claudenw · 2023-08-03T17:44:15Z

Can you try and update the PR so that the git history is kept for two of the renamed files. You should be able to restore the files from an old commit, use git mv to rename them and then replace the contents with the new code files you have here.

Please remove changes to IndexFilter. They may be useful. They are not related to this change.

No luck trying to get the move to be recognized. Tried restoring and moving, restoring editing and then moving, restoring clearing the git cache of the new name, moving. Nothing I tried resulted in the github display showing the file as changed. Only deleted and new.

If I revert the IndexFilter change the code coverage will not be 100% and no way to get to 100%. If that is the better solution at this point I will do that and make the change later as a separate pull request.

aherbert · 2023-08-03T19:12:12Z

I think you have fixed it. The GH UI sometimes does not show the renamed file as a move. I have cloned the latest PR code and verified locally that the history is there:

git log --follow  src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java
git log --follow  src/test/java/org/apache/commons/collections4/bloomfilter/AbstractCellProducerTest.java

Log revisions date back to 2022.

I should have done this before I asked you to restore the history as it may have been there already. Anyway, it is OK now.

I think we should revert the IndexFilter change and do that separately. This PR is related to the counting Bloom filters and their ability to indicate the max capacity of a cell and estimate the number of entries of an item.

Claudenw · 2023-08-05T12:27:43Z

I think that IndexUtils method should be moved to ArrayUtils in the root package.

aherbert · 2023-08-05T14:02:35Z

I think that IndexUtils method should be moved to ArrayUtils in the root package.

Note that class is package private, and copied and adapted from Commons Lang.

To make the method public requires a bit more effort. For example duplicating for all 8 primitive array types and also a generic type. It may also require adding a method to expand the array for adding more than 1 item. (The current method only requires expanding the array to add 1 item.) This generic functionality may be better suited to ArrayUtils in commons-lang.

It is simpler to hide this functionality at the package level for now, and potentially add this as a feature in the future.

Claudenw · 2023-08-05T16:32:16Z

OK. If there are any issues remaining please let me know.

aherbert

Sorry for the delay in reviewing. There is an issue with the use of the IndexUtils to enlarge the array. The class should also be a package-private final utility class.

src/main/java/org/apache/commons/collections4/bloomfilter/IndexUtils.java

src/main/java/org/apache/commons/collections4/bloomfilter/CellProducer.java

src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java

aherbert · 2023-08-08T22:32:09Z

src/main/java/org/apache/commons/collections4/bloomfilter/IndexFilter.java

@@ -72,7 +72,10 @@ public boolean test(final int number) {
        if (number >= size) {
            throw new IndexOutOfBoundsException(String.format("number too large %d >= %d", number, size));
        }
-        return !tracker.test(number) || consumer.test(number);
+        if (tracker.test(number)) {


You reverted the IndexFilter file but left this change. I think the statements are identical. Since the plan is to replace this with a separate change to the IndexFilter then can you revert this too. The IndexFilter should be unchanged by this PR.

This change fixes code coverage check. In the original case there are 4 branches, one of which can not be tested. In this case there are only 3 and the code coverage report shows 100% coverage for the Bloomfilter code.

Claudenw changed the title ~~COLLECTIONS-844 added maxValue() to CountingBloomFilter~~ COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE Jul 16, 2023

Claudenw requested a review from aherbert July 16, 2023 20:00

Claudenw self-assigned this Jul 21, 2023

added getMaxInsert() and getMaxValue() to CountingBloomFilter and cha…

eccfd03

…nged BitCount* to Cell* Updated documentation

Claudenw force-pushed the unlock_counting_bloom_filters branch from 00adc10 to eccfd03 Compare July 26, 2023 08:27

aherbert reviewed Jul 26, 2023

View reviewed changes

aherbert requested changes Jul 26, 2023

View reviewed changes

Changed CellProducer to require ordered distinct cell indices.

837a8ff

Claudenw added 6 commits July 27, 2023 10:35

Updated documentation

7a9f9c2

Merge branch 'unlock_counting_bloom_filters' of

afd0057

github.com:Claudenw/commons-collections into unlock_counting_bloom_filters

fixed javadoc issue

ec6360c

updated cell documentation and term usage

e6807e0

fixed typo

8a9b193

removed extra spaces.

66d45ba

aherbert reviewed Jul 27, 2023

View reviewed changes

Claudenw added 4 commits July 29, 2023 10:47

updated to fix Counting merge/remove issue

1ea112e

Changes to standardize merge() and IndexProducer

27e4119

modified as per discussion

f7e4be0

Cleaned up Jacoco report.

06051a5

aherbert reviewed Jul 29, 2023

View reviewed changes

src/main/java/org/apache/commons/collections4/bloomfilter/IndexProducer.java Outdated Show resolved Hide resolved

fixed asIndexArray data reallocation issues.

257d8a9

Claudenw requested a review from aherbert August 1, 2023 13:26

aherbert requested changes Aug 1, 2023

View reviewed changes

Claudenw force-pushed the unlock_counting_bloom_filters branch 3 times, most recently from 3aac605 to 9164ace Compare August 3, 2023 16:52

reversing again

1c6a3f1

Claudenw force-pushed the unlock_counting_bloom_filters branch from dad4a75 to 1c6a3f1 Compare August 3, 2023 17:18

Claudenw added 5 commits August 3, 2023 19:22

updated BitCountProducer

0e404bd

updated BitCountProducer Test

8c0c182

renamed bitcountproducer

94e5581

reverse

25143a5

renamed bitcountproducer

e5a2b45

updated as per comments

a197d2e

reverted IndexFilter with minor change for jacoco

12ae0e3

Claudenw requested a review from aherbert August 4, 2023 12:47

added javadoc

aa8f214

aherbert requested changes Aug 8, 2023

View reviewed changes

Claudenw requested a review from aherbert August 13, 2023 14:48

Claudenw added 2 commits August 13, 2023 16:49

changes as requested

0df5351

removed extra spaces

f1e159b

aherbert merged commit 488393e into apache:master Aug 15, 2023

COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406

COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406

Conversation

Claudenw commented Jul 16, 2023 • edited Loading

codecov-commenter commented Jul 16, 2023 • edited Loading

Codecov Report

aherbert commented Jul 16, 2023

aherbert commented Jul 17, 2023

Claudenw commented Jul 17, 2023 • edited Loading

aherbert commented Jul 17, 2023

Claudenw commented Jul 19, 2023

aherbert commented Jul 19, 2023

Claudenw commented Jul 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aherbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Claudenw commented Jul 27, 2023

aherbert Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aherbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Claudenw commented Aug 2, 2023

aherbert commented Aug 2, 2023

Claudenw commented Aug 3, 2023

aherbert commented Aug 3, 2023

Claudenw commented Aug 5, 2023

aherbert commented Aug 5, 2023 • edited Loading

Claudenw commented Aug 5, 2023

aherbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Claudenw commented Jul 16, 2023 •

edited

Loading

codecov-commenter commented Jul 16, 2023 •

edited

Loading

Claudenw commented Jul 17, 2023 •

edited

Loading

aherbert Jul 27, 2023 •

edited

Loading

aherbert commented Aug 5, 2023 •

edited

Loading