Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COLLECTIONS-843] Implement Layered Bloom filter #402

Merged
merged 54 commits into from
Dec 22, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
36beec4
Adjusted tests to handle bloom filter implementations that utilized
Claudenw Jun 21, 2023
5be49b7
cleaned up spacing
Claudenw Jun 27, 2023
3a1a8c8
fixed indent
Claudenw Jun 21, 2023
a6b5e81
updated for layered testing
Claudenw Jun 27, 2023
301041b
removed spaces
Claudenw Jun 27, 2023
7631528
fixed merge issue
Claudenw Jun 27, 2023
680d2bb
initial checkin
Claudenw Jun 27, 2023
ee2cdff
cleaned up tests
Claudenw Jun 29, 2023
d581e99
fixed timing on test
Claudenw Jun 29, 2023
0dddfa2
fixed formatting
Claudenw Jun 29, 2023
756e88f
added javadoc
Claudenw Jun 29, 2023
ccf430f
fixed typos
Claudenw Jun 29, 2023
b8fe880
removed blank lines
Claudenw Jun 29, 2023
a6d4f46
fixed javadocs
Claudenw Jun 30, 2023
4ab19a6
Fix Javadoc
garydgregory Jun 30, 2023
ad257e1
Add Javadoc @since 4.5
garydgregory Jun 30, 2023
d5e17e8
Add Javadoc @since 4.5
garydgregory Jun 30, 2023
97ca57e
updated tests and added BloomFilterProducer code
Claudenw Jun 30, 2023
cb09fbe
Merge branch 'apache:master' into layered_filter
Claudenw Jun 30, 2023
0123c3f
Merge branch 'layered_filter' of github.com:Claudenw/commons-collecti…
Claudenw Jun 30, 2023
1a647d5
Cleaned up javadoc and BiPredicate<BloomFilter,BloomFilter> processing
Claudenw Jun 30, 2023
62eba66
fixed javadoc issues
Claudenw Jun 30, 2023
4e7ab0b
fixed typography issue
Claudenw Jun 30, 2023
a1749d7
Fixed a documentation error
Claudenw Jul 4, 2023
8eda0ba
code format cleanup
Claudenw Jul 6, 2023
7bd8d33
code simplification and documentation
Claudenw Jul 6, 2023
63bfe90
added isEmpty and associated tests
Claudenw Jul 6, 2023
8f22d46
Changes as requested by review
Claudenw Jul 6, 2023
1addfe7
cleaned up formatting errors
Claudenw Jul 6, 2023
ac92c7d
fixed javadoc issues
Claudenw Jul 6, 2023
6188866
added LayeredBloomFilter to overview.
Claudenw Jul 6, 2023
2967b1e
added coco driven test cases.
Claudenw Jul 6, 2023
6517fed
attempt to fix formatting
Claudenw Jul 7, 2023
c3fbcf5
cleaned up javadoc differences
Claudenw Jul 7, 2023
b8e4850
cleaned up javadoc
Claudenw Jul 7, 2023
7732ede
Made flatten() part of BloomFilterProducer
Claudenw Jul 7, 2023
9d8282c
fixed since tag.
Claudenw Jul 7, 2023
c18e7da
changed X() methods to setX()
Claudenw Jul 21, 2023
f24bfad
updated javadoc
Claudenw Jul 21, 2023
f0eb919
fixed javadoc errors
Claudenw Jul 21, 2023
e193143
Merge branch 'master' into layered_filter
Claudenw Aug 15, 2023
75b9a6d
merged changes from master
Claudenw Aug 15, 2023
e3ac952
renamed to Test to CellProducerFromLayeredBloomFilterTest
Claudenw Aug 15, 2023
508eec3
changed to jupiter from junit.
Claudenw Aug 25, 2023
c2ecf7e
added override for uniqueIndices as optimization.
Claudenw Aug 25, 2023
9769046
fixed checkstyle issue
Claudenw Aug 27, 2023
b39a472
modified as per review
Claudenw Oct 1, 2023
195f987
Merge branch 'layered_filter' of github.com:Claudenw/commons-collecti…
Claudenw Oct 1, 2023
a71b55e
Updated tests as per review
Claudenw Oct 27, 2023
4babacf
fixed variable initialization issues
Claudenw Oct 28, 2023
95df800
Merge branch 'master' into layered_filter
Claudenw Nov 24, 2023
68201e5
made suggested test changes
Claudenw Nov 27, 2023
d806d6e
fixed broken test
Claudenw Nov 27, 2023
500a252
Remove dead comments per code reviews
garydgregory Dec 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,21 @@ default boolean isFull() {
*/
int cardinality();

/**
* Determines if all the bits are off. This is equivalent to
* {@code cardinality() == 0}.
*
* <p>
* <em>Note: This method is optimised for non-sparse filters.</em> Implementers
* are encouraged to implement faster checks if possible.
* </p>
*
* @return {@code true} if no bites are enabled, {@code false} otherwise.
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
*/
default boolean isEmpty() {
return forEachBitMap(y -> y == 0);
}

/**
* Estimates the number of items in the Bloom filter.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,22 @@
*/
package org.apache.commons.collections4.bloomfilter;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.function.BiPredicate;
import java.util.function.Predicate;

/**
* Produces Bloom filters that are copies of Bloom filters in a collection (e.g.
* LayerBloomFilter).
* Produces Bloom filters from a collection (e.g. LayeredBloomFilter).
*
* @since 4.5
*/
public interface BloomFilterProducer {

/**
* Executes a Bloom filter Predicate on each Bloom filter in the manager in
* depth order. Oldest filter first.
* Executes a Bloom filter Predicate on each Bloom filter in the collection. The
* ordering of the Bloom filters is not specified by this interface.
*
* @param bloomFilterPredicate the predicate to evaluate each Bloom filter with.
* @return {@code false} when the first filter fails the predicate test. Returns
Expand All @@ -38,54 +40,102 @@ public interface BloomFilterProducer {
boolean forEachBloomFilter(Predicate<BloomFilter> bloomFilterPredicate);

/**
* Return a deep copy of the BloomFilterProducer data as a Bloom filter array.
* <p>
* The default implementation of this method is slow. It is recommended that
* implementing classes reimplement this method.
* </p>
* Return an array of the Bloom filters in the collection.
* <p><em>Implementations should specify if the array contains deep copies, immutable instances,
* or references to the filters in the collection.</em></p>
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
*
* @return An array of Bloom filters.
*/
default BloomFilter[] asBloomFilterArray() {
class Filters {
private BloomFilter[] data = new BloomFilter[16];
private int size;

boolean add(final BloomFilter filter) {
if (size == data.length) {
// This will throw an out-of-memory error if there are too many Bloom filters.
data = Arrays.copyOf(data, size * 2);
}
data[size++] = filter.copy();
return true;
}

BloomFilter[] toArray() {
// Edge case to avoid a large array copy
return size == data.length ? data : Arrays.copyOf(data, size);
}
}
final Filters filters = new Filters();
forEachBloomFilter(filters::add);
return filters.toArray();
final List<BloomFilter> filters = new ArrayList<>();
forEachBloomFilter(f -> filters.add(f.copy()));
return filters.toArray(new BloomFilter[filters.size()]);
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* Applies the {@code func} to each Bloom filter pair in order. Will apply all
* of the Bloom filters from the other BloomFilterProducer to this producer. If
* this producer does not have as many BloomFilters it will provide
* {@code null} for all excess calls to the BiPredicate.
* either {@code this} producer or {@code other} producer has fewer BloomFilters
* ths method will provide {@code null} for all excess calls to the {@code func}.
*
* <p><em>This implementation returns references to the Bloom filter. Other implementations
* should specify if the array contains deep copies, immutable instances,
* or references to the filters in the collection.</em></p>
*
* @param other The other BloomFilterProducer that provides the y values in the
* (x,y) pair.
* @param func The function to apply.
* @return A LongPredicate that tests this BitMapProducers bitmap values in
* order.
* @return {@code true} if the {@code func} returned {@code true} for every pair,
* {@code false} otherwise.
*/
default boolean forEachBloomFilterPair(final BloomFilterProducer other,
final BiPredicate<BloomFilter, BloomFilter> func) {
final CountingPredicate<BloomFilter> p = new CountingPredicate<>(asBloomFilterArray(), func);
return other.forEachBloomFilter(p) && p.forEachRemaining();
}

Claudenw marked this conversation as resolved.
Show resolved Hide resolved
/**
* Create a standard (non-layered) Bloom filter by merging all of the layers. If
* the filter is empty this method will return an empty Bloom filter.
*
* @return the merged bloom filter.
*/
default BloomFilter flatten() {
BloomFilter bf[] = {null};
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
forEachBloomFilter( x -> {
if (bf[0]==null) {
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
bf[0] = new SimpleBloomFilter( x.getShape());
}
return bf[0].merge( x );
});
return bf[0];
}

/**
* Creates a BloomFilterProducer from an array of Bloom filters.
*
* <ul>
* <li>The asBloomFilterArray() method returns a copy of the original array
* with references to the original filters.</li>
* <li>The forEachBloomFilterPair() method uses references to the original filters.</li>
* </ul>
* <p><em>All modifications to the Bloom filters are reflected in the original filters</em></p>
*
* @param filters The filters to be returned by the producer.
* @return THe BloomFilterProducer containing the filters.
*/
static BloomFilterProducer fromBloomFilterArray(BloomFilter... filters) {
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
return new BloomFilterProducer() {
@Override
public boolean forEachBloomFilter(final Predicate<BloomFilter> predicate) {
for (final BloomFilter filter : filters) {
if (!predicate.test(filter)) {
return false;
}
}
return true;
}

/**
* This implementation returns a copy the original array, the contained Bloom filters
* are references to the originals, any modifications to them are reflected in the original
* filters.
*/
@Override
public BloomFilter[] asBloomFilterArray() {
return Arrays.copyOf(filters, filters.length);
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* This implementation uses references to the original filters. Any modifications to the
* filters are reflected in the originals.
*/
@Override
public boolean forEachBloomFilterPair(final BloomFilterProducer other,
final BiPredicate<BloomFilter, BloomFilter> func) {
final CountingPredicate<BloomFilter> p = new CountingPredicate<>(filters, func);
return other.forEachBloomFilter(p) && p.forEachRemaining();
}
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ default boolean forEachIndex(final IntPredicate predicate) {
return forEachCell((i, v) -> predicate.test(i));
}

@Override
default IndexProducer uniqueIndices() {
return this;
}

/**
* Creates a CellProducer from an IndexProducer.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,21 @@
import java.util.function.LongPredicate;

/**
* A long predicate that applies the test func to each member of the @{code ary} in sequence for each call to @{code test()}.
* if the @{code ary} is exhausted, the subsequent calls to @{code test} are executed with a zero value.
* If the calls to @{code test} do not exhaust the @{code ary} the @{code forEachRemaining} method can be called to
* execute the @code{text} with a zero value for each remaining @{code idx} value.
* A long predicate that applies the test func to each member of the {@code ary} in sequence for each call to {@code test()}.
* if the {@code ary} is exhausted, the subsequent calls to {@code test} are executed with a zero value.
* If the calls to {@code test} do not exhaust the {@code ary} the {@code forEachRemaining} method can be called to
* execute the @code{text} with a zero value for each remaining {@code idx} value.
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
* @since 4.5
*/
class CountingLongPredicate implements LongPredicate {
private int idx = 0;
private final long[] ary;
private final LongBiPredicate func;

/**
* Constructs an instance that will compare the elements in @{code ary} with the elements returned by @{code func}.
* function is called as @{code func.test( idxValue, otherValue )}. If there are more @{code otherValue} values than
* @{code idxValues} then @{code func} is called as @{code func.test( 0, otherValue )}.
* Constructs an instance that will compare the elements in {@code ary} with the elements returned by {@code func}.
* function is called as {@code func.test( idxValue, otherValue )}. If there are more {@code otherValue} values than
* {@code idxValues} then {@code func} is called as {@code func.test( 0, otherValue )}.
* @param ary The array of long values to compare.
* @param func The function to apply to the pairs of long values.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,29 +20,27 @@
import java.util.function.Predicate;

/**
* A predicate that applies the test func to each member of the @{code ary} in
* sequence for each call to @{code test()}. if the @{code ary} is exhausted,
* the subsequent calls to @{code test} are executed with a {@code null} value.
* If the calls to @{code test} do not exhaust the @{code ary} the @{code
* A predicate that applies the test {@code func} to each member of the {@code ary} in
* sequence for each call to {@code test()}. if the {@code ary} is exhausted,
* the subsequent calls to {@code test} are executed with a {@code null} value.
* If the calls to {@code test} do not exhaust the {@code ary} the {@code
* forEachRemaining} method can be called to execute the @code{text} with a
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
* {@code null} value for each remaining @{code idx} value.
* {@code null} value for each remaining {@code idx} value.
*
* @param <T> the type of object being compared.
*
* @Since 4.5
* @since 4.5
*/
class CountingPredicate<T> implements Predicate<T> {
private int idx = 0;
private final T[] ary;
private final BiPredicate<T, T> func;

/**
* Constructs an instance that will compare the elements in @{code ary} with the
* elements returned by @{code func}. function is called as @{code func.test(
* idxValue, otherValue )}. If there are more @{code otherValue} values than
* Constructs an instance that will compare the elements in {@code ary} with the
* elements returned by {@code func}. function is called as {@code func.test(
* idxValue, otherValue )}. If there are more {@code otherValue} values than
* {@code idxValues} then {@code func} is called as {@code func.test(null, otherValue)}.
*
* @{code idxValues} then @{code func} is called as @{code func.test( null,
* otherValue )}.
* @param ary The array of long values to compare.
* @param func The function to apply to the pairs of long values.
*/
Expand All @@ -57,10 +55,10 @@ public boolean test(final T other) {
}

/**
* Call the T-T consuming bi-predicate for each remaining unpaired T in the
* Call BiPredicate&lt;T,T&gt; for each remaining unpaired &lt;T&gt; in the
Claudenw marked this conversation as resolved.
Show resolved Hide resolved
* input array. This method should be invoked after the predicate has been
* passed to a @{code TProducer#forEachT(BiPredicate(T,T))} to consume any
* unpaired Ts. The second argument to the bi-predicate will be @{code null}.
* passed to a &lt;T&gt;Producer#forEach&lt;T&gt;(BiPredicate&lt;T,T&gt;) to consume any
* unpaired &lt;T&gt;s. The second argument to the BiPredicate will be {@code null}.
*
* @return true if all calls the predicate were successful
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ public interface Hasher {
/**
* Creates an IndexProducer for this hasher based on the Shape.
*
* <p>The @{code IndexProducer} will create indices within the range defined by the number of bits in
* <p>The {@code IndexProducer} will create indices within the range defined by the number of bits in
* the shape. The total number of indices will respect the number of hash functions per item
* defined by the shape. However the count of indices may not be a multiple of the number of
* hash functions if the implementation has removed duplicates.</p>
Expand Down
Loading