Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COLLECTIONS-844 - allow counting Bloom filters with cell size other than Integer.SIZE #406

Merged
merged 24 commits into from
Aug 15, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,16 @@
import java.util.stream.IntStream;

/**
* A counting Bloom filter using an int array to track counts for each enabled bit
* index.
* A counting Bloom filter using an int array to track cells for each enabled bit.
*
* <p>Any operation that results in negative counts or integer overflow of
* counts will mark this filter as invalid. This transition is not reversible.
* The operation is completed in full, no exception is raised and the state is
* set to invalid. This allows the counts for the filter immediately prior to the
* set to invalid. This allows the cells for the filter immediately prior to the
* operation that created the invalid state to be recovered. See the documentation
* in {@link #isValid()} for details.</p>
*
* <p>All the operations in the filter assume the counts are currently valid,
* <p>All the operations in the filter assume the cells are currently valid,
* for example {@code cardinality} or {@code contains} operations. Behavior of an invalid
* filter is undefined. It will no longer function identically to a standard
* Bloom filter that is the merge of all the Bloom filters that have been added
Expand All @@ -57,30 +56,30 @@ public final class ArrayCountingBloomFilter implements CountingBloomFilter {
private final Shape shape;

/**
* The count of each bit index in the filter.
* The cell for each bit index in the filter.
*/
private final int[] counts;
private final int[] cells;

/**
* The state flag. This is a bitwise @{code OR} of the entire history of all updated
* counts. If negative then a negative count or integer overflow has occurred on
* one or more counts in the history of the filter and the state is invalid.
* cells. If negative then a negative cell or integer overflow has occurred on
* one or more cells in the history of the filter and the state is invalid.
*
* <p>Maintenance of this state flag is branch-free for improved performance. It
* eliminates a conditional check for a negative count during remove/subtract
* eliminates a conditional check for a negative cell during remove/subtract
* operations and a conditional check for integer overflow during merge/add
* operations.</p>
*
* <p>Note: Integer overflow is unlikely in realistic usage scenarios. A count
* <p>Note: Integer overflow is unlikely in realistic usage scenarios. A cell
* that overflows indicates that the number of items in the filter exceeds the
* maximum possible size (number of bits) of any Bloom filter constrained by
* integer indices. At this point the filter is most likely full (all bits are
* non-zero) and thus useless.</p>
*
* <p>Negative counts are a concern if the filter is used incorrectly by
* <p>Negative cells are a concern if the filter is used incorrectly by
* removing an item that was never added. It is expected that a user of a
* counting Bloom filter will not perform this action as it is a mistake.
* Enabling an explicit recovery path for negative or overflow counts is a major
* Enabling an explicit recovery path for negative or overflow cells is a major
* performance burden not deemed necessary for the unlikely scenarios when an
* invalid state is created. Maintenance of the state flag is a concession to
* flag improper use that should not have a major performance impact.</p>
Expand All @@ -96,18 +95,23 @@ public final class ArrayCountingBloomFilter implements CountingBloomFilter {
public ArrayCountingBloomFilter(final Shape shape) {
Objects.requireNonNull(shape, "shape");
this.shape = shape;
counts = new int[shape.getNumberOfBits()];
cells = new int[shape.getNumberOfBits()];
}

private ArrayCountingBloomFilter(final ArrayCountingBloomFilter source) {
this.shape = source.shape;
this.state = source.state;
this.counts = source.counts.clone();
this.cells = source.cells.clone();
}

@Override
public void clear() {
Arrays.fill(counts, 0);
Arrays.fill(cells, 0);
}

@Override
public int getMaxCell() {
return Integer.MAX_VALUE;
}

@Override
Expand All @@ -122,20 +126,20 @@ public int characteristics() {

@Override
public int cardinality() {
return (int) IntStream.range(0, counts.length).filter(i -> counts[i] > 0).count();
return (int) IntStream.range(0, cells.length).filter(i -> cells[i] > 0).count();
}

@Override
public boolean add(final BitCountProducer other) {
public boolean add(final CellProducer other) {
Objects.requireNonNull(other, "other");
other.forEachCount(this::add);
other.forEachCell(this::add);
return isValid();
}

@Override
public boolean subtract(final BitCountProducer other) {
public boolean subtract(final CellProducer other) {
Objects.requireNonNull(other, "other");
other.forEachCount(this::subtract);
other.forEachCell(this::subtract);
return isValid();
}

Expand All @@ -146,23 +150,23 @@ public boolean subtract(final BitCountProducer other) {
*
* <p>The state transition to invalid is permanent.</p>
*
* <p>This implementation does not correct negative counts to zero or integer
* overflow counts to {@link Integer#MAX_VALUE}. Thus the operation that
* generated invalid counts can be reversed by using the complement of the
* original operation with the same Bloom filter. This will restore the counts
* to the state prior to the invalid operation. Counts can then be extracted
* using {@link #forEachCount(BitCountConsumer)}.</p>
* <p>This implementation does not correct negative cells to zero or integer
* overflow cells to {@link Integer#MAX_VALUE}. Thus the operation that
* generated invalid cells can be reversed by using the complement of the
* original operation with the same Bloom filter. This will restore the cells
* to the state prior to the invalid operation. Cells can then be extracted
* using {@link #forEachCell(CellConsumer)}.</p>
*/
@Override
public boolean isValid() {
return state >= 0;
}

@Override
public boolean forEachCount(final BitCountProducer.BitCountConsumer consumer) {
public boolean forEachCell(final CellProducer.CellConsumer consumer) {
Objects.requireNonNull(consumer, "consumer");
for (int i = 0; i < counts.length; i++) {
if (counts[i] != 0 && !consumer.test(i, counts[i])) {
for (int i = 0; i < cells.length; i++) {
if (cells[i] != 0 && !consumer.test(i, cells[i])) {
return false;
}
}
Expand All @@ -172,8 +176,8 @@ public boolean forEachCount(final BitCountProducer.BitCountConsumer consumer) {
@Override
public boolean forEachIndex(final IntPredicate consumer) {
Objects.requireNonNull(consumer, "consumer");
for (int i = 0; i < counts.length; i++) {
if (counts[i] != 0 && !consumer.test(i)) {
for (int i = 0; i < cells.length; i++) {
if (cells[i] != 0 && !consumer.test(i)) {
return false;
}
}
Expand All @@ -183,14 +187,14 @@ public boolean forEachIndex(final IntPredicate consumer) {
@Override
public boolean forEachBitMap(final LongPredicate consumer) {
Objects.requireNonNull(consumer, "consumer");
final int blocksm1 = BitMap.numberOfBitMaps(counts.length) - 1;
final int blocksm1 = BitMap.numberOfBitMaps(cells.length) - 1;
int i = 0;
long value;
// must break final block separate as the number of bits may not fall on the long boundary
for (int j = 0; j < blocksm1; j++) {
value = 0;
for (int k = 0; k < Long.SIZE; k++) {
if (counts[i++] != 0) {
if (cells[i++] != 0) {
value |= BitMap.getLongBit(k);
}
}
Expand All @@ -200,39 +204,39 @@ public boolean forEachBitMap(final LongPredicate consumer) {
}
// Final block
value = 0;
for (int k = 0; i < counts.length; k++) {
if (counts[i++] != 0) {
for (int k = 0; i < cells.length; k++) {
if (cells[i++] != 0) {
value |= BitMap.getLongBit(k);
}
}
return consumer.test(value);
}

/**
* Add to the count for the bit index.
* Add to the cell for the bit index.
*
* @param idx the index
* @param addend the amount to add
* @return {@code true} always.
*/
private boolean add(final int idx, final int addend) {
final int updated = counts[idx] + addend;
final int updated = cells[idx] + addend;
state |= updated;
counts[idx] = updated;
cells[idx] = updated;
return true;
}

/**
* Subtract from the count for the bit index.
* Subtract from the cell for the bit index.
*
* @param idx the index
* @param subtrahend the amount to subtract
* @return {@code true} always.
*/
private boolean subtract(final int idx, final int subtrahend) {
final int updated = counts[idx] - subtrahend;
final int updated = cells[idx] - subtrahend;
state |= updated;
counts[idx] = updated;
cells[idx] = updated;
return true;
}

Expand All @@ -243,7 +247,7 @@ public Shape getShape() {

@Override
public boolean contains(final IndexProducer indexProducer) {
return indexProducer.forEachIndex(idx -> this.counts[idx] != 0);
return indexProducer.forEachIndex(idx -> this.cells[idx] != 0);
}

@Override
Expand All @@ -253,6 +257,6 @@ public boolean contains(final BitMapProducer bitMapProducer) {

@Override
public int[] asIndexArray() {
return IntStream.range(0, counts.length).filter(i -> counts[i] > 0).toArray();
return IntStream.range(0, cells.length).filter(i -> cells[i] > 0).toArray();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has lost the git history from BitCountProducer. It is showing as a new file and the old interface as deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be an issue with github display local git showed that it was renamed during commit.

Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,22 @@
import java.util.function.IntPredicate;

/**
* Defines a mapping of index to counts.
* Some Bloom filter implementations use a count rather than a bit flag. The term {@code Cell} is used to
* refer to these counts. This class is the equivalent of the index producer except that it produces a cell
* value associated with each index.
*
* <p>Note that a BitCountProducer may return duplicate indices and may be unordered.
* <p>Note that a CellProducer may return duplicate indices and may be unordered.
*
* <p>Implementations must guarantee that:
*
* <ul>
* <li>The mapping of index to counts is the combined sum of counts at each index.
* <li>The mapping of index to cells is the combined sum of cells at each index.
* <li>For every unique value produced by the IndexProducer there will be at least one matching
* index and count produced by the BitCountProducer.
* <li>The BitCountProducer will not generate indices that are not output by the IndexProducer.
* index and cell produced by the CellProducer.
* <li>The CellProducer will not generate indices that are not output by the IndexProducer.
* </ul>
*
* <p>Note that implementations that do not output duplicate indices for BitCountProducer and
* <p>Note that implementations that do not output duplicate indices for CellProducer and
* do for IndexProducer, or vice versa, are consistent if the distinct indices from each are
* the same.
*
Expand All @@ -48,47 +50,47 @@
* @since 4.5
*/
@FunctionalInterface
public interface BitCountProducer extends IndexProducer {
public interface CellProducer extends IndexProducer {

/**
* Performs the given action for each {@code <index, count>} pair where the count is non-zero.
* Performs the given action for each {@code <index, cell>} pair where the cell is non-zero.
* Any exceptions thrown by the action are relayed to the caller. The consumer is applied to each
* index-count pair, if the consumer returns {@code false} the execution is stopped, {@code false}
* index-cell pair, if the consumer returns {@code false} the execution is stopped, {@code false}
* is returned, and no further pairs are processed.
*
* Duplicate indices are not required to be aggregated. Duplicates may be output by the producer as
* noted in the class javadoc.
*
* @param consumer the action to be performed for each non-zero bit count
* @return {@code true} if all count pairs return true from consumer, {@code false} otherwise.
* @param consumer the action to be performed for each non-zero cell.
* @return {@code true} if all cells return true from consumer, {@code false} otherwise.
* @throws NullPointerException if the specified consumer is null
*/
boolean forEachCount(BitCountConsumer consumer);
boolean forEachCell(CellConsumer consumer);

/**
* The default implementation returns indices with ordering and uniqueness of {@code forEachCount()}.
* The default implementation returns indices with ordering and uniqueness of {@code forEachCell()}.
*/
@Override
default boolean forEachIndex(final IntPredicate predicate) {
return forEachCount((i, v) -> predicate.test(i));
return forEachCell((i, v) -> predicate.test(i));
}

/**
* Creates a BitCountProducer from an IndexProducer. The resulting
* producer will return every index from the IndexProducer with a count of 1.
* Creates a CellProducer from an IndexProducer. The resulting
* producer will return every index from the IndexProducer with a cell value of 1.
*
* <p>Note that the BitCountProducer does not remove duplicates. Any use of the
* BitCountProducer to create an aggregate mapping of index to counts, such as a
* CountingBloomFilter, should use the same BitCountProducer in both add and
* <p>Note that the CellProducer does not remove duplicates. Any use of the
* CellProducer to create an aggregate mapping of index to counts, such as a
* CountingBloomFilter, should use the same CellProducer in both add and
* subtract operations to maintain consistency.
* </p>
* @param idx An index producer.
* @return A BitCountProducer with the same indices as the IndexProducer.
* @return A CellProducer with the same indices as the IndexProducer.
*/
static BitCountProducer from(final IndexProducer idx) {
return new BitCountProducer() {
static CellProducer from(final IndexProducer idx) {
return new CellProducer() {
@Override
public boolean forEachCount(final BitCountConsumer consumer) {
public boolean forEachCell(final CellConsumer consumer) {
return idx.forEachIndex(i -> consumer.test(i, 1));
}

Expand All @@ -105,22 +107,22 @@ public boolean forEachIndex(final IntPredicate predicate) {
}

/**
* Represents an operation that accepts an {@code <index, count>} pair representing
* the count for a bit index. Returns {@code true}
* Represents an operation that accepts an {@code <index, cell>} pair representing
* the cell a bit index. Returns {@code true}
* if processing should continue, {@code false} otherwise.
*
* <p>Note: This is a functional interface as a specialization of
* {@link java.util.function.BiPredicate} for {@code int}.</p>
*/
@FunctionalInterface
interface BitCountConsumer {
interface CellConsumer {
/**
* Performs an operation on the given {@code <index, count>} pair.
*
* @param index the bit index.
* @param count the count at the specified bit index.
* @param cell the cell value at the specified bit index.
* @return {@code true} if processing should continue, {@code false} if processing should stop.
*/
boolean test(int index, int count);
boolean test(int index, int cell);
}
}
Loading