Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COLLECTIONS-857] update documentation #507

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,8 @@
<commons.jacoco.branchRatio>0.78</commons.jacoco.branchRatio>
<commons.jacoco.lineRatio>0.85</commons.jacoco.lineRatio>
<commons.jacoco.complexityRatio>0.78</commons.jacoco.complexityRatio>
<doxia.module.markdown.version>1.12.0</doxia.module.markdown.version>
<math.mathjax.version>2.7.2</math.mathjax.version>
</properties>

<build>
Expand Down Expand Up @@ -267,8 +269,21 @@
<artifactId>maven-javadoc-plugin</artifactId>
<configuration>
<source>8</source>
<!-- Enable MathJax -->
<additionalOptions>-Xdoclint:all --allow-script-in-comments -header '&lt;script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/${math.mathjax.version}/MathJax.js?config=TeX-AMS-MML_HTMLorMML"&gt;&lt;/script&gt;'</additionalOptions>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-site-plugin</artifactId>
<dependencies>
<dependency>
<groupId>org.apache.maven.doxia</groupId>
<artifactId>doxia-module-markdown</artifactId>
<version>${doxia.module.markdown.version}</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>

Expand Down Expand Up @@ -812,4 +827,3 @@
</contributors>

</project>

Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ private static long toLong(final byte[] byteArray, final int offset, final int l
*</p>
* <ol>
* <li>If there is an odd number of bytes the excess byte is assigned to the increment value</li>
* <li>The bytes alloted are read in big-endian order any byte not populated is set to zero.</li>
* <li>The bytes allotted are read in big-endian order any byte not populated is set to zero.</li>
* </ol>
* <p>
* This ensures that small arrays generate the largest possible increment and initial values.
Expand Down
25 changes: 25 additions & 0 deletions src/site/markdown/bloomFilters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom filters

Bloom filters were added to commons collections in version 4.5.

The documentation comprises four parts:
* [An introduction](bloomFilters/intro.html) to Bloom filters.
* The [Commons Collections implementations](bloomFilters/implementation.html).
* [Unusual usage and advanced implementations](bloomFilters/advanced.html).
* Using [Bloom filters for indexing](bloomFilters/multidimensional.html).
324 changes: 324 additions & 0 deletions src/site/markdown/bloomFilters/advanced.md

Large diffs are not rendered by default.

235 changes: 235 additions & 0 deletions src/site/markdown/bloomFilters/implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom Filters Part 2: Apache Commons CollectionsⓇ Bloom Filter Implementation

Previously I wrote about Bloom filters, what they are, why we use them, and what statistics they can provide. In this post I am going to cover the Apache Commons Collections Bloom filter implementation in Java and explain the logic behind some of the design decisions.

We need to realize that there are several pain points in Bloom filter usage. First, the amount of time it takes to create the number of hashes needed for each filter. Second, the trade-off between storing the bits as a bit vector (which for sparsely populated filters can be wasteful) or storing the index of the enabled bits (which for densely populated filters can be very large). Adjacent to the internal storage question is how to serialize the Bloom filter. Consideration of these pain points drove design decisions in the development of the Apache Commons Collections Bloom filter implementation.

## Shape

Last time I spoke of the shape of the filter. In Commons Collections, `Shape` is a class that defines the number of bits in the filter (`m`) and the number of hashes (`k`) used for each object insertion.

Bloom filters with the same shape can be merged and compared. The code does not verify that the shape is the same, as the comparison would increase processing time. However, every Bloom filter has a reference to its `Shape` and the test can be performed if desired.

The `Shape` is constructed from one of five combinations of parameters specified in `fromXX` methods:
* `fromKM` - when the number of hash functions and the number of bits is known.
* `fromNM` - when the number of items and the number of bits is known.
* `fromNMK` - when the number of items, the number of bits and the number of hash functions is known. In this case the number of items is used to verify that the probability is within range.
* `fromNP` - when the number of items and the probability of false positives is known.
* `fromPMK` - when the probability of false positives, number of bits, and number of hash functions is known. In this case the probability is used to verify that the maximum number of elements will result in a probability that is valid.

## The Extractors

The question of how to efficiently externally represent the internal representation of a Bloom filter led to the development of two “extractors”. One that logically represents an ordered collection of BitMap longs, and the other that represents a collection of enabled bits indices. In both cases the extractor feeds each value to a function - called a predicate - that does some operation with the value and returns true to continue processing, or false to stop processing.

The extractors allow implementation of different internal storage as long as implementation can produce one or both of the standard producers. Each producer has a static method to convert from the other type so only one implementation is required of the internal storage.

### BitMapExtractor

The BitMap extractor is an interface that defines objects that can produce BitMap vectors. BitMap vectors are vectors of long values where the bits are enabled as per the `BitMap` class. All Bloom filters implement this interface. The `BitMapExtractor` has static methods to create a producer from an array of long values, as well as from an `IndexExtractor`. The interface has default implementations that convert the extractor into an array of long values, and one that executes a LongPredicate for each BitMap.

The method `processBitMaps(LongPredicate)` is the standard access method for the extractor. Each long value in the BitMap is passed to the predicate in order. Processing continues until the last BitMap is processed or the `LongPredicate` returns false.

### IndexExtractor

The index extractor produces the index values for the enabled bits in a Bloom filter. All Bloom filters implement this interface. The `IndexExtractor` has static methods to create the extractor from an array of integers or a `BitMapExtractor`. The interface also has default implementations to convert the extractor to an array of integers and one that executes an `IntPredicate` on each index.

The method `processIndices(IntPredicate)` is the standard access method for the extractor. Each int value in the extractor is passed to the predicate. Processing continues until the last int is processed or the `IntPredicate` returns false. The order and uniqueness of the values is not guaranteed.

## The Hashers

### Hasher

The `Hasher` interface defines a method that accepts a `Shape` and returns an `IndexrExtractor`. Calling `hasher.indices(shape)` will yield an `IndexExtractor` that will produce the `shape.numberOfHashFunctions()` integers.

The `Hasher` provides a clean separation between new data being hashed and the Bloom filter. It also provides a mechanism to add the same item to multiple Bloom filters even if they have different shapes. This can be an important consideration when using Bloom filters in multiple systems with different requirements.

Any current hashing process used by an existing Bloom filter implementation can be duplicated by a hasher. The hasher only needs to create an `IndexExtractor` for an arbitrary shape. In fact, the testing code contains `Hasher` implementations that produce sequential values.

### EnhancedDoubleHasher

In the previous post I described a hashing technique where the hash value is split into two values. One is used for the initial value, the other is used to increment the values as additional hashes are required. The Commons Collections implementation adds code to ensure that the increment is changed by a tetrahedral value on every iteration. This change addresses issues with the incrementer initially being zero or very small.

The constructor accepts either two longs or a byte array from a hashing function. Users are expected to select a hashing algorithm to create an initial hash for the item. The `EnhancedDoubleHasher` is then constructed from that hash value. In many cases the Murmur3 hash, available in the Apache Commons CodecⓇ library, is sufficient and very fast. The result is a `Hasher` that represents the hashed item. This hasher can now be passed to any Bloom filter and the `Shape` of the filter will determine the number of hashes created.

## The Bloom Filters

### Bloom Filter

Now we come to the Bloom filter. As noted above the `BloomFilter` interface implements both the `IndexExtractor` and the `BitMapExtractor`. The two extractors are the external representations of the internal representation of the Bloom filter. Bloom filter implementations are free to store bit values in any way deemed fit. The requirements for bit value storage are:
* Must be able to produce an IndexExtractor.
* Must be able to produce a BitMapExtractor.
* Must be able to clear the values. Reset the cardinality to zero.
* Must specify if the filter prefers (is faster creating) the IndexExtractor (Sparse characteristic) or the BitMapExtractor (Not sparse characteristic).
* Must be able to merge hashers, Bloom filters, index extractors, and BitMap extractors. When handling extractors it is often the case that an implementation will convert one type of extractor to the other for merging. The BloomFilter interface has default implementations for Bloom filter and hasher merging.
* Must be able to determine if the filter contains hashers, Bloom filters, index extractors, and BitMap extractors. The BloomFilter interface has default implementations for BitMap extractor, Bloom filter and hasher checking.
* Must be able to make a deep copy of itself.
* Must be able to produce its `Shape`.

Several implementations of Bloom filter are provided. We will start by focusing on the `SimpleBloomFilter` and the `SparseBloomFilter`.

### SimpleBloomFilter

The `SimpleBloomFilter` implements its storage as an on-heap array of longs. This implementation is suitable for many applications. It can also serve as a good model for how to implement `BitMap` storage in specific off-heap situations.

### SparseBloomFilter
The `SparseBloomFilter` implements its storage as a TreeSet of Integers. Since each bit is randomly selected across the entire [0,shape.numberOfBits) range the Sparse Bloom filter only makes sense when \\( \frac{2 \times m}{64} \lt k \times n \\). The reasoning being that every BitMap in the array is equivalent to 2 integers in the sparse filter. The number of integers is the number of hash functions times the number of items – this is an overestimation due to the number of hash collisions that will occur over the number of bits.

### Other

The helper classes included in the package make it easy to implement new Bloom filters, for example it should be fairly simple to implement a Bloom filter on a `ByteBuffer`, or `LongBuffer`, or one that uses a compressed bit vector.

## Common Usage Patterns

### Populating a Bloom filter

In most cases a shape is determined; in this example it is determined by the number of items to go in and the acceptable false positive rate, and then a number of items are added to the filter.

```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.collections4.bloomfilter.Shape;
import org.apache.commons.lang3.SerializationUtils;

public class PopulatingExample {

/**
* Populates a Bloom filter with the items. Uses a shape that
* expects 10K items max and a false positive rate of 0.01.
* @param items The items to insert.
* @return the Bloom filter populated with the items.
*/
public static BloomFilter populate(Object[] items) {
populate(Shape.fromNP(10000, 0.01), items);
}

/**
* Populates a Bloom filter with the items.
* @param shape The shape of the Bloom filter.
* @param items The items to insert.
* @return the Bloom filter populated with the items.
*/
public static BloomFilter populate(Shape shape, Object[] items) {
BloomFilter collection = new SimpleBloomFilter(shape);
for (Object o : items) {
// this example serializes the entire object, actual implementation
// may want to serialize specific properties into the hash.
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
collection.merge(new EnhancedDoubleHasher(hash[0], hash[1]));
}
// collection now contains all the items from the list of items
return collection;
}
}
```

### Searching a Bloom filter
When searching a single Bloom filter, it makes sense to use the `Hasher` to search in order to reduce the overhead of creating a Bloom filter for the match.

```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.lang3.SerializationUtils;

public class SearchingExample {

private Bloomfilter collection;

/**
* Creates the example from a populated collection.
* @param collection the collection to search.
*/
public SearchingExample(BloomFilter collection) {
this.collection = collection;
}

/**
* Perform the search.
* @param item the item to look for.
* @return true if the item is found, false otherwise.
*/
public boolean search(Object item) {
// create the hasher to look for.=
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
return collection.contains(new EnhancedDoubleHasher(hash[0], hash[1]));
}
}
```

However, if multiple filters are being checked, and they are all the same shape, then creating a Bloom filter from the hasher and using that is more efficient.

```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.collections4.bloomfilter.Shape;
import org.apache.commons.lang3.SerializationUtils;

public abstract class SearchingExample2 {

private BloomFilter[] collections;

/**
* Create example from an array of populated Bloom filters that all have the same Shape.
* @param collections The Bloom filters to search.
*/
public SearchingExample2(BloomFilter[] collections) {
this.collections = collections;
}

/**
* Search the filters for matching items.
* @param item the Item t o search for.
*/
public void search(Object item) {
// create the Bloom filter to search for.
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
BloomFilter target = new SimpleBloomFilter(collectons[0].getShape());
target.merge(new EnhancedDoubleHasher(hash[0], hash[1]));

for (BloomFilter candidate : candidates) {
if (candidate.contains(target)) {
doSomethingInteresting(candidate, item);
}
}
}

/**
* The interesting thing to do when a match occurs.
* @param collection the Bloom filter that matched.
* @param item The item that was being searched for.
*/
abstract public void doSomethingInteresting(BloomFilter collection, Object item);

}
```

If multiple filters of different shapes are being checked then use the `Hasher` to perform the check. It will be more efficient than building a Bloom filter for each `Shape`.

## Statistics

The statistics that I mentioned in the previous bloom post are implemented in the `SetOperations` class. The methods in this class accept `BitMapExtractor` arguments. They can be called with `BloomFilter` implementations or with `BitMapExtractor` implementations generated from arrays of longs or `IndexExtractors` or any other valid `BitMapExtractor` instance.

## Review

In this post we investigated the Apache Commons Collections implementation of Bloom filters and how to use them. We touched on how they can be used to implement new designs of the standard components. In the next post we will cover some of the more unusual usages and introduce some unusual implementations.
Loading
Loading