Releases: apache/datasketches-java
0.8.2 Nov 14, 2016: New HLL-based UniqueCountMap sketch; Bug Fixes
New HLL-based UniqueCountMap sketch
This is a totally new sketch in the HLL package that addresses real-time unique counting of identifiers associated with millions of Keys. Please refer to the javadocs for the UniqueCountMap sketch class in the sketches/hll package.
Fixed SerDe Compatibility With Shaded, Reallocated sketches-core.jar for Pig and Hive
The previous scheme of creating a hash ID from the SerDe class names to prevent accidental deserialization with the wrong SerDe class was fragile and failed when the core classes were shaded and reallocated for the Pig and Hive jars. In our attempt to protect the user from themselves, we had inadvertently created a worse problem.
This has now been fixed, but to do that we had to abandon the SerDe ID concept entirely. This fix is now backward compatible will all earlier releases of SerDe classes.
Upgraded Reservoir Sampling to allow for full integer precision values of K.
This allows the sketch size specification, K, for the Reservoir sketches to have full integer precision. This is also backward compatible with the earlier specification.
0.8.1 Oct 6, 2016: Fixed pom.xml bug
Soon after we released 0.8.0, we discovered that the shaded jar was misnamed as sketches-core.0.8.0-with-shaded-core.jar instead of sketches-core.0.8.0-with-shaded-memory.jar. And internally both the core libraries and the memory package were both shaded, which was not the intent. Only the memory package should have been shaded. This is fixed with this release.
The impact of the bug was relatively low as it would only be detected if you installed the shaded jar and tried to reference the core sketches. All the other jars are fine.
0.8.0 Oct 5, 2016: Modular release structure, reservoir sampling, and more...
Modular Release Structure
Because the Memory package has many applications beyond just the DataSketches library, it made sense to separate it out into its own module. The jars for the Memory package will appear as
memory-X.Y.Z-<type>.jar.
The remainder of the library is its own module and will appear as usual as
sketches-core-X.Y.Z-<type>.jar, but with a dependency on the memory jar.
In addition to the usual jar types, there is an additional
sketches-core-X.Y.Z-with-shaded-memory.jar, which contains the
sketches-core-X.Y.Z.jar and a shaded, renamed memory-X.Y.Z.jar.
This shading allows protection from the "DLL Hell" situation where there may be a different version of the memory jar registered in the same system.
New Sampling Package
We have now added a sampling package into the suite of different types of sketch algorithms.
The first entry in this area is an efficient implementation of the classical reservoir sampling algorithm that is often used as an interview question. However, this implementation is quite a bit more sophisticated in that it also solves the more complex problem of merging with different sized sketches. It also includes a base implementation using longs (more as a tutorial example) as well as a Java Generic version that can be extended to any type, including polymorphic types. As with all the other sketches in the library the challenges of efficient serialization and deserialization have also been addressed.
There are a number of exciting ways this sampling package can grow.
Memory Package Enhancements
The Memory package is used extensively in the library for off-heap work and a number of groups have shown a lot of interest in using it more broadly. In this release we have extended the API to include read-only variants of the Memory classes in the same way as the ByteBuffer classes. It also has been extended to allow direct access to the Unsafe class in those situations where the utmost in performance is required. Examples of the emerging use of this capability can be found in the PreambleUtils class in the quantiles package. Caution is advised when using this package, as it is easy to "shoot yourself in the foot"! Caveat Emptor!
New PairwiseSetOperations
The new PairwiseSetOperations class fills in the need for performing set operations on just 2 arguments fast. These are stateless operations and are specifically optimized for Theta Sketches that are already in ordered, CompactSketch form.
0.7.0 Jul 29, 2016: SerDe API changes; Added MemoryMappedFile
Quantiles and FrequentItems Sketches:
- API changes to ArrayOfDoublesSerDe, ArrayOfItemsSerDe, ArrayOfLongsSerDe, ArrayOfStringsSerDe, ArrayOfUtf16StringsSerDe
- Changed the SerDe TYPE, which was a static final constant, quite fragile from a maintenance point-of-view, and had to be assigned by the class developer.
- TYPE has been changed to a more robust TypeID, which is automatically assigned using hashCode() based on the to a SerDe class and can be overridden by the SerDe class developer.
This is used to detect incorrect SerDe instances. Refer to the Javadocs and code documentation of ArrayOfItemsSerDe.
- Corrected problem of downsampling with quantiles/Union: If you set k to a lower value, it was ignored with the first sketch update with a larger k.
All Sketches:
Changed the leaf-node classes to final. These classes are not designed to be extended.
Static analyzers
- Ran FindBugs, which found a few minor coding issues, which have been corrected.
- Running PMD and Checkstyle found a lot of "style" issues. All of the style issues that I agree with have been corrected.
- The /tools directory has a FindBugsExcludeFilter.xml and a SketchesCheckstyle.xml that you can use to run these static checkers yourself.
- They are not yet integrated into the maven pom.xml.
Memory Package
- Added MemoryMappedFile to the Memory package.
- This capability allows mapping a file, which can be larger than 2GB, into native memory and the ability to write any changes back to that file.
- It is similar to, but much simpler than Java's FileChannel.map() function that returns a MappedByteBuffer, which is restricted to files less than 2GB.
- This capability should be considered experimental and is not thread safe.
0.6.0 Jun 29, 2016: Generic Quantiles
Major Additions
Generic Quantiles Sketch
Any object that can be compared using a supplied comparator (or one of the default native comparators) can be processed using the new generic version of Quantiles.
Code Improvements in Frequent Items and Quantiles Sketches
Class name consistency
The original names were long and were redundant with the sketch package:
- FrequentItemsSketch in the frequencies package
- ItemsQuantilesSketch in the quantiles package
These class names have been renamed to remove redundancy and to be more consistent with class names in Theta and Tuple packages. This is a one-time change that will require users to update their code base to the new names. For example:
- quantiles package:
- ItemsQuantilesSketch -> ItemsSketch
- DoublesQuantilesSketch -> DoublesSketch
- DoublesQuantilesSketchBuilder -> DoublesSketchBuilder
- frequencies package
- FrequentItemsSketch -> ItemsSketch
- FrequentLongsSketch -> LongsSketch
Similar changes are reflected in the names of the test classes
Binary Storage Improvements for quantiles/DoublesSketch
The new storage structure is about half the size, on average, so it will be faster and smaller to merge, serialize, deserialize. This is also a one-time change and the new version cannot read the old binary format. Hopefully we have caught this early enough so that users don't have many sketches stored in the previous format.
Restructuring
The frequencies/ArrayOf<Type>SerDe classes are useful for the generic versions of both FrequentItems and Quantiles sketches and so have been promoted to the sketches package.
Javadocs and code formatting
This version includes a number of improvements to the javadocs and code to make it easier to read. This is an ongoing process.
External Contributions. Thank you!
George Kankava of DevFactory suggested creating dedicated RuntimeException classes for the library. This has been implemented and will allow systems that implement the library to catch all exceptions thrown by the library classes to be caught as SketchesException. George also found a few binary OR operations that should have been implemented as Logical ORs. These have been fixed.
0.5.2 May 23, 2016: UTF-16 Strings
- Added ArrayOfUtf16StringsSerDe
- Added char[] input capability to MurmurHash3
0.5.1 May 9, 2016: Bounds on ratios in Theta Sketched sets
- Added BoundsOnRatiosInThetaSketchedSets
0.5.0 May 3, 2016: Frequent Items
- Introduction of the FREQUENCY family of sketches
- JDK7 support has been removed. JDK8 is now required.
- Added BoundsOnBinomialProportions and BoundsOnRatiosInSampledSets
- Numerous updates to Javadocs, code docs, and a few minor API changes.
0.4.1 Apr 6, 2016: Javadoc Fixes
- Added missing javadocs mostly
0.4.0 Mar 4, 2016: Tuple Sketches
- Introduction of Tuple Family of sketches
- Promoted ResizeFactor to sketches package
- Renamed BinomialBounds to BinomialBoundsN
- Merged theta/DirectUnion and theta/HeapUnion into theta/UnionImpl