Skip to content

Releases: apache/datasketches-java

0.11.0 Mar 15, 2018: KLL quantiles sketch, tuple sketch API change and more

16 Mar 02:20
Compare
Choose a tag to compare
  • New KLL sketch: KllFloatsSketch:
    • This is a new quantiles sketch with better accuracy per stored bit than the original quantiles DoublesSketch. If you select a value of K for the KLL sketch so that it matches the same accuracy as the DoublesSketch, the K will be larger, but the space required will be much smaller. This sketch is specifically tuned for the smallest amount of space usage as possible (near theoretical optimum) and uses floats rather than doubles. On update this new KLL sketch is a little faster than the original DoublesSketch, but may be slower on merge. Also, this KLL sketch currently does not have a generic version (as does the DoublesSketch) nor does it provide off-heap capability like the DoublesSketch. Refer to the javadocs for a link to the KLL theoretical paper.
  • Tuple:
    • generic sketch API change
      • removed the convention to require static methods with a certain signature, these methods are now based on a more visible API
      • added SummaryDeserializer
      • The need to serialize factories has been removed
      • removed getSummaries() method - use iterator instead
  • Theta:
    • added new SingleItemSketch - fast way to create sketches with a single input item
  • Original quantiles sketch enhancements:
    • added getRank() - faster than getCDF() with one split point
    • empty sketch returns null from getQuantiles(), getPMF() and getCDF()
    • empty sketch returns NaN from getQuantile(), getMinValue() and getMaxValue()
    • Komologorov-Smirnov Statistic between two quantiles sketches
    • fixed sorting using comparator in generic ItemsSketch

0.10.3 Oct 26, 2017: Theta backward compatibility

27 Oct 00:21
Compare
Choose a tag to compare

Theta sketch: As a part of the resize factor serialization fix in version 0.10.2 a validation check was added, which led to inability to deserialize UpdateSketch or Union serialized using sketches-core-0.8.4 and above. This release is to address the issue.

0.10.2 Oct 20, 2017: Theta, HLL bug fixes

20 Oct 22:17
Compare
Choose a tag to compare
  • Theta:
    • Fixed bug in HeapUpdatesketch.toByteArray() that didn't set resize factor
    • Added getFamily() to all Set Operations. Any user-defined subclasses of SetOperations will need to implement this method.
  • HLL:
    • Fixed HLL Union conversion to HLL_4 bug
    • Made isSameResource() public

0.10.1 Sep 7, 2017: HLL Sketch Extended for Off-Heap Operation

08 Sep 01:41
Compare
Choose a tag to compare
  • This release extends the prior HLL release 0.10.0 to also allow the HLL sketch to operate off-heap leveraging the new Memory package (located in the DataSketches/Memory repository. This capability is critical for large systems that must manage millions of sketches as updatable fields located in off-heap (native) memory. The other sketches in the library that also enable this off-heap operation include the Theta sketch as well as the Quantiles sketch.

0.10.0 Jun 16, 2017: New Memory, new HLL, new weighted sampling

16 Jun 19:16
Compare
Choose a tag to compare
  • The Memory package, which is used extensively by all the DataSketches library, has been completely rewritten and moved to its own repository.
    • The new Memory package now leverages Closeable and when used with try-with-resources blocks eliminates the need to close() resources external to the JVM (e.g., memory-mapped files and off-heap memory allocations). This totally replaces the freeMemory() requirements of the prior Memory implementation.
    • The API has been streamlined to allow simpler creation of regions (like ByteBuffer slices), which are views of the same underlying resource.
    • The internal architecture has been redesigned to eliminate redundancy and cleaner separation of the management of resources (off-heap memory, memory-mapped files, wrapped ByteBuffers and wrapped primitive arrays) from the specifics of the API implementation.
    • Currently there are two API implementations: Memory, which provides direct-addressed, primitive (and primitive array) access, and Buffer, which provides a relative positional interface for primitive (and primitive array) access.
    • This has required some API changes when using the Memory package: For example, instead of new NativeMemory(bytes) use Memory.wrap(bytes) or WritableMemory.wrap(bytes). Watch the distinction between the read-only wrap methods, which take Memory and updatable wrap methods, which take WritableMemory. Attempts to modify read-only objects will throw SketchesReadOnlyException.
  • Completely rewritten HLL sketches with improved speed and accuracy performance.
    • The prior version of HLL had some performance, usability and design issues that were problematic. In addition, our science team has developed some more advanced estimators that dramatically improve the accuracy of the HLL sketches, especially in the low-range. We decided that the best route was to redesign the HLL sketches from scratch.
  • Added weighted sampling sketch
    • VarOptItemsSketch creates a random sample of weighted items from a stream, with the inclusion probability approximately a function of the item's weight. The sketch can additionally apply a predicate to the sampled items to compute sums of weights over the subset, along with error bounds.
  • Added support for subset sums with error bounds to Reservoir sampling
    • Mirrors the (new) functionality for weighted sampling, back-ported to unweighted sampling.
  • Some API changes in the Builder.build() methods:
    • Builder.build() methods don't accept sketch size anymore, and optionally only accept a Memory object. This was changed to avoid an easy-to-create bug by a user that can be difficult to find. The initMemory(Memory) function is moved to the build(Memory) and the build(int k) function is moved to a builder.setK(int k) function.
  • To improve consistency and clarity of functionality across the library, we have changed factory method names from the generic getInstance() to newInstance() when a virgin instance is being created and heapify(), or wrap() when the result instance already contains data.

0.9.1 Apr 14, 2017: Sorted Quantiles CompactDoublesSketch, added reset methods

14 Apr 19:22
Compare
Choose a tag to compare
  • Fixed issue with unsorted Quantiles CompactDoublesSketch
  • Added reset methods to Sampling sketches
  • Added reset methods to Tuple sketches

0.9.0 Mar 24, 2017: Quantiles DoublesSketch refactoring, Frequent Items merge fix, read-only memory fix

24 Mar 23:20
Compare
Choose a tag to compare
  • Quantiles DoublesSketch refactoring with API change
    • New UpdateDoublesSketch and CompactDoublesSketch classes; can only call update() on the former
    • Default serialization retains update or compact structure, allows wrap() to work as expected
    • Create Union from any combination of update or compact, direct or heap sketches
  • Fixed problem with merging Frequent Items sketches
  • Fixed problem with read-only memory

0.8.4 Jan 18, 2017: Quantiles DirectDoublesSketch, Jaccard Similarity, Sampling improvements, bug fixes

19 Jan 02:34
Compare
Choose a tag to compare
  • Quantiles DirectDoublesSketch
  • Jaccard Similarity
  • Quantiles forward compatibility from 0.3.0
  • Sampling improvements
  • additional getFrequentItems() method with threshold for convenience
  • PairwiseSetOperations bug fixes and performance improvements

0.8.3 Nov 20, 2016: NativeMemory support for sliced ByteBuffers

21 Nov 01:03
Compare
Choose a tag to compare
  • Thanks to a PR by Gian Merlino (@gianm), NativeMemory now supports sliced ByteBuffers, i.e., ByteBuffers with an internal offset that have been derived from another ByteBuffer.
  • We have done (and will continue to do) a lot of refactoring to improve code quality. We also continue to tighten the rules specified by the code checkers found in the /tools directory.

0.8.2 Nov 14, 2016: New HLL-based UniqueCountMap sketch; Bug Fixes

14 Nov 23:46
Compare
Choose a tag to compare

New HLL-based UniqueCountMap sketch

This is a totally new sketch in the HLL package that addresses real-time unique counting of identifiers associated with millions of Keys. Please refer to the javadocs for the UniqueCountMap sketch class in the sketches/hll package.

Fixed SerDe Compatibility With Shaded, Reallocated sketches-core.jar for Pig and Hive

The previous scheme of creating a hash ID from the SerDe class names to prevent accidental deserialization with the wrong SerDe class was fragile and failed when the core classes were shaded and reallocated for the Pig and Hive jars. In our attempt to protect the user from themselves, we had inadvertently created a worse problem.

This has now been fixed, but to do that we had to abandon the SerDe ID concept entirely. This fix is now backward compatible will all earlier releases of SerDe classes.

Upgraded Reservoir Sampling to allow for full integer precision values of K.

This allows the sketch size specification, K, for the Reservoir sketches to have full integer precision. This is also backward compatible with the earlier specification.