Skip to content

Latest commit

 

History

History
48 lines (32 loc) · 2.58 KB

CHANGELOG.md

File metadata and controls

48 lines (32 loc) · 2.58 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

[0.5.0] - 2021-06-11

Added

  • Added new keyword argument tfidf_matrix_dtype (the datatype for the tf-idf values of the matrix components). Allowed values are numpy.float32 and numpy.float64 (used by the required external package sparse_dot_topn version 0.3.1). Default is numpy.float32. (Note: numpy.float32 often leads to faster processing and a smaller memory footprint albeit less numerical precision than numpy.float64.)

Changed

  • Changed dependency on sparse_dot_topn from version 0.2.9 to 0.3.1
  • Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
  • Changed the default value of the keyword argument max_n_matches from 20 to the number of strings in duplicates (or master, if duplicates is not given).
  • Changed warning issued when the condition [include_zeroes=True and min_similarity ≤ 0 and max_n_matches is not sufficiently high to capture all nonzero-similarity-matches] is met to an exception.

Removed

  • Removed the keyword argument suppress_warning

[0.4.0] - 2021-04-11

Added

  • Added group representative functionality - by default the centroid is used. From @ParticularMiner

  • Added string_grouper_utils package with additional group-representative functionality:

    • new_group_rep_by_earliest_timestamp
    • new_group_rep_by_completeness
    • new_group_rep_by_highest_weight

    From @ParticularMiner

  • Original indices are now added by default to output of group_similar_strings, match_most_similar and match_strings. From @ParticularMiner

  • compute_pairwise_similarities function From @ParticularMiner

Changed

  • Default group representative is now the centroid. Used to be the first string in the series belonging to a group. From @ParticularMiner
  • Output of match_most_similar and match_strings is now a pandas.DataFrame object instead of a pandas.Series by default. From @ParticularMiner
  • Fixed a bug which occurs when min_similarity=0. From @ParticularMiner