You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+10Lines changed: 10 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.6.0?] - 2021-09-21
11
+
12
+
### Added
13
+
14
+
* matrix-blocking/splitting as a performance-enhancer (see README.md for details)
15
+
* new keyword arguments `force_symmetries` and `n_blocks` (see README.md for details)
16
+
* new dependency on packages `topn` and `sparse_dot_topn_for_blocks` to help with the matrix-blocking
17
+
* capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like `match_strings()`)
***New in version 0.6.0***: a new *optional* parameter, namely `corpus`, can now be specified for all of the above high-level functions. `corpus` is a `StringGrouper` instance that has already been initialized (and thus already contains a corpus). The input Series (`master`, `duplicates`, and so on) will thus be tokenized, or transformed into tf-idf matrices, using this corpus.
60
+
59
61
In the rest of this document the names, `Series` and `DataFrame`, refer to the familiar `pandas` object types.
60
62
#### Parameters:
61
63
@@ -145,6 +147,8 @@ All functions are built using a class **`StringGrouper`**. This class can be use
145
147
***`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
146
148
***`include_zeroes`**: When `min_similarity`≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).) **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`. To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
147
149
***`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen. Allowed values are `'centroid'` (the default) and `'first'`. See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.
150
+
***`force_symmetries`**: In cases where `duplicates` is `None`, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to `True`.
151
+
***`n_blocks`**: This parameter is a tuple of two `int`s provided to help boost performance, if possible, of processing large DataFrames (see [Subsection Performance](#perf)), by splitting the DataFrames into `n_blocks[0]` blocks for the left operand (of the underlying matrix multiplication) and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons block-wise. Defaults to `None`, in which case automatic splitting occurs if an `OverflowError` would otherwise occur.
### Semilogx plots of run-times of `match_strings()` vs the number of blocks (`n_blocks[1]`) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number `n_blocks[0]` of blocks into which the left matrix-operand was split.
0 commit comments