Skip to content

Commit d0e099f

Browse files
added blocking capabilities
1 parent 0c08cef commit d0e099f

File tree

10 files changed

+830
-167
lines changed

10 files changed

+830
-167
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.6.0?] - 2021-09-21
11+
12+
### Added
13+
14+
* matrix-blocking/splitting as a performance-enhancer (see README.md for details)
15+
* new keyword arguments `force_symmetries` and `n_blocks` (see README.md for details)
16+
* new dependency on packages `topn` and `sparse_dot_topn_for_blocks` to help with the matrix-blocking
17+
* capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like `match_strings()`)
18+
19+
1020
## [0.5.0] - 2021-06-11
1121

1222
### Added

README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ The permitted calling patterns of the four functions, and their return types, ar
5656
| `group_similar_strings`| `(strings_to_group, strings_id, **kwargs)`| `DataFrame` |
5757
| `compute_pairwise_similarities`| `(string_series_1, string_series_2, **kwargs)`| `Series` |
5858

59+
***New in version 0.6.0***: a new *optional* parameter, namely `corpus`, can now be specified for all of the above high-level functions. `corpus` is a `StringGrouper` instance that has already been initialized (and thus already contains a corpus). The input Series (`master`, `duplicates`, and so on) will thus be tokenized, or transformed into tf-idf matrices, using this corpus.
60+
5961
In the rest of this document the names, `Series` and `DataFrame`, refer to the familiar `pandas` object types.
6062
#### Parameters:
6163

@@ -145,6 +147,8 @@ All functions are built using a class **`StringGrouper`**. This class can be use
145147
* **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
146148
* **`include_zeroes`**: When `min_similarity` ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).) **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`. To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
147149
* **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen. Allowed values are `'centroid'` (the default) and `'first'`. See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.
150+
* **`force_symmetries`**: In cases where `duplicates` is `None`, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to `True`.
151+
* **`n_blocks`**: This parameter is a tuple of two `int`s provided to help boost performance, if possible, of processing large DataFrames (see [Subsection Performance](#perf)), by splitting the DataFrames into `n_blocks[0]` blocks for the left operand (of the underlying matrix multiplication) and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons block-wise. Defaults to `None`, in which case automatic splitting occurs if an `OverflowError` would otherwise occur.
148152

149153
## Examples
150154

@@ -993,3 +997,54 @@ companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')
993997
</tbody>
994998
</table>
995999
</div>
1000+
1001+
# Performance<a name="perf"></a>
1002+
1003+
### Semilogx plots of run-times of `match_strings()` vs the number of blocks (`n_blocks[1]`) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number `n_blocks[0]` of blocks into which the left matrix-operand was split.
1004+
<img width="100%" src="https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockNumberSpaceExploration1.png">
1005+
1006+
String comparison, as implemented by `string_grouper`, is essentially matrix
1007+
multiplication. A DataFrame of strings is converted (tokenized) into a
1008+
matrix. Then that matrix is multiplied by itself (or another) transposed.
1009+
1010+
Here is an illustration of multiplication of two matrices ***D*** and ***M***<sup>T</sup>:
1011+
![Block Matrix 1 1](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_1.png)
1012+
1013+
It turns out that when the matrix (or DataFrame) is very large, the computer
1014+
proceeds quite slowly with the multiplication (apparently due to the RAM being
1015+
too full). Some computers give up with an `OverflowError`.
1016+
1017+
To circumvent this issue, `string_grouper` now allows the division of the DataFrame(s)
1018+
into smaller chunks (or blocks) and multiplies the chunks one pair at a time
1019+
instead to get the same result:
1020+
1021+
![Block Matrix 2 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_2_2.png)
1022+
1023+
But surprise ... the run-time of the process is sometimes drastically reduced
1024+
as a result. For example, the speed-up of the following call is about 500%
1025+
(here, the DataFrame is divided into 200 blocks on the right operand, that is,
1026+
1 block on the left &times; 200 on the right) compared to the same call with no
1027+
splitting \[`n_blocks=(1, 1)`, the default, which is what previous versions
1028+
(0.5.0 and earlier) of `string_grouper` did\]:
1029+
1030+
```python
1031+
# A DataFrame of 668 000 records:
1032+
companies = pd.read_csv('data/sec__edgar_company_info.csv')
1033+
1034+
# The following call is more than 6 times faster than earlier versions of
1035+
# match_strings() (that is, when n_blocks=(1, 1))!
1036+
match_strings(companies['Company Name')], n_blocks=(1, 200))
1037+
```
1038+
1039+
Further exploration of the block number space has revealed that for any fixed
1040+
number of right blocks, the run-time gets longer the larger the number of left
1041+
blocks specified. For this reason, it is recommended *not* to split the left matrix.
1042+
1043+
![Block Matrix 1 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_2.png)
1044+
1045+
So what are the optimum block number values for any given DataFrame? That is
1046+
anyone's guess, and the answer may vary from computer to computer.
1047+
1048+
We however encourage the user to make judicious use of the `n_blocks`
1049+
parameter to boost performance of `string_grouper`.
1050+

images/BlockMatrix_1_1.png

436 KB
Loading

images/BlockMatrix_1_2.png

467 KB
Loading

images/BlockMatrix_2_2.png

525 KB
Loading
53.2 KB
Loading

images/Fuzzy_vs_Exact.png

35.2 KB
Loading

setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
setup(
1111
name='string_grouper',
12-
version='0.5.0',
12+
version='0.6.0',
1313
packages=['string_grouper', 'string_grouper_utils'],
1414
license='MIT License',
1515
description='String grouper contains functions to do string matching using TF-IDF and the cossine similarity. '
@@ -25,6 +25,7 @@
2525
, 'scipy'
2626
, 'scikit-learn'
2727
, 'numpy'
28-
, 'sparse_dot_topn>=0.3.1'
28+
, 'sparse_dot_topn_for_blocks>=0.3.1'
29+
, 'topn>=0.0.4'
2930
]
3031
)

0 commit comments

Comments
 (0)