Bergvca
diff --git a/‎CHANGELOG.md
Lines changed: 10 additions & 0 deletions b/‎CHANGELOG.md
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 55 additions & 0 deletions b/‎README.md
Lines changed: 55 additions & 0 deletions
diff --git a/‎images/BlockMatrix_1_1.png
436 KB b/‎images/BlockMatrix_1_1.png
436 KB
diff --git a/‎images/BlockMatrix_1_2.png
467 KB b/‎images/BlockMatrix_1_2.png
467 KB
diff --git a/‎images/BlockMatrix_2_2.png
525 KB b/‎images/BlockMatrix_2_2.png
525 KB
diff --git a/‎images/BlockNumberSpaceExploration1.png
53.2 KB b/‎images/BlockNumberSpaceExploration1.png
53.2 KB
diff --git a/‎images/Fuzzy_vs_Exact.png
35.2 KB b/‎images/Fuzzy_vs_Exact.png
35.2 KB
diff --git a/‎setup.py
Lines changed: 3 additions & 2 deletions b/‎setup.py
Lines changed: 3 additions & 2 deletions
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.6.0?] - 2021-09-21
+
+### Added
+
+* matrix-blocking/splitting as a performance-enhancer (see README.md for details)
+* new keyword arguments `force_symmetries` and `n_blocks` (see README.md for details)
+* new dependency on packages `topn` and `sparse_dot_topn_for_blocks` to help with the matrix-blocking
+* capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like `match_strings()`)
+
+
 ## [0.5.0] - 2021-06-11
 
 ### Added
 
@@ -56,6 +56,8 @@ The permitted calling patterns of the four functions, and their return types, ar
 | `group_similar_strings`| `(strings_to_group, strings_id, **kwargs)`| `DataFrame` |
 | `compute_pairwise_similarities`| `(string_series_1, string_series_2, **kwargs)`| `Series` |
 
+***New in version 0.6.0***: a new *optional* parameter, namely `corpus`, can now be specified for all of the above high-level functions.  `corpus` is a `StringGrouper` instance that has already been initialized (and thus already contains a corpus).  The input Series (`master`, `duplicates`, and so on) will thus be tokenized, or transformed into tf-idf matrices, using this corpus.
+
 In the rest of this document the names, `Series` and `DataFrame`, refer to the familiar `pandas` object types.
 #### Parameters:
 
@@ -145,6 +147,8 @@ All functions are built using a class **`StringGrouper`**. This class can be use
    * **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
    * **`include_zeroes`**: When `min_similarity` &le; 0, determines whether zero-similarity matches appear in the output.  Defaults to `True`.  (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).)  **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`.  To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
    * **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen.  Allowed values are `'centroid'` (the default) and `'first'`.  See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.
+   * **`force_symmetries`**: In cases where `duplicates` is `None`, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to `True`.
+   * **`n_blocks`**: This parameter is a tuple of two `int`s provided to help boost performance, if possible, of processing large DataFrames (see [Subsection Performance](#perf)), by splitting the DataFrames into `n_blocks[0]` blocks for the left operand (of the underlying matrix multiplication) and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons block-wise.  Defaults to `None`, in which case automatic splitting occurs if an `OverflowError` would otherwise occur.
 
 ## Examples
 
@@ -993,3 +997,54 @@ companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')
   </tbody>
 </table>
 </div>
+
+# Performance<a name="perf"></a>
+
+### Semilogx plots of run-times of `match_strings()` vs the number of blocks (`n_blocks[1]`) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison.  As shown in the legend, each plot corresponds to the number `n_blocks[0]` of blocks into which the left matrix-operand was split.
+<img width="100%" src="https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockNumberSpaceExploration1.png">
+
+String comparison, as implemented by `string_grouper`, is essentially matrix 
+multiplication.  A DataFrame of strings is converted (tokenized) into a 
+matrix.  Then that matrix is multiplied by itself (or another) transposed.  
+
+Here is an illustration of multiplication of two matrices ***D*** and ***M***<sup>T</sup>:
+![Block Matrix 1 1](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_1.png)
+
+It turns out that when the matrix (or DataFrame) is very large, the computer 
+proceeds quite slowly with the multiplication (apparently due to the RAM being 
+too full).  Some computers give up with an `OverflowError`.
+
+To circumvent this issue, `string_grouper` now allows the division of the DataFrame(s) 
+into smaller chunks (or blocks) and multiplies the chunks one pair at a time 
+instead to get the same result:
+
+![Block Matrix 2 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_2_2.png)
+
+But surprise ... the run-time of the process is sometimes drastically reduced 
+as a result.  For example, the speed-up of the following call is about 500% 
+(here, the DataFrame is divided into 200 blocks on the right operand, that is, 
+1 block on the left &times; 200 on the right) compared to the same call with no
+splitting \[`n_blocks=(1, 1)`, the default, which is what previous versions 
+(0.5.0 and earlier) of `string_grouper` did\]:
+
+```python
+# A DataFrame of 668 000 records:
+companies = pd.read_csv('data/sec__edgar_company_info.csv')
+
+# The following call is more than 6 times faster than earlier versions of 
+# match_strings() (that is, when n_blocks=(1, 1))!
+match_strings(companies['Company Name')], n_blocks=(1, 200))
+```
+
+Further exploration of the block number space has revealed that for any fixed 
+number of right blocks, the run-time gets longer the larger the number of left 
+blocks specified.  For this reason, it is recommended *not* to split the left matrix.
+
+![Block Matrix 1 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_2.png)
+
+So what are the optimum block number values for any given DataFrame? That is 
+anyone's guess, and the answer may vary from computer to computer.  
+
+We however encourage the user to make judicious use of the `n_blocks` 
+parameter to boost performance of `string_grouper`.
+
@@ -9,7 +9,7 @@
 
 setup(
     name='string_grouper',
-    version='0.5.0',
+    version='0.6.0',
     packages=['string_grouper', 'string_grouper_utils'],
     license='MIT License',
     description='String grouper contains functions to do string matching using TF-IDF and the cossine similarity. '
@@ -25,6 +25,7 @@
                       , 'scipy'
                       , 'scikit-learn'
                       , 'numpy'
-                      , 'sparse_dot_topn>=0.3.1'
+                      , 'sparse_dot_topn_for_blocks>=0.3.1'
+                      , 'topn>=0.0.4'
                       ]
 )