diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 17dcc3ee..db3e1fbc 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -11,7 +11,7 @@ jobs:
     strategy:
       matrix:
         python-version: [3.7, 3.8, 3.9]
-        os: [ubuntu-latest, windows-latest]
+        os: [ubuntu-latest]
 
     steps:
     - uses: actions/checkout@v2
@@ -21,8 +21,13 @@ jobs:
       with:
         python-version: ${{ matrix.python-version }}
     
-    - name: Install package
-      run: pip install .
+    - name: Install dev-package
+      run: |
+        sudo apt-get install qemu tree
+        python -m pip install --upgrade pip
+        pip install -v -e .
+        qemu-x86_64 -R 20M python time_match_strings.py
+        
     
     - name: Run tests
       run: python -m unittest
diff --git a/CHANGELOG.md b/CHANGELOG.md
index d1cb63ff..7b77f8bd 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.6.0?] - 2021-09-21
+
+### Added
+
+* matrix-blocking/splitting as a performance-enhancer (see [README.md](https://github.com/ParticularMiner/string_grouper/tree/block#performance) for details)
+* new keyword arguments `force_symmetries` and `n_blocks` (see [README.md](https://github.com/ParticularMiner/string_grouper/tree/block#kwargs) for details)
+* new dependency on packages `topn` and `sparse_dot_topn_for_blocks` to help with the matrix-blocking
+* capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like `match_strings()`.  See [README.md](https://github.com/ParticularMiner/string_grouper/tree/block#corpus) for details.)
+
+
+## [0.5.0] - 2021-06-11
+
+### Added
+
+* Added new keyword argument **`tfidf_matrix_dtype`** (the datatype for the tf-idf values of the matrix components). Allowed values are `numpy.float32` and `numpy.float64` (used by the required external package `sparse_dot_topn` version 0.3.1).  Default is `numpy.float32`.  (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)
+
+### Changed
+
+* Changed dependency on `sparse_dot_topn` from version 0.2.9 to 0.3.1
+* Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
+* Changed the default value of the keyword argument `max_n_matches` from 20 to the number of strings in `duplicates` (or `master`, if `duplicates` is not given). 
+* Changed warning issued when the condition \[`include_zeroes=True` and `min_similarity` &le; 0 and `max_n_matches` is not sufficiently high to capture all nonzero-similarity-matches\] is met to an exception. 
+ 
+### Removed
+
+* Removed the keyword argument `suppress_warning`
+
 ## [0.4.0] - 2021-04-11
 
 ### Added
diff --git a/README.md b/README.md
index 13f22127..270b4e26 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@
 
 The image displayed above is a visualization of the graph-structure of one of the groups of strings found by `string_grouper`.  Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here `0.8`).  
 
-The ***centroid*** of the group, as determined by `string_grouper` (see [tutorials/group_representatives.md](tutorials/group_representatives.md) for an explanation), is the largest node, also with the most edges originating from it.  A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
+The ***centroid*** of the group, as determined by `string_grouper` (see [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation), is the largest node, also with the most edges originating from it.  A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
 
 The power of `string_grouper` is discernible from this image: in large datasets, `string_grouper` is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.    
 
@@ -70,6 +70,18 @@ In the rest of this document the names, `Series` and `DataFrame`, refer to the f
 |**`string_series_1(_2)`** | A `Series` of strings each of which is to be compared with its corresponding string in `string_series_2(_1)`. |
 |**`**kwargs`** | Keyword arguments (see [below](#kwargs)).|
 
+***New in version 0.6.0***<a name="corpus"></a>: each of the high-level functions listed above also has a `StringGrouper` method counterpart of the same name and parameters.  Calling such a method of any instance of `StringGrouper` will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons.  The input Series to the method (`master`, `duplicates`, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus.  For example:
+```python
+# Build a corpus using strings in the pandas Series master:
+sg = StringGrouper(master)
+# The following method-calls will compare strings first in
+# pandas Series new_master_1 and next in new_master_2
+# using the corpus already built above without rebuilding or
+# changing it in any way:
+matches1 = sg.match_strings(new_master_1)
+matches2 = sg.match_strings(new_master_2)
+```
+
 #### Functions:
 
 * #### `match_strings` 
@@ -85,7 +97,7 @@ In the rest of this document the names, `Series` and `DataFrame`, refer to the f
    2. `'similarity'` whose column has the similarity-scores as values, and 
    3. The name of `duplicates` (or `master` if `duplicates` is not given) and the name(s) of its index (or index-levels) prefixed by the string `'right_'`.
    
-   Indexes (or their levels) only appear when the keyword argument `ignore_index=False` (the default). (See [tutorials/ignore_index_and_replace_na.md](tutorials/ignore_index_and_replace_na.md) for a demonstration.)
+   Indexes (or their levels) only appear when the keyword argument `ignore_index=False` (the default). (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
    
    If either `master` or `duplicates` has no name, it assumes the name `'side'` which is then prefixed as described above.  Similarly, if any of the indexes (or index-levels) has no name it assumes its `pandas` default name (`'index'`, `'level_0'`, and so on) and is then prefixed as described above.
    
@@ -101,7 +113,7 @@ In the rest of this document the names, `Series` and `DataFrame`, refer to the f
    
    The name of the output `Series` is the same as that of `master` prefixed with the string `'most_similar_'`.  If `master` has no name, it is assumed to have the name `'master'` before being prefixed.
        
-   If `ignore_index=False` (the default), `match_most_similar` returns a `DataFrame` containing the same `Series` described above as one of its columns.  So it inherits the same index and length as `duplicates`.  The rest of its columns correspond to the index (or index-levels) of `master` and thus contain the index-labels of the most similar strings being output as values.  If there are no similar strings in `master` for a given string in `duplicates` then the value(s) assigned to this index-column(s) for that string is `NaN` by default.  However, if the keyword argument `replace_na=True`, then these `NaN` values are replaced with the index-label(s) of that string in `duplicates`.  Note that such replacements can only occur if the indexes of `master` and `duplicates` have the same number of levels.  (See [tutorials/ignore_index_and_replace_na.md](tutorials/ignore_index_and_replace_na.md#MMS) for a demonstration.)
+   If `ignore_index=False` (the default), `match_most_similar` returns a `DataFrame` containing the same `Series` described above as one of its columns.  So it inherits the same index and length as `duplicates`.  The rest of its columns correspond to the index (or index-levels) of `master` and thus contain the index-labels of the most similar strings being output as values.  If there are no similar strings in `master` for a given string in `duplicates` then the value(s) assigned to this index-column(s) for that string is `NaN` by default.  However, if the keyword argument `replace_na=True`, then these `NaN` values are replaced with the index-label(s) of that string in `duplicates`.  Note that such replacements can only occur if the indexes of `master` and `duplicates` have the same number of levels.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md#MMS) for a demonstration.)
    
    Each column-name of the output `DataFrame` has the same name as its corresponding column, index, or index-level of `master` prefixed with the string `'most_similar_'`.
   
@@ -109,7 +121,7 @@ In the rest of this document the names, `Series` and `DataFrame`, refer to the f
 
 
 * #### `group_similar_strings` 
-  Takes a single `Series` of strings (`strings_to_group`) and groups them by assigning to each string one string from `strings_to_group` chosen as the group-representative for each group of similar strings found. (See [tutorials/group_representatives.md](tutorials/group_representatives.md) for details on how the the group-representatives are chosen.)   
+  Takes a single `Series` of strings (`strings_to_group`) and groups them by assigning to each string one string from `strings_to_group` chosen as the group-representative for each group of similar strings found. (See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for details on how the the group-representatives are chosen.)   
   
   If `ignore_index=True`, the output is a `Series` (with the same name as `strings_to_group` prefixed by the string `'group_rep_'`) of the same length and index as `strings_to_group` containing the group-representative strings.  If `strings_to_group` has no name then the name of the returned `Series` is `'group_rep'`.  
    
@@ -134,17 +146,20 @@ All functions are built using a class **`StringGrouper`**. This class can be use
    All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:
 
    * **`ngram_size`**: The amount of characters in each n-gram. Default is `3`.
-   * **`regex`**: The regex string used to clean-up the input string. Default is `"[,-./]|\s"`.
-   * **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is `20`.
+   * **`regex`**: The regex string used to clean-up the input string. Default is `r"[,-./]|\s"`.
+   * **`ignore_case`**: Determines whether or not letter case in strings should be ignored. Defaults to `True`.
+   * **`tfidf_matrix_dtype`**: The datatype for the tf-idf values of the matrix components. Allowed values are `numpy.float32` and `numpy.float64`.  Default is `numpy.float32`.  (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)
+   * **`max_n_matches`**: The maximum number of matching strings in `master` allowed per string in `duplicates`. Default is the total number of strings in `master`.
    * **`min_similarity`**: The minimum cosine similarity for two strings to be considered a match.
     Defaults to `0.8`
    * **`number_of_processes`**: The number of processes used by the cosine similarity calculation. Defaults to
     `number of cores on a machine - 1.`
-   * **`ignore_index`**: Determines whether indexes are ignored or not.  If `False` (the default), index-columns will appear in the output, otherwise not.  (See [tutorials/ignore_index_and_replace_na.md](tutorials/ignore_index_and_replace_na.md) for a demonstration.)
-   * **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`.  (See [tutorials/ignore_index_and_replace_na.md](tutorials/ignore_index_and_replace_na.md) for a demonstration.)
-   * **`include_zeroes`**: When `min_similarity` &le; 0, determines whether zero-similarity matches appear in the output.  Defaults to `True`.  (See [tutorials/zero_similarity.md](tutorials/zero_similarity.md) for a demonstration.)  **Warning:** Make sure the kwarg `max_n_matches` is sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise some zero-similarity-matches returned will be false.
-   * **`suppress_warning`**: when `min_similarity` &le; 0 and `include_zeroes`  is `True`, determines whether or not to suppress the message warning that `max_n_matches` may be too small.  Defaults to `False`.
-   * **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen.  Allowed values are `'centroid'` (the default) and `'first'`.  See [tutorials/group_representatives.md](tutorials/group_representatives.md) for an explanation.
+   * **`ignore_index`**: Determines whether indexes are ignored or not.  If `False` (the default), index-columns will appear in the output, otherwise not.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
+   * **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
+   * **`include_zeroes`**: When `min_similarity` &le; 0, determines whether zero-similarity matches appear in the output.  Defaults to `True`.  (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).)  **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`.  To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
+   * **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen.  Allowed values are `'centroid'` (the default) and `'first'`.  See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.
+   * **`force_symmetries`**: In cases where `duplicates` is `None`, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to `True`.
+   * **`n_blocks`**: This parameter is a tuple of two `int`s provided to help boost performance, if possible, of processing large DataFrames (see [Subsection Performance](#perf)), by splitting the DataFrames into `n_blocks[0]` blocks for the left operand (of the underlying matrix multiplication) and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons block-wise.  Defaults to `None`, in which case automatic splitting occurs if an `OverflowError` would otherwise occur.
 
 ## Examples
 
@@ -306,7 +321,7 @@ Out of the four company names in `duplicates`, three companies are found in the
 
 ### Finding duplicates from a (database extract to) DataFrame where IDs for rows are supplied.
 
-A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the `match_strings` function duplicates can be found easily. A [tutorial](tutorials/tutorial_1.md) that steps though the process with an example data set is available.
+A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the `match_strings` function duplicates can be found easily. A [tutorial](https://github.com/Bergvca/string_grouper/blob/master/tutorials/tutorial_1.md) that steps though the process with an example data set is available.
 
 
 ### For a second data set, find only the most similar match
@@ -993,3 +1008,89 @@ companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')
   </tbody>
 </table>
 </div>
+
+# Performance<a name="perf"></a>
+
+### <a name="Semilogx"></a>Semilogx plots of run-times of `match_strings()` vs the number of blocks (`n_blocks[1]`) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison.  As shown in the legend, each plot corresponds to the number `n_blocks[0]` of blocks into which the left matrix-operand was split.
+![Semilogx](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockNumberSpaceExploration1.png)
+
+String comparison, as implemented by `string_grouper`, is essentially matrix 
+multiplication.  A pandas Series of strings is converted (tokenized) into a 
+matrix.  Then that matrix is multiplied by itself (or another) transposed.  
+
+Here is an illustration of multiplication of two matrices ***D*** and ***M***<sup>T</sup>:
+![Block Matrix 1 1](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_1.png)
+
+It turns out that when the matrix (or Series) is very large, the computer 
+proceeds quite slowly with the multiplication (apparently due to the RAM being 
+too full).  Some computers give up with an `OverflowError`.
+
+To circumvent this issue, `string_grouper` now allows the division of the Series 
+into smaller chunks (or blocks) and multiplies the chunks one pair at a time 
+instead to get the same result:
+
+![Block Matrix 2 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_2_2.png)
+
+But surprise ... the run-time of the process is sometimes drastically reduced 
+as a result.  For example, the speed-up of the following call is about 500% 
+(here, the Series is divided into 200 blocks on the right operand, that is, 
+1 block on the left &times; 200 on the right) compared to the same call with no
+splitting \[`n_blocks=(1, 1)`, the default, which is what previous versions 
+(0.5.0 and earlier) of `string_grouper` did\]:
+
+```python
+# A DataFrame of 668 000 records:
+companies = pd.read_csv('data/sec__edgar_company_info.csv')
+
+# The following call is more than 6 times faster than earlier versions of 
+# match_strings() (that is, when n_blocks=(1, 1))!
+match_strings(companies['Company Name')], n_blocks=(1, 200))
+```
+
+Further exploration of the block number space ([see plot above](#Semilogx)) has revealed that for any fixed 
+number of right blocks, the run-time gets longer the larger the number of left 
+blocks specified.  For this reason, it is recommended *not* to split the left matrix.
+
+![Block Matrix 1 2](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/BlockMatrix_1_2.png)
+
+In general,
+
+&nbsp;&nbsp;&nbsp;***total runtime*** = `n_blocks[0]` &times; `n_blocks[1]` &times; ***mean runtime per block-pair***
+
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = ***Left Operand Size*** &times; ***Right Operand Size*** &times; 
+
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ***mean runtime per block-pair*** / (***Left Block Size*** &times; ***Right Block Size***)
+
+So for given left and right operands, minimizing the ***total runtime*** is the same as minimizing the
+
+&nbsp;&nbsp;&nbsp;***runtime per string-pair comparison*** &#8797; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;***mean runtime per block-pair*** / (***Left Block Size*** &times; ***Right Block Size***)
+
+
+[Below is a log-log-log contour plot](#ContourPlot) of the ***runtime per string-pair comparison*** scaled by its value
+at ***Left Block Size*** = ***Right Block Size*** = 5000.  Here, ***Block Size***
+is the number of strings in that block, and ***mean runtime per block-pair*** is the time taken for the following call to run:
+```python
+# note the parameter order!
+match_strings(right_Series, left_Series, n_blocks=(1, 1))
+```
+where `left_Series` and `right_Series`, corresponding to ***Left Block*** and ***Right Block*** respectively, are random subsets of the Series `companies['Company Name')]` from the
+[sec__edgar_company_info.csv](https://www.kaggle.com/dattapiy/sec-edgar-companies-list/version/1) sample data file.
+
+<a name="ContourPlot"></a> ![ContourPlot](https://raw.githubusercontent.com/ParticularMiner/string_grouper/block/images/ScaledRuntimeContourPlot.png)
+
+It can be seen that when `right_Series` is roughly the size of 80&nbsp;000 (denoted by the 
+white dashed line in the contour plot above), the runtime per string-pair comparison is at 
+its lowest for any fixed `left_Series` size.  Above ***Right Block Size*** = 80&nbsp;000, the 
+matrix-multiplication routine begins to feel the limits of the computer's 
+available memory space and thus its performance deteriorates, as evidenced by the increase 
+in runtime per string-pair comparison there (above the white dashed line).  This knowledge 
+could serve as a guide for estimating the optimum block numbers &mdash;
+namely those that divide the Series into blocks of size roughly equal to 
+80&nbsp;000 for the right operand (or `right_Series`).
+
+So what are the optimum block number values for *any* given Series? That is 
+anyone's guess, and may likely depend on the data itself.  Furthermore, as hinted above, 
+the answer may vary from computer to computer.  
+
+We however encourage the user to make judicious use of the `n_blocks` 
+parameter to boost performance of `string_grouper` whenever possible.
diff --git a/images/BlockMatrix_1_1.png b/images/BlockMatrix_1_1.png
new file mode 100644
index 00000000..23843452
Binary files /dev/null and b/images/BlockMatrix_1_1.png differ
diff --git a/images/BlockMatrix_1_2.png b/images/BlockMatrix_1_2.png
new file mode 100644
index 00000000..8e77511a
Binary files /dev/null and b/images/BlockMatrix_1_2.png differ
diff --git a/images/BlockMatrix_2_2.png b/images/BlockMatrix_2_2.png
new file mode 100644
index 00000000..89bbdbc5
Binary files /dev/null and b/images/BlockMatrix_2_2.png differ
diff --git a/images/BlockNumberSpaceExploration1.png b/images/BlockNumberSpaceExploration1.png
new file mode 100644
index 00000000..836600e5
Binary files /dev/null and b/images/BlockNumberSpaceExploration1.png differ
diff --git a/images/Fuzzy_vs_Exact.png b/images/Fuzzy_vs_Exact.png
new file mode 100644
index 00000000..4bfcdf39
Binary files /dev/null and b/images/Fuzzy_vs_Exact.png differ
diff --git a/images/ScaledRuntimeContourPlot.png b/images/ScaledRuntimeContourPlot.png
new file mode 100644
index 00000000..c51cea55
Binary files /dev/null and b/images/ScaledRuntimeContourPlot.png differ
diff --git a/images/ScaledTimePerComparison.png b/images/ScaledTimePerComparison.png
new file mode 100644
index 00000000..2436f54b
Binary files /dev/null and b/images/ScaledTimePerComparison.png differ
diff --git a/setup.py b/setup.py
index f4b5ecb0..cad6d08a 100644
--- a/setup.py
+++ b/setup.py
@@ -9,8 +9,8 @@
 
 setup(
     name='string_grouper',
-    version='0.4.0',
-    packages=['string_grouper'],
+    version='0.6.0',
+    packages=['string_grouper', 'string_grouper_utils'],
     license='MIT License',
     description='String grouper contains functions to do string matching using TF-IDF and the cossine similarity. '
                 'Based on https://bergvca.github.io/2017/10/14/super-fast-string-matching.html',
@@ -25,6 +25,7 @@
                       , 'scipy'
                       , 'scikit-learn'
                       , 'numpy'
-                      , 'sparse_dot_topn>=0.2.6'
+                      , 'sparse_dot_topn_for_blocks>=0.3.1'
+                      , 'topn>=0.0.7'
                       ]
 )
diff --git a/string_grouper/__init__.py b/string_grouper/__init__.py
index 84e3abd8..3b872b9b 100644
--- a/string_grouper/__init__.py
+++ b/string_grouper/__init__.py
@@ -1,2 +1,2 @@
 from .string_grouper import compute_pairwise_similarities, group_similar_strings, match_most_similar, match_strings, \
-StringGrouperConfig, StringGrouper
+    StringGrouperConfig, StringGrouper
diff --git a/string_grouper/string_grouper.py b/string_grouper/string_grouper.py
index 3ab8cc46..63986354 100644
--- a/string_grouper/string_grouper.py
+++ b/string_grouper/string_grouper.py
@@ -2,15 +2,20 @@
 import numpy as np
 import re
 import multiprocessing
+import warnings
 from sklearn.feature_extraction.text import TfidfVectorizer
+from scipy.sparse import vstack
 from scipy.sparse.csr import csr_matrix
+from scipy.sparse.lil import lil_matrix
 from scipy.sparse.csgraph import connected_components
 from typing import Tuple, NamedTuple, List, Optional, Union
-from sparse_dot_topn import awesome_cossim_topn
+from sparse_dot_topn_for_blocks import awesome_cossim_topn
+from topn import awesome_hstack_topn
 from functools import wraps
-import warnings
+
 
 DEFAULT_NGRAM_SIZE: int = 3
+DEFAULT_TFIDF_MATRIX_DTYPE: type = np.float32   # (only types np.float32 and np.float64 are allowed by sparse_dot_topn)
 DEFAULT_REGEX: str = r'[,-./]|\s'
 DEFAULT_MAX_N_MATCHES: int = 20
 DEFAULT_MIN_SIMILARITY: float = 0.8  # minimum cosine similarity for an item to be considered a match
@@ -18,29 +23,31 @@
 DEFAULT_IGNORE_CASE: bool = True  # ignores case by default
 DEFAULT_DROP_INDEX: bool = False  # includes index-columns in output
 DEFAULT_REPLACE_NA: bool = False    # when finding the most similar strings, does not replace NaN values in most
-                                    # similar string index-columns with corresponding duplicates-index values
-DEFAULT_INCLUDE_ZEROES: bool = True # when the minimum cosine similarity <=0, determines whether zero-similarity
-                                    # matches appear in the output 
-DEFAULT_SUPPRESS_WARNING: bool = False  # when the minimum cosine similarity <=0 and zero-similarity matches are
-                                        # requested, determines whether or not to suppress the message warning that 
-                                        # max_n_matches may be too small 
+# similar string index-columns with corresponding duplicates-index values
+DEFAULT_INCLUDE_ZEROES: bool = True  # when the minimum cosine similarity <=0, determines whether zero-similarity
+# matches appear in the output
 GROUP_REP_CENTROID: str = 'centroid'    # Option value to select the string in each group with the largest
-                                        # similarity aggregate as group-representative:
+# similarity aggregate as group-representative:
 GROUP_REP_FIRST: str = 'first'  # Option value to select the first string in each group as group-representative:
-DEFAULT_GROUP_REP: str = GROUP_REP_CENTROID # chooses group centroid as group-representative by default
+DEFAULT_GROUP_REP: str = GROUP_REP_CENTROID  # chooses group centroid as group-representative by default
+DEFAULT_FORCE_SYMMETRIES: bool = True  # Option value to specify whether corrections should be made to the results
+# to account for symmetry thus compensating for those numerical errors that violate symmetry due to loss of
+# significance
+DEFAULT_N_BLOCKS: Tuple[int, int] = None  # Option value to use to split dataset(s) into roughly equal-sized blocks
 
 # The following string constants are used by (but aren't [yet] options passed to) StringGrouper
 DEFAULT_COLUMN_NAME: str = 'side'   # used to name non-index columns of the output of StringGrouper.get_matches
-DEFAULT_ID_NAME: str = 'id' # used to name id-columns in the output of StringGrouper.get_matches
+DEFAULT_ID_NAME: str = 'id'  # used to name id-columns in the output of StringGrouper.get_matches
 LEFT_PREFIX: str = 'left_'  # used to prefix columns on the left of the output of StringGrouper.get_matches
 RIGHT_PREFIX: str = 'right_'    # used to prefix columns on the right of the output of StringGrouper.get_matches
 MOST_SIMILAR_PREFIX: str = 'most_similar_'  # used to prefix columns of the output of
-                                            # StringGrouper._get_nearest_matches
-DEFAULT_MASTER_NAME: str = 'master' # used to name non-index column of the output of StringGrouper.get_nearest_matches
+# StringGrouper._get_nearest_matches
+DEFAULT_MASTER_NAME: str = 'master'  # used to name non-index column of the output of StringGrouper.get_nearest_matches
 DEFAULT_MASTER_ID_NAME: str = f'{DEFAULT_MASTER_NAME}_{DEFAULT_ID_NAME}'    # used to name id-column of the output of
-                                                                            # StringGrouper.get_nearest_matches
+# StringGrouper.get_nearest_matches
 GROUP_REP_PREFIX: str = 'group_rep_'    # used to prefix and name columns of the output of StringGrouper._deduplicate
 
+
 # High level functions
 
 
@@ -55,7 +62,8 @@ def compute_pairwise_similarities(string_series_1: pd.Series,
     :param kwargs: All other keyword arguments are passed to StringGrouperConfig
     :return: pandas.Series of similarity scores, the same length as string_series_1 and string_series_2
     """
-    return StringGrouper(string_series_1, string_series_2, **kwargs).dot()
+    sg = StringGrouper(string_series_1, string_series_2, **kwargs)
+    return sg.dot()
 
 
 def group_similar_strings(strings_to_group: pd.Series,
@@ -76,8 +84,11 @@ def group_similar_strings(strings_to_group: pd.Series,
     :param kwargs: All other keyword arguments are passed to StringGrouperConfig. (Optional)
     :return: pandas.Series or pandas.DataFrame.
     """
-    string_grouper = StringGrouper(strings_to_group, master_id=string_ids, **kwargs).fit()
-    return string_grouper.get_groups()
+    sg = StringGrouper(strings_to_group,
+                       master_id=string_ids,
+                       **kwargs)
+    sg = sg.fit()
+    return sg.get_groups()
 
 
 def match_most_similar(master: pd.Series,
@@ -105,12 +116,14 @@ def match_most_similar(master: pd.Series,
     :param kwargs: All other keyword arguments are passed to StringGrouperConfig. (Optional)
     :return: pandas.Series or pandas.DataFrame.
     """
-    string_grouper = StringGrouper(master,
-                                   duplicates=duplicates,
-                                   master_id=master_id,
-                                   duplicates_id=duplicates_id,
-                                   **kwargs).fit()
-    return string_grouper.get_groups()
+    kwargs['max_n_matches'] = 1
+    sg = StringGrouper(master,
+                       duplicates=duplicates,
+                       master_id=master_id,
+                       duplicates_id=duplicates_id,
+                       **kwargs)
+    sg = sg.fit()
+    return sg.get_groups()
 
 
 def match_strings(master: pd.Series,
@@ -130,48 +143,61 @@ def match_strings(master: pd.Series,
     :param kwargs: All other keyword arguments are passed to StringGrouperConfig.
     :return: pandas.Dataframe.
     """
-    string_grouper = StringGrouper(master,
-                                   duplicates=duplicates,
-                                   master_id=master_id,
-                                   duplicates_id=duplicates_id,
-                                   **kwargs).fit()
-    return string_grouper.get_matches()
+    sg = StringGrouper(master,
+                       duplicates=duplicates,
+                       master_id=master_id,
+                       duplicates_id=duplicates_id,
+                       **kwargs)
+    sg = sg.fit()
+    return sg.get_matches()
 
 
 class StringGrouperConfig(NamedTuple):
-    """
+    r"""
     Class with configuration variables.
 
     :param ngram_size: int. The amount of characters in each n-gram. Default is 3.
-    :param regex: str. The regex string used to cleanup the input string. Default is [,-./]|\s.
-    :param max_n_matches: int. The maximum number of matches allowed per string. Default is 20.
+    :param tfidf_matrix_dtype: type. The datatype for the tf-idf values of the matrix components.
+    Possible values allowed by sparse_dot_topn are np.float32 and np.float64.  Default is np.float32.
+    (Note: np.float32 often leads to faster processing and a smaller memory footprint albeit less precision
+    than np.float64.)
+    :param regex: str. The regex string used to cleanup the input string. Default is '[,-./]|\s'.
+    :param max_n_matches: int. The maximum number of matching strings in master allowed per string in duplicates.
+    Default is the total number of strings in master.
     :param min_similarity: float. The minimum cosine similarity for two strings to be considered a match.
     Defaults to 0.8.
     :param number_of_processes: int. The number of processes used by the cosine similarity calculation.
     Defaults to number of cores on a machine - 1.
     :param ignore_case: bool. Whether or not case should be ignored. Defaults to True (ignore case).
     :param ignore_index: whether or not to exclude string Series index-columns in output.  Defaults to False.
-    :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches 
+    :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches
     appear in the output.  Defaults to True.
-    :param suppress_warning: when min_similarity <=0 and include_zeroes=True, determines whether or not to supress
-    the message warning that max_n_matches may be too small.  Defaults to False.
-    :param replace_na: whether or not to replace NaN values in most similar string index-columns with 
+    :param replace_na: whether or not to replace NaN values in most similar string index-columns with
     corresponding duplicates-index values. Defaults to False.
     :param group_rep: str.  The scheme to select the group-representative.  Default is 'centroid'.
     The other choice is 'first'.
+    :param force_symmetries: bool. In cases where duplicates is None, specifies whether corrections should be
+    made to the results to account for symmetry, thus compensating for those losses of numerical significance
+    which violate the symmetries. Defaults to True.
+    :param n_blocks: (int, int) This parameter is provided to help boost performance, if possible, of
+    processing large DataFrames, by splitting the DataFrames into n_blocks[0] blocks for the left
+    operand (of the underlying matrix multiplication) and into n_blocks[1] blocks for the right operand
+    before performing the string-comparisons block-wise.  Defaults to None.
     """
 
     ngram_size: int = DEFAULT_NGRAM_SIZE
+    tfidf_matrix_dtype: int = DEFAULT_TFIDF_MATRIX_DTYPE
     regex: str = DEFAULT_REGEX
-    max_n_matches: int = DEFAULT_MAX_N_MATCHES
+    max_n_matches: Optional[int] = None
     min_similarity: float = DEFAULT_MIN_SIMILARITY
     number_of_processes: int = DEFAULT_N_PROCESSES
     ignore_case: bool = DEFAULT_IGNORE_CASE
     ignore_index: bool = DEFAULT_DROP_INDEX
     include_zeroes: bool = DEFAULT_INCLUDE_ZEROES
-    suppress_warning: bool = DEFAULT_SUPPRESS_WARNING
     replace_na: bool = DEFAULT_REPLACE_NA
     group_rep: str = DEFAULT_GROUP_REP
+    force_symmetries: bool = DEFAULT_FORCE_SYMMETRIES
+    n_blocks: Tuple[int, int] = DEFAULT_N_BLOCKS
 
 
 def validate_is_fit(f):
@@ -212,26 +238,130 @@ def __init__(self, master: pd.Series,
         :param duplicates_id: pandas.Series. If set, contains ID values for each row in duplicates Series.
         :param kwargs: All other keyword arguments are passed to StringGrouperConfig
         """
-        # Validate match strings input
-        if not StringGrouper._is_series_of_strings(master) or \
-                (duplicates is not None and not StringGrouper._is_series_of_strings(duplicates)):
-            raise TypeError('Input does not consist of pandas.Series containing only Strings')
+        # private members:
+        self.is_build = False
+
+        self._master: pd.DataFrame = pd.DataFrame()
+        self._duplicates: Optional[pd.Series] = None
+        self._master_id: Optional[pd.Series] = None
+        self._duplicates_id: Optional[pd.Series] = None
+
+        self._right_Series: pd.DataFrame = pd.DataFrame()
+        self._left_Series: pd.DataFrame = pd.DataFrame()
+
+        # After the StringGrouper is fit, _matches_list will contain the indices and similarities of the matches
+        self._matches_list: pd.DataFrame = pd.DataFrame()
+        # _true_max_n_matches will contain the true maximum number of matches over all strings in master if
+        # self._config.min_similarity <= 0
+        self._true_max_n_matches: int = 0
+        self._max_n_matches: int = 0
+
+        self._config: StringGrouperConfig = StringGrouperConfig(**kwargs)
+
+        # initialize the members:
+        self._set_data(master, duplicates, master_id, duplicates_id)
+        self._set_options(**kwargs)
+        self._build_corpus()
+
+    def _set_data(self,
+                  master: pd.Series,
+                  duplicates: Optional[pd.Series] = None,
+                  master_id: Optional[pd.Series] = None,
+                  duplicates_id: Optional[pd.Series] = None):
+        # Validate input strings data
+        self.master = master
+        self.duplicates = duplicates
+
         # Validate optional IDs input
         if not StringGrouper._is_input_data_combination_valid(duplicates, master_id, duplicates_id):
             raise Exception('List of data Series options is invalid')
         StringGrouper._validate_id_data(master, duplicates, master_id, duplicates_id)
+        self._master_id = master_id
+        self._duplicates_id = duplicates_id
+
+        # Set some private members
+        self._right_Series = self._master
+        if self._duplicates is None:
+            self._left_Series = self._master
+        else:
+            self._left_Series = self._duplicates
+
+        self.is_build = False
+
+    def _set_options(self, **kwargs):
+        self._config = StringGrouperConfig(**kwargs)
+
+        if self._config.max_n_matches is None:
+            self._max_n_matches = len(self._master)
+        else:
+            self._max_n_matches = self._config.max_n_matches
 
-        self._master: pd.Series = master
-        self._duplicates: pd.Series = duplicates if duplicates is not None else None
-        self._master_id: pd.Series = master_id if master_id is not None else None
-        self._duplicates_id: pd.Series = duplicates_id if duplicates_id is not None else None
-        self._config: StringGrouperConfig = StringGrouperConfig(**kwargs)
         self._validate_group_rep_specs()
+        self._validate_tfidf_matrix_dtype()
         self._validate_replace_na_and_drop()
+        StringGrouper._validate_n_blocks(self._config.n_blocks)
+        self.is_build = False
+
+    def _build_corpus(self):
+        self._vectorizer = TfidfVectorizer(min_df=1, analyzer=self.n_grams, dtype=self._config.tfidf_matrix_dtype)
+        self._vectorizer = self._fit_vectorizer()
         self.is_build = False  # indicates if the grouper was fit or not
-        self._vectorizer = TfidfVectorizer(min_df=1, analyzer=self.n_grams)
-        # After the StringGrouper is build, _matches_list will contain the indices and similarities of two matches
-        self._matches_list: pd.DataFrame = pd.DataFrame()
+
+    def reset_data(self,
+                   master: pd.Series,
+                   duplicates: Optional[pd.Series] = None,
+                   master_id: Optional[pd.Series] = None,
+                   duplicates_id: Optional[pd.Series] = None):
+        """
+        Sets the input Series of a StringGrouper instance without changing the underlying corpus.
+        :param master: pandas.Series. A Series of strings in which similar strings are searched, either against itself
+        or against the `duplicates` Series.
+        :param duplicates: pandas.Series. If set, for each string in duplicates a similar string is searched in Master.
+        :param master_id: pandas.Series. If set, contains ID values for each row in master Series.
+        :param duplicates_id: pandas.Series. If set, contains ID values for each row in duplicates Series.
+        :param kwargs: All other keyword arguments are passed to StringGrouperConfig
+        """
+        self._set_data(master, duplicates, master_id, duplicates_id)
+
+    def clear_data(self):
+        self._master = None
+        self._duplicates = None
+        self._master_id = None
+        self._duplicates_id = None
+        self._matches_list = None
+        self._left_Series = None
+        self._right_Series = None
+        self.is_build = False
+
+    def update_options(self, **kwargs):
+        """
+        Updates the kwargs of a StringGrouper object
+        :param **kwargs: any StringGrouper keyword=value argument pairs
+        """
+        _ = StringGrouperConfig(**kwargs)
+        old_kwargs = self._config._asdict()
+        old_kwargs.update(kwargs)
+        self._set_options(**old_kwargs)
+
+    @property
+    def master(self):
+        return self._master
+
+    @master.setter
+    def master(self, master):
+        if not StringGrouper._is_series_of_strings(master):
+            raise TypeError('Master input does not consist of pandas.Series containing only Strings')
+        self._master = master
+
+    @property
+    def duplicates(self):
+        return self._duplicates
+
+    @duplicates.setter
+    def duplicates(self, duplicates):
+        if duplicates is not None and not StringGrouper._is_series_of_strings(duplicates):
+            raise TypeError('Duplicates input does not consist of pandas.Series containing only Strings')
+        self._duplicates = duplicates
 
     def n_grams(self, string: str) -> List[str]:
         """
@@ -246,16 +376,210 @@ def n_grams(self, string: str) -> List[str]:
         n_grams = zip(*[string[i:] for i in range(ngram_size)])
         return [''.join(n_gram) for n_gram in n_grams]
 
-    def fit(self) -> 'StringGrouper':
-        """Builds the _matches list which contains string matches indices and similarity"""
-        master_matrix, duplicate_matrix = self._get_tf_idf_matrices()
-        # Calculate the matches using the cosine similarity
-        matches = self._build_matches(master_matrix, duplicate_matrix)
-        # retrieve all matches
+    def _fit_blockwise_manual(self, n_blocks=(1, 1)):
+        # Function to compute matrix product by optionally first dividing
+        # the DataFrames(s) into equal-sized blocks as much as possible.
+
+        def divide_by(n, series):
+            # Returns an array of n rows and 2 columns.
+            # The columns denote the start and end of each of the n blocks.
+            # Note: zero-indexing is implied.
+            sz = len(series)//n
+            block_rem = np.full(n, 0, dtype=np.int64)
+            block_rem[:len(series) % n] = 1
+            if sz > 0:
+                equal_block_sz = np.full(n, sz, dtype=np.int64)
+                equal_block_sz += block_rem
+            else:
+                equal_block_sz = block_rem[:len(series) % n]
+            equal_block_sz = np.cumsum(equal_block_sz)
+            equal_block_sz = np.tile(equal_block_sz, (2, 1))
+            equal_block_sz[0, 0] = 0
+            equal_block_sz[0, 1:] = equal_block_sz[1, :-1]
+            return equal_block_sz.T
+
+        block_ranges_left = divide_by(n_blocks[0], self._left_Series)
+        block_ranges_right = divide_by(n_blocks[1], self._right_Series)
+
+        self._true_max_n_matches = 0
+        block_true_max_n_matches = 0
+        vblocks = []
+        for left_block in block_ranges_left:
+            left_matrix = self._get_left_tf_idf_matrix(left_block)
+            nnz_rows = np.full(left_matrix.shape[0], 0, dtype=np.int32)
+            hblocks = []
+            for right_block in block_ranges_right:
+                right_matrix = self._get_right_tf_idf_matrix(right_block)
+                try:
+                    # Calculate the matches using the cosine similarity
+                    # Note: awesome_cossim_topn will sort each row only when
+                    # _max_n_matches < size of right_block or sort=True
+                    matches, block_true_max_n_matches = self._build_matches(
+                        left_matrix, right_matrix, nnz_rows, sort=(len(block_ranges_right) == 1)
+                    )
+                except OverflowError as oe:
+                    import sys
+                    raise (type(oe)(f"{str(oe)} Use the n_blocks parameter to split-up "
+                                    f"the data into smaller chunks.  The current values"
+                                    f"(n_blocks = {n_blocks}) are too small.")
+                           .with_traceback(sys.exc_info()[2]))
+                hblocks.append(matches)
+                # end of inner loop
+
+            self._true_max_n_matches = \
+                max(block_true_max_n_matches, self._true_max_n_matches)
+            if len(block_ranges_right) > 1:
+                # Note: awesome_hstack_topn will sort each row only when
+                # _max_n_matches < length of _right_Series or sort=True
+                vblocks.append(
+                    awesome_hstack_topn(
+                        hblocks,
+                        self._max_n_matches,
+                        sort=True,
+                        use_threads=self._config.number_of_processes > 1,
+                        n_jobs=self._config.number_of_processes
+                    )
+                )
+            else:
+                vblocks.append(hblocks[0])
+            del hblocks
+            del matches
+            # end of outer loop
+
+        if len(block_ranges_left) > 1:
+            return vstack(vblocks)
+        else:
+            return vblocks[0]
+
+    def _fit_blockwise_auto(self,
+                            left_partition=(None, None),
+                            right_partition=(None, None),
+                            nnz_rows=None,
+                            sort=True,
+                            whoami=0):
+        # This is a recursive function!
+        # fit() has been extended here to enable StringGrouper to handle large
+        # datasets which otherwise would lead to an OverflowError
+        # The handling is achieved using block matrix multiplication.
+        def begin(partition):
+            return partition[0] if partition[0] is not None else 0
+
+        def end(partition, left=True):
+            if partition[1] is not None:
+                return partition[1]
+
+            return len(self._left_Series if left else self._right_Series)
+
+        left_matrix = self._get_left_tf_idf_matrix(left_partition)
+        right_matrix = self._get_right_tf_idf_matrix(right_partition)
+
+        if whoami == 0:
+            # At the topmost level of recursion initialize nnz_rows
+            # which will be used to compute _true_max_n_matches
+            nnz_rows = np.full(left_matrix.shape[0], 0, dtype=np.int32)
+            self._true_max_n_matches = 0
+
+        try:
+            # Calculate the matches using the cosine similarity
+            matches, true_max_n_matches = self._build_matches(
+                left_matrix, right_matrix, nnz_rows[slice(*left_partition)],
+                sort=sort)
+        except OverflowError:
+            warnings.warn("An OverflowError occurred but is being "
+                          "handled.  The input data will be automatically "
+                          "split-up into smaller chunks which will then be "
+                          "processed one chunk at a time.  To prevent "
+                          "OverflowError, use the n_blocks parameter to split-up "
+                          "the data manually into small enough chunks.")
+            # Matrices too big!  Try splitting:
+            del left_matrix, right_matrix
+
+            def split_partition(partition, left=True):
+                data_begin = begin(partition)
+                data_end = end(partition, left=left)
+                data_mid = data_begin + (data_end - data_begin)//2
+                if data_mid > data_begin:
+                    return [(data_begin, data_mid), (data_mid, data_end)]
+                else:
+                    return [(data_begin, data_end)]
+
+            left_halves = split_partition(left_partition, left=True)
+            right_halves = split_partition(right_partition, left=False)
+            vblocks = []
+            for lhalf in left_halves:
+                hblocks = []
+                for rhalf in right_halves:
+                    # Note: awesome_cossim_topn will sort each row only when
+                    # _max_n_matches < size of right_partition or sort=True
+                    matches = self._fit_blockwise_auto(
+                        left_partition=lhalf, right_partition=rhalf,
+                        nnz_rows=nnz_rows,
+                        sort=((whoami == 0) and (len(right_halves) == 1)),
+                        whoami=(whoami + 1)
+                    )
+                    hblocks.append(matches)
+                    # end of inner loop
+                if whoami == 0:
+                    self._true_max_n_matches = max(
+                        np.amax(nnz_rows[slice(*lhalf)]),
+                        self._true_max_n_matches
+                    )
+                if len(right_halves) > 1:
+                    # Note: awesome_hstack_topn will sort each row only when
+                    # _max_n_matches < length of _right_Series or sort=True
+                    vblocks.append(
+                        awesome_hstack_topn(
+                            hblocks,
+                            self._max_n_matches,
+                            sort=(whoami == 0),
+                            use_threads=self._config.number_of_processes > 1,
+                            n_jobs=self._config.number_of_processes
+                        )
+                    )
+                else:
+                    vblocks.append(hblocks[0])
+                del hblocks
+                # end of outer loop
+            if len(left_halves) > 1:
+                return vstack(vblocks)
+            else:
+                return vblocks[0]
+
+        if whoami == 0:
+            self._true_max_n_matches = true_max_n_matches
+        return matches
+
+    def fit(self, force_symmetries=None, n_blocks=None):
+        """
+        Builds the _matches list which contains string-matches' indices and similarity
+        Updates and returns the StringGrouper object that called it.
+        """
+        if force_symmetries is None:
+            force_symmetries = self._config.force_symmetries
+        StringGrouper._validate_n_blocks(n_blocks)
+        if n_blocks is None:
+            n_blocks = self._config.n_blocks
+
+        # do the matching
+        if n_blocks is None:
+            matches = self._fit_blockwise_auto()
+        else:
+            matches = self._fit_blockwise_manual(n_blocks=n_blocks)
+
+        # enforce symmetries?
+        if force_symmetries and (self._duplicates is None):
+            # convert to lil format for best efficiency when setting
+            # matrix-elements
+            matches = matches.tolil()
+            # matrix diagonal elements must be exactly 1 (numerical precision
+            # errors introduced by floating-point computations in
+            # awesome_cossim_topn sometimes lead to unexpected results)
+            matches = StringGrouper._fix_diagonal(matches)
+            # the list of matches must be symmetric!
+            # (i.e., if A != B and A matches B; then B matches A)
+            matches = StringGrouper._symmetrize_matrix(matches)
+            matches = matches.tocsr()
         self._matches_list = self._get_matches_list(matches)
-        if self._duplicates is None:
-            # the list of matches needs to be symmetric!!! (i.e., if A != B and A matches B; then B matches A)
-            self._symmetrize_matches_list()
         self.is_build = True
         return self
 
@@ -263,26 +587,23 @@ def dot(self) -> pd.Series:
         """Computes the row-wise similarity scores between strings in _master and _duplicates"""
         if len(self._master) != len(self._duplicates):
             raise Exception("To perform this function, both input Series must have the same length.")
-        master_matrix, duplicate_matrix = self._get_tf_idf_matrices()
+        master_matrix, duplicate_matrix = self._get_left_tf_idf_matrix(), self._get_right_tf_idf_matrix()
         # Calculate pairwise cosine similarities:
-        pairwise_similarities = np.asarray(master_matrix.multiply(duplicate_matrix).sum(axis=1)).squeeze()
+        pairwise_similarities = np.asarray(master_matrix.multiply(duplicate_matrix).sum(axis=1)).squeeze(axis=1)
         return pd.Series(pairwise_similarities, name='similarity', index=self._master.index)
 
     @validate_is_fit
     def get_matches(self,
                     ignore_index: Optional[bool] = None,
-                    include_zeroes: Optional[bool]=None,
-                    suppress_warning: Optional[bool]=None) -> pd.DataFrame:
+                    include_zeroes: Optional[bool] = None) -> pd.DataFrame:
         """
         Returns a DataFrame with all the matches and their cosine similarity.
         If optional IDs are used, returned as extra columns with IDs matched to respective data rows
 
-        :param ignore_index: whether or not to exclude string Series index-columns in output.  Defaults to 
+        :param ignore_index: whether or not to exclude string Series index-columns in output.  Defaults to
         self._config.ignore_index.
-        :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches 
+        :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches
         appear in the output.  Defaults to self._config.include_zeroes.
-        :param suppress_warning: when min_similarity <=0 and include_zeroes=True, determines whether or not to suppress
-        the message warning that max_n_matches may be too small.  Defaults to self._config.suppress_warning.
         """
         def get_both_sides(master: pd.Series,
                            duplicates: pd.Series,
@@ -304,19 +625,20 @@ def prefix_column_names(data: Union[pd.Series, pd.DataFrame], prefix: str):
             else:
                 return data.rename(f"{prefix}{data.name}")
 
-        if ignore_index is None: ignore_index = self._config.ignore_index
-        if include_zeroes is None: include_zeroes = self._config.include_zeroes
-        if suppress_warning is None: suppress_warning = self._config.suppress_warning
+        if ignore_index is None:
+            ignore_index = self._config.ignore_index
+        if include_zeroes is None:
+            include_zeroes = self._config.include_zeroes
         if self._config.min_similarity > 0 or not include_zeroes:
             matches_list = self._matches_list
         elif include_zeroes:
             # Here's a fix to a bug pointed out by one GitHub user (@nbcvijanovic):
-            # the fix includes zero-similarity matches that are missing by default 
-            # in _matches_list due to our use of sparse matrices 
-            non_matches_list = self._get_non_matches_list(suppress_warning)
+            # the fix includes zero-similarity matches that are missing by default
+            # in _matches_list due to our use of sparse matrices
+            non_matches_list = self._get_non_matches_list()
             matches_list = self._matches_list if non_matches_list.empty else \
                 pd.concat([self._matches_list, non_matches_list], axis=0, ignore_index=True)
-            
+
         left_side, right_side = get_both_sides(self._master, self._duplicates, drop_index=ignore_index)
         similarity = matches_list.similarity.reset_index(drop=True)
         if self._master_id is None:
@@ -358,18 +680,128 @@ def get_groups(self,
          If there are IDs (master_id and/or duplicates_id) then the IDs corresponding to the string outputs
          above are returned as well altogether in a DataFrame.
 
-        :param ignore_index: whether or not to exclude string Series index-columns in output.  Defaults to 
+        :param ignore_index: whether or not to exclude string Series index-columns in output.  Defaults to
         self._config.ignore_index.
-        :param replace_na: whether or not to replace NaN values in most similar string index-columns with 
+        :param replace_na: whether or not to replace NaN values in most similar string index-columns with
         corresponding duplicates-index values. Defaults to self._config.replace_na.
          """
-        if ignore_index is None: ignore_index = self._config.ignore_index
+        if ignore_index is None:
+            ignore_index = self._config.ignore_index
         if self._duplicates is None:
             return self._deduplicate(ignore_index=ignore_index)
         else:
-            if replace_na is None: replace_na = self._config.replace_na
+            if replace_na is None:
+                replace_na = self._config.replace_na
             return self._get_nearest_matches(ignore_index=ignore_index, replace_na=replace_na)
 
+    def match_strings(self,
+                      master: pd.Series,
+                      duplicates: Optional[pd.Series] = None,
+                      master_id: Optional[pd.Series] = None,
+                      duplicates_id: Optional[pd.Series] = None,
+                      **kwargs) -> pd.DataFrame:
+        """
+        Returns all highly similar strings without rebuilding the corpus.
+        If only 'master' is given, it will return highly similar strings within master.
+        This can be seen as an self-join. If both master and duplicates is given, it will return highly similar strings
+        between master and duplicates. This can be seen as an inner-join.
+
+        :param master: pandas.Series. Series of strings against which matches are calculated.
+        :param duplicates: pandas.Series. Series of strings that will be matched with master if given (Optional).
+        :param master_id: pandas.Series. Series of values that are IDs for master column rows (Optional).
+        :param duplicates_id: pandas.Series. Series of values that are IDs for duplicates column rows (Optional).
+        :param kwargs: All other keyword arguments are passed to StringGrouperConfig.
+        :return: pandas.Dataframe.
+        """
+        self.reset_data(master, duplicates, master_id, duplicates_id)
+        self.update_options(**kwargs)
+        self = self.fit()
+        return self.get_matches()
+
+    def match_most_similar(self,
+                           master: pd.Series,
+                           duplicates: pd.Series,
+                           master_id: Optional[pd.Series] = None,
+                           duplicates_id: Optional[pd.Series] = None,
+                           **kwargs) -> Union[pd.DataFrame, pd.Series]:
+        """
+        If no IDs ('master_id' and 'duplicates_id') are given, returns, without rebuilding the corpus, a
+        Series of strings of the same length as 'duplicates' where for each string in duplicates the most
+        similar string in 'master' is returned.
+        If there are no similar strings in master for a given string in duplicates
+        (there is no potential match where the cosine similarity is above the threshold [default: 0.8])
+        the original string in duplicates is returned.
+
+        For example the input Series [foooo, bar, baz] (master) and [foooob, bar, new] will return:
+        [foooo, bar, new].
+
+        If IDs (both 'master_id' and 'duplicates_id') are also given, returns a DataFrame of the same strings
+        output in the above case with their corresponding IDs.
+
+        :param master: pandas.Series. Series of strings that the duplicates will be matched with.
+        :param duplicates: pandas.Series. Series of strings that will me matched with the master.
+        :param master_id: pandas.Series. Series of values that are IDs for master column rows. (Optional)
+        :param duplicates_id: pandas.Series. Series of values that are IDs for duplicates column rows. (Optional)
+        :param kwargs: All other keyword arguments are passed to StringGrouperConfig. (Optional)
+        :return: pandas.Series or pandas.DataFrame.
+        """
+        self.reset_data(master, duplicates, master_id, duplicates_id)
+
+        old_max_n_matches = self._max_n_matches
+        new_max_n_matches = None
+        if 'max_n_matches' in kwargs:
+            new_max_n_matches = kwargs['max_n_matches']
+        kwargs['max_n_matches'] = 1
+        self.update_options(**kwargs)
+
+        self = self.fit()
+        output = self.get_groups()
+
+        kwargs['max_n_matches'] = old_max_n_matches if new_max_n_matches is None else new_max_n_matches
+        self.update_options(**kwargs)
+        return output
+
+    def group_similar_strings(self,
+                              strings_to_group: pd.Series,
+                              string_ids: Optional[pd.Series] = None,
+                              **kwargs) -> Union[pd.DataFrame, pd.Series]:
+        """
+        If 'string_ids' is not given, finds all similar strings in 'strings_to_group' without rebuilding the
+        corpus and returns a Series of strings of the same length as 'strings_to_group'. For each group of
+        similar strings a single string is chosen as the 'master' string and is returned for each member of
+        the group.
+
+        For example the input Series: [foooo, foooob, bar] will return [foooo, foooo, bar].  Here 'foooo' and
+        'foooob' are grouped together into group 'foooo' because they are found to be very similar.
+
+        If string_ids is also given, a DataFrame of the strings and their corresponding IDs is instead returned.
+
+        :param strings_to_group: pandas.Series. The input Series of strings to be grouped.
+        :param string_ids: pandas.Series. The input Series of the IDs of the strings to be grouped. (Optional)
+        :param kwargs: All other keyword arguments are passed to StringGrouperConfig. (Optional)
+        :return: pandas.Series or pandas.DataFrame.
+        """
+        self.reset_data(strings_to_group, master_id=string_ids)
+        self.update_options(**kwargs)
+        self = self.fit()
+        return self.get_groups()
+
+    def compute_pairwise_similarities(self,
+                                      string_series_1: pd.Series,
+                                      string_series_2: pd.Series,
+                                      **kwargs) -> pd.Series:
+        """
+        Computes the similarity scores between two Series of strings row-wise without rebuilding the corpus.
+
+        :param string_series_1: pandas.Series. The input Series of strings to be grouped
+        :param string_series_2: pandas.Series. The input Series of the IDs of the strings to be grouped
+        :param kwargs: All other keyword arguments are passed to StringGrouperConfig
+        :return: pandas.Series of similarity scores, the same length as string_series_1 and string_series_2
+        """
+        self.reset_data(string_series_1, string_series_2)
+        self.update_options(**kwargs)
+        return self.dot()
+
     @validate_is_fit
     def add_match(self, master_side: str, dupe_side: str) -> 'StringGrouper':
         """Adds a match if it wasn't found by the fit function"""
@@ -409,19 +841,19 @@ def remove_match(self, master_side: str, dupe_side: str) -> 'StringGrouper':
             )]
         return self
 
-    def _get_tf_idf_matrices(self) -> Tuple[csr_matrix, csr_matrix]:
-        # Fit the tf-idf vectorizer
-        self._vectorizer = self._fit_vectorizer()
-        # Build the two matrices
-        master_matrix = self._vectorizer.transform(self._master)
-
-        if self._duplicates is not None:
-            duplicate_matrix = self._vectorizer.transform(self._duplicates)
-        # IF there is no duplicate matrix, we assume we want to match on the master matrix itself
-        else:
-            duplicate_matrix = master_matrix
+    def _get_left_tf_idf_matrix(self, partition=(None, None)):
+        # unlike _get_tf_idf_matrices(), _get_left_tf_idf_matrix
+        # does not set the corpus but rather
+        # builds a matrix using the existing corpus
+        return self._vectorizer.transform(
+            self._left_Series.iloc[slice(*partition)])
 
-        return master_matrix, duplicate_matrix
+    def _get_right_tf_idf_matrix(self, partition=(None, None)):
+        # unlike _get_tf_idf_matrices(), _get_right_tf_idf_matrix
+        # does not set the corpus but rather
+        # builds a matrix using the existing corpus
+        return self._vectorizer.transform(
+            self._right_Series.iloc[slice(*partition)])
 
     def _fit_vectorizer(self) -> TfidfVectorizer:
         # if both dupes and master string series are set - we concat them to fit the vectorizer on all
@@ -433,74 +865,57 @@ def _fit_vectorizer(self) -> TfidfVectorizer:
         self._vectorizer.fit(strings)
         return self._vectorizer
 
-    def _build_matches(self, master_matrix: csr_matrix, duplicate_matrix: csr_matrix) -> csr_matrix:
+    def _build_matches(self,
+                       left_matrix: csr_matrix, right_matrix: csr_matrix,
+                       nnz_rows: np.ndarray = None,
+                       sort: bool = True) -> csr_matrix:
         """Builds the cossine similarity matrix of two csr matrices"""
-        tf_idf_matrix_1 = master_matrix
-        tf_idf_matrix_2 = duplicate_matrix.transpose()
-
-        optional_kwargs = dict()
-        if self._config.number_of_processes > 1:
-            optional_kwargs = {
-                'use_threads': True,
-                'n_jobs': self._config.number_of_processes
-            }
-
-        return awesome_cossim_topn(tf_idf_matrix_1, tf_idf_matrix_2,
-                                   self._config.max_n_matches,
-                                   self._config.min_similarity,
-                                   **optional_kwargs)
-
-    def _symmetrize_matches_list(self):
-        # [symmetrized matches_list] = [matches_list] UNION [transposed matches_list] (i.e., column-names swapped):
-        self._matches_list = self._matches_list.set_index(['master_side', 'dupe_side'])\
-            .combine_first(
-                self._matches_list.rename(
-                    columns={
-                        'master_side': 'dupe_side',
-                        'dupe_side': 'master_side'
-                    }
-                ).set_index(['master_side', 'dupe_side'])
-            ).reset_index()
-
-    def _get_non_matches_list(self, suppress_warning=False) -> pd.DataFrame:
+        right_matrix = right_matrix.transpose()
+
+        if nnz_rows is None:
+            nnz_rows = np.full(left_matrix.shape[0], 0, dtype=np.int32)
+
+        optional_kwargs = {
+            'return_best_ntop': True,
+            'sort': sort,
+            'use_threads': self._config.number_of_processes > 1,
+            'n_jobs': self._config.number_of_processes}
+
+        return awesome_cossim_topn(
+            left_matrix, right_matrix,
+            self._max_n_matches,
+            nnz_rows,
+            self._config.min_similarity,
+            **optional_kwargs)
+
+    def _get_matches_list(self,
+                          matches: csr_matrix
+                          ) -> pd.DataFrame:
+        """Returns a list of all the indices of matches"""
+        r, c = matches.nonzero()
+        d = matches.data
+        return pd.DataFrame({'master_side': c.astype(np.int64),
+                             'dupe_side': r.astype(np.int64),
+                             'similarity': d})
+
+    def _get_non_matches_list(self) -> pd.DataFrame:
         """Returns a list of all the indices of non-matching pairs (with similarity set to 0)"""
         m_sz, d_sz = len(self._master), len(self._master if self._duplicates is None else self._duplicates)
         all_pairs = pd.MultiIndex.from_product([range(m_sz), range(d_sz)], names=['master_side', 'dupe_side'])
         matched_pairs = pd.MultiIndex.from_frame(self._matches_list[['master_side', 'dupe_side']])
         missing_pairs = all_pairs.difference(matched_pairs)
-        if missing_pairs.empty: return pd.DataFrame()
-        if (self._config.max_n_matches < d_sz) and not suppress_warning:
-            warnings.warn(f'WARNING: max_n_matches={self._config.max_n_matches} may be too small!\n'
-                          f'\t\t Some zero-similarity matches returned may be false!\n'
-                          f'\t\t To be absolutely certain all zero-similarity matches are true,\n'
-                          f'\t\t try setting max_n_matches={d_sz} (the length of the Series parameter duplicates).\n'
-                          f'\t\t To suppress this warning, set suppress_warning=True.')
+        if missing_pairs.empty:
+            return pd.DataFrame()
+        if (self._max_n_matches < self._true_max_n_matches):
+            raise Exception(f'\nERROR: Cannot return zero-similarity matches since \n'
+                            f'\t\t max_n_matches={self._max_n_matches} is too small!\n'
+                            f'\t\t Try setting max_n_matches={self._true_max_n_matches} (the \n'
+                            f'\t\t true maximum number of matches over all strings in master)\n'
+                            f'\t\t or greater or do not set this kwarg at all.')
         missing_pairs = missing_pairs.to_frame(index=False)
         missing_pairs['similarity'] = 0
         return missing_pairs
 
-    @staticmethod
-    def _get_matches_list(matches) -> pd.DataFrame:
-        """Returns a list of all the indices of matches"""
-        non_zeros = matches.nonzero()
-
-        sparserows = non_zeros[0]
-        sparsecols = non_zeros[1]
-        nr_matches = sparsecols.size
-        master_side = np.empty([nr_matches], dtype=np.int64)
-        dupe_side = np.empty([nr_matches], dtype=np.int64)
-        similarity = np.zeros(nr_matches)
-
-        for index in range(0, nr_matches):
-            master_side[index] = sparserows[index]
-            dupe_side[index] = sparsecols[index]
-            similarity[index] = matches.data[index]
-
-        matches_list = pd.DataFrame({'master_side': master_side,
-                                     'dupe_side': dupe_side,
-                                     'similarity': similarity})
-        return matches_list
-
     def _get_nearest_matches(self,
                              ignore_index=False,
                              replace_na=False) -> Union[pd.DataFrame, pd.Series]:
@@ -508,8 +923,8 @@ def _get_nearest_matches(self,
         master_label = f'{prefix}{self._master.name if self._master.name else DEFAULT_MASTER_NAME}'
         master = self._master.rename(master_label).reset_index(drop=ignore_index)
         dupes = self._duplicates.rename('duplicates').reset_index(drop=ignore_index)
-        
-        # Rename new master-columns to avoid possible conflict with new dupes-columns when later merging 
+
+        # Rename new master-columns to avoid possible conflict with new dupes-columns when later merging
         if isinstance(dupes, pd.DataFrame):
             master.rename(
                 columns={col: f'{prefix}{col}' for col in master.columns if str(col) != master_label},
@@ -539,14 +954,14 @@ def _get_nearest_matches(self,
         if self._master_id is not None:
             # Also update the master_id-series with the duplicates_id in cases were there is no match
             dupes_max_sim.loc[rows_to_update, master_id_label] = dupes_max_sim[rows_to_update].duplicates_id
-            
+
             # For some weird reason, pandas' merge function changes int-datatype columns to float when NaN values
             # appear within them. So here we change them back to their original datatypes if possible:
             if dupes_max_sim[master_id_label].dtype != self._master_id.dtype and \
-                self._duplicates_id.dtype == self._master_id.dtype:
+                    self._duplicates_id.dtype == self._master_id.dtype:
                 dupes_max_sim.loc[:, master_id_label] = \
-                dupes_max_sim.loc[:, master_id_label].astype(self._master_id.dtype)
-            
+                    dupes_max_sim.loc[:, master_id_label].astype(self._master_id.dtype)
+
         # Prepare the output:
         required_column_list = [master_label] if self._master_id is None else [master_id_label, master_label]
         index_column_list = \
@@ -556,22 +971,21 @@ def _get_nearest_matches(self,
             # Update the master index-columns with the duplicates index-column values in cases were there is no match
             dupes_index_columns = [col for col in dupes.columns if str(col) != 'duplicates']
             dupes_max_sim.loc[rows_to_update, index_column_list] = \
-            dupes_max_sim.loc[rows_to_update, dupes_index_columns].values
-            
+                dupes_max_sim.loc[rows_to_update, dupes_index_columns].values
+
             # Restore their original datatypes if possible:
             for m, d in zip(index_column_list, dupes_index_columns):
                 if dupes_max_sim[m].dtype != master[m].dtype and dupes[d].dtype == master[m].dtype:
                     dupes_max_sim.loc[:, m] = dupes_max_sim.loc[:, m].astype(master[m].dtype)
-                    
+
         # Make sure to keep same order as duplicates
         dupes_max_sim = dupes_max_sim.sort_values('dupe_side').set_index('dupe_side')
         output = dupes_max_sim[index_column_list + required_column_list]
         output.index = self._duplicates.index
-        return output.squeeze()
+        return output.squeeze(axis=1)
 
     def _deduplicate(self, ignore_index=False) -> Union[pd.DataFrame, pd.Series]:
-        # discard self-matches: A matches A
-        pairs = self._matches_list[self._matches_list['master_side'] != self._matches_list['dupe_side']]
+        pairs = self._matches_list
         # rebuild graph adjacency matrix from already found matches:
         n = len(self._master)
         graph = csr_matrix(
@@ -599,7 +1013,7 @@ def _deduplicate(self, ignore_index=False) -> Union[pd.DataFrame, pd.Series]:
             graph.data = pairs['similarity'].to_numpy()
             # sum along the rows to obtain numpy 1D matrix of similarity aggregates then ...
             # ... convert to 1D numpy array (using asarray then squeeze) and then to Series:
-            group_of_master_index['weight'] = pd.Series(np.asarray(graph.sum(axis=1)).squeeze())
+            group_of_master_index['weight'] = pd.Series(np.asarray(graph.sum(axis=1)).squeeze(axis=1))
             method = 'idxmax'
 
         # Determine the group representatives AND merge with indices:
@@ -623,7 +1037,7 @@ def _deduplicate(self, ignore_index=False) -> Union[pd.DataFrame, pd.Series]:
             output_id = self._master_id.iloc[group_of_master_index.group_rep].rename(id_label).reset_index(drop=True)
             output = pd.concat([output_id, output], axis=1)
         output.index = self._master.index
-        return output.squeeze()
+        return output
 
     def _get_indices_of(self, master_side: str, dupe_side: str) -> Tuple[pd.Series, pd.Series]:
         master_strings = self._master
@@ -634,7 +1048,7 @@ def _get_indices_of(self, master_side: str, dupe_side: str) -> Tuple[pd.Series,
         master_indices = master_strings[master_strings == master_side].index.to_series().reset_index(drop=True)
         dupe_indices = dupe_strings[dupe_strings == dupe_side].index.to_series().reset_index(drop=True)
         return master_indices, dupe_indices
-    
+
     def _validate_group_rep_specs(self):
         group_rep_options = (GROUP_REP_FIRST, GROUP_REP_CENTROID)
         if self._config.group_rep not in group_rep_options:
@@ -642,6 +1056,13 @@ def _validate_group_rep_specs(self):
                 f"Invalid option value for group_rep. The only permitted values are\n {group_rep_options}"
             )
 
+    def _validate_tfidf_matrix_dtype(self):
+        dtype_options = (np.float32, np.float64)
+        if self._config.tfidf_matrix_dtype not in dtype_options:
+            raise Exception(
+                f"Invalid option value for tfidf_matrix_dtype. The only permitted values are\n {dtype_options}"
+            )
+
     def _validate_replace_na_and_drop(self):
         if self._config.ignore_index and self._config.replace_na:
             raise Exception("replace_na can only be set to True when ignore_index=False.")
@@ -651,6 +1072,33 @@ def _validate_replace_na_and_drop(self):
                 "index if the number of index-levels does not equal the number of index-columns."
             )
 
+    @staticmethod
+    def _validate_n_blocks(n_blocks):
+        errmsg = "Invalid option value for parameter n_blocks: "
+        "n_blocks must be None or a tuple of 2 integers greater than 0."
+        if n_blocks is None:
+            return
+        if not isinstance(n_blocks, tuple):
+            raise Exception(errmsg)
+        if len(n_blocks) != 2:
+            raise Exception(errmsg)
+        if not (isinstance(n_blocks[0], int) and isinstance(n_blocks[1], int)):
+            raise Exception(errmsg)
+        if (n_blocks[0] < 1) or (n_blocks[1] < 1):
+            raise Exception(errmsg)
+
+    @staticmethod
+    def _fix_diagonal(m: lil_matrix) -> lil_matrix:
+        r = np.arange(m.shape[0])
+        m[r, r] = 1
+        return m
+
+    @staticmethod
+    def _symmetrize_matrix(m_symmetric: lil_matrix) -> lil_matrix:
+        r, c = m_symmetric.nonzero()
+        m_symmetric[c, r] = m_symmetric[r, c]
+        return m_symmetric
+
     @staticmethod
     def _make_symmetric(new_matches: pd.DataFrame) -> pd.DataFrame:
         columns_switched = pd.DataFrame({'master_side': new_matches.dupe_side,
@@ -678,7 +1126,7 @@ def _is_series_of_strings(series_to_test: pd.Series) -> bool:
             return False
         elif series_to_test.to_frame().applymap(
                     lambda x: not isinstance(x, str)
-                ).squeeze().any():
+                ).squeeze(axis=1).any():
             return False
         return True
 
diff --git a/string_grouper/test/test_string_grouper.py b/string_grouper/test/test_string_grouper.py
index 723d3f22..b159646b 100644
--- a/string_grouper/test/test_string_grouper.py
+++ b/string_grouper/test/test_string_grouper.py
@@ -3,13 +3,15 @@
 import numpy as np
 from scipy.sparse.csr import csr_matrix
 from string_grouper.string_grouper import DEFAULT_MIN_SIMILARITY, \
-    DEFAULT_MAX_N_MATCHES, DEFAULT_REGEX, \
-    DEFAULT_NGRAM_SIZE, DEFAULT_N_PROCESSES, DEFAULT_IGNORE_CASE, \
+    DEFAULT_REGEX, DEFAULT_NGRAM_SIZE, DEFAULT_N_PROCESSES, DEFAULT_IGNORE_CASE, \
     StringGrouperConfig, StringGrouper, StringGrouperNotFitException, \
-    match_most_similar, group_similar_strings, match_strings,\
+    match_most_similar, group_similar_strings, match_strings, \
     compute_pairwise_similarities
-from unittest.mock import patch
-import warnings
+from unittest.mock import patch, Mock
+
+
+def mock_symmetrize_matrix(x: csr_matrix) -> csr_matrix:
+    return x
 
 
 class SimpleExample(object):
@@ -93,7 +95,7 @@ def test_config_defaults(self):
         """Empty initialisation should set default values"""
         config = StringGrouperConfig()
         self.assertEqual(config.min_similarity, DEFAULT_MIN_SIMILARITY)
-        self.assertEqual(config.max_n_matches, DEFAULT_MAX_N_MATCHES)
+        self.assertEqual(config.max_n_matches, None)
         self.assertEqual(config.regex, DEFAULT_REGEX)
         self.assertEqual(config.ngram_size, DEFAULT_NGRAM_SIZE)
         self.assertEqual(config.number_of_processes, DEFAULT_N_PROCESSES)
@@ -114,6 +116,251 @@ def test_config_non_default_values(self):
 
 
 class StringGrouperTest(unittest.TestCase):
+
+    def test_auto_blocking_single_DataFrame(self):
+        """tests whether automatic blocking yields consistent results"""
+        # This function will force an OverflowError to occur when
+        # the input Series have a combined length above a given number:
+        # OverflowThreshold.  This will in turn trigger automatic splitting
+        # of the Series/matrices into smaller blocks when n_blocks = None
+
+        sort_cols = ['right_index', 'left_index']
+
+        def fix_row_order(df):
+            return df.sort_values(sort_cols).reset_index(drop=True)
+
+        simple_example = SimpleExample()
+        df1 = simple_example.customers_df2['Customer Name']
+
+        # first do manual blocking
+        sg = StringGrouper(df1, min_similarity=0.1)
+        pd.testing.assert_series_equal(sg.master, df1)
+        self.assertEqual(sg.duplicates, None)
+
+        matches = fix_row_order(sg.match_strings(df1, n_blocks=(1, 1)))
+        self.assertEqual(sg._config.n_blocks, (1, 1))
+
+        # Create a custom wrapper for this StringGrouper instance's
+        # _build_matches() method which will later be used to
+        # mock _build_matches().
+        # Note that we have  to  define  the  wrapper  here  because
+        # _build_matches() is a non-static function of StringGrouper
+        # and needs access to the specific StringGrouper instance sg
+        # created here.
+        def mock_build_matches(OverflowThreshold,
+                               real_build_matches=sg._build_matches):
+            def wrapper(left_matrix,
+                        right_matrix,
+                        nnz_rows=None,
+                        sort=True):
+                if (left_matrix.shape[0] + right_matrix.shape[0]) > \
+                        OverflowThreshold:
+                    raise OverflowError
+                return real_build_matches(left_matrix, right_matrix, nnz_rows, sort)
+            return wrapper
+
+        def do_test_with(OverflowThreshold):
+            nonlocal sg  # allows reference to sg, as sg will be modified below
+            # Now let us mock sg._build_matches:
+            sg._build_matches = Mock(side_effect=mock_build_matches(OverflowThreshold))
+            sg.clear_data()
+            matches_auto = fix_row_order(sg.match_strings(df1, n_blocks=None))
+            pd.testing.assert_series_equal(sg.master, df1)
+            pd.testing.assert_frame_equal(matches, matches_auto)
+            self.assertEqual(sg._config.n_blocks, None)
+            # Note that _build_matches is called more than once if and only if
+            # a split occurred (that is, there was more than one pair of
+            # matrix-blocks multiplied)
+            if len(sg._left_Series) + len(sg._right_Series) > \
+                    OverflowThreshold:
+                # Assert that split occurred:
+                self.assertGreater(sg._build_matches.call_count, 1)
+            else:
+                # Assert that split did not occur:
+                self.assertEqual(sg._build_matches.call_count, 1)
+
+        # now test auto blocking by forcing an OverflowError when the
+        # combined Series' lengths is greater than 10, 5, 3, 2
+
+        do_test_with(OverflowThreshold=100)  # does not trigger auto blocking
+        do_test_with(OverflowThreshold=10)
+        do_test_with(OverflowThreshold=5)
+        do_test_with(OverflowThreshold=3)
+        do_test_with(OverflowThreshold=2)
+
+    def test_n_blocks_single_DataFrame(self):
+        """tests whether manual blocking yields consistent results"""
+        sort_cols = ['right_index', 'left_index']
+
+        def fix_row_order(df):
+            return df.sort_values(sort_cols).reset_index(drop=True)
+
+        simple_example = SimpleExample()
+        df1 = simple_example.customers_df2['Customer Name']
+
+        matches11 = fix_row_order(match_strings(df1, min_similarity=0.1))
+
+        matches12 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches12)
+
+        matches13 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 3), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches13)
+
+        matches14 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 4), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches14)
+
+        matches15 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 5), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches15)
+
+        matches16 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 6), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches16)
+
+        matches17 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 7), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches17)
+
+        matches18 = fix_row_order(
+            match_strings(df1, n_blocks=(1, 8), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches18)
+
+        matches21 = fix_row_order(
+            match_strings(df1, n_blocks=(2, 1), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches21)
+
+        matches22 = fix_row_order(
+            match_strings(df1, n_blocks=(2, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches22)
+
+        matches32 = fix_row_order(
+            match_strings(df1, n_blocks=(3, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches32)
+
+        # Create a custom wrapper for this StringGrouper instance's
+        # _build_matches() method which will later be used to
+        # mock _build_matches().
+        # Note that we have  to  define  the  wrapper  here  because
+        # _build_matches() is a non-static function of StringGrouper
+        # and needs access to the specific StringGrouper instance sg
+        # created here.
+        sg = StringGrouper(df1, min_similarity=0.1)
+
+        def mock_build_matches(OverflowThreshold,
+                               real_build_matches=sg._build_matches):
+            def wrapper(left_matrix,
+                        right_matrix,
+                        nnz_rows=None,
+                        sort=True):
+                if (left_matrix.shape[0] + right_matrix.shape[0]) > \
+                        OverflowThreshold:
+                    raise OverflowError
+                return real_build_matches(left_matrix, right_matrix, nnz_rows, sort)
+            return wrapper
+
+        def test_overflow_error_with(OverflowThreshold, n_blocks):
+            nonlocal sg
+            sg._build_matches = Mock(side_effect=mock_build_matches(OverflowThreshold))
+            sg.clear_data()
+            max_left_block_size = (len(df1)//n_blocks[0]
+                                   + (1 if len(df1) % n_blocks[0] > 0 else 0))
+            max_right_block_size = (len(df1)//n_blocks[1]
+                                    + (1 if len(df1) % n_blocks[1] > 0 else 0))
+            if (max_left_block_size + max_right_block_size) > OverflowThreshold:
+                with self.assertRaises(Exception):
+                    _ = sg.match_strings(df1, n_blocks=n_blocks)
+            else:
+                matches_manual = fix_row_order(sg.match_strings(df1, n_blocks=n_blocks))
+                pd.testing.assert_frame_equal(matches11, matches_manual)
+
+        test_overflow_error_with(OverflowThreshold=100, n_blocks=(1, 1))
+        test_overflow_error_with(OverflowThreshold=10, n_blocks=(1, 1))
+        test_overflow_error_with(OverflowThreshold=10, n_blocks=(2, 1))
+        test_overflow_error_with(OverflowThreshold=10, n_blocks=(1, 2))
+        test_overflow_error_with(OverflowThreshold=10, n_blocks=(4, 4))
+
+    def test_n_blocks_both_DataFrames(self):
+        """tests whether manual blocking yields consistent results"""
+        sort_cols = ['right_index', 'left_index']
+
+        def fix_row_order(df):
+            return df.sort_values(sort_cols).reset_index(drop=True)
+
+        simple_example = SimpleExample()
+        df1 = simple_example.customers_df['Customer Name']
+        df2 = simple_example.customers_df2['Customer Name']
+
+        matches11 = fix_row_order(match_strings(df1, df2, min_similarity=0.1))
+
+        matches12 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches12)
+
+        matches13 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 3), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches13)
+
+        matches14 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 4), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches14)
+
+        matches15 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 5), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches15)
+
+        matches16 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 6), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches16)
+
+        matches17 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 7), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches17)
+
+        matches18 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(1, 8), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches18)
+
+        matches21 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(2, 1), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches21)
+
+        matches22 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(2, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches22)
+
+        matches32 = fix_row_order(
+            match_strings(df1, df2, n_blocks=(3, 2), min_similarity=0.1))
+        pd.testing.assert_frame_equal(matches11, matches32)
+
+    def test_n_blocks_bad_option_value(self):
+        """Tests that bad option values for n_blocks are caught"""
+        simple_example = SimpleExample()
+        df1 = simple_example.customers_df2['Customer Name']
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, n_blocks=2)
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, n_blocks=(0, 2))
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, n_blocks=(1, 2.5))
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, n_blocks=(1, 2, 3))
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, n_blocks=(1, ))
+
+    def test_tfidf_dtype_bad_option_value(self):
+        """Tests that bad option values for n_blocks are caught"""
+        simple_example = SimpleExample()
+        df1 = simple_example.customers_df2['Customer Name']
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, tfidf_matrix_dtype=None)
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, tfidf_matrix_dtype=0)
+        with self.assertRaises(Exception):
+            _ = match_strings(df1, tfidf_matrix_dtype='whatever')
+
     def test_compute_pairwise_similarities(self):
         """tests the high-level function compute_pairwise_similarities"""
         simple_example = SimpleExample()
@@ -131,6 +378,10 @@ def test_compute_pairwise_similarities(self):
             ],
             name='similarity'
         )
+        expected_result = expected_result.astype(np.float32)
+        pd.testing.assert_series_equal(expected_result, similarities)
+        sg = StringGrouper(df1, df2)
+        similarities = sg.compute_pairwise_similarities(df1, df2)
         pd.testing.assert_series_equal(expected_result, similarities)
 
     def test_compute_pairwise_similarities_data_integrity(self):
@@ -197,14 +448,17 @@ def test_match_strings(self, mock_StringGouper):
         mock_StringGrouper_instance.get_matches.assert_called_once()
         self.assertEqual(df, 'whatever')
 
-    @patch('string_grouper.string_grouper.StringGrouper._symmetrize_matches_list')
-    def test_match_list_symmetry_without_symmetrize_function(self, mock_symmetrize_matches_list):
-        """mocks StringGrouper._symmetrize_matches_list so that this test fails whenever _matches_list is 
+    @patch(
+        'string_grouper.string_grouper.StringGrouper._symmetrize_matrix',
+        side_effect=mock_symmetrize_matrix
+    )
+    def test_match_list_symmetry_without_symmetrize_function(self, mock_symmetrize_matrix_param):
+        """mocks StringGrouper._symmetrize_matches_list so that this test fails whenever _matches_list is
         **partially** symmetric which often occurs when the kwarg max_n_matches is too small"""
         simple_example = SimpleExample()
         df = simple_example.customers_df2['Customer Name']
         sg = StringGrouper(df, max_n_matches=2).fit()
-        mock_symmetrize_matches_list.assert_called_once()
+        mock_symmetrize_matrix_param.assert_called_once()
         # obtain the upper and lower triangular parts of the matrix of matches:
         upper = sg._matches_list[sg._matches_list['master_side'] < sg._matches_list['dupe_side']]
         lower = sg._matches_list[sg._matches_list['master_side'] > sg._matches_list['dupe_side']]
@@ -213,7 +467,7 @@ def test_match_list_symmetry_without_symmetrize_function(self, mock_symmetrize_m
         # obtain the intersection between upper and upper_prime:
         intersection = upper_prime.merge(upper, how='inner', on=['master_side', 'dupe_side'])
         # if the intersection is empty then _matches_list is completely non-symmetric (this is acceptable)
-        # if the intersection is not empty then at least some matches are repeated.  
+        # if the intersection is not empty then at least some matches are repeated.
         # To make sure all (and not just some) matches are repeated, the lengths of
         # upper, upper_prime and their intersection should be identical.
         self.assertFalse(intersection.empty or len(upper) == len(upper_prime) == len(intersection))
@@ -231,38 +485,53 @@ def test_match_list_symmetry_with_symmetrize_function(self):
         # Obtain the intersection between upper and upper_prime:
         intersection = upper_prime.merge(upper, how='inner', on=['master_side', 'dupe_side'])
         # If the intersection is empty this means _matches_list is completely non-symmetric (this is acceptable)
-        # If the intersection is not empty this means at least some matches are repeated.  
+        # If the intersection is not empty this means at least some matches are repeated.
         # To make sure all (and not just some) matches are repeated, the lengths of
         # upper, upper_prime and their intersection should be identical.
         self.assertTrue(intersection.empty or len(upper) == len(upper_prime) == len(intersection))
 
-    def test_match_list_diagonal(self):
+    @patch(
+        'string_grouper.string_grouper.StringGrouper._fix_diagonal',
+        side_effect=mock_symmetrize_matrix
+    )
+    def test_match_list_diagonal_without_the_fix(self, mock_fix_diagonal):
         """test fails whenever _matches_list's number of self-joins is not equal to the number of strings"""
         # This bug is difficult to reproduce -- I mostly encounter it while working with very large datasets;
         # for small datasets setting max_n_matches=1 reproduces the bug
         simple_example = SimpleExample()
         df = simple_example.customers_df['Customer Name']
         matches = match_strings(df, max_n_matches=1)
+        mock_fix_diagonal.assert_called_once()
         num_self_joins = len(matches[matches['left_index'] == matches['right_index']])
         num_strings = len(df)
         self.assertNotEqual(num_self_joins, num_strings)
 
+    def test_match_list_diagonal(self):
+        """This test ensures that all self-joins are present"""
+        # This bug is difficult to reproduce -- I mostly encounter it while working with very large datasets;
+        # for small datasets setting max_n_matches=1 reproduces the bug
+        simple_example = SimpleExample()
+        df = simple_example.customers_df['Customer Name']
+        matches = match_strings(df, max_n_matches=1)
+        num_self_joins = len(matches[matches['left_index'] == matches['right_index']])
+        num_strings = len(df)
+        self.assertEqual(num_self_joins, num_strings)
+
     def test_zero_min_similarity(self):
-        """Since sparse matrices exclude zero elements, this test ensures that zero similarity matches are 
+        """Since sparse matrices exclude zero elements, this test ensures that zero similarity matches are
         returned when min_similarity <= 0.  A bug related to this was first pointed out by @nbcvijanovic"""
         simple_example = SimpleExample()
         s_master = simple_example.customers_df['Customer Name']
         s_dup = simple_example.whatever_series_1
-        matches = match_strings(s_master, s_dup, max_n_matches=len(s_master), min_similarity=0)
+        matches = match_strings(s_master, s_dup, min_similarity=0)
         pd.testing.assert_frame_equal(simple_example.expected_result_with_zeroes, matches)
 
     def test_zero_min_similarity_small_max_n_matches(self):
-        """This test ensures that a warning is issued when n_max_matches is suspected to be too small while 
+        """This test ensures that a warning is issued when n_max_matches is suspected to be too small while
         min_similarity <= 0 and include_zeroes is True"""
         simple_example = SimpleExample()
         s_master = simple_example.customers_df['Customer Name']
         s_dup = simple_example.two_strings
-        warnings.simplefilter('error', UserWarning)
         with self.assertRaises(Exception):
             _ = match_strings(s_master, s_dup, max_n_matches=1, min_similarity=0)
 
@@ -276,7 +545,7 @@ def test_get_non_matches_empty_case(self):
 
     def test_n_grams_case_unchanged(self):
         """Should return all ngrams in a string with case"""
-        test_series = pd.Series(pd.Series(['aa']))
+        test_series = pd.Series(pd.Series(['aaa']))
         # Explicit do not ignore case
         sg = StringGrouper(test_series, ignore_case=False)
         expected_result = ['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
@@ -284,7 +553,7 @@ def test_n_grams_case_unchanged(self):
 
     def test_n_grams_ignore_case_to_lower(self):
         """Should return all case insensitive ngrams in a string"""
-        test_series = pd.Series(pd.Series(['aa']))
+        test_series = pd.Series(pd.Series(['aaa']))
         # Explicit ignore case
         sg = StringGrouper(test_series, ignore_case=True)
         expected_result = ['mcd', 'cdo', 'don', 'ona', 'nal', 'ald', 'lds']
@@ -292,7 +561,7 @@ def test_n_grams_ignore_case_to_lower(self):
 
     def test_n_grams_ignore_case_to_lower_with_defaults(self):
         """Should return all case insensitive ngrams in a string"""
-        test_series = pd.Series(pd.Series(['aa']))
+        test_series = pd.Series(pd.Series(['aaa']))
         # Implicit default case (i.e. default behaviour)
         sg = StringGrouper(test_series)
         expected_result = ['mcd', 'cdo', 'don', 'ona', 'nal', 'ald', 'lds']
@@ -302,7 +571,7 @@ def test_build_matrix(self):
         """Should create a csr matrix only master"""
         test_series = pd.Series(['foo', 'bar', 'baz'])
         sg = StringGrouper(test_series)
-        master, dupe = sg._get_tf_idf_matrices()
+        master, dupe = sg._get_right_tf_idf_matrix(), sg._get_left_tf_idf_matrix()
         c = csr_matrix([[0., 0., 1.],
                         [1., 0., 0.],
                         [0., 1., 0.]])
@@ -314,7 +583,7 @@ def test_build_matrix_master_and_duplicates(self):
         test_series_1 = pd.Series(['foo', 'bar', 'baz'])
         test_series_2 = pd.Series(['foo', 'bar', 'bop'])
         sg = StringGrouper(test_series_1, test_series_2)
-        master, dupe = sg._get_tf_idf_matrices()
+        master, dupe = sg._get_right_tf_idf_matrix(), sg._get_left_tf_idf_matrix()
         master_expected = csr_matrix([[0., 0., 0., 1.],
                                      [1., 0., 0., 0.],
                                      [0., 1., 0., 0.]])
@@ -330,12 +599,12 @@ def test_build_matches(self):
         test_series_1 = pd.Series(['foo', 'bar', 'baz'])
         test_series_2 = pd.Series(['foo', 'bar', 'bop'])
         sg = StringGrouper(test_series_1, test_series_2)
-        master, dupe = sg._get_tf_idf_matrices()
+        master, dupe = sg._get_right_tf_idf_matrix(), sg._get_left_tf_idf_matrix()
 
         expected_matches = np.array([[1., 0., 0.],
                                      [0., 1., 0.],
                                      [0., 0., 0.]])
-        np.testing.assert_array_equal(expected_matches, sg._build_matches(master, dupe).toarray())
+        np.testing.assert_array_equal(expected_matches, sg._build_matches(master, dupe)[0].toarray())
 
     def test_build_matches_list(self):
         """Should create the cosine similarity matrix of two series"""
@@ -347,6 +616,7 @@ def test_build_matches_list(self):
         dupe_side = [0, 1]
         similarity = [1.0, 1.0]
         expected_df = pd.DataFrame({'master_side': master, 'dupe_side': dupe_side, 'similarity': similarity})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg._matches_list)
 
     def test_case_insensitive_build_matches_list(self):
@@ -359,6 +629,7 @@ def test_case_insensitive_build_matches_list(self):
         dupe_side = [0, 1]
         similarity = [1.0, 1.0]
         expected_df = pd.DataFrame({'master_side': master, 'dupe_side': dupe_side, 'similarity': similarity})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg._matches_list)
 
     def test_get_matches_two_dataframes(self):
@@ -373,6 +644,7 @@ def test_get_matches_two_dataframes(self):
         expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side,
                                     'similarity': similarity,
                                     'right_side': right_side, 'right_index': right_index})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg.get_matches())
 
     def test_get_matches_single(self):
@@ -381,12 +653,13 @@ def test_get_matches_single(self):
         sg = sg.fit()
         left_side = ['foo', 'foo', 'bar', 'baz', 'foo', 'foo']
         right_side = ['foo', 'foo', 'bar', 'baz', 'foo', 'foo']
-        left_index = [0, 0, 1, 2, 3, 3]
-        right_index = [0, 3, 1, 2, 0, 3]
+        left_index = [0, 3, 1, 2, 0, 3]
+        right_index = [0, 0, 1, 2, 3, 3]
         similarity = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
         expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side,
                                     'similarity': similarity,
                                     'right_side': right_side, 'right_index': right_index})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg.get_matches())
 
     def test_get_matches_1_series_1_id_series(self):
@@ -395,15 +668,16 @@ def test_get_matches_1_series_1_id_series(self):
         sg = StringGrouper(test_series_1, master_id=test_series_id_1)
         sg = sg.fit()
         left_side = ['foo', 'foo', 'bar', 'baz', 'foo', 'foo']
-        left_side_id = ['A0', 'A0', 'A1', 'A2', 'A3', 'A3']
-        left_index = [0, 0, 1, 2, 3, 3]
+        left_side_id = ['A0', 'A3', 'A1', 'A2', 'A0', 'A3']
+        left_index = [0, 3, 1, 2, 0, 3]
         right_side = ['foo', 'foo', 'bar', 'baz', 'foo', 'foo']
-        right_side_id = ['A0', 'A3', 'A1', 'A2', 'A0', 'A3']
-        right_index = [0, 3, 1, 2, 0, 3]
+        right_side_id = ['A0', 'A0', 'A1', 'A2', 'A3', 'A3']
+        right_index = [0, 0, 1, 2, 3, 3]
         similarity = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
         expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'left_id': left_side_id,
                                     'similarity': similarity,
                                     'right_id': right_side_id, 'right_side': right_side, 'right_index': right_index})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg.get_matches())
 
     def test_get_matches_2_series_2_id_series(self):
@@ -423,6 +697,7 @@ def test_get_matches_2_series_2_id_series(self):
         expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'left_id': left_side_id,
                                     'similarity': similarity,
                                     'right_id': right_side_id, 'right_side': right_side, 'right_index': right_index})
+        expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype)
         pd.testing.assert_frame_equal(expected_df, sg.get_matches())
 
     def test_get_matches_raises_exception_if_unexpected_options_given(self):
@@ -469,6 +744,61 @@ def test_get_groups_single_df_group_rep_default(self):
                 ignore_index=True
             )
         )
+        sg = StringGrouper(customers_df['Customer Name'])
+        pd.testing.assert_series_equal(
+            simple_example.expected_result_centroid,
+            sg.group_similar_strings(
+                customers_df['Customer Name'],
+                min_similarity=0.6,
+                ignore_index=True
+            )
+        )
+
+    def test_get_groups_single_valued_series(self):
+        """This test ensures that get_groups() returns a single-valued DataFrame or Series object
+        since the input-series is also single-valued.  This test was created in response to a bug discovered
+        by George Walker"""
+        pd.testing.assert_frame_equal(
+            pd.DataFrame([(0, "hello")], columns=['group_rep_index', 'group_rep']),
+            group_similar_strings(
+                pd.Series(["hello"]),
+                min_similarity=0.6
+            )
+        )
+        pd.testing.assert_series_equal(
+            pd.Series(["hello"], name='group_rep'),
+            group_similar_strings(
+                pd.Series(["hello"]),
+                min_similarity=0.6,
+                ignore_index=True
+            )
+        )
+        pd.testing.assert_frame_equal(
+            pd.DataFrame([(0, "hello")], columns=['most_similar_index', 'most_similar_master']),
+            match_most_similar(
+                pd.Series(["hello"]),
+                pd.Series(["hello"]),
+                min_similarity=0.6
+            )
+        )
+        pd.testing.assert_frame_equal(
+            pd.DataFrame([(0, "hello")], columns=['most_similar_index', 'most_similar_master']),
+            match_most_similar(
+                pd.Series(["hello"]),
+                pd.Series(["hello"]),
+                min_similarity=0.6,
+                max_n_matches=20
+            )
+        )
+        pd.testing.assert_series_equal(
+            pd.Series(["hello"], name='most_similar_master'),
+            match_most_similar(
+                pd.Series(["hello"]),
+                pd.Series(["hello"]),
+                min_similarity=0.6,
+                ignore_index=True
+            )
+        )
 
     def test_get_groups_single_df_keep_index(self):
         """Should return a pd.Series object with the same length as the original df. The series object will contain
@@ -542,6 +872,8 @@ def test_get_groups_two_df(self):
         result = sg.get_groups()
         expected_result = pd.Series(['foooo', 'bar', 'baz', 'foooo'], name='most_similar_master')
         pd.testing.assert_series_equal(expected_result, result)
+        result = sg.match_most_similar(test_series_1, test_series_2, max_n_matches=3)
+        pd.testing.assert_series_equal(expected_result, result)
 
     def test_get_groups_2_string_series_2_id_series(self):
         """Should return a pd.DataFrame object with the length of the dupes. The series will contain the master string
@@ -610,9 +942,9 @@ def test_get_groups_4_df_same_similarity(self):
         test_series_2 = pd.Series(['foooo', 'bar', 'baz', 'foooob'])
         test_series_id_1 = pd.Series(['A0', 'A1', 'A2', 'A3'])
         test_series_id_2 = pd.Series(['B0', 'B1', 'B2', 'B3'])
-        sg = StringGrouper(test_series_1, 
-                           test_series_2, 
-                           master_id=test_series_id_1, 
+        sg = StringGrouper(test_series_1,
+                           test_series_2,
+                           master_id=test_series_id_1,
                            duplicates_id=test_series_id_2,
                            ignore_index=True)
         sg = sg.fit()
diff --git a/string_grouper_utils/string_grouper_utils.py b/string_grouper_utils/string_grouper_utils.py
index 11803a32..e674367b 100644
--- a/string_grouper_utils/string_grouper_utils.py
+++ b/string_grouper_utils/string_grouper_utils.py
@@ -1,7 +1,7 @@
-import numpy as np
 import pandas as pd
 from typing import List, Optional, Union
 from dateutil.parser import parse
+from dateutil.tz import UTC
 from numbers import Number
 from datetime import datetime
 import re
@@ -137,19 +137,19 @@ def get_column(col: Union[str, int, List[Union[str, int]]], data: pd.DataFrame):
 
 
 def parse_timestamps(timestamps: pd.Series, parserinfo=None, **kwargs) -> pd.Series:
-    error_msg = f"timestamps must be a Series of date-like or datetime-like strings"
-    error_msg += f" or datetime datatype or pandas Timestamp datatype or numbers"
+    error_msg = "timestamps must be a Series of date-like or datetime-like strings"
+    error_msg += " or datetime datatype or pandas Timestamp datatype or numbers"
     if is_series_of_type(str, timestamps):
         # if any of the strings is not datetime-like raise an exception
         if timestamps.to_frame().applymap(is_date).squeeze().all():
             # convert strings to numpy datetime64
-            return timestamps.transform(lambda x: np.datetime64(parse(x, parserinfo, **kwargs)))
+            return timestamps.transform(lambda x: parse(x, parserinfo, **kwargs).astimezone(UTC))
     elif is_series_of_type(type(pd.Timestamp('15-1-2000')), timestamps):
         # convert pandas Timestamps to numpy datetime64
         return timestamps.transform(lambda x: x.to_numpy())
     elif is_series_of_type(datetime, timestamps):
         # convert python datetimes to numpy datetime64
-        return timestamps.transform(lambda x: np.datetime64(x))
+        return timestamps.transform(lambda x: x.astimezone(UTC))
     elif is_series_of_type(Number, timestamps):
         return timestamps
     raise Exception(error_msg)
diff --git a/string_grouper_utils/test/test_string_grouper_utils.py b/string_grouper_utils/test/test_string_grouper_utils.py
index 3798e3cd..0c8a8ee4 100644
--- a/string_grouper_utils/test/test_string_grouper_utils.py
+++ b/string_grouper_utils/test/test_string_grouper_utils.py
@@ -1,8 +1,8 @@
 import unittest
 import pandas as pd
 from dateutil.parser import parse
-from string_grouper_utils.string_grouper_utils import new_group_rep_by_earliest_timestamp, new_group_rep_by_completeness, \
-    new_group_rep_by_highest_weight
+from string_grouper_utils.string_grouper_utils import new_group_rep_by_earliest_timestamp, \
+    new_group_rep_by_completeness, new_group_rep_by_highest_weight
 
 
 class SimpleExample(object):
diff --git a/time_match_strings.py b/time_match_strings.py
new file mode 100644
index 00000000..ee87b204
--- /dev/null
+++ b/time_match_strings.py
@@ -0,0 +1,63 @@
+import pandas as pd
+import numpy as np
+from string_grouper import match_strings
+import random
+import time
+import os
+
+# mem_limit = '1G'
+# procgov = r'C:\Users\heamu\Source\Repos\process-governor\ProcessGovernor\bin\x64\Debug\procgov.exe'
+# os.popen(f'{procgov} -r -m {mem_limit} -p {os.getpid()}')
+# time.sleep(1)
+progress = 0
+do_print = True
+companies = pd.read_csv('data/sec__edgar_company_info.csv')
+x0 = 10000
+Nx = 10000
+dNx = 1000
+Nx2 = 500000
+dNx2 = 50000
+y0 = 10000
+Ny = 10000
+dNy = 10000
+ns = 10
+# X = np.append(np.arange(dNx, Nx + 1, dNx), np.arange(dNx2 + dNx2, Nx2 + 1, dNx2))
+X = np.arange(x0, Nx + 1, dNx)
+Y = np.arange(y0, Ny + 1, dNy)
+means = np.full((len(X), len(Y)), 0)
+for s in range(ns):
+    dgrid = []
+    i = 1
+    _ = print('[', flush=True, end='') if do_print else None
+    for x in X:
+        left_df = companies['Company Name'].iloc[random.sample(range(len(companies)), k = x)]
+        if i > 1:
+            _ = print(', ', flush=True) if do_print else None
+        dseries = []
+        stdseries = []
+        _ = print('[', flush=True, end='') if do_print else None
+        j = 1
+        for y in Y:
+            if j > 1:
+                _ = print(', ', flush=True, end='') if do_print else None
+            right_df = companies['Company Name'].iloc[random.sample(range(len(companies)), k = y)]
+            t0 = time.time()
+            _ = match_strings(right_df, left_df, n_blocks=(1, 1))
+            t1 = time.time()
+            dseries += [(t1 - t0)/60]
+            progress += 1.0/(ns*len(X)*len(Y))
+            # print(f'Progress {progress:.1%}', end='\x1b[1K\r')
+            _ = print(f'{dseries[-1]}', flush=True, end='') if do_print else None
+            # _ = print('.', flush=True, end='') if not do_print else None
+            j += 1
+        _ = print(']', flush=True, end='') if do_print else None
+        dgrid += [dseries]
+        i += 1
+        # _ = print(f'{i}/{len(X)}', flush=True) if not do_print else None
+    _ = print(']', flush=True) if do_print else None
+    means = (np.asarray(dgrid) + s*means)/(s + 1)
+    with open(f'runtime_means_x_{x0}-{Nx}_y_{y0}-{Ny}.npy', 'wb') as f:
+        np.save(f, means)
+        np.save(f, X)
+        np.save(f, Y)
+    #send_me_mail()