Skip to content

Commit b6180ae

Browse files
changed default value of kwarg max_n_matches to #strings in master
1 parent 6711bb7 commit b6180ae

File tree

3 files changed

+11
-4
lines changed

3 files changed

+11
-4
lines changed

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Changed
11+
12+
## [0.5.1?] - 2021-07-05
13+
14+
* Improved the performance of the function `match_most_similar`.
15+
* Changed the default value of the keyword argument `max_n_matches` to the number of strings in `master`. (`max_n_matches` is now defined as the maximum number of matches allowed per string in `duplicates` \[or `master` if `duplicates` is not given\]).
16+
1017
## [0.5.0] - 2021-06-11
1118

1219
### Added

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ All functions are built using a class **`StringGrouper`**. This class can be use
136136
* **`ngram_size`**: The amount of characters in each n-gram. Default is `3`.
137137
* **`tfidf_matrix_dtype`**: The datatype for the tf-idf values of the matrix components. Allowed values are `numpy.float32` and `numpy.float64`. Default is `numpy.float32`. (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)
138138
* **`regex`**: The regex string used to clean-up the input string. Default is `"[,-./]|\s"`.
139-
* **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is the number of strings in `duplicates` (or `master`, if `duplicates` is not given).
139+
* **`max_n_matches`**: The maximum number of matches allowed per string in `duplicates` (or `master` if `duplicates` is not given). Default is the number of strings in `master`.
140140
* **`min_similarity`**: The minimum cosine similarity for two strings to be considered a match.
141141
Defaults to `0.8`
142142
* **`number_of_processes`**: The number of processes used by the cosine similarity calculation. Defaults to

string_grouper/string_grouper.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313
DEFAULT_NGRAM_SIZE: int = 3
1414
DEFAULT_TFIDF_MATRIX_DTYPE: type = np.float32 # (only types np.float32 and np.float64 are allowed by sparse_dot_topn)
1515
DEFAULT_REGEX: str = r'[,-./]|\s'
16-
DEFAULT_MAX_N_MATCHES: int = 20
1716
DEFAULT_MIN_SIMILARITY: float = 0.8 # minimum cosine similarity for an item to be considered a match
1817
DEFAULT_N_PROCESSES: int = multiprocessing.cpu_count() - 1
1918
DEFAULT_IGNORE_CASE: bool = True # ignores case by default
@@ -209,7 +208,8 @@ class StringGrouperConfig(NamedTuple):
209208
(Note: np.float32 often leads to faster processing and a smaller memory footprint albeit less precision
210209
than np.float64.)
211210
:param regex: str. The regex string used to cleanup the input string. Default is '[,-./]|\s'.
212-
:param max_n_matches: int. The maximum number of matches allowed per string. Default is 20.
211+
:param max_n_matches: int. The maximum number of matches allowed per string in `duplicates` (or `master`
212+
is duplicates is not given). Default will be set by StringGrouper.
213213
:param min_similarity: float. The minimum cosine similarity for two strings to be considered a match.
214214
Defaults to 0.8.
215215
:param number_of_processes: int. The number of processes used by the cosine similarity calculation.
@@ -297,7 +297,7 @@ def __init__(self, master: pd.Series,
297297

298298
self._config: StringGrouperConfig = StringGrouperConfig(**kwargs)
299299
if self._config.max_n_matches is None:
300-
self._max_n_matches = len(self._master) if self._duplicates is None else len(self._duplicates)
300+
self._max_n_matches = len(self._master)
301301
else:
302302
self._max_n_matches = self._config.max_n_matches
303303

0 commit comments

Comments
 (0)