added blocking capabilities #72

ParticularMiner · 2021-09-21T17:03:00Z

Hi @Bergvca

I'm glad you like red_string_grouper! So do I. 😃

As requested, here is a branch with all the matrix-blocking capabilities included.

The following call on the sec__edgar_companies.csv data file took 4 minutes (down from 25 minutes on my computer with the last version of string_grouper!)
```
matches = match_strings(companies['Company Name'], n_blocks=(1, 200))
```
It's almost hard to believe.
Included is also the option to supply an already initialized StringGrouper object to the high-level functions match_strings, match_most_similar, group_similar_strings and compute_pairwise_similarities.

This enables a StringGrouper object to be reused thus the corpus can persist between calls as mentioned by @justasojourner in Issue Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69.
The README.md at this moment contains links to files on my branch in order to display the images. CHANGELOG.md also has such links to README.md. After merging (or rebasing) this branch, you will need to do a "search and replace" for the following strings before uploading to pypi,org:
"ParticularMiner/string_grouper/block/" → "Bergvca/string_grouper/master/"
"ParticularMiner/string_grouper/tree/block/" → "Bergvca/string_grouper/tree/master/"

ParticularMiner · 2021-09-24T08:17:18Z

@Bergvca

Unit tests now complete. Code coverage is also 100%.

Bergvca · 2021-09-27T10:58:46Z

Hi @ParticularMiner ,

What would be the best order to review the different PR's? Is it best to start with this one? Thanks!

ParticularMiner · 2021-09-27T11:39:43Z

Hi @Bergvca ,

Yes. This one.

ParticularMiner · 2021-09-27T12:08:42Z

@Bergvca

Please don't hesitate to ask me any questions if you have any.

Cheers.

Bergvca

Hi @ParticularMiner ,

Many thanks for all your work again! It looks very good! As its a pretty big PR, with multiple functionalities involved I just started going through it and adding comments. I still need to wrap my head around the blocking functionality and test the code itself, but here are some first comments :).

Bergvca · 2021-09-27T18:48:57Z

string_grouper/string_grouper.py

+        self._set_options(**kwargs)
+        self._build_corpus()
+
+    def _set_data(self,


I would prefer to have all the attributes in the init function, either as a sensible default or as None (but with type hints, e.g: Optional[DataFrame] ). This way it is always clear for the reader (and for the IDE) which attributes are possible (instead of defining them in other functions).

Bergvca · 2021-09-27T18:52:19Z

string_grouper/string_grouper.py

+        """
+        self._set_data(master, duplicates, master_id, duplicates_id)
+
+    def clear_data(self):


2 questions:

What is the goal of the "clear data" function?

Same as comment above, by looking at the Init function it is currently not clear that a "_matches_list" attribute exists for example

@Bergvca

Same reason as above.

I think your above comment somehow got lost - I think i saw it earlier, but i don't see it anymore...

Bergvca · 2021-09-28T19:11:15Z

string_grouper/string_grouper.py

+        self._duplicates = duplicates
+
+    @property
+    def master_id(self):


Should we just make these attributes public? So instead of having self._master_id we get self.master_id. This eliminates the need for having these setter and property functions.

@Bergvca

Sure. master_id and duplicates_id do not need setter and getter functions as neither of them is validated on its own. I agree that we can get rid of them altogether. Never mind making them public. Actually it is best they are always input together with master and duplicates through the __init__() or reset_data() functions.

Bergvca · 2021-09-28T19:12:28Z

string_grouper/string_grouper.py

+                left_matrix, right_matrix, nnz_rows[slice(*left_partition)])
+        except OverflowError:
+            # Matrices too big!  Try splitting:
+            left_matrix = None


left_matrix / right_matrix variables are not getting used

@Bergvca

Indeed, in the Exception handler, left_matrix and right_matrix are no longer being used. And since this is a recursive function, I didn’t want them residing in memory hogging the memory space while the function continues to recurse which would lead to degradation in performance. So I set them to None.

(Every new call to fit_blockwise_auto() recreates it own copy of these variables and they remain in memory until the end of the recursion.)

Ah I understand now. To make it more clear you could also do del left_matrix

Bergvca · 2021-09-28T19:27:10Z

string_grouper/string_grouper.py

+    if not corpus:
+        corpus = StringGrouper(string_series_1, string_series_2, **kwargs)
+    else:
+        corpus.reset_data(string_series_1, string_series_2)


I'm not sure we should the corpus option to the higher level functions. Basically this is an StringGrouper with a pre-fitted tf-idf vectorizer right? I think there are some dangers here (mainly - the new series can contain ngrams that are not known in the vectorizer), that one should know about when using this. Also it implies deeper knowledge of how the StringGrouper object works. The "High level functions" were created to be able to use the string grouper class without much reading into the code. Once you start inserting your own tf-idf vectorizer I think you are fairly knowledgeable on how the class works, and should use the class functions directly. What do you think?

I guess i'm not seeing a (common) situation where a user is first building a StringGrouper, and then inserting that in one of the High Level functions, but maybe I'm missing something

ParticularMiner · 2021-09-29T10:30:02Z

@Bergvca

Thanks for your comments.

I've updated the code by restoring the high-level functions to what they once were.

Also, I've restored the typing specifications Optional[pd.Series]. Sorry, I shouldn't have removed those.

All getter and setter methods have been removed, except for master and duplicates, since they have their own validation routines.

Following my suggestions for StringGrouper "shadow member functions" corresponding to the high-level functions which will not rebuild the existing corpus, I've also added the following:

sg = StringGrouper(master, duplicates)
sg.match_strings(new_master, new_duplicates, min_similarity=0.7)
sg.match_most_similar(new_master, new_duplicates, min_similarity=0.7)
sg.group_similar_strings(new_master, min_similarity=0.7)
sg.compute_pairwise_similarities(new_master, new_duplicates, min_similarity=0.7)

What do you think?

I'm also awaiting your decision regarding red_string_grouper's wish-list for string_grouper to decide whether to finally remove the code reorganization in StringGrouper's __init__() function or not.

Bergvca · 2021-09-29T18:42:47Z

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_groups()

If that make's sense? Maybe I'm just missing something?

ParticularMiner · 2021-09-29T19:36:33Z

Hi @Bergvca

In a red_string_grouper use case for example, I wouldn't want to execute the second and third lines of your code:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()

because they would take just too much time for large master and duplicates. From empirical observation, matching takes far more time than just building the corpus. Building the corpus alone takes a very short time (seconds) even for large datasets.

Besides I wouldn't necessarily need all the matches between master and duplicates. For example, in red_string_grouper, I do not match completely new series but subsets of the original master. In that way, I can choose which strings get matched and those that do not:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
sg.set_master(subset_1_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_1_of_master
result_1 = sg.get_matches()
sg.set_master(subset_2_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_2_of_master
result_2 = sg.get_matches()

You see, in this way I avoid unwanted and time-consuming matching between strings of subset_1_of_master and strings of subset_2_of_master. At the same time, if I had not maintained the corpus across all the calls, the similarity results would have been inconsistent with each other, because they would have been based on different corpora.

As I mentioned in a previous comment (I don't know if you received it), from a predictive modeling point of view, the new series need not even have all its n-grams in the existing corpus. Sure, it may not look ideal but it is still a predictive model.

See 2^nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:
sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_matches()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_matches()
If that make's sense? Maybe I'm just missing something?

ParticularMiner · 2021-09-29T19:53:39Z

@Bergvca

So my above code (of 7 lines) would be condensed into the following 3 lines if a shadow member function was available:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
result_1  = sg.match_strings(subset_1_of_master)
result_2  = sg.match_strings(subset_2_of_master)

ParticularMiner · 2021-09-29T20:07:34Z

@Bergvca

Ah I understand now. To make it more clear you could also do del left_matrix

I agree. I was ahead of you there. 😄 Did that already.

ParticularMiner · 2021-09-29T20:51:19Z

@Bergvca

See 2^nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

ParticularMiner · 2021-09-30T06:43:24Z

@Bergvca

So one concrete example I came across while working with two users in Issue #64 and which closely follows my suggested 3-line code snippet above, went like this:

Dataset: A large DataFrame with columns 'Address' and 'State' containing data from the USA.
Goal: To match addresses

The first user reasoned that it was pointless to match any two addresses that were in different states. So he would group his data by state first and perform the matching on each group. That made sense to me and also seemed efficient.

On the other hand, the second user didn't bother grouping at all and decided to filter his matching results afterwards. And it seemed OK to me too. Indeed, it was the natural way to go!

But as I studied both users' results, it turned out that the similarity scores were different between the users for the same pair of addresses. Not only that, but for the same min_similarity some matches found in the second user's data were missing in the first user's results!

That's how I remembered that the corpora had been different for each user. The second user had used a single corpus for all his matches, while the first user had used 51 different corpora (corresponding to 51 US states). Furthermore, it seemed that the smaller corpora had led to smaller similarity scores, some of which fell below the similarity threshold.

So, to mend the discrepancy between the two results, while hoping to also take advantage of the first user's tremendous gains in efficiency, I thought of a way to recode StringGrouper to fix the corpus while preserving its user-interface. Hence also red_string_grouper.

Bergvca · 2021-10-06T19:04:53Z

Ah, clear now, thanks! Question about the variables outside the init function - I think you made a comment that explained it but I no longer see it. Could you explain it again? (maybe it was deleted by accident? Or i'm just overlooking it?)

I ran the test and it really is crazy fast - almost impossibly fast but the results are correct as far as I can see.

Bergvca · 2021-10-06T19:10:24Z

string_grouper/string_grouper.py

-                matches = StringGrouper._symmetrize_matrix(matches)
+            # the list of matches must be symmetric!
+            # (i.e., if A != B and A matches B; then B matches A)
+            matches = StringGrouper._symmetrize_matrix(matches)


_fix_diagonal returns a csr_matrix, but _symmetrize_matrix expects a lil_matrix

Bergvca · 2021-10-06T19:18:02Z

string_grouper/string_grouper.py

+                # end of inner loop
+
+            self._true_max_n_matches = \
+                max(block_true_max_n_matches, self._true_max_n_matches)


block_true_max_n_matches might be unassigned (not sure if it is possible but my editor complains :)) - maybe set it to 0 by default after self._true_max_n_matches?

ParticularMiner · 2021-10-06T20:00:24Z

Hi @Bergvca

Thanks for your comments.

For some reason, I'm unable to reply to your in-code comments. So I'll put my comments here:

You were right about _symmetrize_matrix and lil_matrix. I've fixed them now.

I've also fixed block_true_max_n_matches.

I've also "declared" all class members in __init__(). Let me know if that fixes the issues with your IDE. I'm not sure about the missing comments you referred to. Let me know if they are still missing after this latest commit.

Cheers.

fixed _fix_diagonal() and _symmetrize_matrix()

Bergvca · 2021-10-11T18:44:47Z

Hi @ParticularMiner , thanks for all the updates. From my side I can merge if you are ready. However I do have a few questions / idea's to add:

When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?
Since you found that a blocksize of 80k strings seems to be the optimal blocksize, should we not just:
- Add an option to guesstimate the optimal number of blocks (n_strings / 80 000)
- If the overflow error is hit OR the user inputs a parameter that estimates the optimal number of blocks (for example n_blocks = 'estimate_optimal') this estimate is used. This could even be the default.
Since setting the number of blocks seems to have such a big impact, should we not make it more prominent in the readme (I don't have a good idea yet how though, maybe add it to the examples?)

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

ParticularMiner · 2021-10-11T19:30:23Z

Hi @Bergvca

When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I think that’s a good idea. I’ll add that.

Since you found that a blocksize of 80k strings seems to be the optimal blocksize, should we not just:

Add an option to guesstimate the optimal number of blocks (n_strings / 80 000)

If the overflow error is hit OR the user inputs a parameter that estimates the optimal number of blocks (for example n_blocks = 'estimate_optimal') this estimate is used. This could even be the default.

I had been thinking along similar lines. But one thing that kept me from following through with it was the fact that the guesstimate is both data- and machine-dependent (probably even dependent on available memory size). So estimating the actual optimal block numbers could be a more difficult problem.

But if you have any ideas on how to compute that on the fly, please do share. For instance, I believe there are python packages that can promptly report the available memory size, which in turn can be used to calculate the guesstimate.

Since setting the number of blocks seems to have such a big impact, should we not make it more prominent in the readme (I don't have a good idea yet how though, maybe add it to the examples?)

Sure. No problem. We can think of ways to do that.

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

Yes. Let’s merge now and add more later.

Thanks.

ParticularMiner · 2021-10-12T05:49:33Z

Hi @Bergvca

When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I've just added the warnings. I'm still thinking about the rest ...

Bergvca · 2021-10-15T18:16:57Z

Thanks! I'll merge this now and we can think about further enhancements later

Bergvca · 2021-10-15T18:59:05Z

Merged and updated to pypi

ParticularMiner force-pushed the block branch 12 times, most recently from 652a400 to c6028ef Compare September 23, 2021 13:03

added blocking capabilities

d0e099f

ParticularMiner force-pushed the block branch from c6028ef to d0e099f Compare September 24, 2021 08:12

ParticularMiner force-pushed the block branch from 6200755 to 67032e2 Compare September 27, 2021 10:48

ParticularMiner force-pushed the block branch 3 times, most recently from d56021a to acccfa9 Compare September 27, 2021 11:36

ParticularMiner force-pushed the block branch from acccfa9 to 54fc6b4 Compare September 27, 2021 12:04

ParticularMiner force-pushed the block branch from 54fc6b4 to 8063118 Compare September 27, 2021 12:26

added contour plot of runtime per string-pair comparison

c6d29cb

ParticularMiner force-pushed the block branch 2 times, most recently from 2d5acf6 to 92907de Compare September 28, 2021 11:43

added contour plot of runtime per string-pair comparison

ab4f6ff

ParticularMiner force-pushed the block branch from 92907de to ab4f6ff Compare September 28, 2021 15:25

Bergvca reviewed Sep 28, 2021

View reviewed changes

restored high-level functions and type-specifications

0c04f44

modified README.md

e04f317

ParticularMiner force-pushed the block branch 5 times, most recently from f0e9db0 to bf18a71 Compare October 3, 2021 21:21

introduced awesome_hstack_topn to optimize horizontal block-stacking

dc6734a

ParticularMiner force-pushed the block branch from bf18a71 to dc6734a Compare October 3, 2021 21:25

Bergvca reviewed Oct 6, 2021

View reviewed changes

moved class members' first appearance to __init__()

3e4fe6e

fixed _fix_diagonal() and _symmetrize_matrix()

ParticularMiner force-pushed the block branch from 90d483d to 3e4fe6e Compare October 11, 2021 13:22

ParticularMiner force-pushed the block branch from 229e185 to addf57f Compare October 12, 2021 05:33

added warning messages when OverflowError occurs.

9721b1b

ParticularMiner force-pushed the block branch from addf57f to 9721b1b Compare October 12, 2021 05:45

Bergvca merged commit 467b54e into Bergvca:master Oct 15, 2021

added blocking capabilities #72

added blocking capabilities #72

Uh oh!

Conversation

ParticularMiner commented Sep 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ParticularMiner commented Sep 24, 2021

Uh oh!

Bergvca commented Sep 27, 2021

Uh oh!

ParticularMiner commented Sep 27, 2021

Uh oh!

ParticularMiner commented Sep 27, 2021

Uh oh!

Bergvca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParticularMiner Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParticularMiner commented Sep 29, 2021

Uh oh!

Bergvca commented Sep 29, 2021

Uh oh!

ParticularMiner commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ParticularMiner commented Sep 29, 2021

Uh oh!

ParticularMiner commented Sep 29, 2021

Uh oh!

ParticularMiner commented Sep 29, 2021

Uh oh!

ParticularMiner commented Sep 30, 2021

Uh oh!

Bergvca commented Oct 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ParticularMiner commented Oct 6, 2021

Uh oh!

Bergvca commented Oct 11, 2021

Uh oh!

ParticularMiner commented Oct 11, 2021

Uh oh!

ParticularMiner commented Oct 12, 2021

Uh oh!

Bergvca commented Oct 15, 2021

Uh oh!

Bergvca commented Oct 15, 2021

Uh oh!

Uh oh!

ParticularMiner commented Sep 21, 2021 •

edited

Loading

ParticularMiner Sep 28, 2021 •

edited

Loading

ParticularMiner commented Sep 29, 2021 •

edited

Loading