Skip to content

added blocking capabilities #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 15, 2021
Merged

Conversation

ParticularMiner
Copy link
Contributor

@ParticularMiner ParticularMiner commented Sep 21, 2021

Hi @Bergvca

I'm glad you like red_string_grouper! So do I. 😃

  1. As requested, here is a branch with all the matrix-blocking capabilities included.

    The following call on the sec__edgar_companies.csv data file took 4 minutes (down from 25 minutes on my computer with the last version of string_grouper!)

    matches = match_strings(companies['Company Name'], n_blocks=(1, 200))

    It's almost hard to believe.

  2. Included is also the option to supply an already initialized StringGrouper object to the high-level functions match_strings, match_most_similar, group_similar_strings and compute_pairwise_similarities.

    This enables a StringGrouper object to be reused thus the corpus can persist between calls as mentioned by @justasojourner in Issue Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69.

  3. The README.md at this moment contains links to files on my branch in order to display the images. CHANGELOG.md also has such links to README.md. After merging (or rebasing) this branch, you will need to do a "search and replace" for the following strings before uploading to pypi,org:
    "ParticularMiner/string_grouper/block/" → "Bergvca/string_grouper/master/"
    "ParticularMiner/string_grouper/tree/block/" → "Bergvca/string_grouper/tree/master/"

@ParticularMiner ParticularMiner force-pushed the block branch 12 times, most recently from 652a400 to c6028ef Compare September 23, 2021 13:03
@ParticularMiner
Copy link
Contributor Author

@Bergvca

Unit tests now complete. Code coverage is also 100%.

@Bergvca
Copy link
Owner

Bergvca commented Sep 27, 2021

Hi @ParticularMiner ,

What would be the best order to review the different PR's? Is it best to start with this one? Thanks!

@ParticularMiner ParticularMiner force-pushed the block branch 3 times, most recently from d56021a to acccfa9 Compare September 27, 2021 11:36
@ParticularMiner
Copy link
Contributor Author

Hi @Bergvca ,

Yes. This one.

@ParticularMiner
Copy link
Contributor Author

@Bergvca

Please don't hesitate to ask me any questions if you have any.

Cheers.

@ParticularMiner ParticularMiner force-pushed the block branch 2 times, most recently from 2d5acf6 to 92907de Compare September 28, 2021 11:43
Copy link
Owner

@Bergvca Bergvca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ParticularMiner ,

Many thanks for all your work again! It looks very good! As its a pretty big PR, with multiple functionalities involved I just started going through it and adding comments. I still need to wrap my head around the blocking functionality and test the code itself, but here are some first comments :).

self._set_options(**kwargs)
self._build_corpus()

def _set_data(self,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have all the attributes in the init function, either as a sensible default or as None (but with type hints, e.g: Optional[DataFrame] ). This way it is always clear for the reader (and for the IDE) which attributes are possible (instead of defining them in other functions).

"""
self._set_data(master, duplicates, master_id, duplicates_id)

def clear_data(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions:

  1. What is the goal of the "clear data" function?
  2. Same as comment above, by looking at the Init function it is currently not clear that a "_matches_list" attribute exists for example

Copy link
Contributor Author

@ParticularMiner ParticularMiner Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bergvca

Same reason as above.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your above comment somehow got lost - I think i saw it earlier, but i don't see it anymore...

self._duplicates = duplicates

@property
def master_id(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make these attributes public? So instead of having self._master_id we get self.master_id. This eliminates the need for having these setter and property functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bergvca

Sure. master_id and duplicates_id do not need setter and getter functions as neither of them is validated on its own. I agree that we can get rid of them altogether. Never mind making them public. Actually it is best they are always input together with master and duplicates through the __init__() or reset_data() functions.

left_matrix, right_matrix, nnz_rows[slice(*left_partition)])
except OverflowError:
# Matrices too big! Try splitting:
left_matrix = None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_matrix / right_matrix variables are not getting used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bergvca

Indeed, in the Exception handler, left_matrix and right_matrix are no longer being used. And since this is a recursive function, I didn’t want them residing in memory hogging the memory space while the function continues to recurse which would lead to degradation in performance. So I set them to None.

(Every new call to fit_blockwise_auto() recreates it own copy of these variables and they remain in memory until the end of the recursion.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand now. To make it more clear you could also do del left_matrix

if not corpus:
corpus = StringGrouper(string_series_1, string_series_2, **kwargs)
else:
corpus.reset_data(string_series_1, string_series_2)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should the corpus option to the higher level functions. Basically this is an StringGrouper with a pre-fitted tf-idf vectorizer right? I think there are some dangers here (mainly - the new series can contain ngrams that are not known in the vectorizer), that one should know about when using this. Also it implies deeper knowledge of how the StringGrouper object works. The "High level functions" were created to be able to use the string grouper class without much reading into the code. Once you start inserting your own tf-idf vectorizer I think you are fairly knowledgeable on how the class works, and should use the class functions directly. What do you think?

I guess i'm not seeing a (common) situation where a user is first building a StringGrouper, and then inserting that in one of the High Level functions, but maybe I'm missing something

@ParticularMiner
Copy link
Contributor Author

@Bergvca

Thanks for your comments.

I've updated the code by restoring the high-level functions to what they once were.

Also, I've restored the typing specifications Optional[pd.Series]. Sorry, I shouldn't have removed those.

All getter and setter methods have been removed, except for master and duplicates, since they have their own validation routines.

Following my suggestions for StringGrouper "shadow member functions" corresponding to the high-level functions which will not rebuild the existing corpus, I've also added the following:

sg = StringGrouper(master, duplicates)
sg.match_strings(new_master, new_duplicates, min_similarity=0.7)
sg.match_most_similar(new_master, new_duplicates, min_similarity=0.7)
sg.group_similar_strings(new_master, min_similarity=0.7)
sg.compute_pairwise_similarities(new_master, new_duplicates, min_similarity=0.7)

What do you think?

I'm also awaiting your decision regarding red_string_grouper's wish-list for string_grouper to decide whether to finally remove the code reorganization in StringGrouper's __init__() function or not.

@Bergvca
Copy link
Owner

Bergvca commented Sep 29, 2021

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_groups()

If that make's sense? Maybe I'm just missing something?

@ParticularMiner
Copy link
Contributor Author

ParticularMiner commented Sep 29, 2021

Hi @Bergvca

In a red_string_grouper use case for example, I wouldn't want to execute the second and third lines of your code:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()

because they would take just too much time for large master and duplicates. From empirical observation, matching takes far more time than just building the corpus. Building the corpus alone takes a very short time (seconds) even for large datasets.

Besides I wouldn't necessarily need all the matches between master and duplicates. For example, in red_string_grouper, I do not match completely new series but subsets of the original master. In that way, I can choose which strings get matched and those that do not:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
sg.set_master(subset_1_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_1_of_master
result_1 = sg.get_matches()
sg.set_master(subset_2_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_2_of_master
result_2 = sg.get_matches()

You see, in this way I avoid unwanted and time-consuming matching between strings of subset_1_of_master and strings of subset_2_of_master. At the same time, if I had not maintained the corpus across all the calls, the similarity results would have been inconsistent with each other, because they would have been based on different corpora.

As I mentioned in a previous comment (I don't know if you received it), from a predictive modeling point of view, the new series need not even have all its n-grams in the existing corpus. Sure, it may not look ideal but it is still a predictive model.

See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_matches()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_matches()

If that make's sense? Maybe I'm just missing something?

@ParticularMiner
Copy link
Contributor Author

@Bergvca

So my above code (of 7 lines) would be condensed into the following 3 lines if a shadow member function was available:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
result_1  = sg.match_strings(subset_1_of_master)
result_2  = sg.match_strings(subset_2_of_master)

@ParticularMiner
Copy link
Contributor Author

@Bergvca

Ah I understand now. To make it more clear you could also do del left_matrix

I agree. I was ahead of you there. 😄 Did that already.

@ParticularMiner
Copy link
Contributor Author

@Bergvca

See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

@ParticularMiner
Copy link
Contributor Author

@Bergvca

So one concrete example I came across while working with two users in Issue #64 and which closely follows my suggested 3-line code snippet above, went like this:

Dataset: A large DataFrame with columns 'Address' and 'State' containing data from the USA.
Goal: To match addresses

The first user reasoned that it was pointless to match any two addresses that were in different states. So he would group his data by state first and perform the matching on each group. That made sense to me and also seemed efficient.

On the other hand, the second user didn't bother grouping at all and decided to filter his matching results afterwards. And it seemed OK to me too. Indeed, it was the natural way to go!

But as I studied both users' results, it turned out that the similarity scores were different between the users for the same pair of addresses. Not only that, but for the same min_similarity some matches found in the second user's data were missing in the first user's results!

That's how I remembered that the corpora had been different for each user. The second user had used a single corpus for all his matches, while the first user had used 51 different corpora (corresponding to 51 US states). Furthermore, it seemed that the smaller corpora had led to smaller similarity scores, some of which fell below the similarity threshold.

So, to mend the discrepancy between the two results, while hoping to also take advantage of the first user's tremendous gains in efficiency, I thought of a way to recode StringGrouper to fix the corpus while preserving its user-interface. Hence also red_string_grouper.

@ParticularMiner ParticularMiner force-pushed the block branch 5 times, most recently from f0e9db0 to bf18a71 Compare October 3, 2021 21:21
@Bergvca
Copy link
Owner

Bergvca commented Oct 6, 2021

Ah, clear now, thanks! Question about the variables outside the init function - I think you made a comment that explained it but I no longer see it. Could you explain it again? (maybe it was deleted by accident? Or i'm just overlooking it?)

I ran the test and it really is crazy fast - almost impossibly fast but the results are correct as far as I can see.

matches = StringGrouper._symmetrize_matrix(matches)
# the list of matches must be symmetric!
# (i.e., if A != B and A matches B; then B matches A)
matches = StringGrouper._symmetrize_matrix(matches)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fix_diagonal returns a csr_matrix, but _symmetrize_matrix expects a lil_matrix

# end of inner loop

self._true_max_n_matches = \
max(block_true_max_n_matches, self._true_max_n_matches)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block_true_max_n_matches might be unassigned (not sure if it is possible but my editor complains :)) - maybe set it to 0 by default after self._true_max_n_matches?

@ParticularMiner
Copy link
Contributor Author

Hi @Bergvca

Thanks for your comments.

For some reason, I'm unable to reply to your in-code comments. So I'll put my comments here:

You were right about _symmetrize_matrix and lil_matrix. I've fixed them now.

I've also fixed block_true_max_n_matches.

I've also "declared" all class members in __init__(). Let me know if that fixes the issues with your IDE. I'm not sure about the missing comments you referred to. Let me know if they are still missing after this latest commit.

Cheers.

fixed _fix_diagonal() and _symmetrize_matrix()
@Bergvca
Copy link
Owner

Bergvca commented Oct 11, 2021

Hi @ParticularMiner , thanks for all the updates. From my side I can merge if you are ready. However I do have a few questions / idea's to add:

  • When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?
  • Since you found that a blocksize of 80k strings seems to be the optimal blocksize, should we not just:
    • Add an option to guesstimate the optimal number of blocks (n_strings / 80 000)
    • If the overflow error is hit OR the user inputs a parameter that estimates the optimal number of blocks (for example n_blocks = 'estimate_optimal') this estimate is used. This could even be the default.
  • Since setting the number of blocks seems to have such a big impact, should we not make it more prominent in the readme (I don't have a good idea yet how though, maybe add it to the examples?)

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

@ParticularMiner
Copy link
Contributor Author

Hi @Bergvca

  • When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I think that’s a good idea. I’ll add that.

  • Since you found that a blocksize of 80k strings seems to be the optimal blocksize, should we not just:
    • Add an option to guesstimate the optimal number of blocks (n_strings / 80 000)
    • If the overflow error is hit OR the user inputs a parameter that estimates the optimal number of blocks (for example n_blocks = 'estimate_optimal') this estimate is used. This could even be the default.

I had been thinking along similar lines. But one thing that kept me from following through with it was the fact that the guesstimate is both data- and machine-dependent (probably even dependent on available memory size). So estimating the actual optimal block numbers could be a more difficult problem.

But if you have any ideas on how to compute that on the fly, please do share. For instance, I believe there are python packages that can promptly report the available memory size, which in turn can be used to calculate the guesstimate.

  • Since setting the number of blocks seems to have such a big impact, should we not make it more prominent in the readme (I don't have a good idea yet how though, maybe add it to the examples?)

Sure. No problem. We can think of ways to do that.

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

Yes. Let’s merge now and add more later.

Thanks.

@ParticularMiner
Copy link
Contributor Author

Hi @Bergvca

  • When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I've just added the warnings. I'm still thinking about the rest ...

@Bergvca Bergvca merged commit 467b54e into Bergvca:master Oct 15, 2021
@Bergvca
Copy link
Owner

Bergvca commented Oct 15, 2021

Thanks! I'll merge this now and we can think about further enhancements later

@Bergvca
Copy link
Owner

Bergvca commented Oct 15, 2021

Merged and updated to pypi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants