-
Notifications
You must be signed in to change notification settings - Fork 76
added blocking capabilities #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
652a400
to
c6028ef
Compare
c6028ef
to
d0e099f
Compare
Unit tests now complete. Code coverage is also 100%. |
6200755
to
67032e2
Compare
Hi @ParticularMiner , What would be the best order to review the different PR's? Is it best to start with this one? Thanks! |
d56021a
to
acccfa9
Compare
Hi @Bergvca , Yes. This one. |
acccfa9
to
54fc6b4
Compare
Please don't hesitate to ask me any questions if you have any. Cheers. |
54fc6b4
to
8063118
Compare
2d5acf6
to
92907de
Compare
92907de
to
ab4f6ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ParticularMiner ,
Many thanks for all your work again! It looks very good! As its a pretty big PR, with multiple functionalities involved I just started going through it and adding comments. I still need to wrap my head around the blocking
functionality and test the code itself, but here are some first comments :).
self._set_options(**kwargs) | ||
self._build_corpus() | ||
|
||
def _set_data(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have all the attributes in the init function, either as a sensible default or as None (but with type hints, e.g: Optional[DataFrame] ). This way it is always clear for the reader (and for the IDE) which attributes are possible (instead of defining them in other functions).
""" | ||
self._set_data(master, duplicates, master_id, duplicates_id) | ||
|
||
def clear_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 questions:
- What is the goal of the "clear data" function?
- Same as comment above, by looking at the Init function it is currently not clear that a "_matches_list" attribute exists for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your above comment somehow got lost - I think i saw it earlier, but i don't see it anymore...
string_grouper/string_grouper.py
Outdated
self._duplicates = duplicates | ||
|
||
@property | ||
def master_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just make these attributes public? So instead of having self._master_id
we get self.master_id. This eliminates the need for having these setter and property functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. master_id
and duplicates_id
do not need setter and getter functions as neither of them is validated on its own. I agree that we can get rid of them altogether. Never mind making them public. Actually it is best they are always input together with master
and duplicates
through the __init__()
or reset_data()
functions.
string_grouper/string_grouper.py
Outdated
left_matrix, right_matrix, nnz_rows[slice(*left_partition)]) | ||
except OverflowError: | ||
# Matrices too big! Try splitting: | ||
left_matrix = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left_matrix / right_matrix variables are not getting used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, in the Exception handler, left_matrix and right_matrix are no longer being used. And since this is a recursive function, I didn’t want them residing in memory hogging the memory space while the function continues to recurse which would lead to degradation in performance. So I set them to None.
(Every new call to fit_blockwise_auto() recreates it own copy of these variables and they remain in memory until the end of the recursion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand now. To make it more clear you could also do del left_matrix
string_grouper/string_grouper.py
Outdated
if not corpus: | ||
corpus = StringGrouper(string_series_1, string_series_2, **kwargs) | ||
else: | ||
corpus.reset_data(string_series_1, string_series_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should the corpus option to the higher level functions. Basically this is an StringGrouper with a pre-fitted tf-idf vectorizer right? I think there are some dangers here (mainly - the new series can contain ngrams that are not known in the vectorizer), that one should know about when using this. Also it implies deeper knowledge of how the StringGrouper object works. The "High level functions" were created to be able to use the string grouper class without much reading into the code. Once you start inserting your own tf-idf vectorizer I think you are fairly knowledgeable on how the class works, and should use the class functions directly. What do you think?
I guess i'm not seeing a (common) situation where a user is first building a StringGrouper, and then inserting that in one of the High Level functions, but maybe I'm missing something
Thanks for your comments. I've updated the code by restoring the high-level functions to what they once were. Also, I've restored the typing specifications All getter and setter methods have been removed, except for Following my suggestions for sg = StringGrouper(master, duplicates)
sg.match_strings(new_master, new_duplicates, min_similarity=0.7)
sg.match_most_similar(new_master, new_duplicates, min_similarity=0.7)
sg.group_similar_strings(new_master, min_similarity=0.7)
sg.compute_pairwise_similarities(new_master, new_duplicates, min_similarity=0.7) What do you think? I'm also awaiting your decision regarding |
Hi @ParticularMiner, I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:
If that make's sense? Maybe I'm just missing something? |
Hi @Bergvca In a sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups() because they would take just too much time for large Besides I wouldn't necessarily need all the matches between sg = StringGrouper(master) # this would build the corpus; no matching so very fast
sg.set_master(subset_1_of_master)
sg.fit() # matching here is only between strings in this subset: subset_1_of_master
result_1 = sg.get_matches()
sg.set_master(subset_2_of_master)
sg.fit() # matching here is only between strings in this subset: subset_2_of_master
result_2 = sg.get_matches() You see, in this way I avoid unwanted and time-consuming matching between strings of As I mentioned in a previous comment (I don't know if you received it), from a predictive modeling point of view, the new series need not even have all its n-grams in the existing corpus. Sure, it may not look ideal but it is still a predictive model. See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.
|
So my above code (of 7 lines) would be condensed into the following 3 lines if a shadow member function was available: sg = StringGrouper(master) # this would build the corpus; no matching so very fast
result_1 = sg.match_strings(subset_1_of_master)
result_2 = sg.match_strings(subset_2_of_master) |
I agree. I was ahead of you there. 😄 Did that already. |
See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus. |
So one concrete example I came across while working with two users in Issue #64 and which closely follows my suggested 3-line code snippet above, went like this: Dataset: A large DataFrame with columns 'Address' and 'State' containing data from the USA. The first user reasoned that it was pointless to match any two addresses that were in different states. So he would group his data by state first and perform the matching on each group. That made sense to me and also seemed efficient. On the other hand, the second user didn't bother grouping at all and decided to filter his matching results afterwards. And it seemed OK to me too. Indeed, it was the natural way to go! But as I studied both users' results, it turned out that the similarity scores were different between the users for the same pair of addresses. Not only that, but for the same That's how I remembered that the corpora had been different for each user. The second user had used a single corpus for all his matches, while the first user had used 51 different corpora (corresponding to 51 US states). Furthermore, it seemed that the smaller corpora had led to smaller similarity scores, some of which fell below the similarity threshold. So, to mend the discrepancy between the two results, while hoping to also take advantage of the first user's tremendous gains in efficiency, I thought of a way to recode |
f0e9db0
to
bf18a71
Compare
Ah, clear now, thanks! Question about the variables outside the init function - I think you made a comment that explained it but I no longer see it. Could you explain it again? (maybe it was deleted by accident? Or i'm just overlooking it?) I ran the test and it really is crazy fast - almost impossibly fast but the results are correct as far as I can see. |
matches = StringGrouper._symmetrize_matrix(matches) | ||
# the list of matches must be symmetric! | ||
# (i.e., if A != B and A matches B; then B matches A) | ||
matches = StringGrouper._symmetrize_matrix(matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_fix_diagonal returns a csr_matrix
, but _symmetrize_matrix
expects a lil_matrix
# end of inner loop | ||
|
||
self._true_max_n_matches = \ | ||
max(block_true_max_n_matches, self._true_max_n_matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_true_max_n_matches might be unassigned (not sure if it is possible but my editor complains :)) - maybe set it to 0 by default after self._true_max_n_matches?
Hi @Bergvca Thanks for your comments. For some reason, I'm unable to reply to your in-code comments. So I'll put my comments here: You were right about I've also fixed I've also "declared" all class members in Cheers. |
fixed _fix_diagonal() and _symmetrize_matrix()
90d483d
to
3e4fe6e
Compare
Hi @ParticularMiner , thanks for all the updates. From my side I can merge if you are ready. However I do have a few questions / idea's to add:
Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions. |
Hi @Bergvca
I think that’s a good idea. I’ll add that.
I had been thinking along similar lines. But one thing that kept me from following through with it was the fact that the guesstimate is both data- and machine-dependent (probably even dependent on available memory size). So estimating the actual optimal block numbers could be a more difficult problem. But if you have any ideas on how to compute that on the fly, please do share. For instance, I believe there are python packages that can promptly report the available memory size, which in turn can be used to calculate the guesstimate.
Sure. No problem. We can think of ways to do that.
Yes. Let’s merge now and add more later. Thanks. |
229e185
to
addf57f
Compare
addf57f
to
9721b1b
Compare
Hi @Bergvca
I've just added the warnings. I'm still thinking about the rest ... |
Thanks! I'll merge this now and we can think about further enhancements later |
Merged and updated to pypi |
Hi @Bergvca
I'm glad you like
red_string_grouper
! So do I. 😃As requested, here is a branch with all the matrix-blocking capabilities included.
The following call on the
sec__edgar_companies.csv
data file took 4 minutes (down from 25 minutes on my computer with the last version ofstring_grouper
!)It's almost hard to believe.
Included is also the option to supply an already initialized
StringGrouper
object to the high-level functionsmatch_strings
,match_most_similar
,group_similar_strings
andcompute_pairwise_similarities
.This enables a
StringGrouper
object to be reused thus the corpus can persist between calls as mentioned by @justasojourner in Issue Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69.The README.md at this moment contains links to files on my branch in order to display the images. CHANGELOG.md also has such links to README.md. After merging (or rebasing) this branch, you will need to do a "search and replace" for the following strings before uploading to pypi,org:
"ParticularMiner/string_grouper/block/" → "Bergvca/string_grouper/master/"
"ParticularMiner/string_grouper/tree/block/" → "Bergvca/string_grouper/tree/master/"