Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto cleanup memmap scores #1086

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

NickCrews
Copy link
Contributor

Users don't need to worry about cleaning up the memmpa file now

@coveralls
Copy link

coveralls commented Aug 30, 2022

Coverage Status

Coverage increased (+0.05%) to 64.157% when pulling 133d744 on NickCrews:auto-cleanup into c595052 on dedupeio:main.

@@ -175,7 +174,6 @@ def partition(
clusters = self.cluster(pair_scores, threshold)
clusters = self._add_singletons(data, clusters)
clusters = list(clusters)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have auto-cleanup, perhaps we don't want to cast this to a list? I think then this entire method would be out-of-core, right?

Same for casting to list in join()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right. i think we we should put in a warning for a few releases though because some folks are likely depending on it being a list instead of a generator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to do what Gazeteer.search() does with the generator: bool parameter?

def score(.... generator: bool = False):
    ...
    if not generator:
        warn("this deprecated")
        return list(scores)
    else:
        yield from scores

dedupe/core.py Outdated
@@ -176,9 +177,28 @@ def scoreDuplicates(
else:
scored_pairs = numpy.array([], dtype=dtype)

# See https://docs.python.org/3/library/weakref.html#comparing-finalizers-with-del-methods
scored_pairs.remove = weakref.finalize(scored_pairs, _cleanup_scores, scored_pairs)
scored_pairs.removed = property(_is_removed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if including the removed property is good, probably could live without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i understand correctly, it's never going to be anything except false in reachable code? because it would only set to true when scored_paris is garbage collected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone could call pairs.remove() manually to do the finalization, and then it would be True. But you're correct, we as the library rely on garbage collection so we never take advantage of the .removed property. We could remove it, always easier to add later than to have to continue supporting it forever...

It wasn't stated anywhere in the docs that you have to cleanup the
possibly-a-tempfile, so people might not have been doing it.
And, I thought this was better than
the alternatives from
https://stackoverflow.com/questions/865115/how-do-i-correctly-clean-up-a-python-object
Now you can #type: ignore[specific-code]
@@ -176,9 +177,29 @@ def scoreDuplicates(
else:
scored_pairs = numpy.array([], dtype=dtype)

# Monkeypatch in these extra methods and attributes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seriously! The first time I've used it too.

@fgregg
Copy link
Contributor

fgregg commented Sep 2, 2022

wow! i've long wanted to do something like this but del was always so finicky. a super nice approach.

can we get a test for this: something like

create_scored_pairs
get file location for memmap
del scored_pairs
check that there is no file at that location

Copy link
Contributor

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a request for a test in the thread

@NickCrews
Copy link
Contributor Author

Yes the test seems important, I can look into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants