Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nonunique. #506

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Add nonunique. #506

wants to merge 7 commits into from

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Jan 11, 2021

itertoolz.unique yields the never before seen elements of sequence.
nonunique is the complement, yielding the already seen elements of a sequence.

This is incredibly useful for finding duplicates in a sequence.

>>> tuple(nonunique([1, 2, 3, 4, 5, 1, 2, 3]))
(1, 2, 3)

This isn't really a new feature to itertoolz, but instead exposes an already existing feature. isdistinct already had this logic, but instead of returning True/False, I return the already seen elements as they are encountered. This PR simply moves the logic into its own function.

ping: @eriknw

Nonunique returns the already seen elements of sequence.
Guarding the seen_add call can improve performance when there are a high
ratio of duplicates.
@groutr
Copy link
Contributor Author

groutr commented Jul 21, 2021

@eriknw Can I get your thoughts on this?

@eriknw
Copy link
Member

eriknw commented Oct 28, 2021

Thanks @groutr! Everything here looks reasonable and good. I'm curious: do you have a use case for this?

And sorry for my delay. This year has been, uh, a little crazy.

@groutr
Copy link
Contributor Author

groutr commented Oct 28, 2021

I'm sure that I had a better use case when I created the PR that I cannot recall now.

One use case that currently comes to mind: when I'm asking "is this distinct", many times I'm really meaning to ask "why isn't this distinct"? If isdistinct is False, it can be natural to wonder what the duplicated elements are. Pandas has duplicated and now toolz can also be used.

@eriknw
Copy link
Member

eriknw commented Oct 28, 2021

Yeah, that sounds reasonable.

@groutr
Copy link
Contributor Author

groutr commented Oct 29, 2021

@eriknw which name do you find easier to remember? toolz.duplicated (toolz.duplicates?) or toolz.nonunique

@groutr
Copy link
Contributor Author

groutr commented May 2, 2022

I think I prefer the name nonunique as we don't produce a mask like pd.duplicated.

@groutr
Copy link
Contributor Author

groutr commented May 6, 2022

I think this is ready. What do you think @eriknw?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants