Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace python-Levenshtein with rapidfuzz #10

Merged
merged 10 commits into from
Aug 18, 2023
Merged

replace python-Levenshtein with rapidfuzz #10

merged 10 commits into from
Aug 18, 2023

Conversation

maxbachmann
Copy link
Contributor

@maxbachmann maxbachmann commented Nov 6, 2021

The pull request replaces python-Levenshtein with rapidfuzz, while still keeping the pure python fallback. This change has the following reasons:

  • Python-Levenshtein / StringMatcher.py are the only parts of the library which require the GPL license. Over the years there have been multiple requests for a more open license (e.g. Any specific reason why this package is released under GPLv2? #6). After this change it should be possible to use a more open license again
  • It is often an order of magnitude faster than the previous implementation (see benchmark results below)
  • It comes with python wheels making the installation for users simpler
  • It replaces the broken get_matching_blocks implementation from python-Levenshtein which lead to incorrect results
  • it is actively maintained, which is not the case for python-Levenshtein, which will not work in Python 3.12 anymore
  • it removes the different behavior between thefuzz and thefuzz[speedup] by always using thefuzz[speedup] and providing a pure Python fallback which has the same results
  • significantly reduce the size of the implementation which reduced the maintenance effort

issues this fixes

benchmarks

original_benchmark.txt
new_benchmark.txt

test_thefuzz_hypothesis.py Show resolved Hide resolved
test_thefuzz.py Show resolved Hide resolved
@eabase
Copy link

eabase commented Feb 6, 2022

Please merge!
I'm not able to use thefuzz, since the python-Levenshtein can't compile on my machine.

@josegonzalez
Copy link
Contributor

@bigtoast thoughts on this one? Seems legit but I'm no export on the fuzz 👮🏽 .

@maxbachmann
Copy link
Contributor Author

@josegonzalez @bigtoast any news on this?

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Jul 7, 2022

Updated the pr assuming Python 2.7 will be dropped at some point. It does now use the pure python fallback of rapidfuzz. This has the following advantages:

  • behavior of pure Python and C++ implementation have the same results, which led to a lot of confusion in the past
  • faster pure Python version than the previous difflib based implementation
  • less code to maintain

@protux
Copy link

protux commented Nov 10, 2022

@kbasnayake-seatgeek @Hi-Tech-SeatGeek @josegonzalez @bigtoast
Hi, any updates on this one? This repository is broken with Levenshtein 20.4+

@maxbachmann
Copy link
Contributor Author

@protux what exactly is broken with these versions of Levenshtein? Do you refer to the same issue as #37? In this case the upgrade mechanism of pip does not work properly. You can fix this in your environment using one of the following methods:

  1. uninstall Levenshtein+python-Levenshtein and install them again
  2. pin the version yourself

@lucasrodes
Copy link

Any news on this?

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Jan 13, 2023

No it appears seatgeek does not really monitor this repository either. Apparently they only wanted to rename it. Note that this PR converts thefuzz to a very simple compatibility wrapper around rapidfuzz, since there are some very small differences between the two. However for most projects it would be better to simply replace their dependency on thefuzz/fuzzywuzzy alltogether and use rapidfuzz directly (e.g. from rapidfuzz import fuzz instead of from thefuzz import fuzz)

@maxbachmann
Copy link
Contributor Author

As an update regarding this PR: I am the new maintainer of python-Levenshtein since a couple of months, so this will no longer break in Python 3.12. This PR still introduces the following improvements:

  • rapidfuzz is a dependency of python-Levenshtein anyways, so it reduces the amount of recursive dependencies
  • Python-Levenshtein / StringMatcher.py are the only GPL licensed parts, so after dropping them it would be possible to use a more permissive license again
  • significantly faster (see initial message)
  • both the C++ and the pure Python fallback have consistent results, which was not the case before
  • fixes some issues with partial_ratio returning incorrect scores in some cases both for the C and pure Python implementation
  • significantly reduces the amount of code in this library, which should make maintenance simpler

@maxbachmann
Copy link
Contributor Author

@josegonzalez I rebased this on the latest master now that the other PRs got merged

Comment on lines +104 to +105
self.assertEqual(fuzz.partial_token_sort_ratio(self.s10, self.s10a, full_process=False), 67)
self.assertEqual(fuzz.partial_token_sort_ratio(self.s10a, self.s10, full_process=False), 67)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation did have different results for these two tests (50 / 67), since it only allowed alignments behind the string, but not in the front. This is fixed here.

@lucasrodes
Copy link

@maxbachmann Thanks for your input! For the time being, we'll be switching to rapidfuzz!

test_thefuzz.py Outdated
Comment on lines 290 to 293
if scorer in {fuzz.token_set_ratio, fuzz.partial_token_set_ratio, fuzz.WRatio}:
self.assertEqual(scorer('', ''), 0)
else:
self.assertEqual(scorer('', ''), 100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already the case in thefuzz beforehand. It is unclear to me whether this was done on purpose, or whether this is a bug.

@josegonzalez
Copy link
Contributor

josegonzalez commented Feb 15, 2023

I'm on vacation but will merge this in and create a major release based off of it when I'm back.

setup.py Show resolved Hide resolved
@maxbachmann
Copy link
Contributor Author

ping @josegonzalez

@maxbachmann
Copy link
Contributor Author

I did now write a compatible wrapper for process.py as well, which uses the rapidfuzz implementations under the hood. This allows a couple of optimizations making these parts significantly faster:
fuzz.py using rapidfuzz, but process.py still using the original implementation:

Test process.extract(scorer =  fuzz.QRatio) for string: "zarakana - cirque du soleil - bellagio"
-------------------------------
Total time: 0.151360s. Average run: 15.136ms.
Test process.extract(scorer =  fuzz.WRatio) for string: "zarakana - cirque du soleil - bellagio"
-------------------------------
Total time: 0.305969s. Average run: 30.597ms.

both using raidfuzz under the hood:

Test process.extract3(scorer =  fuzz.QRatio) for string: "zarakana - cirque du soleil - bellagio"
-------------------------------
Total time: 0.019336s. Average run: 1.934ms.
Test process.extract3(scorer =  fuzz.WRatio) for string: "zarakana - cirque du soleil - bellagio"
-------------------------------
Total time: 0.120800s. Average run: 12.080ms.

Since this is a complete rewrite of the affected functions, the only function not rewritten by this PR is process.dedupe. This function was authored by @brandomr in seatgeek/fuzzywuzzy@712833a and except for whitespace changes was never changed again. So we will need to ask a lot fewer people to change the license :)

@maxbachmann
Copy link
Contributor Author

As a note this breaks, since rapidfuzz does not support generators yet. I will add support for this, since this seems generally useful

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Apr 1, 2023

I basically rewrote dedupe as well, since it can be significantly simplified. In addition I released an updated version of rapidfuzz which supports generators in process.extract and fixes the test failure.

So from what I can tell there is nothing stopping us from changing the license now.

maxbachmann and others added 5 commits April 16, 2023 17:48
You must always close a file after opening it. The current code might not work on Windows + PyPy, since it's assuming the garbage collector will run (which it won't on PyPy) & the file descriptor will be released (which is required on Windows).
@nonchris
Copy link

Okay, so why is this still open? :)

@maxbachmann
Copy link
Contributor Author

maxbachmann commented May 20, 2023

To my understanding Seatgeek doesn't have the capacity to maintain thefuzz/fuzzywuzzy at this point. I did offer @josegonzalez and @bigtoast to take over maintenance of the package, which they they wanted to look into. However I did not hear back about this since then (around 1 month ago).

@josegonzalez
Copy link
Contributor

Sorry I no longer work at SeatGeek, so I cannot help anymore.

@maxbachmann
Copy link
Contributor Author

Sorry I no longer work at SeatGeek, so I cannot help anymore.

I already thought so, since your seatgeek e-mail address ceased to exist last week ;)

I still hope @bigtoast get's back to this at some point (and would still be willing to take over maintenance). At least right now my recommendation for anyone using fuzzywuzzy/thefuzz would be to replace to usage with rapidfuzz.

@ddelange
Copy link

For the time being, people can also add to their requirements.txt:

thefuzz @ https://github.com/maxbachmann/thefuzz2/archive/refs/heads/master.zip

or run:

pip install thefuzz@https://github.com/maxbachmann/thefuzz2/archive/refs/heads/master.zip

@ddelange
Copy link

ddelange commented Aug 10, 2023

Hi @andrewjkerr 👋

I saw from your bio that you currently work at seatgeek. Could you ask someone from seatgeek's python related devs whether they can do a release of this PR?

If there is no capacity, could you give @maxbachmann write permission to https://pypi.org/project/thefuzz? What is your PyPI username @maxbachmann?

Then he can either become a maintainer of this repository, or you archive this repo with a notice in the README, and his fork becomes leading?

What do you think?

@maxbachmann
Copy link
Contributor Author

If there is no capacity, could you give @maxbachmann write permission to https://pypi.org/project/thefuzz? What is your PyPI username @maxbachmann?

My PyPI username is maxbachmann as well. I am in contact with Seatgeek in regards to taking over maintenance for both thefuzz and fuzzywuzzysince around half a year ago. To my understanding this is still discussed internally.

Then he can either become a maintainer of this repository, or you archive this repo with a notice in the README, and his fork becomes leading?

I would personally prefer taking over the existing repository, since it would allow me to handle existing issues. However I would be fine either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants