-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The library doesn't use Levenshtein at all, or at least it's inconsistent #53
Comments
Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2. Still I agree this is misleading. I think to a part this stems from python-Levenshtein, where |
Yeah, I also mentioned they're similar. Still, they're not the same.
Yup. I mentioned that too. I can't help but wonder why. The library is literally named Levenshtein. It could've been an option instead of the default, having "Indel" in the function name like it does for Jaro, or as an optional parameter. Well, I think the readme should at least be changed to tell people about this. It's where they'd read first and currently it says "It uses Levenshtein Distance" at the very beginning, while it doesn't. |
Yes no clue why the original author wrote it like this 🤷♂️ |
This is really confusing to me, I also stumbled upon the fact that different ratios are returned depending on whether python-Levenshtein is installed or not. What the hell, how is this acceptable? @maxbachmann |
In thefuzz this is no longer the case for 1.5 years. Since #10 it always uses the same implementation. The basis for ratio is the Levenshtein distance with a substitution weight of 2. This is also known as InDel distance. |
This library is described as fuzzy string matching with Levenshtein distance. However, it doesn't seem to use Levenshtein at all?
fuzz.ratio("tide", "diet") returns:
This is because python-Levenshtein uses Indel distance for ratio instead of Levenshtein. Issue 47 in python-Levenshtein mentioned that "Levenshtein.ratio is based upon the Indel distance and not on the Levenshtein distance". (That's the correct repo I think). Indel distance is similar to Levenshtein, but they are not the same and may produce different results.
But it doesn't seem to just end there. difflib, the fallback for when you don't have python-Levenshtein installed, doesn't seem to use Levenshtein distance either. Issue 128 in the older repository/version of this package mentioned that difflib uses "Ratcliff-Obershelp algorithm, not the Levenshtein distance".
To conclude:
Note: This library uses ratio function from Levenshtein package or difflib for all of its function with ratio in its name.
The text was updated successfully, but these errors were encountered: