-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array string distance alpha #2195
base: splink4_dev
Are you sure you want to change the base?
Array string distance alpha #2195
Conversation
… distance feature
Splink folks: this is a work in progress! I'll give it a review first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JonnyShiUW! Looks good overall, with a few minor things to work out. We also need to add some tests.
self.distance_function = validate_categorical_parameter( | ||
allowed_values=["levenshtein", "damerau_levenshtein", "jaro_winkler", "jaro"], | ||
parameter_value=distance_function, | ||
level_name=self.__class__.__name__, | ||
parameter_name="distance_function" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an open question whether it is better for the user to define any function name they want (as in DistanceFunctionLevel
) or only to have certain options (but it gets auto-transpiled).
), | ||
pair -> {d_fn}(pair[1], pair[2]) | ||
) | ||
) <= {self.distance_threshold}""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison should likely change to a >=
for "jaro" and "jaro_winkler" where higher scores are more similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and distance_threshold should probably be a float, not an int.
self, | ||
col_name: str, | ||
distance_threshold_or_thresholds: Union[Iterable[int], int] = [1], | ||
distance_function: str = "levenshtein", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a custom type that lists the allowable values, similar to DateMetricType
above.
) | ||
|
||
def create_label_for_charts(self) -> str: | ||
return f"Array string distance <= {self.distance_threshold}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include the distance function name here
comma_separated_thresholds_string = ", ".join(map(str, self.thresholds)) | ||
plural = "s" if len(self.thresholds) > 1 else "" | ||
return ( | ||
f"Array string distance at maximum size{plural} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include the distance function name here
x -> list_transform( | ||
{col.name_r}, | ||
y -> [x,y] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x -> list_transform( | |
{col.name_r}, | |
y -> [x,y] | |
l_item -> list_transform( | |
{col.name_r}, | |
r_item -> [l_item, r_item] |
Just for readability
Thanks! I'm on leave after today, but i had a very quick look, and tried it out: Runnable code to try new comparisonimport duckdb
import pandas as pd
import splink.internals.comparison_level_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.comparison_library import CustomComparison
from splink.internals.dialects import DuckDBDialect
data_l = pd.DataFrame.from_dict(
[
{"unique_id": 1, "name": ["robin", "james"]},
{"unique_id": 2, "name": ["robyn", "steve"]},
{"unique_id": 3, "name": ["stephen"]},
{"unique_id": 4, "name": ["stephen"]},
]
)
arr_comparison = cl.ArrayStringDistanceLevel("name", 2, "levenshtein")
cc = CustomComparison(
[
cl.NullLevel("name"),
cl.ExactMatchLevel("name"),
arr_comparison,
cl.ElseLevel(),
]
)
print(arr_comparison.create_sql(DuckDBDialect()))
settings = SettingsCreator(
link_type="dedupe_only",
blocking_rules_to_generate_predictions=["1=1"],
comparisons=[cc],
)
linker = Linker(data_l, settings, database_api=DuckDBAPI(), set_up_basic_logging=True)
linker.predict().as_pandas_dataframe()
# Test sql
literal_array_l = str(["robin", "james"])
literal_array_r = str(["robyn", "bob"])
sql = f"""
select
list_max(
list_transform(
flatten(
list_transform(
{literal_array_l},
x -> list_transform(
{literal_array_r},
y -> [x,y]
)
)
),
pair -> jaro_winkler_similarity(pair[1], pair[2])
)
) >= 0.5 as result
"""
duckdb.sql(sql) Overall implementation looks great, thanks. Need to think a bit about the 'user facing' aspect of this:
For consistency we could consider doing something similar to: class JaroWinklerAtThresholds(ComparisonCreator): Where we use So I wonder whether we coud do something like: Where we have a simple inheritance that slightly changes the behaviour. So we could have a base class that takes a But then what's exposed to the user is actually two functions, called something like Or we could even consider going as far as simply exposing Finally, we'll want to check compat across dialects. I think to begin with supporting just Spark and Duckdb should be fine |
This sounds like the best compromise to me! |
@RobinL Circling back to how to design this, I actually would like to propose consistency with the Specifically, we'd have This feels slightly weird given that "DistanceFunction" seems to imply that higher is more distant, i.e. less similar, but that is already an issue with the existing As an add-on (in a separate PR?) we could also have a more configurable What do you think? |
Yeah - good spot re the existing issue/weirdness with I think I agree with you -
I think to minimise complexity I'm minded to leave it at that rather than go through the complexity of allowing any arbitrary sql function. For highly customised things, the user can provide the comparison as a dict, and they can use |
… distance feature
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Closes #1994
Give a brief description for the solution you have provided
First prototype of fuzzy matching for array-value columns for DuckDB
PR Checklist