Array string distance alpha #2195

JonnyShiUW · 2024-05-21T22:10:23Z

… distance feature

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #1994

Give a brief description for the solution you have provided

First prototype of fuzzy matching for array-value columns for DuckDB

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter
Run the spellchecker (if appropriate)

… distance feature

…-distance-alpha

zmbc · 2024-05-21T22:27:13Z

Splink folks: this is a work in progress! I'll give it a review first.

zmbc

Thanks @JonnyShiUW! Looks good overall, with a few minor things to work out. We also need to add some tests.

zmbc · 2024-05-21T22:29:03Z

splink/internals/comparison_level_library.py

+        self.distance_function = validate_categorical_parameter(
+            allowed_values=["levenshtein", "damerau_levenshtein", "jaro_winkler", "jaro"],
+            parameter_value=distance_function,
+            level_name=self.__class__.__name__,
+            parameter_name="distance_function"
+        )


I think it's an open question whether it is better for the user to define any function name they want (as in DistanceFunctionLevel) or only to have certain options (but it gets auto-transpiled).

zmbc · 2024-05-21T22:29:46Z

splink/internals/comparison_level_library.py

+                        ), 
+                        pair -> {d_fn}(pair[1], pair[2])
+                    )
+                ) <= {self.distance_threshold}"""


This comparison should likely change to a >= for "jaro" and "jaro_winkler" where higher scores are more similar.

Oh and distance_threshold should probably be a float, not an int.

zmbc · 2024-05-21T22:30:50Z

splink/internals/comparison_library.py

+            self,
+            col_name: str,
+            distance_threshold_or_thresholds: Union[Iterable[int], int] = [1],
+            distance_function: str = "levenshtein",


This should probably be a custom type that lists the allowable values, similar to DateMetricType above.

zmbc · 2024-05-21T22:31:56Z

splink/internals/comparison_level_library.py

+        )
+
+    def create_label_for_charts(self) -> str:
+        return f"Array string distance <= {self.distance_threshold}"


Let's include the distance function name here

zmbc · 2024-05-21T22:32:11Z

splink/internals/comparison_library.py

+        comma_separated_thresholds_string = ", ".join(map(str, self.thresholds))
+        plural = "s" if len(self.thresholds) > 1 else ""
+        return (
+            f"Array string distance at maximum size{plural} "


Let's include the distance function name here

zmbc · 2024-05-21T22:34:24Z

splink/internals/comparison_level_library.py

+                                x -> list_transform(
+                                    {col.name_r}, 
+                                    y -> [x,y]


Suggested change

x -> list_transform(

{col.name_r},

y -> [x,y]

l_item -> list_transform(

{col.name_r},

r_item -> [l_item, r_item]

Just for readability

RobinL · 2024-05-22T15:58:03Z

Thanks! I'm on leave after today, but i had a very quick look, and tried it out:

Runnable code to try new comparison

import duckdb
import pandas as pd

import splink.internals.comparison_level_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.comparison_library import CustomComparison
from splink.internals.dialects import DuckDBDialect

data_l = pd.DataFrame.from_dict(
    [
        {"unique_id": 1, "name": ["robin", "james"]},
        {"unique_id": 2, "name": ["robyn", "steve"]},
        {"unique_id": 3, "name": ["stephen"]},
        {"unique_id": 4, "name": ["stephen"]},
    ]
)

arr_comparison = cl.ArrayStringDistanceLevel("name", 2, "levenshtein")
cc = CustomComparison(
    [
        cl.NullLevel("name"),
        cl.ExactMatchLevel("name"),
        arr_comparison,
        cl.ElseLevel(),
    ]
)
print(arr_comparison.create_sql(DuckDBDialect()))

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=["1=1"],
    comparisons=[cc],
)


linker = Linker(data_l, settings, database_api=DuckDBAPI(), set_up_basic_logging=True)


linker.predict().as_pandas_dataframe()


# Test sql
literal_array_l = str(["robin", "james"])
literal_array_r = str(["robyn", "bob"])


sql = f"""
select
    list_max(
        list_transform(
            flatten(
                list_transform(
                    {literal_array_l},
                    x -> list_transform(
                        {literal_array_r},
                        y -> [x,y]
                    )
                )
            ),
            pair -> jaro_winkler_similarity(pair[1], pair[2])
        )
    ) >= 0.5 as result
"""

duckdb.sql(sql)

Overall implementation looks great, thanks. Need to think a bit about the 'user facing' aspect of this:

Best way to allow user to provide the distance function
How to deal with the distance vs similiaryt issue (for levenstein higher is worse, for others like jaro, higher is better)

For consistency we could consider doing something similar to:

class JaroWinklerAtThresholds(ComparisonCreator):
class LevenshteinAtThresholds(ComparisonCreator):

Where we use distance_threshold_or_thresholds vs score_threshold_or_thresholds

So I wonder whether we coud do something like:
class AbsoluteDateDifferenceLevel(AbsoluteTimeDifferenceLevel):

Where we have a simple inheritance that slightly changes the behaviour.

So we could have a base class that takes a higher_is_more_similar type argument, e.g. like here,

But then what's exposed to the user is actually two functions, called something like ArrayStringDistanceLevel and ArrayStringSimilarityLevel which set the higher_is_more_similar argument for the user, so they don't have to think about it.

Or we could even consider going as far as simply exposing ArrayJaroWinklerLevel ArrayLevenshteinLevel and maybe a couple more to the user. Though overall, if we set a custom type that allows autocomplete of the supported ones (distance_function), UI think I'd err on the side of just having ArrayStringDistanceLevel and ArrayStringSimilarityLevel

Finally, we'll want to check compat across dialects. I think to begin with supporting just Spark and Duckdb should be fine

RobinL · 2024-05-22T16:07:37Z

Just to further explain the distance vs score, here's a slack convo i dug up where I asked the team!

zmbc · 2024-05-22T17:02:59Z

But then what's exposed to the user is actually two functions, called something like ArrayStringDistanceLevel and ArrayStringSimilarityLevel which set the higher_is_more_similar argument for the user, so they don't have to think about it.

This sounds like the best compromise to me!

zmbc · 2024-06-26T22:23:33Z

@RobinL Circling back to how to design this, I actually would like to propose consistency with the DistanceFunctionAtThresholds class, rather than consistency with JaroWinklerAtThresholds etc.

Specifically, we'd have PairwiseStringDistanceFunctionAtThresholds and PairwiseStringDistanceFunctionLevel. These would each take a distance_function argument that would be validated to be one of the options "jaro," "levenshtein," etc and transpiled automatically. The value of this argument would then automatically set higher_is_more_similar.

This feels slightly weird given that "DistanceFunction" seems to imply that higher is more distant, i.e. less similar, but that is already an issue with the existing DistanceFunctionAtThresholds.

As an add-on (in a separate PR?) we could also have a more configurable PairwiseDistanceFunctionAtThresholds (and associated level) that did not restrict the function used, did not transpile the function name, and required the user to specify higher_is_more_similar. I'd also like to give the user the ability to customize whether the min or max distance/similarity is used.

What do you think?

RobinL · 2024-07-02T18:45:45Z

Yeah - good spot re the existing issue/weirdness with DistanceFunctionAtThresholds

I think I agree with you - PairwiseStringDistanceFunctionAtThresholds and PairwiseStringDistanceFunctionLevel seem like the best options and most consistent. I also think it's a good idea to constrain the list of possibilities with a validated/enum, to keep a lid on complexity. Also we can use typing.Literal so it auto-completes for the user

Literal["jaro", ...]

I think to minimise complexity I'm minded to leave it at that rather than go through the complexity of allowing any arbitrary sql function. For highly customised things, the user can provide the comparison as a dict, and they can use PairwiseStringDistanceFunctionLevel to generate something close to what they need, that they can then manually edit.

Added new comparison and associated comparison level for array string…

6599047

… distance feature

JonnyShiUW changed the base branch from master to splink4_dev May 21, 2024 22:11

Merge remote-tracking branch 'upstream/splink4_dev' into array-string…

a4ff0fc

…-distance-alpha

zmbc reviewed May 21, 2024

View reviewed changes

RobinL mentioned this pull request Jul 24, 2024

[FEAT] Pairwise array comparison levels #1337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array string distance alpha #2195

Array string distance alpha #2195

JonnyShiUW commented May 21, 2024

zmbc commented May 21, 2024

zmbc left a comment

zmbc May 21, 2024

zmbc May 21, 2024

zmbc May 21, 2024

zmbc May 21, 2024

zmbc May 21, 2024

zmbc May 21, 2024

zmbc May 21, 2024

RobinL commented May 22, 2024 •

edited

Loading

RobinL commented May 22, 2024

zmbc commented May 22, 2024

zmbc commented Jun 26, 2024

RobinL commented Jul 2, 2024

Array string distance alpha #2195

Are you sure you want to change the base?

Array string distance alpha #2195

Conversation

JonnyShiUW commented May 21, 2024

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

zmbc commented May 21, 2024

zmbc left a comment

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

zmbc May 21, 2024

Choose a reason for hiding this comment

RobinL commented May 22, 2024 • edited Loading

RobinL commented May 22, 2024

zmbc commented May 22, 2024

zmbc commented Jun 26, 2024

RobinL commented Jul 2, 2024

RobinL commented May 22, 2024 •

edited

Loading