Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array string distance alpha #2195

Open
wants to merge 2 commits into
base: splink4_dev
Choose a base branch
from

Conversation

JonnyShiUW
Copy link

… distance feature

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #1994

Give a brief description for the solution you have provided

First prototype of fuzzy matching for array-value columns for DuckDB

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

@JonnyShiUW JonnyShiUW changed the base branch from master to splink4_dev May 21, 2024 22:11
@zmbc
Copy link
Contributor

zmbc commented May 21, 2024

Splink folks: this is a work in progress! I'll give it a review first.

Copy link
Contributor

@zmbc zmbc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JonnyShiUW! Looks good overall, with a few minor things to work out. We also need to add some tests.

Comment on lines +767 to +772
self.distance_function = validate_categorical_parameter(
allowed_values=["levenshtein", "damerau_levenshtein", "jaro_winkler", "jaro"],
parameter_value=distance_function,
level_name=self.__class__.__name__,
parameter_name="distance_function"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an open question whether it is better for the user to define any function name they want (as in DistanceFunctionLevel) or only to have certain options (but it gets auto-transpiled).

),
pair -> {d_fn}(pair[1], pair[2])
)
) <= {self.distance_threshold}"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison should likely change to a >= for "jaro" and "jaro_winkler" where higher scores are more similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and distance_threshold should probably be a float, not an int.

self,
col_name: str,
distance_threshold_or_thresholds: Union[Iterable[int], int] = [1],
distance_function: str = "levenshtein",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a custom type that lists the allowable values, similar to DateMetricType above.

)

def create_label_for_charts(self) -> str:
return f"Array string distance <= {self.distance_threshold}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the distance function name here

comma_separated_thresholds_string = ", ".join(map(str, self.thresholds))
plural = "s" if len(self.thresholds) > 1 else ""
return (
f"Array string distance at maximum size{plural} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the distance function name here

Comment on lines +792 to +794
x -> list_transform(
{col.name_r},
y -> [x,y]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x -> list_transform(
{col.name_r},
y -> [x,y]
l_item -> list_transform(
{col.name_r},
r_item -> [l_item, r_item]

Just for readability

@RobinL
Copy link
Member

RobinL commented May 22, 2024

Thanks! I'm on leave after today, but i had a very quick look, and tried it out:

Runnable code to try new comparison
import duckdb
import pandas as pd

import splink.internals.comparison_level_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator
from splink.internals.comparison_library import CustomComparison
from splink.internals.dialects import DuckDBDialect

data_l = pd.DataFrame.from_dict(
    [
        {"unique_id": 1, "name": ["robin", "james"]},
        {"unique_id": 2, "name": ["robyn", "steve"]},
        {"unique_id": 3, "name": ["stephen"]},
        {"unique_id": 4, "name": ["stephen"]},
    ]
)

arr_comparison = cl.ArrayStringDistanceLevel("name", 2, "levenshtein")
cc = CustomComparison(
    [
        cl.NullLevel("name"),
        cl.ExactMatchLevel("name"),
        arr_comparison,
        cl.ElseLevel(),
    ]
)
print(arr_comparison.create_sql(DuckDBDialect()))

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=["1=1"],
    comparisons=[cc],
)


linker = Linker(data_l, settings, database_api=DuckDBAPI(), set_up_basic_logging=True)


linker.predict().as_pandas_dataframe()


# Test sql
literal_array_l = str(["robin", "james"])
literal_array_r = str(["robyn", "bob"])


sql = f"""
select
    list_max(
        list_transform(
            flatten(
                list_transform(
                    {literal_array_l},
                    x -> list_transform(
                        {literal_array_r},
                        y -> [x,y]
                    )
                )
            ),
            pair -> jaro_winkler_similarity(pair[1], pair[2])
        )
    ) >= 0.5 as result
"""

duckdb.sql(sql)

Overall implementation looks great, thanks. Need to think a bit about the 'user facing' aspect of this:

  • Best way to allow user to provide the distance function
  • How to deal with the distance vs similiaryt issue (for levenstein higher is worse, for others like jaro, higher is better)

For consistency we could consider doing something similar to:

class JaroWinklerAtThresholds(ComparisonCreator):
class LevenshteinAtThresholds(ComparisonCreator):

Where we use distance_threshold_or_thresholds vs score_threshold_or_thresholds

So I wonder whether we coud do something like:
class AbsoluteDateDifferenceLevel(AbsoluteTimeDifferenceLevel):

Where we have a simple inheritance that slightly changes the behaviour.

So we could have a base class that takes a higher_is_more_similar type argument, e.g. like here,

But then what's exposed to the user is actually two functions, called something like ArrayStringDistanceLevel and ArrayStringSimilarityLevel which set the higher_is_more_similar argument for the user, so they don't have to think about it.

Or we could even consider going as far as simply exposing ArrayJaroWinklerLevel ArrayLevenshteinLevel and maybe a couple more to the user. Though overall, if we set a custom type that allows autocomplete of the supported ones (distance_function), UI think I'd err on the side of just having ArrayStringDistanceLevel and ArrayStringSimilarityLevel

Finally, we'll want to check compat across dialects. I think to begin with supporting just Spark and Duckdb should be fine

@RobinL
Copy link
Member

RobinL commented May 22, 2024

Just to further explain the distance vs score, here's a slack convo i dug up where I asked the team!
image

@zmbc
Copy link
Contributor

zmbc commented May 22, 2024

But then what's exposed to the user is actually two functions, called something like ArrayStringDistanceLevel and ArrayStringSimilarityLevel which set the higher_is_more_similar argument for the user, so they don't have to think about it.

This sounds like the best compromise to me!

@zmbc
Copy link
Contributor

zmbc commented Jun 26, 2024

@RobinL Circling back to how to design this, I actually would like to propose consistency with the DistanceFunctionAtThresholds class, rather than consistency with JaroWinklerAtThresholds etc.

Specifically, we'd have PairwiseStringDistanceFunctionAtThresholds and PairwiseStringDistanceFunctionLevel. These would each take a distance_function argument that would be validated to be one of the options "jaro," "levenshtein," etc and transpiled automatically. The value of this argument would then automatically set higher_is_more_similar.

This feels slightly weird given that "DistanceFunction" seems to imply that higher is more distant, i.e. less similar, but that is already an issue with the existing DistanceFunctionAtThresholds.

As an add-on (in a separate PR?) we could also have a more configurable PairwiseDistanceFunctionAtThresholds (and associated level) that did not restrict the function used, did not transpile the function name, and required the user to specify higher_is_more_similar. I'd also like to give the user the ability to customize whether the min or max distance/similarity is used.

What do you think?

@RobinL
Copy link
Member

RobinL commented Jul 2, 2024

Yeah - good spot re the existing issue/weirdness with DistanceFunctionAtThresholds

I think I agree with you - PairwiseStringDistanceFunctionAtThresholds and PairwiseStringDistanceFunctionLevel seem like the best options and most consistent. I also think it's a good idea to constrain the list of possibilities with a validated/enum, to keep a lid on complexity. Also we can use typing.Literal so it auto-completes for the user

Literal["jaro", ...]

I think to minimise complexity I'm minded to leave it at that rather than go through the complexity of allowing any arbitrary sql function. For highly customised things, the user can provide the comparison as a dict, and they can use PairwiseStringDistanceFunctionLevel to generate something close to what they need, that they can then manually edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Allow fuzzy matches on array-valued columns
3 participants