ArraySubset defined, first draft #2383

mihir-packmoose · 2024-09-04T19:12:22Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

Defines a new class in splink.internals.comparison_level_library called ArrayIntersectPercentage. This class takes two arguments, col_name and percentage_threshold. The functionality here has been tested locally.

No testing yet, will continue adding components as I go along.

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter
Run the spellchecker (if appropriate)

mihir-packmoose · 2024-09-04T19:27:38Z

Hello! I'm trying to figure out a couple of things with this PR:

I ran the linter and poetry run ruff format changed 27 files, all of which were ipynb files unrelated to my commit. I'm assuming it's fine to not push those changes out and tick off the linter task from the PR checklist?
What do my next steps look like? I'm assuming that I need to add tests since this is a code change, but after skimming through the tests, I couldn't figure out exactly where and what form these tests would take.

Thanks and have a great day!

…ative

RobinL · 2024-09-11T13:25:22Z

Yeah, happy to accept an ArraySubsetLevel, seems like a sensible idea. I think a slight adjustment to your implementation is needed, something like:

class ArraySubsetLevel(ComparisonLevelCreator):
    def __init__(self, col_name: str | ColumnExpression):
        """Represents a comparison level where the smaller array is an
        exact subset of the larger array.
        """
        self.col_expression = ColumnExpression.instantiate_if_str(col_name)

    @unsupported_splink_dialects(["sqlite"])
    def create_sql(self, sql_dialect: SplinkDialect) -> str:
        sqlglot_dialect_name = sql_dialect.sqlglot_name

        sqlglot_base_dialect_sql = """
            ARRAY_SIZE(
            ARRAY_INTERSECT(___col____l, ___col____r)) /
            LEAST(ARRAY_SIZE(___col____l), ARRAY_SIZE(___col____r))
            == 1
            """
        translated = _translate_sql_string(
            sqlglot_base_dialect_sql, sqlglot_dialect_name
        )

        self.col_expression.sql_dialect = sql_dialect
        col = self.col_expression
        col = self.col_expression
        translated = translated.replace("___col____l", col.name_l)
        translated = translated.replace("___col____r", col.name_r)
        return translated

    def create_label_for_charts(self) -> str:
        return "Array subset"

You then need to expose here for the user-facing API

In tersm of testing, you should be able to use an approach like this:
https://github.com/mihir-packmoose/splink/blob/bc18a044cdee713f3278240de6cbd99a27232b6e/tests/test_array_columns.py#L13

i.e. add tests to the test_array_columns.py file

I've checked the above suggestion against duckdb but not spark, so not 100% it'll work there as well

If you're fighting with ruff don't worry too much, i can add a commit once everything else is ready to fix. But yeah, you shouldn't need to change the .ipynbs

mihir-packmoose · 2024-09-11T20:45:24Z

Hey, Robin! Thanks for your help with where to place the tests and the corrections you mentioned. I've made the necessary amendments, and added some edgecase handling for cases where one array of the two was empty (we had a bit of a division-by-zero problem but it now runs through a preliminary check to make sure that the smaller of the two arrays is not empty).

A couple of questions here:

Since I'm not too sure about how the @mark_with_dialects_excluding decorator works, I removed the exclusion directive on "spark" since you mentioned that you'd checked the suggestion against duckdb and not spark. I'm presuming that this means my tests also ran for spark, but I'm not sure. What do you think?
When we consider mathematical accuracy, $\emptyset \subset \emptyset$. However, me and my team feel that:
-- For this specific application, the mathematical rule need not apply, since we don't expect the function to be required to behave in this manner when invoked; Quite the opposite, we expect that such a comparison should intuitively yield a False.
-- In the context that we expect subsets to be used, the expectation should probably be that there are no empty-to-empty matching attempts in the first place.
Even so, we could write a bit of logic into the query that makes it behave mathematically accurately. So we've decided that this should be left to the discretion of the original library authors. Let me know about your thoughts!

RobinL · 2024-09-17T10:51:30Z

@mihir-packmoose sorry for the delay. Thanks - yes I'm happy with this and agree re: arrays with zero elements. Are you happy for me to add some commits for tests, and go ahead and merge?

mihir-packmoose · 2024-09-17T15:03:41Z

Sounds good! Thanks again for your help, Robin!

mihir-packmoose · 2024-09-17T15:46:37Z

P.S. I see what you mean by adding commits for tests now, I tried to resolve those conflicts on test_array_columns.py and the literal_utils file has gone through a complete refactor haha. I'll leave it to you!

RobinL · 2024-09-20T18:07:10Z

I've moved this changeset over to #2416 to simplify the merge. I cherry picked your commits so you'll still appear in the contributors :-)

mihir-packmoose · 2024-09-20T19:18:04Z

Sounds good! Thanks for helping me through the process, @RobinL !

ArrayIntersectPercentage defined, first draft

b0adf6a

mihir-packmoose marked this pull request as draft September 4, 2024 19:29

mihir-packmoose marked this pull request as ready for review September 4, 2024 19:29

mihir-packmoose marked this pull request as draft September 5, 2024 16:42

Replacing ArrayIntersectPercentage with ArraySubsetLevel as an altern…

bc18a04

…ative

mihir-packmoose and others added 4 commits September 11, 2024 10:45

Minor fixes and exposing to user-facing API

8080600

Merge branch 'master' into master

679c9f4

Added testing and edgecase handling for empty arrays

55b5944

Reordered imports to pass ruff

8d5f912

mihir-packmoose marked this pull request as ready for review September 17, 2024 15:13

RobinL changed the title ~~ArrayIntersectPercentage defined, first draft~~ ArraySubset defined, first draft Sep 20, 2024

RobinL closed this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArraySubset defined, first draft #2383

ArraySubset defined, first draft #2383

mihir-packmoose commented Sep 4, 2024 •

edited

Loading

mihir-packmoose commented Sep 4, 2024

RobinL commented Sep 11, 2024 •

edited

Loading

mihir-packmoose commented Sep 11, 2024

RobinL commented Sep 17, 2024

mihir-packmoose commented Sep 17, 2024

mihir-packmoose commented Sep 17, 2024

RobinL commented Sep 20, 2024 •

edited

Loading

mihir-packmoose commented Sep 20, 2024

ArraySubset defined, first draft #2383

ArraySubset defined, first draft #2383

Conversation

mihir-packmoose commented Sep 4, 2024 • edited Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

mihir-packmoose commented Sep 4, 2024

RobinL commented Sep 11, 2024 • edited Loading

mihir-packmoose commented Sep 11, 2024

RobinL commented Sep 17, 2024

mihir-packmoose commented Sep 17, 2024

mihir-packmoose commented Sep 17, 2024

RobinL commented Sep 20, 2024 • edited Loading

mihir-packmoose commented Sep 20, 2024

mihir-packmoose commented Sep 4, 2024 •

edited

Loading

RobinL commented Sep 11, 2024 •

edited

Loading

RobinL commented Sep 20, 2024 •

edited

Loading