Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArraySubset defined, first draft #2383

Closed

Conversation

mihir-packmoose
Copy link
Contributor

@mihir-packmoose mihir-packmoose commented Sep 4, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #2382

Give a brief description for the solution you have provided

Defines a new class in splink.internals.comparison_level_library called ArrayIntersectPercentage. This class takes two arguments, col_name and percentage_threshold. The functionality here has been tested locally.

No testing yet, will continue adding components as I go along.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

@mihir-packmoose
Copy link
Contributor Author

Hello! I'm trying to figure out a couple of things with this PR:

  1. I ran the linter and poetry run ruff format changed 27 files, all of which were ipynb files unrelated to my commit. I'm assuming it's fine to not push those changes out and tick off the linter task from the PR checklist?
  2. What do my next steps look like? I'm assuming that I need to add tests since this is a code change, but after skimming through the tests, I couldn't figure out exactly where and what form these tests would take.

Thanks and have a great day!

@mihir-packmoose mihir-packmoose marked this pull request as draft September 4, 2024 19:29
@mihir-packmoose mihir-packmoose marked this pull request as ready for review September 4, 2024 19:29
@mihir-packmoose mihir-packmoose marked this pull request as draft September 5, 2024 16:42
@RobinL
Copy link
Member

RobinL commented Sep 11, 2024

Yeah, happy to accept an ArraySubsetLevel, seems like a sensible idea. I think a slight adjustment to your implementation is needed, something like:

class ArraySubsetLevel(ComparisonLevelCreator):
    def __init__(self, col_name: str | ColumnExpression):
        """Represents a comparison level where the smaller array is an
        exact subset of the larger array.
        """
        self.col_expression = ColumnExpression.instantiate_if_str(col_name)

    @unsupported_splink_dialects(["sqlite"])
    def create_sql(self, sql_dialect: SplinkDialect) -> str:
        sqlglot_dialect_name = sql_dialect.sqlglot_name

        sqlglot_base_dialect_sql = """
            ARRAY_SIZE(
            ARRAY_INTERSECT(___col____l, ___col____r)) /
            LEAST(ARRAY_SIZE(___col____l), ARRAY_SIZE(___col____r))
            == 1
            """
        translated = _translate_sql_string(
            sqlglot_base_dialect_sql, sqlglot_dialect_name
        )

        self.col_expression.sql_dialect = sql_dialect
        col = self.col_expression
        col = self.col_expression
        translated = translated.replace("___col____l", col.name_l)
        translated = translated.replace("___col____r", col.name_r)
        return translated

    def create_label_for_charts(self) -> str:
        return "Array subset"

You then need to expose here for the user-facing API

In tersm of testing, you should be able to use an approach like this:
https://github.com/mihir-packmoose/splink/blob/bc18a044cdee713f3278240de6cbd99a27232b6e/tests/test_array_columns.py#L13

i.e. add tests to the test_array_columns.py file

I've checked the above suggestion against duckdb but not spark, so not 100% it'll work there as well

If you're fighting with ruff don't worry too much, i can add a commit once everything else is ready to fix. But yeah, you shouldn't need to change the .ipynbs

@mihir-packmoose
Copy link
Contributor Author

Hey, Robin! Thanks for your help with where to place the tests and the corrections you mentioned. I've made the necessary amendments, and added some edgecase handling for cases where one array of the two was empty (we had a bit of a division-by-zero problem but it now runs through a preliminary check to make sure that the smaller of the two arrays is not empty).

A couple of questions here:

  1. Since I'm not too sure about how the @mark_with_dialects_excluding decorator works, I removed the exclusion directive on "spark" since you mentioned that you'd checked the suggestion against duckdb and not spark. I'm presuming that this means my tests also ran for spark, but I'm not sure. What do you think?
  2. When we consider mathematical accuracy, $\emptyset \subset \emptyset$. However, me and my team feel that:
    -- For this specific application, the mathematical rule need not apply, since we don't expect the function to be required to behave in this manner when invoked; Quite the opposite, we expect that such a comparison should intuitively yield a False.
    -- In the context that we expect subsets to be used, the expectation should probably be that there are no empty-to-empty matching attempts in the first place.
    Even so, we could write a bit of logic into the query that makes it behave mathematically accurately. So we've decided that this should be left to the discretion of the original library authors. Let me know about your thoughts!

@RobinL
Copy link
Member

RobinL commented Sep 17, 2024

@mihir-packmoose sorry for the delay. Thanks - yes I'm happy with this and agree re: arrays with zero elements. Are you happy for me to add some commits for tests, and go ahead and merge?

@mihir-packmoose
Copy link
Contributor Author

Sounds good! Thanks again for your help, Robin!

@mihir-packmoose mihir-packmoose marked this pull request as ready for review September 17, 2024 15:13
@mihir-packmoose
Copy link
Contributor Author

P.S. I see what you mean by adding commits for tests now, I tried to resolve those conflicts on test_array_columns.py and the literal_utils file has gone through a complete refactor haha. I'll leave it to you!

@RobinL RobinL changed the title ArrayIntersectPercentage defined, first draft ArraySubset defined, first draft Sep 20, 2024
@RobinL
Copy link
Member

RobinL commented Sep 20, 2024

I've moved this changeset over to #2416 to simplify the merge. I cherry picked your commits so you'll still appear in the contributors :-)

@RobinL RobinL closed this Sep 20, 2024
@mihir-packmoose
Copy link
Contributor Author

Sounds good! Thanks for helping me through the process, @RobinL !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] A percentage threshold for array intersection
2 participants