Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a specific m and u probabilities to be fixed during training #2379

Merged
merged 6 commits into from
Sep 5, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Sep 4, 2024

This PR allow the user to fix m and u probabilities when the model is created such that they aren't changed when training is run.

This is a fairly fairly common requirement e.g. here here and here. Because in some cases the user has prior knowledge of specific m and u values and wishes to fix them during training.

Suggested API:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels

labels = splink_dataset_labels.fake_1000_labels

db_api = DuckDBAPI()

first_name_comparison = cl.CustomComparison(
    comparison_levels=[
        cll.NullLevel("first_name"),
        cll.ExactMatchLevel("first_name").configure(
            m_probability=0.9999,
            fix_m_probability=True,
            u_probability=0.7,
            fix_u_probability=True,
        ),
        {
            "sql_condition": 'levenshtein("first_name_l", "first_name_r") <= 2',
            "label_for_charts": "Levenshtein distance of first_name <= 2",
            "m_probability": 0.88,
            "is_null_level": False,
            "fix_m_probability": True,
        },
        cll.ElseLevel()
    ]
)
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        first_name_comparison,
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("dob"),
    ],
    additional_columns_to_retain=["cluster"],
)

linker = Linker(splink_datasets.fake_1000, settings, db_api)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

labels_sdf = linker.table_management.register_labels_table(labels)
linker.training.estimate_m_from_pairwise_labels(labels_sdf)
linker.training.estimate_m_from_label_column("cluster")

linker.visualisations.m_u_parameters_chart()

Closes #2068

@RobinL
Copy link
Member Author

RobinL commented Sep 4, 2024

To do:

  • All training functions e.g. train from labels should support
  • Do a test that uses all training functions in from labels and verify m and u don't change for specified levels
  • Can the logic be pushed 'higher' e.g. into append_m_probability_to_comparison_level_trained_probabilities or somewhere else to minimise changes needed to code

@bkitej-rw
Copy link
Contributor

It's reassuring to know this is a common ask. Previously, I have resorted to manually setting some params on the _settings_obj post-training. But, I had been concerned with whether that biases anything in the resulting system of weights. It seems like that's not a concern.

@RobinL RobinL changed the title (WIP) Fix m (WIP) allow a specific m and u probabilities to be fixed during training Sep 4, 2024
@RobinL RobinL changed the title (WIP) allow a specific m and u probabilities to be fixed during training Allow a specific m and u probabilities to be fixed during training Sep 5, 2024
@RobinL RobinL requested a review from ADBond September 5, 2024 13:16
@@ -169,7 +169,7 @@ def populate_m_u_from_lookup(
) -> None:
cl = comparison_level

if "m" not in training_fixed_probabilities:
if not cl._fix_m_probability and "m" not in training_fixed_probabilities:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because otherwise the cl.m_probability in the training linker is set, and so the value fluctuates during EM training despite not being assigned back to the m probability on the main linker

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@RobinL RobinL merged commit edc94c1 into master Sep 5, 2024
25 checks passed
@RobinL RobinL deleted the fix_m branch September 5, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Allow exact or Bayesian pre-specification of m-probabilities for selected comparisons
3 participants