Allow a specific m and u probabilities to be fixed during training #2379

RobinL · 2024-09-04T12:41:13Z

This PR allow the user to fix m and u probabilities when the model is created such that they aren't changed when training is run.

This is a fairly fairly common requirement e.g. here here and here. Because in some cases the user has prior knowledge of specific m and u values and wishes to fix them during training.

Suggested API:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.datasets import splink_dataset_labels

labels = splink_dataset_labels.fake_1000_labels

db_api = DuckDBAPI()

first_name_comparison = cl.CustomComparison(
    comparison_levels=[
        cll.NullLevel("first_name"),
        cll.ExactMatchLevel("first_name").configure(
            m_probability=0.9999,
            fix_m_probability=True,
            u_probability=0.7,
            fix_u_probability=True,
        ),
        {
            "sql_condition": 'levenshtein("first_name_l", "first_name_r") <= 2',
            "label_for_charts": "Levenshtein distance of first_name <= 2",
            "m_probability": 0.88,
            "is_null_level": False,
            "fix_m_probability": True,
        },
        cll.ElseLevel()
    ]
)
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        first_name_comparison,
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("dob"),
    ],
    additional_columns_to_retain=["cluster"],
)

linker = Linker(splink_datasets.fake_1000, settings, db_api)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

labels_sdf = linker.table_management.register_labels_table(labels)
linker.training.estimate_m_from_pairwise_labels(labels_sdf)
linker.training.estimate_m_from_label_column("cluster")

linker.visualisations.m_u_parameters_chart()

Closes #2068

RobinL · 2024-09-04T13:18:14Z

To do:

All training functions e.g. train from labels should support
Do a test that uses all training functions in from labels and verify m and u don't change for specified levels
Can the logic be pushed 'higher' e.g. into append_m_probability_to_comparison_level_trained_probabilities or somewhere else to minimise changes needed to code

bkitej-rw · 2024-09-04T14:50:14Z

It's reassuring to know this is a common ask. Previously, I have resorted to manually setting some params on the _settings_obj post-training. But, I had been concerned with whether that biases anything in the resulting system of weights. It seems like that's not a concern.

RobinL · 2024-09-05T13:17:29Z

splink/internals/expectation_maximisation.py

@@ -169,7 +169,7 @@ def populate_m_u_from_lookup(
 ) -> None:
    cl = comparison_level

-    if "m" not in training_fixed_probabilities:
+    if not cl._fix_m_probability and "m" not in training_fixed_probabilities:


This is needed because otherwise the cl.m_probability in the training linker is set, and so the value fluctuates during EM training despite not being assigned back to the m probability on the main linker

ADBond

Great!

RobinL changed the title ~~(WIP) Fix m~~ (WIP) allow a specific m and u probabilities to be fixed during training Sep 4, 2024

RobinL added 4 commits September 5, 2024 13:47

initial commit

aaecba5

seems to work

1af6a27

support dict representation

c5d73f1

move fix logic up

cbb6a52

RobinL force-pushed the fix_m branch from 7646406 to cbb6a52 Compare September 5, 2024 12:47

test that fixes work as expected

f91092c

RobinL changed the title ~~(WIP) allow a specific m and u probabilities to be fixed during training~~ Allow a specific m and u probabilities to be fixed during training Sep 5, 2024

RobinL requested a review from ADBond September 5, 2024 13:16

RobinL commented Sep 5, 2024

View reviewed changes

update changelog

78b6d9f

ADBond approved these changes Sep 5, 2024

View reviewed changes

RobinL merged commit edc94c1 into master Sep 5, 2024
25 checks passed

RobinL deleted the fix_m branch September 5, 2024 17:01

RobinL mentioned this pull request Oct 10, 2024

Remove broken EM training options #2272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a specific m and u probabilities to be fixed during training #2379

Allow a specific m and u probabilities to be fixed during training #2379

RobinL commented Sep 4, 2024 •

edited

Loading

RobinL commented Sep 4, 2024

bkitej-rw commented Sep 4, 2024

RobinL Sep 5, 2024

ADBond left a comment

Allow a specific m and u probabilities to be fixed during training #2379

Allow a specific m and u probabilities to be fixed during training #2379

Conversation

RobinL commented Sep 4, 2024 • edited Loading

RobinL commented Sep 4, 2024

bkitej-rw commented Sep 4, 2024

RobinL Sep 5, 2024

Choose a reason for hiding this comment

ADBond left a comment

Choose a reason for hiding this comment

RobinL commented Sep 4, 2024 •

edited

Loading