Cluster without linker #2412

RobinL · 2024-09-18T09:05:33Z

We've heard from several people who want to cluster without a linker. For instance if you are combining predictions from multiple models and want to cluster. e.g. #2358

This PR allows the clustering algorithm to be used without needing a linker, similar to exploratory analysis

Example without linker

from duckdb import DuckDBPyRelation

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

nodes = [
    {"abc": 1},
    {"abc": 2},
    {"abc": 3},
    {"abc": 4},
]

edges = [
    {"abc_l": 1, "abc_r": 2, "match_probability": 0.8},
    {"abc_l": 3, "abc_r": 2, "match_probability": 0.9},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="abc",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

Example

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

# split df into two dfs with modulo 2
df_1 = df[df["unique_id"] % 2 == 0]
df_2 = df[df["unique_id"] % 2 == 1]

settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("city"),
        block_on("email"),
    ],
    max_iterations=2,
)

linker = Linker([df_1, df_2], settings, db_api, input_table_aliases=["a", "b"])
linker._settings_obj._get_source_dataset_column_name_is_required()
pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)
pairwise_predictions.as_pandas_dataframe().sort_values(["unique_id_l", "unique_id_r"])
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.00001
)


cluster_pairwise_predictions_at_threshold(
    df,
    pairwise_predictions.physical_name,
    node_id_column_name="unique_id",
    db_api=db_api,
    threshold_match_probability=0.00001,
).as_pandas_dataframe()

Also works for deterministic linking

import os

import pandas as pd

from splink import DuckDBAPI, Linker, SettingsCreator
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

# Load the data
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

# Define blocking rules
br_for_predict = [
    "l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob",
    "l.surname = r.surname and l.dob = r.dob and l.email = r.email",
    "l.first_name = r.first_name and l.surname = r.surname and l.email = r.email",
]

# Create settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=br_for_predict,
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

# Initialize DuckDB API
db_api = DuckDBAPI()


# Create linker
linker = Linker(df, settings, db_api=db_api)

# Perform deterministic linking
df_predict = linker.inference.deterministic_link()

# Cluster predictions
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict,
)

clusters.as_pandas_dataframe()

RobinL · 2024-09-18T10:10:34Z

splink/internals/clustering.py

+
+
+def cluster_pairwise_predictions_at_threshold(
+    nodes: AcceptableInputTableType,


This should probably eventually allow the input to also be SplinkDataFrame, but i think that's for a wider PR which allows all public-API functions to accept SplinkDataFrames

RobinL · 2024-09-20T18:54:46Z

docs/demos/examples/duckdb/deterministic_dedupe.ipynb

match probability = 1 hack no longer required due to this refactor

nabebaye · 2024-09-24T22:35:47Z

🙌 Thanks for pushing this out! This will be extremely helpful for using Splink where the data is periodically fed live into DuckDB

RobinL · 2024-09-27T13:29:04Z

docs/api_docs/clustering.md

@@ -1,18 +1,15 @@
 ---
 tags:
  - API
-  - Clustering
+  - clustering


the old file is now linker_clustering.md to distinguish from the 'plain' linker method

RobinL · 2024-09-27T13:29:55Z

tests/test_cc_random_graphs.py

@@ -2,54 +2,41 @@
 import pytest

 from tests.cc_testing_utils import (


I've switched all tests over to use the plain (no linker) clustering functions

RobinL · 2024-09-27T13:30:21Z

tests/cc_testing_utils.py

-    return pd.DataFrame(rows)
-
-
-def check_df_equality(df1, df2, skip_dtypes=False):


syntax like assert (cc_df.values == nx_df.values).all() is sufficient so this doesn't need to be a fn

RobinL · 2024-09-30T07:57:41Z

Another testing script

import duckdb
import networkx as nx
import pandas as pd

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on


def generate_random_graph(graph_size, seed=47):
    if graph_size < 10:
        density = 1 / graph_size
    else:
        density = 2 / graph_size
    # print(f"Graph size: {graph_size}, Density: {density}")

    graph = nx.fast_gnp_random_graph(graph_size, density, seed=seed, directed=False)
    return graph


def nodes_and_edges_from_graph(G):
    edges = nx.to_pandas_edgelist(G)
    edges.columns = ["unique_id_l", "unique_id_r"]

    nodes = pd.DataFrame({"unique_id": list(G.nodes)})

    return nodes, edges


g = generate_random_graph(10000)
nodes, edges = nodes_and_edges_from_graph(g)

G = nx.from_pandas_edgelist(edges, "unique_id_l", "unique_id_r")

# Ensure all nodes from the original graph are in G
for node in nodes["unique_id"]:
    if node not in G:
        G.add_node(node)

connected_components = list(nx.connected_components(G))

# Create a dictionary mapping node to cluster
node_to_cluster = {}
for cluster_id, component in enumerate(connected_components):
    for node in component:
        node_to_cluster[node] = cluster_id

# Create the final DataFrame
nodes_with_clusters = nodes.copy()
nodes_with_clusters["cluster"] = nodes_with_clusters["unique_id"].map(node_to_cluster)


db_api = DuckDBAPI(":default:")

blocking_rules = [
    block_on("cluster"),
]


settings = SettingsCreator(
    link_type="dedupe_only",
    probability_two_random_records_match=0.5,
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons=[
        cl.ExactMatch("cluster").configure(
            m_probabilities=[0.99, 0.01], u_probabilities=[0.01, 0.99]
        )
    ],
    retain_intermediate_calculation_columns=True,
)


linker = Linker(nodes_with_clusters, settings, db_api=db_api)
linker.visualisations.match_weights_chart()

df_predict = linker.inference.predict()

res = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict=df_predict, threshold_match_probability=0.5
)

res_duck = res.as_duckdbpyrelation()
res_duck
sql = """
SELECT
    COUNT(DISTINCT cluster_id) AS number_of_clusters,
    AVG(cluster_size) AS average_cluster_size
FROM (
    SELECT
        cluster_id,
        COUNT(*) AS cluster_size
    FROM res_duck
    GROUP BY cluster_id
)
"""

duckdb.sql(sql)

change in #2412, but only just rebased so that it affects this branch

RobinL added 5 commits September 18, 2024 07:52

solve connected components without linker

698b1cb

clustering.py

2092b7f

linker clustering works again

3c0c2e6

tests work with new framework

2c16b23

allow clustering without match prob

b2b7cd1

RobinL commented Sep 18, 2024

View reviewed changes

RobinL added 4 commits September 18, 2024 11:14

remove errant comma

43dac2a

mypy

bfeb572

compute_graph_metrics works again

7a28c97

fix tests

1bff8a2

RobinL commented Sep 20, 2024

View reviewed changes

docs/demos/examples/duckdb/deterministic_dedupe.ipynb Outdated

Copy link

Member Author

RobinL Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match probability = 1 hack no longer required due to this refactor

RobinL added 4 commits September 27, 2024 13:24

improve docstring, add public api

be28eac

make sure we delete intermediate tables from cache

fdc7115

add docs

fd15cd9

fix

687cdd0

RobinL requested a review from ADBond September 27, 2024 13:28

RobinL commented Sep 27, 2024

View reviewed changes

RobinL added 2 commits September 27, 2024 14:31

add additional link

705335b

clean up unused fn

3f4dd4b

RobinL removed the request for review from ADBond September 30, 2024 07:52

RobinL merged commit 5a9068b into master Sep 30, 2024
27 checks passed

RobinL deleted the cluster_without_linker branch September 30, 2024 07:58

ADBond added a commit that referenced this pull request Oct 2, 2024

join clusters table to tf table now that it doesn't come for free

7e9fb10

change in #2412, but only just rebased so that it affects this branch

ADBond mentioned this pull request Oct 3, 2024

Fix clustering in linky jobs with source dataset column on Postgres #2444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster without linker #2412

Cluster without linker #2412

RobinL commented Sep 18, 2024 •

edited

Loading

RobinL Sep 18, 2024

RobinL Sep 20, 2024

nabebaye commented Sep 24, 2024

RobinL Sep 27, 2024

RobinL Sep 27, 2024

RobinL Sep 27, 2024

RobinL commented Sep 30, 2024



		def cluster_pairwise_predictions_at_threshold(
		nodes: AcceptableInputTableType,

		@@ -2,54 +2,41 @@
		import pytest

		from tests.cc_testing_utils import (

		return pd.DataFrame(rows)


		def check_df_equality(df1, df2, skip_dtypes=False):

Cluster without linker #2412

Cluster without linker #2412

Conversation

RobinL commented Sep 18, 2024 • edited Loading

RobinL Sep 18, 2024

Choose a reason for hiding this comment

RobinL Sep 20, 2024

Choose a reason for hiding this comment

nabebaye commented Sep 24, 2024

RobinL Sep 27, 2024

Choose a reason for hiding this comment

RobinL Sep 27, 2024

Choose a reason for hiding this comment

RobinL Sep 27, 2024

Choose a reason for hiding this comment

RobinL commented Sep 30, 2024

RobinL commented Sep 18, 2024 •

edited

Loading