-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster without linker #2412
Merged
Merged
Cluster without linker #2412
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
698b1cb
solve connected components without linker
RobinL 2092b7f
clustering.py
RobinL 3c0c2e6
linker clustering works again
RobinL 2c16b23
tests work with new framework
RobinL b2b7cd1
allow clustering without match prob
RobinL 43dac2a
remove errant comma
RobinL bfeb572
mypy
RobinL 7a28c97
compute_graph_metrics works again
RobinL 1bff8a2
fix tests
RobinL be28eac
improve docstring, add public api
RobinL fdc7115
make sure we delete intermediate tables from cache
RobinL fd15cd9
add docs
RobinL 687cdd0
fix
RobinL 705335b
add additional link
RobinL 3f4dd4b
clean up unused fn
RobinL File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,15 @@ | ||
--- | ||
tags: | ||
- API | ||
- Clustering | ||
- clustering | ||
--- | ||
# Methods in Linker.clustering | ||
# Documentation for `splink.clustering` | ||
|
||
|
||
::: splink.internals.linker_components.clustering.LinkerClustering | ||
::: splink.clustering | ||
handler: python | ||
filters: | ||
- "!^__init__$" | ||
options: | ||
show_root_heading: false | ||
show_root_toc: false | ||
show_source: false | ||
members_order: source | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from .internals.clustering import cluster_pairwise_predictions_at_threshold | ||
|
||
__all__ = ["cluster_pairwise_predictions_at_threshold"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
from typing import Optional | ||
|
||
from splink.internals.connected_components import solve_connected_components | ||
from splink.internals.database_api import AcceptableInputTableType, DatabaseAPISubClass | ||
from splink.internals.input_column import InputColumn | ||
from splink.internals.misc import ascii_uid | ||
from splink.internals.splink_dataframe import SplinkDataFrame | ||
|
||
|
||
def cluster_pairwise_predictions_at_threshold( | ||
nodes: AcceptableInputTableType, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should probably eventually allow the input to also be SplinkDataFrame, but i think that's for a wider PR which allows all public-API functions to accept SplinkDataFrames |
||
edges: AcceptableInputTableType, | ||
db_api: DatabaseAPISubClass, | ||
node_id_column_name: str, | ||
edge_id_column_name_left: Optional[str] = None, | ||
edge_id_column_name_right: Optional[str] = None, | ||
threshold_match_probability: Optional[float] = None, | ||
) -> SplinkDataFrame: | ||
"""Clusters the pairwise match predictions into groups of connected records using | ||
the connected components graph clustering algorithm. | ||
|
||
Records with an estimated match probability at or above threshold_match_probability | ||
are considered to be a match (i.e. they represent the same entity). | ||
|
||
If no match probability column is provided, it is assumed that all edges | ||
(comparison) are a match. | ||
|
||
If your node and edge column names follow Splink naming conventions, then you can | ||
omit edge_id_column_name_left and edge_id_column_name_right. For example, if you | ||
have a table of nodes with a column `unique_id`, it would be assumed that the | ||
edge table has columns `unique_id_l` and `unique_id_r`. | ||
|
||
Args: | ||
nodes (AcceptableInputTableType): The table containing node information | ||
edges (AcceptableInputTableType): The table containing edge information | ||
db_api (DatabaseAPISubClass): The database API to use for querying | ||
node_id_column_name (str): The name of the column containing node IDs | ||
edge_id_column_name_left (Optional[str]): The name of the column containing | ||
left edge IDs. If not provided, assumed to be f"{node_id_column_name}_l" | ||
edge_id_column_name_right (Optional[str]): The name of the column containing | ||
right edge IDs. If not provided, assumed to be f"{node_id_column_name}_r" | ||
threshold_match_probability (Optional[float]): Pairwise comparisons with a | ||
match_probability at or above this threshold are matched | ||
|
||
Returns: | ||
SplinkDataFrame: A SplinkDataFrame containing a list of all IDs, clustered | ||
into groups based on the desired match threshold. | ||
|
||
Examples: | ||
```python | ||
from splink import DuckDBAPI | ||
from splink.clustering import cluster_pairwise_predictions_at_threshold | ||
|
||
db_api = DuckDBAPI() | ||
|
||
nodes = [ | ||
{"my_id": 1}, | ||
{"my_id": 2}, | ||
{"my_id": 3}, | ||
{"my_id": 4}, | ||
{"my_id": 5}, | ||
{"my_id": 6}, | ||
] | ||
|
||
edges = [ | ||
{"n_1": 1, "n_2": 2, "match_probability": 0.8}, | ||
{"n_1": 3, "n_2": 2, "match_probability": 0.9}, | ||
{"n_1": 4, "n_2": 5, "match_probability": 0.99}, | ||
] | ||
|
||
cc = cluster_pairwise_predictions_at_threshold( | ||
nodes, | ||
edges, | ||
node_id_column_name="my_id", | ||
edge_id_column_name_left="n_1", | ||
edge_id_column_name_right="n_2", | ||
db_api=db_api, | ||
threshold_match_probability=0.5, | ||
) | ||
|
||
cc.as_duckdbpyrelation() | ||
``` | ||
""" | ||
|
||
uid = ascii_uid(8) | ||
|
||
if isinstance(nodes, SplinkDataFrame): | ||
nodes_sdf = nodes | ||
else: | ||
nodes_sdf = db_api.register_table(nodes, f"__splink__df_nodes_{uid}") | ||
|
||
if isinstance(edges, SplinkDataFrame): | ||
edges_sdf = edges | ||
else: | ||
edges_sdf = db_api.register_table(edges, f"__splink__df_edges_{uid}") | ||
|
||
if not edge_id_column_name_left: | ||
edge_id_column_name_left = InputColumn( | ||
node_id_column_name, | ||
sqlglot_dialect_str=db_api.sql_dialect.sqlglot_dialect, | ||
).name_l | ||
|
||
if not edge_id_column_name_right: | ||
edge_id_column_name_right = InputColumn( | ||
node_id_column_name, | ||
sqlglot_dialect_str=db_api.sql_dialect.sqlglot_dialect, | ||
).name_r | ||
|
||
cc = solve_connected_components( | ||
nodes_table=nodes_sdf, | ||
edges_table=edges_sdf, | ||
node_id_column_name=node_id_column_name, | ||
edge_id_column_name_left=edge_id_column_name_left, | ||
edge_id_column_name_right=edge_id_column_name_right, | ||
db_api=db_api, | ||
threshold_match_probability=threshold_match_probability, | ||
) | ||
cc.metadata["threshold_match_probability"] = threshold_match_probability | ||
return cc |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old file is now linker_clustering.md to distinguish from the 'plain' linker method