diff --git a/404.html b/404.html new file mode 100644 index 0000000000..ef18dae08d --- /dev/null +++ b/404.html @@ -0,0 +1,5127 @@ + + + + + + + + + + + + + + + + + + + + + + Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ +

404 - Not found

+ +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/api_docs_index.html b/api_docs/api_docs_index.html new file mode 100644 index 0000000000..8efa75e175 --- /dev/null +++ b/api_docs/api_docs_index.html @@ -0,0 +1,5356 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + + + +
+
+ + + + + + + + + + + + +

API Documentation

+

This section contains reference material for the modules and functions within Splink.

+

API

+

Linker

+ +

Comparisons

+ +

Other

+ +

In-built datasets

+

Information on pre-made data tables available within Splink suitable for linking, to get up-and-running or to try out ideas.

+
    +
  • In-built datasets - information on included datasets, as well as how to use them, and methods for managing them.
  • +
+ +

Reference materials for the Splink Settings dictionary:

+
+ + + + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/blocking.html b/api_docs/blocking.html new file mode 100644 index 0000000000..87d66486ee --- /dev/null +++ b/api_docs/blocking.html @@ -0,0 +1,5383 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Blocking rule creator - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Documentation forblock_on

+ + +
+ + + + +
+ +

Generates blocking rules of equality conditions based on the columns +or SQL expressions specified.

+

When multiple columns or SQL snippets are provided, the function generates a +compound blocking rule, connecting individual match conditions with +"AND" clauses.

+

Further information on equi-join conditions can be found +here

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_names_or_exprs + Union[str, ColumnExpression] + +
+

A list of input columns or SQL conditions +you wish to create blocks on.

+
+
+ () +
salting_partitions + (optional, int) + +
+

Whether to add salting +to the blocking rule. More information on salting can +be found within the docs.

+
+
+ None +
arrays_to_explode + (optional, List[str]) + +
+

List of arrays to explode +before applying the blocking rule.

+
+
+ None +
+ + +

Examples:

+
from splink import block_on
+br_1 = block_on("first_name")
+br_2 = block_on("substr(surname,1,2)", "surname")
+
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/blocking_analysis.html b/api_docs/blocking_analysis.html new file mode 100644 index 0000000000..124694775c --- /dev/null +++ b/api_docs/blocking_analysis.html @@ -0,0 +1,5737 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Blocking analysis - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Documentation forsplink.blocking_analysis

+ + +
+ + + + +
+ + + +
+ + + + + + + + + +
+ + +

+ count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0)) + +

+ + +
+ +

Analyse a blocking rule to understand the number of comparisons it will generate.

+

Read more about the definition of pre and post filter conditions +here

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
table_or_tables + (dataframe, str) + +
+

Input data

+
+
+ required +
blocking_rule + Union[BlockingRuleCreator, str, Dict[str, Any]] + +
+

The blocking +rule to analyse

+
+
+ required +
link_type + user_input_link_type_options + +
+

The link type - "link_only", +"dedupe_only" or "link_and_dedupe"

+
+
+ required +
db_api + DatabaseAPISubClass + +
+

Database API

+
+
+ required +
unique_id_column_name + str + +
+

Defaults to "unique_id".

+
+
+ 'unique_id' +
source_dataset_column_name + Optional[str] + +
+

Defaults to None.

+
+
+ None +
compute_post_filter_count + bool + +
+

Whether to use a slower methodology +to calculate how many comparisons will be generated post filter conditions. +Defaults to True.

+
+
+ True +
max_rows_limit + int + +
+

Calculation of post filter counts will only +proceed if the fast method returns a value below this limit. Defaults +to int(1e9).

+
+
+ int(1000000000.0) +
+ + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ dict[str, Union[int, str]] + +
+

dict[str, Union[int, str]]: A dictionary containing the results

+
+
+ +
+ +
+ +
+ + +

+ cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) + +

+ + +
+ + + +
+ +
+ +
+ + +

+ cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None) + +

+ + +
+ + + +
+ +
+ +
+ + +

+ n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5) + +

+ + +
+ +

Find the values responsible for creating the largest blocks of records.

+

For example, when blocking on first name and surname, the 'John Smith' block +might be the largest block of records. In cases where values are highly skewed +a few values may be resonsible for generating a large proportion of all comparisons. +This function helps you find the culprit values.

+

The analysis is performed pre filter conditions, read more about what this means +here

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
table_or_tables + (dataframe, str) + +
+

Input data

+
+
+ required +
blocking_rule + Union[BlockingRuleCreator, str, Dict[str, Any]] + +
+

The blocking +rule to analyse

+
+
+ required +
link_type + user_input_link_type_options + +
+

The link type - "link_only", +"dedupe_only" or "link_and_dedupe"

+
+
+ required +
db_api + DatabaseAPISubClass + +
+

Database API

+
+
+ required +
n_largest + int + +
+

How many rows to return. Defaults to 5.

+
+
+ 5 +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + 'SplinkDataFrame' + +
+

A dataframe containing the n_largest blocks

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/clustering.html b/api_docs/clustering.html new file mode 100644 index 0000000000..7a6f59cb57 --- /dev/null +++ b/api_docs/clustering.html @@ -0,0 +1,5570 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Clustering - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.clustering

+ + +
+ + + + +
+ + +

Cluster the results of the linkage model and analyse clusters, accessed via +linker.clustering.

+ + + + +
+ + + + + + + + + +
+ + +

+ cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None) + +

+ + +
+ +

Clusters the pairwise match predictions that result from +linker.inference.predict() into groups of connected record using the connected +components graph clustering algorithm

+

Records with an estimated match_probability at or above +threshold_match_probability are considered to be a match (i.e. they represent +the same entity).

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

The results of linker.predict()

+
+
+ required +
threshold_match_probability + float + +
+

Pairwise comparisons with a +match_probability at or above this threshold are matched

+
+
+ None +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

A SplinkDataFrame containing a list of all IDs, clustered +into groups based on the desired match threshold.

+
+
+ +
+ +
+ +
+ + +

+ compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None) + +

+ + +
+ +

Generates tables containing graph metrics (for nodes, edges and clusters), +and returns a data class of Splink dataframes

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

The results of linker.inference.predict()

+
+
+ required +
df_clustered + SplinkDataFrame + +
+

The outputs of +linker.clustering.cluster_pairwise_predictions_at_threshold()

+
+
+ required +
threshold_match_probability + float + +
+

Filter the pairwise match +predictions to include only pairwise comparisons with a +match_probability at or above this threshold. If not provided, the value +will be taken from metadata on df_clustered. If no such metadata is +available, this value must be provided.

+
+
+ None +
+ + +

Returns:

+ + + + + + + + + + + + + + + + + +
Name TypeDescription
GraphMetricsResult + GraphMetricsResults + +
+

A data class containing SplinkDataFrames

+
+
+ GraphMetricsResults + +
+

of cluster IDs and selected node, edge or cluster metrics. +attribute "nodes" for nodes metrics table +attribute "edges" for edge metrics table +attribute "clusters" for cluster metrics table

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/column_expression.html b/api_docs/column_expression.html new file mode 100644 index 0000000000..39c3ddc090 --- /dev/null +++ b/api_docs/column_expression.html @@ -0,0 +1,5613 @@ + + + + + + + + + + + + + + + + + + + + + + + + Column Expressions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Column Expressions

+

In comparisons, you may wish to consider expressions which are not simply columns of your input table. +For instance you may have a forename column in your data, but when comparing records you may wish to also use the values in this column transformed all to lowercase, or just the first three letters of the name, or perhaps both of these transformations taken together.

+

If it is feasible to do so, then it may be best to derive a new column containing the transformed data. +Particularly if it is an expensive calculation, or you wish to refer to it many times, deriving the column once on your input data may well be preferable, as it is cheaper than doing so directly in comparisons where each input record may need to be processed many times. +However, there may be situations where you don't wish to derive a new column, perhaps for large data where you have many such transformations, or when you are experimenting with different models.

+

This is where a ColumnExpression may be used. It represents some SQL expression, which may be a column, or some more complicated construct, +to which you can also apply zero or more transformations. These are lazily evaluated, and in particular will not be tied to a specific SQL dialect until they are put (via settings into a linker).

+
+Term frequency adjustments +

One caveat to using a ColumnExpression is that it cannot be combined with term frequency adjustments. +Term frequency adjustments can only be computed on the raw values in a column prior to any function transforms.

+

If you wish to use term frequencies with transformations of an input column, you must pre-compute a new column in your input data +with the transforms applied, instead of a ColumnExpression.

+
+
from splink import ColumnExpression
+
+email_lowercase = ColumnExpression("email").lower()
+dob_as_string = ColumnExpression("dob").cast_to_string()
+surname_initial_lowercase = ColumnExpression("surname").substr(1, 1).lower()
+entry_date = ColumnExpression("entry_date_str").try_parse_date(date_format="YYYY-MM-DD")
+full_name_lowercase = ColumnExpression("first_name || ' ' || surname").lower()
+
+

You can use a ColumnExpression in most places where you might also use a simple column name, such as in a library comparison, a library comparison level, or in a blocking rule:

+
from splink import block_on
+import splink.comparison_library as cl
+import splink.comparison_level_library as cll
+
+full_name_lower_br = block_on([full_name_lowercase])
+
+email_comparison = cl.DamerauLevenshteinAtThresholds(email_lowercase, distance_threshold_or_thresholds=[1, 3])
+entry_date_comparison = cl.AbsoluteTimeDifferenceAtThresholds(
+    entry_date,
+    input_is_string=False,
+    metrics=["day", "day"],
+    thresholds=[1, 10],
+)
+name_comparison = cl.CustomComparison(
+    comparison_levels=[
+        cll.NullLevel(full_name_lowercase),
+        cll.ExactMatch(full_name_lowercase),
+        cll.ExactMatch("surname")
+        cll.ExactMatch("first_name"),
+        cll.ExactMatch(surname_initial_lowercase),
+        cll.ElseLevel()
+    ],
+    output_column_name="name",
+)
+
+

ColumnExpression

+ + +
+ + + + +
+ + +

Enables transforms to be applied to a column before it's passed into a +comparison level.

+

Dialect agnostic. Execution is delayed until the dialect is known.

+ + +
+ For example +
from splink.column_expression import ColumnExpression
+col = (
+    ColumnExpression("first_name")
+    .lower()
+    .regex_extract("^[A-Z]{1,4}")
+)
+
+ExactMatchLevel(col)
+
+

Note that this will typically be created without a dialect, and the dialect +will later be populated when the ColumnExpression is passed via a comparison +level creator into a Linker.

+ + + + +
+ + + + + + + + + +
+ + +

+ lower() + +

+ + +
+ +

Applies a lowercase transform to the input expression.

+ +
+ +
+ +
+ + +

+ substr(start, length) + +

+ + +
+ +

Applies a substring transform to the input expression of a given length +starting from a specified index.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
start + int + +
+

The starting index of the substring.

+
+
+ required +
length + int + +
+

The length of the substring.

+
+
+ required +
+ +
+ +
+ +
+ + +

+ cast_to_string() + +

+ + +
+ +

Applies a cast to string transform to the input expression.

+ +
+ +
+ +
+ + +

+ regex_extract(pattern, capture_group=0) + +

+ + +
+ +

Applies a regex extract transform to the input expression.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
pattern + str + +
+

The regex pattern to match.

+
+
+ required +
capture_group + int + +
+

The capture group to extract from the matched pattern. +Defaults to 0, meaning the full pattern is extracted

+
+
+ 0 +
+ +
+ +
+ +
+ + +

+ try_parse_date(date_format=None) + +

+ + +
+ +

Applies a 'try parse date' transform to the input expression.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
date_format + str + +
+

The date format to attempt to parse. +Defaults to None, meaning the dialect-specific default format is used.

+
+
+ None +
+ +
+ +
+ +
+ + +

+ try_parse_timestamp(timestamp_format=None) + +

+ + +
+ +

Applies a 'try parse timestamp' transform to the input expression.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
timestamp_format + str + +
+

The timestamp format to attempt to parse. +Defaults to None, meaning the dialect-specific default format is used.

+
+
+ None +
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/comparison_level_library.html b/api_docs/comparison_level_library.html new file mode 100644 index 0000000000..5c6302893e --- /dev/null +++ b/api_docs/comparison_level_library.html @@ -0,0 +1,7715 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Comparison Level Library - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Documentation for the comparison_level_library

+ + +
+ + + + +
+ + + +
+ + + + + + + + +
+ + + +

+ AbsoluteDifferenceLevel(col_name, difference_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where the absolute difference between two +numerical values is within a specified threshold.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str | ColumnExpression + +
+

Input column name or ColumnExpression.

+
+
+ required +
difference_threshold + int | float + +
+

The maximum allowed absolute difference +between the two values.

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Computes the absolute elapsed time between two dates (total duration).

+

This function computes the amount of time that has passed between two dates, +in contrast to functions like date_diff found in some SQL backends, +which count the number of full calendar intervals (e.g., months, years) crossed.

+

For instance, the difference between January 29th and March 2nd would be less +than two months in terms of elapsed time, unlike a date_diff calculation that +would give an answer of 2 calendar intervals crossed.

+

That the thresold is inclusive e.g. a level with a 10 day threshold +will include difference in date of 10 days.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the input column containing the dates to compare

+
+
+ required +
input_is_string + bool + +
+

Indicates if the input date/times are in +string format, requiring parsing according to datetime_format.

+
+
+ required +
threshold + int + +
+

The maximum allowed difference between the two dates, +in units specified by date_metric.

+
+
+ required +
metric + str + +
+

The unit of time to use when comparing the dates. +Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

+
+
+ required +
datetime_format + str + +
+

The format string for parsing dates. +ISO 8601 format used if not provided.

+
+
+ None +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ And(*comparison_levels) + +

+ + +
+

+ Bases: _Merge

+ + +

Represents a comparison level that is an 'AND' of other comparison levels

+

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by +merging their SQL conditions using a logical "AND".

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
*comparison_levels + ComparisonLevelCreator | dict + +
+

These represent the +comparison levels you wish to combine via 'AND'

+
+
+ () +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ArrayIntersectLevel(col_name, min_intersection) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level based around the size of an intersection of +arrays

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
min_intersection + int + +
+

The minimum cardinality of the +intersection of arrays for this comparison level. Defaults to 1

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where the columns are reversed. For example, +if surname is in the forename field and vice versa

+

By default, col_l = col_r. If the symmetrical argument is True, then +col_l = col_r AND col_r = col_l.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name_1 + str + +
+

First column, e.g. forename

+
+
+ required +
col_name_2 + str + +
+

Second column, e.g. surname

+
+
+ required +
symmetrical + bool + +
+

If True, equality is required in in both directions. +Default is False.

+
+
+ False +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level with a custom sql expression

+

Must be in a form suitable for use in a SQL CASE WHEN expression +e.g. "substr(name_l, 1, 1) = substr(name_r, 1, 1)"

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
sql_condition + str + +
+

SQL condition to assess similarity

+
+
+ required +
label_for_charts + str + +
+

A label for this level to be used in +charts. Default None, so that sql_condition is used

+
+
+ None +
base_dialect_str + str + +
+

If specified, the SQL dialect that +this expression will parsed as when attempting to translate to +other backends

+
+
+ None +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DamerauLevenshteinLevel(col_name, distance_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using a Damerau-Levenshtein distance function

+

e.g. damerau_levenshtein(val_l, val_r) <= distance_threshold

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
distance_threshold + int + +
+

The threshold to use to assess +similarity

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using an arbitrary distance function

+

e.g. custom_distance(val_l, val_r) >= (<=) distance_threshold

+

The function given by distance_function_name must exist in the SQL +backend you use, and must take two parameters of the type in `col_name, +returning a numeric type

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str | ColumnExpression + +
+

Input column name

+
+
+ required +
distance_function_name + str + +
+

the name of the SQL distance function

+
+
+ required +
distance_threshold + Union[int, float] + +
+

The threshold to use to assess +similarity

+
+
+ required +
higher_is_more_similar + bool + +
+

Are higher values of the distance function +more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) +Default is True

+
+
+ True +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Use the haversine formula to transform comparisons of lat,lngs +into distances measured in kilometers

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
lat_col + str + +
+

The name of a latitude column or the respective array +or struct column column containing the information +For example: long_lat['lat'] or long_lat[0]

+
+
+ required +
long_col + str + +
+

The name of a longitudinal column or the respective array +or struct column column containing the information, plus an index. +For example: long_lat['long'] or long_lat[1]

+
+
+ required +
km_threshold + int + +
+

The total distance in kilometers to evaluate your +comparisons against

+
+
+ required +
not_null + bool + +
+

If true, ensure no attempt is made to compute this if +any inputs are null. This is only necessary if you are not +capturing nulls elsewhere in your comparison level.

+
+
+ False +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ElseLevel + + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

This level is used to capture all comparisons that do not match any other +specified levels. It corresponds to the ELSE clause in a SQL CASE statement.

+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ExactMatchLevel(col_name, term_frequency_adjustments=False) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where there is an exact match

+

e.g. val_l = val_r

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
term_frequency_adjustments + bool + +
+

If True, apply term frequency +adjustments to the exact match level. Defaults to False.

+
+
+ False +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaccardLevel(col_name, distance_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using a Jaccard distance function

+

e.g. jaccard(val_l, val_r) >= distance_threshold

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
distance_threshold + Union[int, float] + +
+

The threshold to use to assess +similarity

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaroLevel(col_name, distance_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using a Jaro distance function

+

e.g. jaro(val_l, val_r) >= distance_threshold

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
distance_threshold + Union[int, float] + +
+

The threshold to use to assess +similarity

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaroWinklerLevel(col_name, distance_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using a Jaro-Winkler distance function

+

e.g. jaro_winkler(val_l, val_r) >= distance_threshold

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
distance_threshold + Union[int, float] + +
+

The threshold to use to assess +similarity

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ LevenshteinLevel(col_name, distance_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

A comparison level using a sqlglot_dialect_name distance function

+

e.g. levenshtein(val_l, val_r) <= distance_threshold

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name

+
+
+ required +
distance_threshold + int + +
+

The threshold to use to assess +similarity

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both') + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where a column matches a literal value

+

e.g. val_l = 'literal' AND/OR val_r = 'literal'

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

Input column name or +ColumnExpression

+
+
+ required +
literal_value + str + +
+

The literal value to compare against e.g. 'male'

+
+
+ required +
literal_datatype + str + +
+

The datatype of the literal value. +Must be one of: "string", "int", "float", "date"

+
+
+ required +
side_of_comparison + str + +
+

Which side(s) of the comparison to +apply. Must be one of: "left", "right", "both". Defaults to "both".

+
+
+ 'both' +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ Not(comparison_level) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level that is the negation of another comparison level

+

Resulting ComparisonLevelCreator is equivalent to the passed ComparisonLevelCreator +but with SQL conditions negated with logical "NOY".

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
*comparison_level + ComparisonLevelCreator | dict + +
+

This represents the +comparison level you wish to negate with 'NOT'

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ NullLevel(col_name, valid_string_pattern=None) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where either or both values are NULL

+

e.g. val_l IS NULL OR val_r IS NULL

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

Input column name or ColumnExpression

+
+
+ required +
valid_string_pattern + str + +
+

If provided, a regex pattern to extract +a valid substring from the column before checking for NULL. Default is None.

+
+
+ None +
+ + +
+ Note +

If a valid_string_pattern is provided, the NULL check will be performed on +the extracted substring rather than the original column value.

+
+ + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ Or(*comparison_levels) + +

+ + +
+

+ Bases: _Merge

+ + +

Represents a comparison level that is an 'OR' of other comparison levels

+

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by +merging their SQL conditions using a logical "OR".

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
*comparison_levels + ComparisonLevelCreator | dict + +
+

These represent the +comparison levels you wish to combine via 'OR'

+
+
+ () +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ PercentageDifferenceLevel(col_name, percentage_threshold) + +

+ + +
+

+ Bases: ComparisonLevelCreator

+ + +

Represents a comparison level where the difference between two numerical +values is within a specified percentage threshold.

+

The percentage difference is calculated as the absolute difference between the +two values divided by the greater of the two values.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

Input column name.

+
+
+ required +
percentage_threshold + float + +
+

The threshold percentage to use +to assess similarity e.g. 0.1 for 10%.

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ + + + +
+ +
+ +

AbsoluteDateDifferenceAtThresholds

+

An alias of AbsoluteTimeDifferenceAtThresholds.

+

Configuring comparisons

+

Note that all comparison levels have a .configure() method as follows:

+ + +
+ + + + +
+ +

Configure the comparison level with options which are common to all +comparison levels. The options align to the keys in the json +specification of a comparison level. These options are usually not +needed, but are available for advanced users.

+

All options have default options set initially. Any call to .configure() +will set any options that are supplied. Any subsequent calls to .configure() +will not override these values with defaults; to override values you must must +explicitly provide a value corresponding to the default.

+

Generally speaking only a single call (at most) to .configure() should +be required.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
m_probability + float + +
+

The m probability for this +comparison level. +Default is equivalent to None, in which case a default initial value +will be provided for this level.

+
+
+ unsupplied_option +
u_probability + float + +
+

The u probability for this +comparison level. +Default is equivalent to None, in which case a default initial value +will be provided for this level.

+
+
+ unsupplied_option +
tf_adjustment_column + str + +
+

Make term frequency adjustments for +this comparison level using this input column. +Default is equivalent to None, meaning that term-frequency adjustments +will not be applied for this level.

+
+
+ unsupplied_option +
tf_adjustment_weight + float + +
+

Make term frequency adjustments +for this comparison level using this weight. +Default is equivalent to None, meaning term-frequency adjustments are +fully-weighted if turned on.

+
+
+ unsupplied_option +
tf_minimum_u_value + float + +
+

When term frequency adjustments are +turned on, where the term frequency adjustment implies a u value below +this value, use this minimum value instead. +Defaults is equivalent to None, meaning no minimum value.

+
+
+ unsupplied_option +
is_null_level + bool + +
+

If true, m and u values will not be +estimated and instead the match weight will be zero for this column. +Default is equivalent to False.

+
+
+ unsupplied_option +
label_for_charts + str + +
+

If provided, a custom label that will +be used for this level in any charts. +Default is equivalent to None, in which case a default label will be +provided for this level.

+
+
+ unsupplied_option +
disable_tf_exact_match_detection + bool + +
+

If true, if term +frequency adjustments are set, the corresponding adjustment will be +made using the u-value for this level, rather than the usual case +where it is the u-value of the exact match level in the same comparison. +Default is equivalent to False.

+
+
+ unsupplied_option +
fix_m_probability + bool + +
+

If true, the m probability for this +level will be fixed and not estimated during training. +Default is equivalent to False.

+
+
+ unsupplied_option +
fix_u_probability + bool + +
+

If true, the u probability for this +level will be fixed and not estimated during training. +Default is equivalent to False.

+
+
+ unsupplied_option +
+ + +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/comparison_library.html b/api_docs/comparison_library.html new file mode 100644 index 0000000000..b3f46bb7cd --- /dev/null +++ b/api_docs/comparison_library.html @@ -0,0 +1,7541 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Comparison Library - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Documentation for the comparison_library

+ + +
+ + + + +
+ + + +
+ + + + + + + + +
+ + + +

+ AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with multiple levels based on +absolute time differences:

+
    +
  • Exact match in col_name
  • +
  • Absolute time difference levels at specified thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels +are:

+
    +
  • Exact match in col_name
  • +
  • Absolute time difference in col_name <= 1 day
  • +
  • Absolute time difference in col_name <= 3 months
  • +
  • Anything else
  • +
+

This comparison uses the AbsoluteTimeDifferenceLevel, which computes the total +elapsed time between two dates, rather than counting calendar intervals.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
input_is_string + bool + +
+

If True, the input dates are treated as strings +and parsed according to datetime_format.

+
+
+ required +
metrics + Union[DateMetricType, List[DateMetricType]] + +
+

The unit(s) of time +to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', +'month', or 'year'.

+
+
+ required +
thresholds + Union[int, float, List[Union[int, float]]] + +
+

The threshold(s) +to use for the time difference level(s).

+
+
+ required +
datetime_format + str + +
+

The format string for parsing dates if +input_is_string is True. ISO 8601 format used if not provided.

+
+
+ None +
term_frequency_adjustments + bool + +
+

Whether to apply term frequency +adjustments. Defaults to False.

+
+
+ False +
invalid_dates_as_null + bool + +
+

If True and input_is_string is +True, treat invalid dates as null. Defaults to True.

+
+
+ True +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with multiple levels based on +the intersection sizes of array elements:

+
    +
  • Intersection at specified size thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with size_threshold_or_thresholds = [3, 1], the levels are:

+
    +
  • Intersection of arrays in col_name has at least 3 elements
  • +
  • Intersection of arrays in col_name has at least 1 element
  • +
  • Anything else (e.g., empty intersection)
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
size_threshold_or_thresholds + Union[int, list[int]] + +
+

The +size threshold(s) for the intersection levels. +Defaults to [1].

+
+
+ [1] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ CustomComparison(comparison_levels, output_column_name=None, comparison_description=None) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data with custom supplied levels.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
output_column_name + str + +
+

The column name to use to refer to this comparison

+
+
+ None +
comparison_levels + list + +
+

A list of some combination of +ComparisonLevelCreator objects, or dicts. These represent the +similarity levels assessed by the comparison, in order of decreasing +specificity

+
+
+ required +
comparison_description + str + +
+

An optional description of the +comparison

+
+
+ None +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Damerau-Levenshtein levels at specified distance thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

+
    +
  • Exact match in col_name
  • +
  • Damerau-Levenshtein distance in col_name <= 1
  • +
  • Damerau-Levenshtein distance in col_name <= 3
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
distance_threshold_or_thresholds + Union[int, list] + +
+

The +threshold(s) to use for the Damerau-Levenshtein similarity level(s). +Defaults to [1, 2].

+
+
+ [1, 2] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Generate an 'out of the box' comparison for a date of birth column +in the col_name provided.

+

Note that input_is_string is a required argument: you must denote whether the +col_name contains if of type date/dattime or string.

+

The default arguments will give a comparison with comparison levels:

+
    +
  • Exact match (all other dates)
  • +
  • Damerau-Levenshtein distance <= 1
  • +
  • Date difference <= 1 month
  • +
  • Date difference <= 1 year
  • +
  • Date difference <= 10 years
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

The column name

+
+
+ required +
input_is_string + bool + +
+

If True, the provided col_name must be of type +string. If False, it must be a date or datetime.

+
+
+ required +
datetime_thresholds + Union[int, float, List[Union[int, float]]] + +
+

Numeric thresholds for date differences. Defaults to [1, 1, 10].

+
+
+ [1, 1, 10] +
datetime_metrics + Union[DateMetricType, List[DateMetricType]] + +
+

Metrics for date differences. Defaults to ["month", "year", "year"].

+
+
+ ['month', 'year', 'year'] +
datetime_format + str + +
+

The datetime format used to cast strings +to dates. Only used if input is a string.

+
+
+ None +
invalid_dates_as_null + bool + +
+

If True, treat invalid dates as null +as opposed to allowing e.g. an exact or levenshtein match where one side +or both are an invalid date. Only used if input is a string. Defaults +to True.

+
+
+ True +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Custom distance function levels at specified thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with distance_threshold_or_thresholds = [1, 3] +and distance_function 'hamming', with higher_is_more_similar False +the levels are:

+
    +
  • Exact match in col_name
  • +
  • Hamming distance of col_name <= 1
  • +
  • Hamming distance of col_name <= 3
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
distance_function_name + str + +
+

the name of the SQL distance function

+
+
+ required +
distance_threshold_or_thresholds + Union[float, list] + +
+

The +threshold(s) to use for the distance function level(s).

+
+
+ required +
higher_is_more_similar + bool + +
+

Are higher values of the distance function +more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) +Default is True

+
+
+ True +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ DistanceInKMAtThresholds(lat_col, long_col, km_thresholds) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

A comparison of the latitude, longitude coordinates defined in +'lat_col' and 'long col' giving the great circle distance between them in km.

+

An example of the output with km_thresholds = [1, 10] would be:

+
    +
  • The two coordinates are within 1 km of one another
  • +
  • The two coordinates are within 10 km of one another
  • +
  • Anything else (i.e. the distance between coordinates are > 10km apart)
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
lat_col + str + +
+

The name of the latitude column to compare.

+
+
+ required +
long_col + str + +
+

The name of the longitude column to compare.

+
+
+ required +
km_thresholds + iterable[float] | float + +
+

The km threshold(s) for the +distance levels.

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ EmailComparison(col_name) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Generate an 'out of the box' comparison for an email address column with the +in the col_name provided.

+

The default comparison levels are:

+
    +
  • Null comparison: e.g., one email is missing or invalid.
  • +
  • Exact match on full email: e.g., john@smith.com vs. john@smith.com.
  • +
  • Exact match on username part of email: e.g., john@company.com vs. +john@other.com.
  • +
  • Jaro-Winkler similarity > 0.88 on full email: e.g., john.smith@company.com +vs. john.smyth@company.com.
  • +
  • Jaro-Winkler similarity > 0.88 on username part of email: e.g., +john.smith@company.com vs. john.smyth@other.com.
  • +
  • Anything else: e.g., john@company.com vs. rebecca@other.com.
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

The column name or expression for +the email addresses to be compared.

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ExactMatch(col_name) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with two levels:

+
    +
  • Exact match in col_name
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare

+
+
+ required +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Generate an 'out of the box' comparison for forename and surname columns +in the forename_col_name and surname_col_name provided.

+

It's recommended to derive an additional column containing a concatenated +forename and surname column so that term frequencies can be applied to the +full name. If you have derived a column, provide it at +forename_surname_concat_col_name.

+

The default comparison levels are:

+
    +
  • Null comparison on both forename and surname
  • +
  • Exact match on both forename and surname
  • +
  • Columns reversed comparison (forename and surname swapped)
  • +
  • Jaro-Winkler similarity > 0.92 on both forename and surname
  • +
  • Jaro-Winkler similarity > 0.88 on both forename and surname
  • +
  • Exact match on surname
  • +
  • Exact match on forename
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
forename_col_name + Union[str, ColumnExpression] + +
+

The column name or +expression for the forenames to be compared.

+
+
+ required +
surname_col_name + Union[str, ColumnExpression] + +
+

The column name or +expression for the surnames to be compared.

+
+
+ required +
jaro_winkler_thresholds + Union[float, list[float]] + +
+

Thresholds +for Jaro-Winkler similarity. Defaults to [0.92, 0.88].

+
+
+ [0.92, 0.88] +
forename_surname_concat_col_name + str + +
+

The column name for +concatenated forename and surname values. If provided, term +frequencies are applied on the exact match using this column

+
+
+ None +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Jaccard score levels at specified thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

+
    +
  • Exact match in col_name
  • +
  • Jaccard score in col_name >= 0.9
  • +
  • Jaccard score in col_name >= 0.7
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
score_threshold_or_thresholds + Union[float, list] + +
+

The +threshold(s) to use for the Jaccard similarity level(s). +Defaults to [0.9, 0.7].

+
+
+ [0.9, 0.7] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Jaro score levels at specified thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

+
    +
  • Exact match in col_name
  • +
  • Jaro score in col_name >= 0.9
  • +
  • Jaro score in col_name >= 0.7
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
score_threshold_or_thresholds + Union[float, list] + +
+

The +threshold(s) to use for the Jaro similarity level(s). +Defaults to [0.9, 0.7].

+
+
+ [0.9, 0.7] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Jaro-Winkler score levels at specified thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

+
    +
  • Exact match in col_name
  • +
  • Jaro-Winkler score in col_name >= 0.9
  • +
  • Jaro-Winkler score in col_name >= 0.7
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare.

+
+
+ required +
score_threshold_or_thresholds + Union[float, list] + +
+

The +threshold(s) to use for the Jaro-Winkler similarity level(s). +Defaults to [0.9, 0.7].

+
+
+ [0.9, 0.7] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Represents a comparison of the data in col_name with three or more levels:

+
    +
  • Exact match in col_name
  • +
  • Levenshtein levels at specified distance thresholds
  • +
  • ...
  • +
  • Anything else
  • +
+

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

+
    +
  • Exact match in col_name
  • +
  • Levenshtein distance in col_name <= 1
  • +
  • Levenshtein distance in col_name <= 3
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + str + +
+

The name of the column to compare

+
+
+ required +
distance_threshold_or_thresholds + Union[int, list] + +
+

The +threshold(s) to use for the levenshtein similarity level(s). +Defaults to [1, 2].

+
+
+ [1, 2] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Generate an 'out of the box' comparison for a name column in the col_name +provided.

+

It's also possible to include a level for a dmetaphone match, but this requires +you to derive a dmetaphone column prior to importing it into Splink. Note +this is expected to be a column containing arrays of dmetaphone values, which +are of length 1 or 2.

+

The default comparison levels are:

+
    +
  • Null comparison
  • +
  • Exact match
  • +
  • Jaro-Winkler similarity > 0.92
  • +
  • Jaro-Winkler similarity > 0.88
  • +
  • Jaro-Winkler similarity > 0.70
  • +
  • Anything else
  • +
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

The column name or expression for +the names to be compared.

+
+
+ required +
jaro_winkler_thresholds + Union[float, list[float]] + +
+

Thresholds +for Jaro-Winkler similarity. Defaults to [0.92, 0.88, 0.7].

+
+
+ [0.92, 0.88, 0.7] +
dmeta_col_name + str + +
+

The column name for dmetaphone values. +If provided, array intersection level is included. This column must +contain arrays of dmetaphone values, which are of length 1 or 2.

+
+
+ None +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ +
+ + + +

+ PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100]) + +

+ + +
+

+ Bases: ComparisonCreator

+ + +

Generate an 'out of the box' comparison for a postcode column with the +in the col_name provided.

+

The default comparison levels are:

+
    +
  • Null comparison
  • +
  • Exact match on full postcode
  • +
  • Exact match on sector
  • +
  • Exact match on district
  • +
  • Exact match on area
  • +
  • Distance in km (if lat_col and long_col are provided)
  • +
+

It's also possible to include levels for distance in km, but this requires +you to have geocoded your postcodes prior to importing them into Splink. Use +the lat_col and long_col arguments to tell Splink where to find the +latitude and longitude columns.

+

See https://ideal-postcodes.co.uk/guides/uk-postcode-format +for definitions

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
col_name + Union[str, ColumnExpression] + +
+

The column name or expression for +the postcodes to be compared.

+
+
+ required +
invalid_postcodes_as_null + bool + +
+

If True, treat invalid postcodes +as null. Defaults to False.

+
+
+ False +
lat_col + Union[str, ColumnExpression] + +
+

The column name or +expression for latitude. Required if km_thresholds is provided.

+
+
+ None +
long_col + Union[str, ColumnExpression] + +
+

The column name or +expression for longitude. Required if km_thresholds is provided.

+
+
+ None +
km_thresholds + Union[float, List[float]] + +
+

Thresholds for distance +in kilometers. If provided, lat_col and long_col must also be +provided.

+
+
+ [1, 10, 100] +
+ + + + +
+ + + + + + + + + + + +
+ +
+ +
+ + + + +
+ +
+ +

AbsoluteDateDifferenceAtThresholds

+

An alias of AbsoluteTimeDifferenceAtThresholds.

+

Configuring comparisons

+

Note that all comparisons have a .configure() method as follows:

+ + +
+ + + + +
+ +

Configure the comparison creator with options that are common to all +comparisons.

+

For m and u probabilities, the first +element in the list corresponds to the first comparison level, usually +an exact match level. Subsequent elements correspond comparison to +levels in sequential order, through to the last element which is usually +the 'ELSE' level.

+

All options have default options set initially. Any call to .configure() +will set any options that are supplied. Any subsequent calls to .configure() +will not override these values with defaults; to override values you must +explicitly provide a value corresponding to the default.

+

Generally speaking only a single call (at most) to .configure() should +be required.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
term_frequency_adjustments + bool + +
+

Whether term frequency +adjustments are switched on for this comparison. Only applied +to exact match levels. +Default corresponds to False.

+
+
+ unsupplied_option +
m_probabilities + list + +
+

List of m probabilities +Default corresponds to None.

+
+
+ unsupplied_option +
u_probabilities + list + +
+

List of u probabilities +Default corresponds to None.

+
+
+ unsupplied_option +
+ + +
+ Example +
cc = LevenshteinAtThresholds("name", 2)
+cc.configure(
+    m_probabilities=[0.9, 0.08, 0.02],
+    u_probabilities=[0.01, 0.05, 0.94]
+    # probabilities for exact match level, levenshtein <= 2, and else
+    # in that order
+)
+
+
+
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/datasets.html b/api_docs/datasets.html new file mode 100644 index 0000000000..d6cedf1934 --- /dev/null +++ b/api_docs/datasets.html @@ -0,0 +1,5761 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + SplinkDatasets - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

In-built datasets

+

Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features. +To use, simply import splink_datasets: +

from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+
+which you can then use to set up a linker: +
from splink splink_datasets, Linker, DuckDBAPI, SettingsCreator
+
+df = splink_datasets.fake_1000
+linker = Linker(
+    df,
+    SettingsCreator(
+        link_type="dedupe_only",
+        comparisons=[
+            cl.exact_match("first_name"),
+            cl.exact_match("surname"),
+        ],
+    ),
+    db_api=DuckDBAPI()
+)
+
+
+Troubleshooting +

If you get a SSLCertVerificationError when trying to use the inbuilt datasets, this can be fixed with the ssl package by running:

+

ssl._create_default_https_context = ssl._create_unverified_context.

+
+ +

Each attribute of splink_datasets is a dataset available for use, which exists as a pandas DataFrame. +These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested. +Once requested they are cached for future use. +The cache can be cleared using splink_dataset_utils (see below), +which also contains information on available datasets, and which have already been cached.

+

Available datasets

+

The datasets available are listed below:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
dataset namedescriptionrowsunique entitieslink to source
fake_1000Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled.1,000250source
historical_50kThe data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors.50,0005,156source
febrl3The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record.5,0002,000source
febrl4aThe Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records.5,0005,000source
febrl4bThe Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a.5,0005,000source
transactions_originThis data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing.45,32645,326source
transactions_destinationThis data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing.45,32645,326source
+ +

Some of the splink_datasets have corresponding clerical labels to help assess model performance. These are requested through the splink_dataset_labels module.

+

Available datasets

+

The datasets available are listed below:

+ + + + + + + + + + + + + + + + + + + +
dataset namedescriptionrowsunique entitieslink to source
fake_1000_labelsClerical labels for fake_10003,176NAsource
+ +

In addition to splink_datasets, you can also import splink_dataset_utils, +which has a few functions to help managing splink_datasets. +This can be useful if you have limited internet connection and want to see what is already cached, +or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue).

+

For example: +

from splink.datasets import splink_dataset_utils
+
+splink_dataset_utils.show_downloaded_data()
+splink_dataset_utils.clear_cache(['fake_1000'])
+
+ + +
+ + + + +
+ + + + + +
+ + + + + + + + + +
+ + +

+ list_downloaded_datasets() + +

+ + +
+ +

Return a list of datasets that have already been pre-downloaded

+ +
+ +
+ +
+ + +

+ list_all_datasets() + +

+ + +
+ +

Return a list of all available datasets, regardless of whether +or not they have already been pre-downloaded

+ +
+ +
+ +
+ + +

+ show_downloaded_data() + +

+ + +
+ +

Print a list of datasets that have already been pre-downloaded

+ +
+ +
+ +
+ + +

+ clear_downloaded_data(datasets=None) + +

+ + +
+ +

Delete any pre-downloaded data stored locally.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
datasets + list + +
+

A list of dataset names (without any file suffix) +to delete. +If None, all datasets will be deleted. Default None

+
+
+ None +
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/em_training_session.html b/api_docs/em_training_session.html new file mode 100644 index 0000000000..b36d50dc2a --- /dev/null +++ b/api_docs/em_training_session.html @@ -0,0 +1,5504 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + EM Training Session API - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Documentation forEMTrainingSession

+

linker.training.estimate_parameters_using_expectation_maximisation returns an object of type EMTrainingSession which has the following methods:

+ + +
+ + + + +
+ + +

Manages training models using the Expectation Maximisation algorithm, and +holds statistics on the evolution of parameter estimates. Plots diagnostic charts

+ + + + +
+ + + + + + + + + +
+ + +

+ probability_two_random_records_match_iteration_chart() + +

+ + +
+ +

Display a chart showing the iteration history of the probability that two +random records match.

+ + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ ChartReturnType + +
+

An interactive Altair chart.

+
+
+ +
+ +
+ +
+ + +

+ match_weights_interactive_history_chart() + +

+ + +
+ +

Display an interactive chart of the match weights history.

+ + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ ChartReturnType + +
+

An interactive Altair chart.

+
+
+ +
+ +
+ +
+ + +

+ m_u_values_interactive_history_chart() + +

+ + +
+ +

Display an interactive chart of the m and u values.

+ + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ ChartReturnType + +
+

An interactive Altair chart.

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/evaluation.html b/api_docs/evaluation.html new file mode 100644 index 0000000000..8daaa8e4dc --- /dev/null +++ b/api_docs/evaluation.html @@ -0,0 +1,6259 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Evaluation - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.evaluation

+ + +
+ + + + +
+ + +

Evaluate the performance of a Splink model. Accessed via +linker.evaluation

+ + + + +
+ + + + + + + + + +
+ + +

+ prediction_errors_from_labels_table(labels_splinkdataframe_or_table_name, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5) + +

+ + +
+ +

Find false positives and false negatives based on the comparison between the +clerical_match_score in the labels table compared with the splink predicted +match probability

+

The table of labels should be in the following format, and should be registered +as a table with your database using

+

labels_table = linker.table_management.register_labels_table(my_df)

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
source_dataset_lunique_id_lsource_dataset_runique_id_rclerical_match_score
df_11df_220.99
df_11df_230.2
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
labels_splinkdataframe_or_table_name + str | SplinkDataFrame + +
+

Name of table +containing labels in the database

+
+
+ required +
include_false_positives + bool + +
+

Defaults to True.

+
+
+ True +
include_false_negatives + bool + +
+

Defaults to True.

+
+
+ True +
threshold_match_probability + float + +
+

Threshold probability +above which a prediction considered to be a match. Defaults to 0.5.

+
+
+ 0.5 +
+ + +

Examples:

+
labels_table = linker.table_management.register_labels_table(df_labels)
+
+linker.evaluation.prediction_errors_from_labels_table(
+   labels_table, include_false_negatives=True, include_false_positives=False
+).as_pandas_dataframe()
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

Table containing false positives and negatives

+
+
+ +
+ +
+ +
+ + +

+ accuracy_analysis_from_labels_column(labels_column_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[], positives_not_captured_by_blocking_rules_scored_as_zero=True) + +

+ + +
+ +

Generate an accuracy chart or table from ground truth data, where the ground +truth is in a column in the input dataset called labels_column_name

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
labels_column_name + str + +
+

Column name containing labels in the input table

+
+
+ required +
threshold_match_probability + float + +
+

Where the +clerical_match_score provided by the user is a probability rather +than binary, this value is used as the threshold to classify +clerical_match_scores as binary matches or non matches. +Defaults to 0.5.

+
+
+ 0.5 +
match_weight_round_to_nearest + float + +
+

When provided, thresholds +are rounded. When large numbers of labels are provided, this is +sometimes necessary to reduce the size of the ROC table, and therefore +the number of points plotted on the chart. Defaults to None.

+
+
+ 0.1 +
add_metrics + list(str) + +
+

Precision and recall metrics are always +included. Where provided, add_metrics specifies additional metrics +to show, with the following options:

+
    +
  • "specificity": specificity, selectivity, true negative rate (TNR)
  • +
  • "npv": negative predictive value (NPV)
  • +
  • "accuracy": overall accuracy (TP+TN)/(P+N)
  • +
  • "f1"/"f2"/"f0_5": F-scores for β=1 (balanced), β=2 +(emphasis on recall) and β=0.5 (emphasis on precision)
  • +
  • "p4" - an extended F1 score with specificity and NPV included
  • +
  • "phi" - φ coefficient or Matthews correlation coefficient (MCC)
  • +
+
+
+ [] +
+ + + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
chart + Union[ChartReturnType, SplinkDataFrame] + +
+

An altair chart

+
+
+ +
+ +
+ +
+ + +

+ accuracy_analysis_from_labels_table(labels_splinkdataframe_or_table_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[]) + +

+ + +
+ +

Generate an accuracy chart or table from labelled (ground truth) data.

+

The table of labels should be in the following format, and should be registered +as a table with your database using +labels_table = linker.register_labels_table(my_df)

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
source_dataset_lunique_id_lsource_dataset_runique_id_rclerical_match_score
df_11df_220.99
df_11df_230.2
+

Note that source_dataset and unique_id should correspond to the values +specified in the settings dict, and the input_table_aliases passed to the +linker object.

+

For dedupe_only links, the source_dataset columns can be ommitted.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
labels_splinkdataframe_or_table_name + str | SplinkDataFrame + +
+

Name of table +containing labels in the database

+
+
+ required +
threshold_match_probability + float + +
+

Where the +clerical_match_score provided by the user is a probability rather +than binary, this value is used as the threshold to classify +clerical_match_scores as binary matches or non matches. +Defaults to 0.5.

+
+
+ 0.5 +
match_weight_round_to_nearest + float + +
+

When provided, thresholds +are rounded. When large numbers of labels are provided, this is +sometimes necessary to reduce the size of the ROC table, and therefore +the number of points plotted on the chart. Defaults to None.

+
+
+ 0.1 +
add_metrics + list(str) + +
+

Precision and recall metrics are always +included. Where provided, add_metrics specifies additional metrics +to show, with the following options:

+
    +
  • "specificity": specificity, selectivity, true negative rate (TNR)
  • +
  • "npv": negative predictive value (NPV)
  • +
  • "accuracy": overall accuracy (TP+TN)/(P+N)
  • +
  • "f1"/"f2"/"f0_5": F-scores for β=1 (balanced), β=2 +(emphasis on recall) and β=0.5 (emphasis on precision)
  • +
  • "p4" - an extended F1 score with specificity and NPV included
  • +
  • "phi" - φ coefficient or Matthews correlation coefficient (MCC)
  • +
+
+
+ [] +
+ + + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ Union[ChartReturnType, SplinkDataFrame] + +
+

altair.Chart: An altair chart

+
+
+ +
+ +
+ +
+ + +

+ prediction_errors_from_labels_column(label_colname, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5) + +

+ + +
+ +

Generate a dataframe containing false positives and false negatives +based on the comparison between the splink match probability and the +labels column. A label column is a column in the input dataset that contains +the 'ground truth' cluster to which the record belongs

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
label_colname + str + +
+

Name of labels column in input data

+
+
+ required +
include_false_positives + bool + +
+

Defaults to True.

+
+
+ True +
include_false_negatives + bool + +
+

Defaults to True.

+
+
+ True +
threshold_match_probability + float + +
+

Threshold above which a score +is considered to be a match. Defaults to 0.5.

+
+
+ 0.5 +
+ + +

Examples:

+
linker.evaluation.prediction_errors_from_labels_column(
+    "ground_truth_cluster",
+    include_false_negatives=True,
+    include_false_positives=False
+).as_pandas_dataframe()
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

Table containing false positives and negatives

+
+
+ +
+ +
+ +
+ + +

+ unlinkables_chart(x_col='match_weight', name_of_data_in_title=None, as_dict=False) + +

+ + +
+ +

Generate an interactive chart displaying the proportion of records that +are "unlinkable" for a given splink score threshold and model parameters.

+

Unlinkable records are those that, even when compared with themselves, do not +contain enough information to confirm a match.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
x_col + str + +
+

Column to use for the x-axis. +Defaults to "match_weight".

+
+
+ 'match_weight' +
name_of_data_in_title + str + +
+

Name of the source dataset to use for +the title of the output chart.

+
+
+ None +
as_dict + bool + +
+

If True, return a dict version of the chart.

+
+
+ False +
+ + +

Examples:

+

After estimating the parameters of the model, run:

+
linker.evaluation.unlinkables_chart()
+
+ + +

Returns:

+ + + + + + + + + + + + + +
TypeDescription
+ ChartReturnType + +
+

altair.Chart: An altair chart

+
+
+ +
+ +
+ +
+ + +

+ labelling_tool_for_specific_record(unique_id, source_dataset=None, out_path='labelling_tool.html', overwrite=False, match_weight_threshold=-4, view_in_jupyter=False, show_splink_predictions_in_interface=True) + +

+ + +
+ +

Create a standalone, offline labelling dashboard for a specific record +as identified by its unique id

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
unique_id + str + +
+

The unique id of the record for which to create the +labelling tool

+
+
+ required +
source_dataset + str + +
+

If there are multiple datasets, to +identify the record you must also specify the source_dataset. Defaults +to None.

+
+
+ None +
out_path + str + +
+

The output path for the labelling tool. Defaults +to "labelling_tool.html".

+
+
+ 'labelling_tool.html' +
overwrite + bool + +
+

If true, overwrite files at the output +path if they exist. Defaults to False.

+
+
+ False +
match_weight_threshold + int + +
+

Include possible matches in the +output which score above this threshold. Defaults to -4.

+
+
+ -4 +
view_in_jupyter + bool + +
+

If you're viewing in the Jupyter +html viewer, set this to True to extract your labels. Defaults to False.

+
+
+ False +
show_splink_predictions_in_interface + bool + +
+

Whether to +show information about the Splink model's predictions that could +potentially bias the decision of the clerical labeller. Defaults to +True.

+
+
+ True +
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/exploratory.html b/api_docs/exploratory.html new file mode 100644 index 0000000000..062c7879eb --- /dev/null +++ b/api_docs/exploratory.html @@ -0,0 +1,5762 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Exploratory - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Documentation forsplink.exploratory

+ + +
+ + + + +
+ + + +
+ + + + + + + + + +
+ + +

+ completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None) + +

+ + +
+ +

Generate a summary chart of data completeness (proportion of non-nulls) of +columns in each of the input table or tables. By default, completeness is assessed +for all columns in the input data.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
table_or_tables + Sequence[AcceptableInputTableType] + +
+

A single table or a list of tables of data

+
+
+ required +
db_api + DatabaseAPISubClass + +
+

The backend database API to use

+
+
+ required +
cols + List[str] + +
+

List of column names to calculate completeness. If +none, all columns will be computed. Default to None.

+
+
+ None +
table_names_for_chart + List[str] + +
+

A list of names. Must be the same length as +table_or_tables.

+
+
+ None +
+ +
+ +
+ +
+ + +

+ profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10) + +

+ + +
+ +

Profiles the specified columns of the dataframe initiated with the linker.

+

This can be computationally expensive if the dataframe is large.

+

For the provided columns with column_expressions (or for all columns if left empty) +calculate: +- A distribution plot that shows the count of values at each percentile. +- A top n chart, that produces a chart showing the count of the top n values +within the column +- A bottom n chart, that produces a chart showing the count of the bottom +n values within the column

+

This should be used to explore the dataframe, determine if columns have +sufficient completeness for linking, analyse the cardinality of columns, and +identify the need for standardisation within a given column.

+

Args:

+
column_expressions (list, optional): A list of strings containing the
+    specified column names.
+    If left empty this will default to all columns.
+top_n (int, optional): The number of top n values to plot.
+bottom_n (int, optional): The number of bottom n values to plot.
+
+ + +

Returns:

+ + + + + + + + + + + + + + + + + +
TypeDescription
+ Optional[ChartReturnType] + +
+

altair.Chart or dict: A visualization or JSON specification describing the

+
+
+ Optional[ChartReturnType] + +
+

profiling charts.

+
+
+ + +
+ Note +
    +
  • The linker object should be an instance of the initiated linker.
  • +
  • The provided column_expressions can be a list of column names to profile. + If left empty, all columns will be profiled.
  • +
  • The top_n and bottom_n parameters determine the number of top and bottom + values to display in the respective charts.
  • +
+
+
+ +
+ + + +
+ +
+ +

Documentation forsplink.exploratory.similarity_analysis

+ + +
+ + + + +
+ + + +
+ + + + + + + + + +
+ + +

+ comparator_score(str1, str2, decimal_places=2) + +

+ + +
+ +

Helper function to give the similarity between two strings for +the string comparators in splink.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+sa.comparator_score("Richard", "iRchard")
+
+ +
+ +
+ +
+ + +

+ comparator_score_chart(list, col1, col2) + +

+ + +
+ +

Helper function returning a heatmap showing the sting similarity +scores and string distances for a list of strings.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+list = {
+        "string1": ["Stephen", "Stephen", "Stephen"],
+        "string2": ["Stephen", "Steven", "Stephan"],
+        }
+
+sa.comparator_score_chart(list, "string1", "string2")
+
+ +
+ +
+ +
+ + +

+ comparator_score_df(list, col1, col2, decimal_places=2) + +

+ + +
+ +

Helper function returning a dataframe showing the string similarity +scores and string distances for a list of strings.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+list = {
+        "string1": ["Stephen", "Stephen","Stephen"],
+        "string2": ["Stephen", "Steven", "Stephan"],
+        }
+
+sa.comparator_score_df(list, "string1", "string2")
+
+ +
+ +
+ +
+ + +

+ comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None) + +

+ + +
+ +

Helper function returning a heatmap showing the string similarity +scores and string distances for a list of strings given a threshold.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+list = {
+        "string1": ["Stephen", "Stephen","Stephen"],
+        "string2": ["Stephen", "Steven", "Stephan"],
+        }
+
+sa.comparator_score_threshold_chart(data,
+                         "string1", "string2",
+                         similarity_threshold=0.8,
+                         distance_threshold=2)
+
+ +
+ +
+ +
+ + +

+ phonetic_match_chart(list, col1, col2) + +

+ + +
+ +

Helper function returning a heatmap showing the phonetic transform and +matches for a list of strings given a threshold.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+list = {
+        "string1": ["Stephen", "Stephen","Stephen"],
+        "string2": ["Stephen", "Steven", "Stephan"],
+        }
+
+sa.comparator_score_threshold_chart(list,
+                         "string1", "string2",
+                         similarity_threshold=0.8,
+                         distance_threshold=2)
+
+ +
+ +
+ +
+ + +

+ phonetic_transform(string) + +

+ + +
+ +

Helper function to give the phonetic transformation of two strings with +Soundex, Metaphone and Double Metaphone.

+ + +

Examples:

+
phonetic_transform("Richard", "iRchard")
+
+ +
+ +
+ +
+ + +

+ phonetic_transform_df(list, col1, col2) + +

+ + +
+ +

Helper function returning a dataframe showing the phonetic transforms +for a list of strings.

+ + +

Examples:

+
import splink.exploratory.similarity_analysis as sa
+
+list = {
+        "string1": ["Stephen", "Stephen","Stephen"],
+        "string2": ["Stephen", "Steven", "Stephan"],
+        }
+
+sa.phonetic_match_chart(list, "string1", "string2")
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/inference.html b/api_docs/inference.html new file mode 100644 index 0000000000..9a70716e4f --- /dev/null +++ b/api_docs/inference.html @@ -0,0 +1,5810 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Inference - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.inference

+ + +
+ + + + +
+ + +

Use your Splink model to make predictions (perform inference). Accessed via +linker.inference.

+ + + + +
+ + + + + + + + + +
+ + + + + +
+ +

Uses the blocking rules specified by +blocking_rules_to_generate_predictions in your settings to +generate pairwise record comparisons.

+

For deterministic linkage, this should be a list of blocking rules which +are strict enough to generate only true links.

+

Deterministic linkage, however, is likely to result in missed links +(false negatives).

+

Examples:

+
```py
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname"),
+        block_on("dob", "first_name"),
+    ],
+)
+
+linker = Linker(df, settings, db_api=db_api)
+splink_df = linker.inference.deterministic_link()
+```
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

A SplinkDataFrame of the pairwise comparisons.

+
+
+ +
+ +
+ +
+ + +

+ predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True) + +

+ + +
+ +

Create a dataframe of scored pairwise comparisons using the parameters +of the linkage model.

+

Uses the blocking rules specified in the +blocking_rules_to_generate_predictions key of the settings to +generate the pairwise comparisons.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
threshold_match_probability + float + +
+

If specified, +filter the results to include only pairwise comparisons with a +match_probability above this threshold. Defaults to None.

+
+
+ None +
threshold_match_weight + float + +
+

If specified, +filter the results to include only pairwise comparisons with a +match_weight above this threshold. Defaults to None.

+
+
+ None +
materialise_after_computing_term_frequencies + bool + +
+

If true, Splink +will materialise the table containing the input nodes (rows) +joined to any term frequencies which have been asked +for in the settings object. If False, this will be +computed as part of a large CTE pipeline. Defaults to True

+
+
+ True +
materialise_blocked_pairs + bool + +
+

In the blocking phase, materialise the table +of pairs of records that will be scored

+
+
+ True +
+ + +

Examples:

+
linker = linker(df, "saved_settings.json", db_api=db_api)
+splink_df = linker.inference.predict(threshold_match_probability=0.95)
+splink_df.as_pandas_dataframe(limit=5)
+
+ + +
+ +
+ +
+ + +

+ find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4) + +

+ + +
+ +

Given one or more records, find records in the input dataset(s) which match +and return in order of the Splink prediction score.

+

This effectively provides a way of searching the input datasets +for given record(s)

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
records_or_tablename + List[dict] + +
+

Input search record(s) as list of dict, +or a table registered to the database.

+
+
+ required +
blocking_rules + list + +
+

Blocking rules to select +which records to find and score. If [], do not use a blocking +rule - meaning the input records will be compared to all records +provided to the linker when it was instantiated. Defaults to [].

+
+
+ [] +
match_weight_threshold + int + +
+

Return matches with a match weight +above this threshold. Defaults to -4.

+
+
+ -4 +
+ + +

Examples:

+
linker = Linker(df, "saved_settings.json", db_api=db_api)
+
+# You should load or pre-compute tf tables for any tables with
+# term frequency adjustments
+linker.table_management.compute_tf_table("first_name")
+# OR
+linker.table_management.register_term_frequency_lookup(df, "first_name")
+
+record = {'unique_id': 1,
+    'first_name': "John",
+    'surname': "Smith",
+    'dob': "1971-05-24",
+    'city': "London",
+    'email': "john@smith.net"
+    }
+df = linker.inference.find_matches_to_new_records(
+    [record], blocking_rules=[]
+)
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

The pairwise comparisons.

+
+
+ +
+ +
+ +
+ + +

+ compare_two_records(record_1, record_2) + +

+ + +
+ +

Use the linkage model to compare and score a pairwise record comparison +based on the two input records provided

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
record_1 + dict + +
+

dictionary representing the first record. Columns names +and data types must be the same as the columns in the settings object

+
+
+ required +
record_2 + dict + +
+

dictionary representing the second record. Columns names +and data types must be the same as the columns in the settings object

+
+
+ required +
+ + +

Examples:

+
linker = Linker(df, "saved_settings.json", db_api=db_api)
+
+# You should load or pre-compute tf tables for any tables with
+# term frequency adjustments
+linker.table_management.compute_tf_table("first_name")
+# OR
+linker.table_management.register_term_frequency_lookup(df, "first_name")
+
+record_1 = {'unique_id': 1,
+    'first_name': "John",
+    'surname': "Smith",
+    'dob': "1971-05-24",
+    'city': "London",
+    'email': "john@smith.net"
+    }
+
+record_2 = {'unique_id': 1,
+    'first_name': "Jon",
+    'surname': "Smith",
+    'dob': "1971-05-23",
+    'city': "London",
+    'email': "john@smith.net"
+    }
+df = linker.inference.compare_two_records(record_1, record_2)
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

Pairwise comparison with scored prediction

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/misc.html b/api_docs/misc.html new file mode 100644 index 0000000000..22c05e6b4e --- /dev/null +++ b/api_docs/misc.html @@ -0,0 +1,5482 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Miscellaneous functions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.misc

+ + +
+ + + + +
+ + +

Miscellaneous methods on the linker that don't fit into other categories. +Accessed via linker.misc.

+ + + + +
+ + + + + + + + + +
+ + +

+ save_model_to_json(out_path=None, overwrite=False) + +

+ + +
+ +

Save the configuration and parameters of the linkage model to a .json file.

+

The model can later be loaded into a new linker using +`Linker(df, settings="path/to/model.json", db_api=db_api).

+

The settings dict is also returned in case you want to save it a different way.

+ + +

Examples:

+
linker.misc.save_model_to_json("my_settings.json", overwrite=True)
+
+ + + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
dict + dict[str, Any] + +
+

The settings as a dictionary.

+
+
+ +
+ +
+ +
+ + +

+ query_sql(sql, output_type='pandas') + +

+ + +
+ +

Run a SQL query against your backend database and return +the resulting output.

+ + +

Examples:

+
linker = Linker(df, settings, db_api)
+df_predict = linker.predict()
+linker.misc.query_sql(f"select * from {df_predict.physical_name} limit 10")
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
sql + str + +
+

The SQL to be queried.

+
+
+ required +
output_type + str + +
+

One of splink_df/splinkdf or pandas. +This determines the type of table that your results are output in.

+
+
+ 'pandas' +
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/settings_dict_guide.html b/api_docs/settings_dict_guide.html new file mode 100644 index 0000000000..b8353bb78e --- /dev/null +++ b/api_docs/settings_dict_guide.html @@ -0,0 +1,5840 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Settings Dict - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Settings Dict

+ +
+

This document enumerates all the settings and configuration options available when +developing your data linkage model.

+
+ + +

The type of data linking task. Required.

+
    +
  • +

    When dedupe_only, splink find duplicates. User expected to provide a single input dataset.

    +
  • +
  • +

    When link_and_dedupe, splink finds links within and between input datasets. User is expected to provide two or more input datasets.

    +
  • +
  • +

    When link_only, splink finds links between datasets, but does not attempt to deduplicate the datasets (it does not try and find links within each input dataset.) User is expected to provide two or more input datasets.

    +
  • +
+

Examples: ['dedupe_only', 'link_only', 'link_and_dedupe']

+
+ +

probability_two_random_records_match

+

The probability that two records chosen at random (with no blocking) are a match. For example, if there are a million input records and each has on average one match, then this value should be 1/1,000,000.

+

If you estimate parameters using expectation maximisation (EM), this provides an initial value (prior) from which the EM algorithm will start iterating. EM will then estimate the true value of this parameter.

+

Default value: 0.0001

+

Examples: [1e-05, 0.006]

+
+ +

em_convergence

+

Convergence tolerance for the Expectation Maximisation algorithm

+

The algorithm will stop converging when the maximum of the change in model parameters between iterations is below this value

+

Default value: 0.0001

+

Examples: [0.0001, 1e-05, 1e-06]

+
+ +

max_iterations

+

The maximum number of Expectation Maximisation iterations to run (even if convergence has not been reached)

+

Default value: 25

+

Examples: [20, 150]

+
+ +

unique_id_column_name

+

Splink requires that the input dataset has a column that uniquely identifies each record. unique_id_column_name is the name of the column in the input dataset representing this unique id

+

For linking tasks, ids must be unique within each dataset being linked, and do not need to be globally unique across input datasets

+

Default value: unique_id

+

Examples: ['unique_id', 'id', 'pk']

+
+ +

source_dataset_column_name

+

The name of the column in the input dataset representing the source dataset

+

Where we are linking datasets, we can't guarantee that the unique id column is globally unique across datasets, so we combine it with a source_dataset column. Usually, this is created by Splink for the user

+

Default value: source_dataset

+

Examples: ['source_dataset', 'dataset_name']

+
+ +

retain_matching_columns

+

If set to true, each column used by the comparisons SQL expressions will be retained in output datasets

+

This is helpful so that the user can inspect matches, but once the comparison vector (gamma) columns are computed, this information is not actually needed by the algorithm. The algorithm will run faster and use less resources if this is set to false.

+

Default value: True

+

Examples: [False, True]

+
+ +

retain_intermediate_calculation_columns

+

Retain intermediate calculation columns, such as the Bayes factors associated with each column in comparisons

+

The algorithm will run faster and use less resources if this is set to false.

+

Default value: False

+

Examples: [False, True]

+
+ +

comparisons

+

A list specifying how records should be compared for probabilistic matching. Each element is a dictionary

+
+Settings keys nested within each member of comparisons +

output_column_name

+

The name used to refer to this comparison in the output dataset. By default, Splink will set this to the name(s) of any input columns used in the comparison. This key is most useful to give a clearer description to comparisons that use multiple input columns. e.g. a location column that uses postcode and town may be named location

+

For a comparison column that uses a single input column, e.g. first_name, this will be set first_name. For comparison columns that use multiple columns, if left blank, this will be set to the concatenation of columns used.

+

Examples: ['first_name', 'surname']

+
+

comparison_description

+

An optional label to describe this comparison, to be used in charting outputs.

+

Examples: ['First name exact match', 'Surname with middle levenshtein level']

+
+

comparison_levels

+

Comparison levels specify how input values should be compared. Each level corresponds to an assessment of similarity, such as exact match, Jaro-Winkler match, one side of the match being null, etc

+

Each comparison level represents a branch of a SQL case expression. They are specified in order of evaluation, each with a sql_condition that represents the branch of a case expression

+

Example: +

[{
+    "sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
+    "label_for_charts": "null",
+    "null_level": True
+},
+{
+    "sql_condition": "first_name_l = first_name_r",
+    "label_for_charts": "exact_match",
+    "tf_adjustment_column": "first_name"
+},
+{
+    "sql_condition": "ELSE",
+    "label_for_charts": "else"
+}]
+
+
+
+Settings keys nested within each member of comparison_levels +

sql_condition

+

A branch of a SQL case expression without WHEN and THEN e.g. jaro_winkler_sim(surname_l, surname_r) > 0.88

+

Examples: ['forename_l = forename_r', 'jaro_winkler_sim(surname_l, surname_r) > 0.88']

+
+

label_for_charts

+

A label for this comparison level, which will appear on charts as a reminder of what the level represents

+

Examples: ['exact', 'postcode exact']

+
+

u_probability

+

the u probability for this comparison level - i.e. the proportion of records that match this level amongst truly non-matching records

+

Examples: [0.9]

+
+

m_probability

+

the m probability for this comparison level - i.e. the proportion of records that match this level amongst truly matching records

+

Examples: [0.1]

+
+

is_null_level

+

If true, m and u values will not be estimated and instead the match weight will be zero for this column. See treatment of nulls here on page 356, quote '. Under this MAR assumption, we can simply ignore missing data.': https://imai.fas.harvard.edu/research/files/linkage.pdf

+

Default value: False

+
+

tf_adjustment_column

+

Make term frequency adjustments for this comparison level using this input column

+

Default value: None

+

Examples: ['first_name', 'postcode']

+
+

tf_adjustment_weight

+

Make term frequency adjustments using this weight. A weight of 1.0 is a full adjustment. A weight of 0.0 is no adjustment. A weight of 0.5 is a half adjustment

+

Default value: 1.0

+

Examples: ['first_name', 'postcode']

+
+

tf_minimum_u_value

+

Where the term frequency adjustment implies a u value below this value, use this minimum value instead

+

This prevents excessive weight being assigned to very unusual terms, such as a collision on a typo

+

Default value: 0.0

+

Examples: [0.001, 1e-09]

+
+
+
+

blocking_rules_to_generate_predictions

+

A list of one or more blocking rules to apply. A Cartesian join is applied if blocking_rules_to_generate_predictions is empty or not supplied.

+

Each rule is a SQL expression representing the blocking rule, which will be used to create a join. The left table is aliased with l and the right table is aliased with r. For example, if you want to block on a first_name column, the blocking rule would be

+

l.first_name = r.first_name.

+

To block on first name and the first letter of surname, it would be

+

l.first_name = r.first_name and substr(l.surname,1,1) = substr(r.surname,1,1).

+

Note that Splink deduplicates the comparisons generated by the blocking rules.

+

If empty or not supplied, all comparisons between the input dataset(s) will be generated and blocking will not be used. For large input datasets, this will generally be computationally intractable because it will generate comparisons equal to the number of rows squared.

+

Default value: []

+

Examples: [['l.first_name = r.first_name AND l.surname = r.surname', 'l.dob = r.dob']]

+
+ +

additional_columns_to_retain

+

A list of columns not being used in the probabilistic matching comparisons that you want to include in your results.

+

By default, Splink drops columns which are not used by any comparisons. This gives you the option to retain columns which are not used by the model. A common example is if the user has labelled data (training data) and wishes to retain the labels in the outputs

+

Default value: []

+

Examples: [['cluster', 'col_2'], ['other_information']]

+
+ +

bayes_factor_column_prefix

+

The prefix to use for the columns that will be created to store the Bayes factors

+

Default value: bf_

+

Examples: ['bf_', '__bf__']

+
+ +

term_frequency_adjustment_column_prefix

+

The prefix to use for the columns that will be created to store the term frequency adjustments

+

Default value: tf_

+

Examples: ['tf_', '__tf__']

+
+ +

comparison_vector_value_column_prefix

+

The prefix to use for the columns that will be created to store the comparison vector values

+

Default value: gamma_

+

Examples: ['gamma_', '__gamma__']

+
+ +

sql_dialect

+

The SQL dialect in which sql_conditions are written. Must be a valid SQLGlot dialect

+

Default value: None

+

Examples: ['spark', 'duckdb', 'presto', 'sqlite']

+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/splink_dataframe.html b/api_docs/splink_dataframe.html new file mode 100644 index 0000000000..73bf1df199 --- /dev/null +++ b/api_docs/splink_dataframe.html @@ -0,0 +1,5601 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + SplinkDataFrame - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Documentation forSplinkDataFrame

+ + +
+ + + + +
+

+ Bases: ABC

+ + +

Abstraction over dataframe to handle basic operations like retrieving data and +retrieving column names, which need different implementations depending on whether +it's a spark dataframe, sqlite table etc. +Uses methods like as_pandas_dataframe() and as_record_dict() to retrieve data

+ + + + +
+ + + + + + + + + +
+ + + + + +
+ +

Return the dataframe as a pandas dataframe.

+

This can be computationally expensive if the dataframe is large.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
limit + int + +
+

If provided, return this number of rows (equivalent +to a limit statement in SQL). Defaults to None, meaning return all rows

+
+
+ None +
+ + +

Examples:

+
df_predict = linker.inference.predict()
+df_ten_edges = df_predict.as_pandas_dataframe(10)
+
+ + +
+ +
+ +
+ + + + + +
+ +

Return the dataframe as a list of record dictionaries.

+

This can be computationally expensive if the dataframe is large.

+ + +

Examples:

+
df_predict = linker.inference.predict()
+ten_edges = df_predict.as_record_dict(10)
+
+ + + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
list + list[dict[str, Any]] + +
+

a list of records, each of which is a dictionary

+
+
+ +
+ +
+ +
+ + + + + +
+ +

Drops the table from the underlying database, and removes it +from the (linker) cache.

+

By default this will fail if the table is not one created by Splink, +but this check can be overriden

+ + +

Examples:

+
df_predict = linker.inference.predict()
+df_predict.drop_table_from_database_and_remove_from_cache()
+# predictions table no longer in the database / cache
+
+ + +
+ +
+ +
+ + + + + +
+ +

Save the dataframe in csv format.

+ + +

Examples:

+
df_predict = linker.inference.predict()
+df_predict.to_csv("model_predictions.csv", overwrite=True)
+
+ + +
+ +
+ +
+ + + + + +
+ +

Save the dataframe in parquet format.

+ + +

Examples:

+
df_predict = linker.inference.predict()
+df_predict.to_parquet("model_predictions.parquet", overwrite=True)
+
+ + +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/table_management.html b/api_docs/table_management.html new file mode 100644 index 0000000000..945cdf79a6 --- /dev/null +++ b/api_docs/table_management.html @@ -0,0 +1,5965 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Table Management - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.table_management

+ + +
+ + + + +
+ + +

Register Splink tables against your database backend and manage the Splink cache. +Accessed via linker.table_management.

+ + + + +
+ + + + + + + + + +
+ + +

+ compute_tf_table(column_name) + +

+ + +
+ +

Compute a term frequency table for a given column and persist to the database

+

This method is useful if you want to pre-compute term frequency tables e.g. +so that real time linkage executes faster, or so that you can estimate +various models without having to recompute term frequency tables each time

+

Examples:

+
Real time linkage
+```py
+linker = Linker(df, settings="saved_settings.json", db_api=db_api)
+linker.table_management.compute_tf_table("surname")
+linker.compare_two_records(record_left, record_right)
+```
+Pre-computed term frequency tables
+```py
+linker = Linker(df, db_api)
+df_first_name_tf = linker.table_management.compute_tf_table("first_name")
+df_first_name_tf.write.parquet("folder/first_name_tf")
+>>>
+# On subsequent data linking job, read this table rather than recompute
+df_first_name_tf = pd.read_parquet("folder/first_name_tf")
+df_first_name_tf.createOrReplaceTempView("__splink__df_tf_first_name")
+```
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
column_name + str + +
+

The column name in the input table

+
+
+ required +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

The resultant table as a splink data frame

+
+
+ +
+ +
+ +
+ + +

+ invalidate_cache() + +

+ + +
+ +

Invalidate the Splink cache. Any previously-computed tables +will be recomputed. +This is useful, for example, if the input data tables have changed.

+ +
+ +
+ +
+ + +

+ register_table_input_nodes_concat_with_tf(input_data, overwrite=False) + +

+ + +
+ +

Register a pre-computed version of the input_nodes_concat_with_tf table that +you want to re-use e.g. that you created in a previous run.

+

This method allows you to register this table in the Splink cache so it will be +used rather than Splink computing this table anew.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
input_data + AcceptableInputTableType + +
+

The data you wish to register. This +can be either a dictionary, pandas dataframe, pyarrow table or a spark +dataframe.

+
+
+ required +
overwrite + bool + +
+

Overwrite the table in the underlying database if it +exists.

+
+
+ False +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

An abstraction representing the table created by the sql +pipeline

+
+
+ +
+ +
+ +
+ + +

+ register_table_predict(input_data, overwrite=False) + +

+ + +
+ +

Register a pre-computed version of the prediction table for use in Splink.

+

This method allows you to register a pre-computed prediction table in the Splink +cache so it will be used rather than Splink computing the table anew.

+ + +

Examples:

+
predict_df = pd.read_parquet("path/to/predict_df.parquet")
+predict_as_splinkdataframe = linker.table_management.register_table_predict(predict_df)
+clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    predict_as_splinkdataframe, threshold_match_probability=0.75
+)
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
input_data + AcceptableInputTableType + +
+

The data you wish to register. This +can be either a dictionary, pandas dataframe, pyarrow table, or a spark +dataframe.

+
+
+ required +
overwrite + bool + +
+

Overwrite the table in the underlying database +if it exists. Defaults to False.

+
+
+ False +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + +
+

An abstraction representing the table created by the SQL +pipeline.

+
+
+ +
+ +
+ +
+ + +

+ register_term_frequency_lookup(input_data, col_name, overwrite=False) + +

+ + +
+ +

Register a pre-computed term frequency lookup table for a given column.

+

This method allows you to register a term frequency table in the Splink +cache for a specific column. This table will then be used during linkage +rather than computing the term frequency table anew from your input data.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
input_data + AcceptableInputTableType + +
+

The data representing the term +frequency table. This can be either a dictionary, pandas dataframe, +pyarrow table, or a spark dataframe.

+
+
+ required +
col_name + str + +
+

The name of the column for which the term frequency +lookup table is being registered.

+
+
+ required +
overwrite + bool + +
+

Overwrite the table in the underlying +database if it exists. Defaults to False.

+
+
+ False +
+ + +

Returns:

+ + + + + + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + +
+

An abstraction representing the registered term

+
+
+ +
+

frequency table.

+
+
+ + +

Examples:

+
tf_table = [
+    {"first_name": "theodore", "tf_first_name": 0.012},
+    {"first_name": "alfie", "tf_first_name": 0.013},
+]
+tf_df = pd.DataFrame(tf_table)
+linker.table_management.register_term_frequency_lookup(tf_df,
+                                                        "first_name")
+
+ +
+ +
+ +
+ + +

+ register_table(input_table, table_name, overwrite=False) + +

+ + +
+ +

Register a table to your backend database, to be used in one of the +splink methods, or simply to allow querying.

+

Tables can be of type: dictionary, record level dictionary, +pandas dataframe, pyarrow table and in the spark case, a spark df.

+ + +

Examples:

+
test_dict = {"a": [666,777,888],"b": [4,5,6]}
+linker.table_management.register_table(test_dict, "test_dict")
+linker.query_sql("select * from test_dict")
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
input_table + AcceptableInputTableType + +
+

The data you wish to register. This can be either a dictionary, +pandas dataframe, pyarrow table or a spark dataframe.

+
+
+ required +
table_name + str + +
+

The name you wish to assign to the table.

+
+
+ required +
overwrite + bool + +
+

Overwrite the table in the underlying database if it +exists

+
+
+ False +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
SplinkDataFrame + SplinkDataFrame + +
+

An abstraction representing the table created by the sql +pipeline

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/training.html b/api_docs/training.html new file mode 100644 index 0000000000..063e2ab646 --- /dev/null +++ b/api_docs/training.html @@ -0,0 +1,5986 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Training - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.training

+ + +
+ + + + +
+ + +

Estimate the parameters of the linkage model, accessed via +linker.training.

+ + + + +
+ + + + + + + + + +
+ + +

+ estimate_probability_two_random_records_match(deterministic_matching_rules, recall, max_rows_limit=int(1000000000.0)) + +

+ + +
+ +

Estimate the model parameter probability_two_random_records_match using +a direct estimation approach.

+

This method counts the number of matches found using deterministic rules and +divides by the total number of possible record comparisons. The recall of the +deterministic rules is used to adjust this proportion up to reflect missed +matches, providing an estimate of the probability that two random records from +the input data are a match.

+

Note that if more than one deterministic rule is provided, any duplicate +pairs are automatically removed, so you do not need to worry about double +counting.

+

See here +for discussion of methodology.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
deterministic_matching_rules + list + +
+

A list of deterministic matching +rules designed to admit very few (preferably no) false positives.

+
+
+ required +
recall + float + +
+

An estimate of the recall the deterministic matching +rules will achieve, i.e., the proportion of all true matches these +rules will recover.

+
+
+ required +
max_rows_limit + int + +
+

Maximum number of rows to consider during estimation. +Defaults to 1e9.

+
+
+ int(1000000000.0) +
+ + +

Examples:

+
deterministic_rules = [
+    block_on("forename", "dob"),
+    "l.forename = r.forename and levenshtein(r.surname, l.surname) <= 2",
+    block_on("email")
+]
+linker.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.8
+)
+
+ + +
+ +
+ +
+ + +

+ estimate_u_using_random_sampling(max_pairs=1000000.0, seed=None) + +

+ + +
+ +

Estimate the u parameters of the linkage model using random sampling.

+

The u parameters estimate the proportion of record comparisons that fall +into each comparison level amongst truly non-matching records.

+

This procedure takes a sample of the data and generates the cartesian +product of pairwise record comparisons amongst the sampled records. +The validity of the u values rests on the assumption that the resultant +pairwise comparisons are non-matches (or at least, they are very unlikely to be +matches). For large datasets, this is typically true.

+

The results of estimate_u_using_random_sampling, and therefore an entire splink +model, can be made reproducible by setting the seed parameter. Setting the seed +will have performance implications as additional processing is required.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
max_pairs + int + +
+

The maximum number of pairwise record comparisons to +sample. Larger will give more accurate estimates but lead to longer +runtimes. In our experience at least 1e9 (one billion) gives best +results but can take a long time to compute. 1e7 (ten million) +is often adequate whilst testing different model specifications, before +the final model is estimated.

+
+
+ 1000000.0 +
seed + int + +
+

Seed for random sampling. Assign to get reproducible u +probabilities. Note, seed for random sampling is only supported for +DuckDB and Spark, for Athena and SQLite set to None.

+
+
+ None +
+ + +

Examples:

+
linker.training.estimate_u_using_random_sampling(max_pairs=1e8)
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
Nothing + None + +
+

Updates the estimated u parameters within the linker object and +returns nothing.

+
+
+ +
+ +
+ +
+ + +

+ estimate_parameters_using_expectation_maximisation(blocking_rule, estimate_without_term_frequencies=False, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False) + +

+ + +
+ +

Estimate the parameters of the linkage model using expectation maximisation.

+

By default, the m probabilities are estimated, but not the u probabilities, +because good estimates for the u probabilities can be obtained from +linker.training.estimate_u_using_random_sampling(). You can change this by +setting fix_u_probabilities to False.

+

The blocking rule provided is used to generate pairwise record comparisons. +Usually, this should be a blocking rule that results in a dataframe where +matches are between about 1% and 99% of the blocked comparisons.

+

By default, m parameters are estimated for all comparisons except those which +are included in the blocking rule.

+

For example, if the blocking rule is block_on("first_name"), then +parameter estimates will be made for all comparison except those which use +first_name in their sql_condition

+

By default, the probability two random records match is allowed to vary +during EM estimation, but is not saved back to the model. See +this PR for +the rationale.

+ + +

Examples:

+

Default behaviour +

br_training = block_on("first_name", "dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    br_training
+)
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
blocking_rule + BlockingRuleCreator | str + +
+

The blocking rule used to +generate pairwise record comparisons.

+
+
+ required +
estimate_without_term_frequencies + bool + +
+

If True, the iterations +of the EM algorithm ignore any term frequency adjustments and only +depend on the comparison vectors. This allows the EM algorithm to run +much faster, but the estimation of the parameters will change slightly.

+
+
+ False +
fix_probability_two_random_records_match + bool + +
+

If True, do not +update the probability two random records match after each iteration. +Defaults to False.

+
+
+ False +
fix_m_probabilities + bool + +
+

If True, do not update the m +probabilities after each iteration. Defaults to False.

+
+
+ False +
fix_u_probabilities + bool + +
+

If True, do not update the u +probabilities after each iteration. Defaults to True.

+
+
+ True +
populate_prob... + (bool, optional) + +
+

The full name of this parameter is +populate_probability_two_random_records_match_from_trained_values. If +True, derive this parameter from the blocked value. Defaults to False.

+
+
+ required +
+ + +

Examples:

+
blocking_rule = block_on("first_name", "surname")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule
+)
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
EMTrainingSession + EMTrainingSession + +
+

An object containing information about the training +session such as how parameters changed during the iteration history

+
+
+ +
+ +
+ +
+ + +

+ estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name) + +

+ + +
+ +

Estimate the m probabilities of the linkage model from a dataframe of +pairwise labels.

+

The table of labels should be in the following format, and should +be registered with your database:

+ + + + + + + + + + + + + + + + + + + + + + + +
source_dataset_lunique_id_lsource_dataset_runique_id_r
df_11df_22
df_11df_23
+

Note that source_dataset and unique_id should correspond to the +values specified in the settings dict, and the input_table_aliases +passed to the linker object. Note that at the moment, this method does +not respect values in a clerical_match_score column. If provided, these +are ignored and it is assumed that every row in the table of labels is a score +of 1, i.e. a perfect match.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
labels_splinkdataframe_or_table_name + str + +
+

Name of table containing labels +in the database or SplinkDataframe

+
+
+ required +
+ + +

Examples:

+
pairwise_labels = pd.read_csv("./data/pairwise_labels_to_estimate_m.csv")
+
+linker.table_management.register_table(
+    pairwise_labels, "labels", overwrite=True
+)
+
+linker.training.estimate_m_from_pairwise_labels("labels")
+
+ +
+ +
+ +
+ + +

+ estimate_m_from_label_column(label_colname) + +

+ + +
+ +

Estimate the m parameters of the linkage model from a label (ground truth) +column in the input dataframe(s).

+

The m parameters represent the proportion of record comparisons that fall +into each comparison level amongst truly matching records.

+

The ground truth column is used to generate pairwise record comparisons +which are then assumed to be matches.

+

For example, if the entity being matched is persons, and your input dataset(s) +contain social security number, this could be used to estimate the m values +for the model.

+

Note that this column does not need to be fully populated. A common case is +where a unique identifier such as social security number is only partially +populated.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
label_colname + str + +
+

The name of the column containing the ground truth +label in the input data.

+
+
+ required +
+ + +

Examples:

+
linker.training.estimate_m_from_label_column("social_security_number")
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
Nothing + None + +
+

Updates the estimated m parameters within the linker object.

+
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/api_docs/visualisations.html b/api_docs/visualisations.html new file mode 100644 index 0000000000..b06dd0e061 --- /dev/null +++ b/api_docs/visualisations.html @@ -0,0 +1,6420 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Visualisations - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + Skip to content + + +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Methods in Linker.visualisations

+ + +
+ + + + +
+ + +

Visualisations to help you understand and diagnose your linkage model. +Accessed via linker.visualisations.

+

Most of the visualisations return an altair.Chart +object, meaning it can be saved an manipulated using Altair.

+

For example:

+
altair_chart = linker.visualisations.match_weights_chart()
+
+# Save to various formats
+altair_chart.save("mychart.png")
+altair_chart.save("mychart.html")
+altair_chart.save("mychart.svg")
+altair_chart.save("mychart.json")
+
+# Get chart spec as dict
+altair_chart.to_dict()
+
+

To save the chart as a self-contained html file with all scripts +inlined so it can be viewed offline:

+
from splink.internals.charts import save_offline_chart
+c = linker.visualisations.match_weights_chart()
+save_offline_chart(c.to_dict(), "test_chart.html")
+
+

View resultant html file in Jupyter (or just load it in your browser)

+
from IPython.display import IFrame
+IFrame(src="./test_chart.html", width=1000, height=500)
+
+ + + + +
+ + + + + + + + + +
+ + +

+ match_weights_chart(as_dict=False) + +

+ + +
+ +

Display a chart of the (partial) match weights of the linkage model

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Examples:

+
altair_chart = linker.visualisations.match_weights_chart()
+altair_chart.save("mychart.png")
+
+ + +
+ +
+ +
+ + +

+ m_u_parameters_chart(as_dict=False) + +

+ + +
+ +

Display a chart of the m and u parameters of the linkage model

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Examples:

+
altair_chart = linker.visualisations.m_u_parameters_chart()
+altair_chart.save("mychart.png")
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
altair_chart + ChartReturnType + +
+

An altair chart

+
+
+ +
+ +
+ +
+ + +

+ match_weights_histogram(df_predict, target_bins=30, width=600, height=250, as_dict=False) + +

+ + +
+ +

Generate a histogram that shows the distribution of match weights in +df_predict

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

Output of linker.inference.predict()

+
+
+ required +
target_bins + int + +
+

Target number of bins in histogram. Defaults to +30.

+
+
+ 30 +
width + int + +
+

Width of output. Defaults to 600.

+
+
+ 600 +
height + int + +
+

Height of output chart. Defaults to 250.

+
+
+ 250 +
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Examples:

+
df_predict = linker.inference.predict(threshold_match_weight=-2)
+linker.visualisations.match_weights_histogram(df_predict)
+
+ + +
+ +
+ +
+ + +

+ parameter_estimate_comparisons_chart(include_m=True, include_u=False, as_dict=False) + +

+ + +
+ +

Show a chart that shows how parameter estimates have differed across +the different estimation methods you have used.

+

For example, if you have run two EM estimation sessions, blocking on +different variables, and both result in parameter estimates for +first_name, this chart will enable easy comparison of the different +estimates

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
include_m + bool + +
+

Show different estimates of m values. Defaults +to True.

+
+
+ True +
include_u + bool + +
+

Show different estimates of u values. Defaults +to False.

+
+
+ False +
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Examples:

+
linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule=block_on("first_name"),
+)
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule=block_on("surname"),
+)
+
+linker.visualisations.parameter_estimate_comparisons_chart()
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
altair_chart + ChartReturnType + +
+

An Altair chart

+
+
+ +
+ +
+ +
+ + +

+ tf_adjustment_chart(output_column_name, n_most_freq=10, n_least_freq=10, vals_to_include=None, as_dict=False) + +

+ + +
+ +

Display a chart showing the impact of term frequency adjustments on a +specific comparison level. +Each value

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
output_column_name + str + +
+

Name of an output column for which term frequency + adjustment has been applied.

+
+
+ required +
n_most_freq + int + +
+

Number of most frequent values to show. If this + or n_least_freq set to None, all values will be shown. +Default to 10.

+
+
+ 10 +
n_least_freq + int + +
+

Number of least frequent values to show. If +this or n_most_freq set to None, all values will be shown. +Default to 10.

+
+
+ 10 +
vals_to_include + list + +
+

Specific values for which to show term +sfrequency adjustments. +Defaults to None.

+
+
+ None +
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Examples:

+
linker.visualisations.tf_adjustment_chart("first_name")
+
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
altair_chart + ChartReturnType + +
+

An Altair chart

+
+
+ +
+ +
+ +
+ + +

+ waterfall_chart(records, filter_nulls=True, remove_sensitive_data=False, as_dict=False) + +

+ + +
+ +

Visualise how the final match weight is computed for the provided pairwise +record comparisons.

+

Records must be provided as a list of dictionaries. This would usually be +obtained from df.as_record_dict(limit=n) where df is a SplinkDataFrame.

+ + +

Examples:

+
df = linker.inference.predict(threshold_match_weight=2)
+records = df.as_record_dict(limit=10)
+linker.visualisations.waterfall_chart(records)
+
+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
records + List[dict] + +
+

Usually be obtained from df.as_record_dict(limit=n) +where df is a SplinkDataFrame.

+
+
+ required +
filter_nulls + bool + +
+

Whether the visualisation shows null +comparisons, which have no effect on final match weight. Defaults to +True.

+
+
+ True +
remove_sensitive_data + bool + +
+

When True, The waterfall chart will +contain match weights only, and all of the (potentially sensitive) data +from the input tables will be removed prior to the chart being created.

+
+
+ False +
as_dict + bool + +
+

If True, return the chart as a dictionary.

+
+
+ False +
+ + +

Returns:

+ + + + + + + + + + + + + +
Name TypeDescription
altair_chart + ChartReturnType + +
+

An Altair chart

+
+
+ +
+ +
+ +
+ + +

+ comparison_viewer_dashboard(df_predict, out_path, overwrite=False, num_example_rows=2, return_html_as_string=False) + +

+ + +
+ +

Generate an interactive html visualization of the linker's predictions and +save to out_path. For more information see +this video

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

The outputs of linker.predict()

+
+
+ required +
out_path + str + +
+

The path (including filename) to save the html file to.

+
+
+ required +
overwrite + bool + +
+

Overwrite the html file if it already exists? +Defaults to False.

+
+
+ False +
num_example_rows + int + +
+

Number of example rows per comparison +vector. Defaults to 2.

+
+
+ 2 +
return_html_as_string + bool + +
+

If True, return the html as a string

+
+
+ False +
+ + +

Examples:

+
df_predictions = linker.predict()
+linker.visualisations.comparison_viewer_dashboard(
+    df_predictions, "scv.html", True, 2
+)
+
+

Optionally, in Jupyter, you can display the results inline +Otherwise you can just load the html file in your browser

+
from IPython.display import IFrame
+IFrame(src="./scv.html", width="100%", height=1200)
+
+ +
+ +
+ +
+ + +

+ cluster_studio_dashboard(df_predict, df_clustered, out_path, sampling_method='random', sample_size=10, cluster_ids=None, cluster_names=None, overwrite=False, return_html_as_string=False, _df_cluster_metrics=None) + +

+ + +
+ +

Generate an interactive html visualization of the predicted cluster and +save to out_path.

+ + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

The outputs of linker.predict()

+
+
+ required +
df_clustered + SplinkDataFrame + +
+

The outputs of +linker.cluster_pairwise_predictions_at_threshold()

+
+
+ required +
out_path + str + +
+

The path (including filename) to save the html file to.

+
+
+ required +
sampling_method + str + +
+

random, by_cluster_size or +lowest_density_clusters. Defaults to random.

+
+
+ 'random' +
sample_size + int + +
+

Number of clusters to show in the dahboard. +Defaults to 10.

+
+
+ 10 +
cluster_ids + list + +
+

The IDs of the clusters that will be displayed in the +dashboard. If provided, ignore the sampling_method and sample_size +arguments. Defaults to None.

+
+
+ None +
overwrite + bool + +
+

Overwrite the html file if it already exists? +Defaults to False.

+
+
+ False +
cluster_names + list + +
+

If provided, the dashboard will display +these names in the selection box. Ony works in conjunction with +cluster_ids. Defaults to None.

+
+
+ None +
return_html_as_string + bool + +
+

If True, return the html as a string

+
+
+ False +
+ + +

Examples:

+
df_p = linker.inference.predict()
+df_c = linker.visualisations.cluster_pairwise_predictions_at_threshold(
+    df_p, 0.5
+)
+
+linker.cluster_studio_dashboard(
+    df_p, df_c, [0, 4, 7], "cluster_studio.html"
+)
+
+

Optionally, in Jupyter, you can display the results inline +Otherwise you can just load the html file in your browser

+
from IPython.display import IFrame
+IFrame(src="./cluster_studio.html", width="100%", height=1200)
+
+ +
+ +
+ + + +
+ +
+ +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/assets/_mkdocstrings.css b/assets/_mkdocstrings.css new file mode 100644 index 0000000000..85449ec798 --- /dev/null +++ b/assets/_mkdocstrings.css @@ -0,0 +1,119 @@ + +/* Avoid breaking parameter names, etc. in table cells. */ +.doc-contents td code { + word-break: normal !important; +} + +/* No line break before first paragraph of descriptions. */ +.doc-md-description, +.doc-md-description>p:first-child { + display: inline; +} + +/* Max width for docstring sections tables. */ +.doc .md-typeset__table, +.doc .md-typeset__table table { + display: table !important; + width: 100%; +} + +.doc .md-typeset__table tr { + display: table-row; +} + +/* Defaults in Spacy table style. */ +.doc-param-default { + float: right; +} + +/* Backward-compatibility: docstring section titles in bold. */ +.doc-section-title { + font-weight: bold; +} + +/* Symbols in Navigation and ToC. */ +:root, +[data-md-color-scheme="default"] { + --doc-symbol-attribute-fg-color: #953800; + --doc-symbol-function-fg-color: #8250df; + --doc-symbol-method-fg-color: #8250df; + --doc-symbol-class-fg-color: #0550ae; + --doc-symbol-module-fg-color: #5cad0f; + + --doc-symbol-attribute-bg-color: #9538001a; + --doc-symbol-function-bg-color: #8250df1a; + --doc-symbol-method-bg-color: #8250df1a; + --doc-symbol-class-bg-color: #0550ae1a; + --doc-symbol-module-bg-color: #5cad0f1a; +} + +[data-md-color-scheme="slate"] { + --doc-symbol-attribute-fg-color: #ffa657; + --doc-symbol-function-fg-color: #d2a8ff; + --doc-symbol-method-fg-color: #d2a8ff; + --doc-symbol-class-fg-color: #79c0ff; + --doc-symbol-module-fg-color: #baff79; + + --doc-symbol-attribute-bg-color: #ffa6571a; + --doc-symbol-function-bg-color: #d2a8ff1a; + --doc-symbol-method-bg-color: #d2a8ff1a; + --doc-symbol-class-bg-color: #79c0ff1a; + --doc-symbol-module-bg-color: #baff791a; +} + +code.doc-symbol { + border-radius: .1rem; + font-size: .85em; + padding: 0 .3em; + font-weight: bold; +} + +code.doc-symbol-attribute { + color: var(--doc-symbol-attribute-fg-color); + background-color: var(--doc-symbol-attribute-bg-color); +} + +code.doc-symbol-attribute::after { + content: "attr"; +} + +code.doc-symbol-function { + color: var(--doc-symbol-function-fg-color); + background-color: var(--doc-symbol-function-bg-color); +} + +code.doc-symbol-function::after { + content: "func"; +} + +code.doc-symbol-method { + color: var(--doc-symbol-method-fg-color); + background-color: var(--doc-symbol-method-bg-color); +} + +code.doc-symbol-method::after { + content: "meth"; +} + +code.doc-symbol-class { + color: var(--doc-symbol-class-fg-color); + background-color: var(--doc-symbol-class-bg-color); +} + +code.doc-symbol-class::after { + content: "class"; +} + +code.doc-symbol-module { + color: var(--doc-symbol-module-fg-color); + background-color: var(--doc-symbol-module-bg-color); +} + +code.doc-symbol-module::after { + content: "mod"; +} + +.doc-signature .autorefs { + color: inherit; + border-bottom: 1px dotted currentcolor; +} diff --git a/assets/images/favicon.png b/assets/images/favicon.png new file mode 100644 index 0000000000..1cf13b9f9d Binary files /dev/null and b/assets/images/favicon.png differ diff --git a/assets/javascripts/bundle.d7c377c4.min.js b/assets/javascripts/bundle.d7c377c4.min.js new file mode 100644 index 0000000000..6a0bcf8803 --- /dev/null +++ b/assets/javascripts/bundle.d7c377c4.min.js @@ -0,0 +1,29 @@ +"use strict";(()=>{var Mi=Object.create;var gr=Object.defineProperty;var Li=Object.getOwnPropertyDescriptor;var _i=Object.getOwnPropertyNames,Ft=Object.getOwnPropertySymbols,Ai=Object.getPrototypeOf,xr=Object.prototype.hasOwnProperty,ro=Object.prototype.propertyIsEnumerable;var to=(e,t,r)=>t in e?gr(e,t,{enumerable:!0,configurable:!0,writable:!0,value:r}):e[t]=r,P=(e,t)=>{for(var r in t||(t={}))xr.call(t,r)&&to(e,r,t[r]);if(Ft)for(var r of Ft(t))ro.call(t,r)&&to(e,r,t[r]);return e};var oo=(e,t)=>{var r={};for(var o in e)xr.call(e,o)&&t.indexOf(o)<0&&(r[o]=e[o]);if(e!=null&&Ft)for(var o of Ft(e))t.indexOf(o)<0&&ro.call(e,o)&&(r[o]=e[o]);return r};var yr=(e,t)=>()=>(t||e((t={exports:{}}).exports,t),t.exports);var Ci=(e,t,r,o)=>{if(t&&typeof t=="object"||typeof t=="function")for(let n of _i(t))!xr.call(e,n)&&n!==r&&gr(e,n,{get:()=>t[n],enumerable:!(o=Li(t,n))||o.enumerable});return e};var jt=(e,t,r)=>(r=e!=null?Mi(Ai(e)):{},Ci(t||!e||!e.__esModule?gr(r,"default",{value:e,enumerable:!0}):r,e));var no=(e,t,r)=>new Promise((o,n)=>{var i=c=>{try{a(r.next(c))}catch(p){n(p)}},s=c=>{try{a(r.throw(c))}catch(p){n(p)}},a=c=>c.done?o(c.value):Promise.resolve(c.value).then(i,s);a((r=r.apply(e,t)).next())});var ao=yr((Er,io)=>{(function(e,t){typeof Er=="object"&&typeof io!="undefined"?t():typeof define=="function"&&define.amd?define(t):t()})(Er,function(){"use strict";function e(r){var o=!0,n=!1,i=null,s={text:!0,search:!0,url:!0,tel:!0,email:!0,password:!0,number:!0,date:!0,month:!0,week:!0,time:!0,datetime:!0,"datetime-local":!0};function a(C){return!!(C&&C!==document&&C.nodeName!=="HTML"&&C.nodeName!=="BODY"&&"classList"in C&&"contains"in C.classList)}function c(C){var ct=C.type,Ve=C.tagName;return!!(Ve==="INPUT"&&s[ct]&&!C.readOnly||Ve==="TEXTAREA"&&!C.readOnly||C.isContentEditable)}function p(C){C.classList.contains("focus-visible")||(C.classList.add("focus-visible"),C.setAttribute("data-focus-visible-added",""))}function l(C){C.hasAttribute("data-focus-visible-added")&&(C.classList.remove("focus-visible"),C.removeAttribute("data-focus-visible-added"))}function f(C){C.metaKey||C.altKey||C.ctrlKey||(a(r.activeElement)&&p(r.activeElement),o=!0)}function u(C){o=!1}function d(C){a(C.target)&&(o||c(C.target))&&p(C.target)}function y(C){a(C.target)&&(C.target.classList.contains("focus-visible")||C.target.hasAttribute("data-focus-visible-added"))&&(n=!0,window.clearTimeout(i),i=window.setTimeout(function(){n=!1},100),l(C.target))}function b(C){document.visibilityState==="hidden"&&(n&&(o=!0),D())}function D(){document.addEventListener("mousemove",J),document.addEventListener("mousedown",J),document.addEventListener("mouseup",J),document.addEventListener("pointermove",J),document.addEventListener("pointerdown",J),document.addEventListener("pointerup",J),document.addEventListener("touchmove",J),document.addEventListener("touchstart",J),document.addEventListener("touchend",J)}function Q(){document.removeEventListener("mousemove",J),document.removeEventListener("mousedown",J),document.removeEventListener("mouseup",J),document.removeEventListener("pointermove",J),document.removeEventListener("pointerdown",J),document.removeEventListener("pointerup",J),document.removeEventListener("touchmove",J),document.removeEventListener("touchstart",J),document.removeEventListener("touchend",J)}function J(C){C.target.nodeName&&C.target.nodeName.toLowerCase()==="html"||(o=!1,Q())}document.addEventListener("keydown",f,!0),document.addEventListener("mousedown",u,!0),document.addEventListener("pointerdown",u,!0),document.addEventListener("touchstart",u,!0),document.addEventListener("visibilitychange",b,!0),D(),r.addEventListener("focus",d,!0),r.addEventListener("blur",y,!0),r.nodeType===Node.DOCUMENT_FRAGMENT_NODE&&r.host?r.host.setAttribute("data-js-focus-visible",""):r.nodeType===Node.DOCUMENT_NODE&&(document.documentElement.classList.add("js-focus-visible"),document.documentElement.setAttribute("data-js-focus-visible",""))}if(typeof window!="undefined"&&typeof document!="undefined"){window.applyFocusVisiblePolyfill=e;var t;try{t=new CustomEvent("focus-visible-polyfill-ready")}catch(r){t=document.createEvent("CustomEvent"),t.initCustomEvent("focus-visible-polyfill-ready",!1,!1,{})}window.dispatchEvent(t)}typeof document!="undefined"&&e(document)})});var Kr=yr((kt,qr)=>{/*! + * clipboard.js v2.0.11 + * https://clipboardjs.com/ + * + * Licensed MIT © Zeno Rocha + */(function(t,r){typeof kt=="object"&&typeof qr=="object"?qr.exports=r():typeof define=="function"&&define.amd?define([],r):typeof kt=="object"?kt.ClipboardJS=r():t.ClipboardJS=r()})(kt,function(){return function(){var e={686:function(o,n,i){"use strict";i.d(n,{default:function(){return Oi}});var s=i(279),a=i.n(s),c=i(370),p=i.n(c),l=i(817),f=i.n(l);function u(V){try{return document.execCommand(V)}catch(_){return!1}}var d=function(_){var O=f()(_);return u("cut"),O},y=d;function b(V){var _=document.documentElement.getAttribute("dir")==="rtl",O=document.createElement("textarea");O.style.fontSize="12pt",O.style.border="0",O.style.padding="0",O.style.margin="0",O.style.position="absolute",O.style[_?"right":"left"]="-9999px";var $=window.pageYOffset||document.documentElement.scrollTop;return O.style.top="".concat($,"px"),O.setAttribute("readonly",""),O.value=V,O}var D=function(_,O){var $=b(_);O.container.appendChild($);var N=f()($);return u("copy"),$.remove(),N},Q=function(_){var O=arguments.length>1&&arguments[1]!==void 0?arguments[1]:{container:document.body},$="";return typeof _=="string"?$=D(_,O):_ instanceof HTMLInputElement&&!["text","search","url","tel","password"].includes(_==null?void 0:_.type)?$=D(_.value,O):($=f()(_),u("copy")),$},J=Q;function C(V){"@babel/helpers - typeof";return typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?C=function(O){return typeof O}:C=function(O){return O&&typeof Symbol=="function"&&O.constructor===Symbol&&O!==Symbol.prototype?"symbol":typeof O},C(V)}var ct=function(){var _=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{},O=_.action,$=O===void 0?"copy":O,N=_.container,Y=_.target,ke=_.text;if($!=="copy"&&$!=="cut")throw new Error('Invalid "action" value, use either "copy" or "cut"');if(Y!==void 0)if(Y&&C(Y)==="object"&&Y.nodeType===1){if($==="copy"&&Y.hasAttribute("disabled"))throw new Error('Invalid "target" attribute. Please use "readonly" instead of "disabled" attribute');if($==="cut"&&(Y.hasAttribute("readonly")||Y.hasAttribute("disabled")))throw new Error(`Invalid "target" attribute. You can't cut text from elements with "readonly" or "disabled" attributes`)}else throw new Error('Invalid "target" value, use a valid Element');if(ke)return J(ke,{container:N});if(Y)return $==="cut"?y(Y):J(Y,{container:N})},Ve=ct;function Fe(V){"@babel/helpers - typeof";return typeof Symbol=="function"&&typeof Symbol.iterator=="symbol"?Fe=function(O){return typeof O}:Fe=function(O){return O&&typeof Symbol=="function"&&O.constructor===Symbol&&O!==Symbol.prototype?"symbol":typeof O},Fe(V)}function vi(V,_){if(!(V instanceof _))throw new TypeError("Cannot call a class as a function")}function eo(V,_){for(var O=0;O<_.length;O++){var $=_[O];$.enumerable=$.enumerable||!1,$.configurable=!0,"value"in $&&($.writable=!0),Object.defineProperty(V,$.key,$)}}function gi(V,_,O){return _&&eo(V.prototype,_),O&&eo(V,O),V}function xi(V,_){if(typeof _!="function"&&_!==null)throw new TypeError("Super expression must either be null or a function");V.prototype=Object.create(_&&_.prototype,{constructor:{value:V,writable:!0,configurable:!0}}),_&&br(V,_)}function br(V,_){return br=Object.setPrototypeOf||function($,N){return $.__proto__=N,$},br(V,_)}function yi(V){var _=Ti();return function(){var $=Rt(V),N;if(_){var Y=Rt(this).constructor;N=Reflect.construct($,arguments,Y)}else N=$.apply(this,arguments);return Ei(this,N)}}function Ei(V,_){return _&&(Fe(_)==="object"||typeof _=="function")?_:wi(V)}function wi(V){if(V===void 0)throw new ReferenceError("this hasn't been initialised - super() hasn't been called");return V}function Ti(){if(typeof Reflect=="undefined"||!Reflect.construct||Reflect.construct.sham)return!1;if(typeof Proxy=="function")return!0;try{return Date.prototype.toString.call(Reflect.construct(Date,[],function(){})),!0}catch(V){return!1}}function Rt(V){return Rt=Object.setPrototypeOf?Object.getPrototypeOf:function(O){return O.__proto__||Object.getPrototypeOf(O)},Rt(V)}function vr(V,_){var O="data-clipboard-".concat(V);if(_.hasAttribute(O))return _.getAttribute(O)}var Si=function(V){xi(O,V);var _=yi(O);function O($,N){var Y;return vi(this,O),Y=_.call(this),Y.resolveOptions(N),Y.listenClick($),Y}return gi(O,[{key:"resolveOptions",value:function(){var N=arguments.length>0&&arguments[0]!==void 0?arguments[0]:{};this.action=typeof N.action=="function"?N.action:this.defaultAction,this.target=typeof N.target=="function"?N.target:this.defaultTarget,this.text=typeof N.text=="function"?N.text:this.defaultText,this.container=Fe(N.container)==="object"?N.container:document.body}},{key:"listenClick",value:function(N){var Y=this;this.listener=p()(N,"click",function(ke){return Y.onClick(ke)})}},{key:"onClick",value:function(N){var Y=N.delegateTarget||N.currentTarget,ke=this.action(Y)||"copy",It=Ve({action:ke,container:this.container,target:this.target(Y),text:this.text(Y)});this.emit(It?"success":"error",{action:ke,text:It,trigger:Y,clearSelection:function(){Y&&Y.focus(),window.getSelection().removeAllRanges()}})}},{key:"defaultAction",value:function(N){return vr("action",N)}},{key:"defaultTarget",value:function(N){var Y=vr("target",N);if(Y)return document.querySelector(Y)}},{key:"defaultText",value:function(N){return vr("text",N)}},{key:"destroy",value:function(){this.listener.destroy()}}],[{key:"copy",value:function(N){var Y=arguments.length>1&&arguments[1]!==void 0?arguments[1]:{container:document.body};return J(N,Y)}},{key:"cut",value:function(N){return y(N)}},{key:"isSupported",value:function(){var N=arguments.length>0&&arguments[0]!==void 0?arguments[0]:["copy","cut"],Y=typeof N=="string"?[N]:N,ke=!!document.queryCommandSupported;return Y.forEach(function(It){ke=ke&&!!document.queryCommandSupported(It)}),ke}}]),O}(a()),Oi=Si},828:function(o){var n=9;if(typeof Element!="undefined"&&!Element.prototype.matches){var i=Element.prototype;i.matches=i.matchesSelector||i.mozMatchesSelector||i.msMatchesSelector||i.oMatchesSelector||i.webkitMatchesSelector}function s(a,c){for(;a&&a.nodeType!==n;){if(typeof a.matches=="function"&&a.matches(c))return a;a=a.parentNode}}o.exports=s},438:function(o,n,i){var s=i(828);function a(l,f,u,d,y){var b=p.apply(this,arguments);return l.addEventListener(u,b,y),{destroy:function(){l.removeEventListener(u,b,y)}}}function c(l,f,u,d,y){return typeof l.addEventListener=="function"?a.apply(null,arguments):typeof u=="function"?a.bind(null,document).apply(null,arguments):(typeof l=="string"&&(l=document.querySelectorAll(l)),Array.prototype.map.call(l,function(b){return a(b,f,u,d,y)}))}function p(l,f,u,d){return function(y){y.delegateTarget=s(y.target,f),y.delegateTarget&&d.call(l,y)}}o.exports=c},879:function(o,n){n.node=function(i){return i!==void 0&&i instanceof HTMLElement&&i.nodeType===1},n.nodeList=function(i){var s=Object.prototype.toString.call(i);return i!==void 0&&(s==="[object NodeList]"||s==="[object HTMLCollection]")&&"length"in i&&(i.length===0||n.node(i[0]))},n.string=function(i){return typeof i=="string"||i instanceof String},n.fn=function(i){var s=Object.prototype.toString.call(i);return s==="[object Function]"}},370:function(o,n,i){var s=i(879),a=i(438);function c(u,d,y){if(!u&&!d&&!y)throw new Error("Missing required arguments");if(!s.string(d))throw new TypeError("Second argument must be a String");if(!s.fn(y))throw new TypeError("Third argument must be a Function");if(s.node(u))return p(u,d,y);if(s.nodeList(u))return l(u,d,y);if(s.string(u))return f(u,d,y);throw new TypeError("First argument must be a String, HTMLElement, HTMLCollection, or NodeList")}function p(u,d,y){return u.addEventListener(d,y),{destroy:function(){u.removeEventListener(d,y)}}}function l(u,d,y){return Array.prototype.forEach.call(u,function(b){b.addEventListener(d,y)}),{destroy:function(){Array.prototype.forEach.call(u,function(b){b.removeEventListener(d,y)})}}}function f(u,d,y){return a(document.body,u,d,y)}o.exports=c},817:function(o){function n(i){var s;if(i.nodeName==="SELECT")i.focus(),s=i.value;else if(i.nodeName==="INPUT"||i.nodeName==="TEXTAREA"){var a=i.hasAttribute("readonly");a||i.setAttribute("readonly",""),i.select(),i.setSelectionRange(0,i.value.length),a||i.removeAttribute("readonly"),s=i.value}else{i.hasAttribute("contenteditable")&&i.focus();var c=window.getSelection(),p=document.createRange();p.selectNodeContents(i),c.removeAllRanges(),c.addRange(p),s=c.toString()}return s}o.exports=n},279:function(o){function n(){}n.prototype={on:function(i,s,a){var c=this.e||(this.e={});return(c[i]||(c[i]=[])).push({fn:s,ctx:a}),this},once:function(i,s,a){var c=this;function p(){c.off(i,p),s.apply(a,arguments)}return p._=s,this.on(i,p,a)},emit:function(i){var s=[].slice.call(arguments,1),a=((this.e||(this.e={}))[i]||[]).slice(),c=0,p=a.length;for(c;c{"use strict";/*! + * escape-html + * Copyright(c) 2012-2013 TJ Holowaychuk + * Copyright(c) 2015 Andreas Lubbe + * Copyright(c) 2015 Tiancheng "Timothy" Gu + * MIT Licensed + */var Wa=/["'&<>]/;Vn.exports=Ua;function Ua(e){var t=""+e,r=Wa.exec(t);if(!r)return t;var o,n="",i=0,s=0;for(i=r.index;i0&&i[i.length-1])&&(p[0]===6||p[0]===2)){r=0;continue}if(p[0]===3&&(!i||p[1]>i[0]&&p[1]=e.length&&(e=void 0),{value:e&&e[o++],done:!e}}};throw new TypeError(t?"Object is not iterable.":"Symbol.iterator is not defined.")}function z(e,t){var r=typeof Symbol=="function"&&e[Symbol.iterator];if(!r)return e;var o=r.call(e),n,i=[],s;try{for(;(t===void 0||t-- >0)&&!(n=o.next()).done;)i.push(n.value)}catch(a){s={error:a}}finally{try{n&&!n.done&&(r=o.return)&&r.call(o)}finally{if(s)throw s.error}}return i}function K(e,t,r){if(r||arguments.length===2)for(var o=0,n=t.length,i;o1||a(u,d)})})}function a(u,d){try{c(o[u](d))}catch(y){f(i[0][3],y)}}function c(u){u.value instanceof ot?Promise.resolve(u.value.v).then(p,l):f(i[0][2],u)}function p(u){a("next",u)}function l(u){a("throw",u)}function f(u,d){u(d),i.shift(),i.length&&a(i[0][0],i[0][1])}}function po(e){if(!Symbol.asyncIterator)throw new TypeError("Symbol.asyncIterator is not defined.");var t=e[Symbol.asyncIterator],r;return t?t.call(e):(e=typeof be=="function"?be(e):e[Symbol.iterator](),r={},o("next"),o("throw"),o("return"),r[Symbol.asyncIterator]=function(){return this},r);function o(i){r[i]=e[i]&&function(s){return new Promise(function(a,c){s=e[i](s),n(a,c,s.done,s.value)})}}function n(i,s,a,c){Promise.resolve(c).then(function(p){i({value:p,done:a})},s)}}function k(e){return typeof e=="function"}function pt(e){var t=function(o){Error.call(o),o.stack=new Error().stack},r=e(t);return r.prototype=Object.create(Error.prototype),r.prototype.constructor=r,r}var Ut=pt(function(e){return function(r){e(this),this.message=r?r.length+` errors occurred during unsubscription: +`+r.map(function(o,n){return n+1+") "+o.toString()}).join(` + `):"",this.name="UnsubscriptionError",this.errors=r}});function ze(e,t){if(e){var r=e.indexOf(t);0<=r&&e.splice(r,1)}}var je=function(){function e(t){this.initialTeardown=t,this.closed=!1,this._parentage=null,this._finalizers=null}return e.prototype.unsubscribe=function(){var t,r,o,n,i;if(!this.closed){this.closed=!0;var s=this._parentage;if(s)if(this._parentage=null,Array.isArray(s))try{for(var a=be(s),c=a.next();!c.done;c=a.next()){var p=c.value;p.remove(this)}}catch(b){t={error:b}}finally{try{c&&!c.done&&(r=a.return)&&r.call(a)}finally{if(t)throw t.error}}else s.remove(this);var l=this.initialTeardown;if(k(l))try{l()}catch(b){i=b instanceof Ut?b.errors:[b]}var f=this._finalizers;if(f){this._finalizers=null;try{for(var u=be(f),d=u.next();!d.done;d=u.next()){var y=d.value;try{lo(y)}catch(b){i=i!=null?i:[],b instanceof Ut?i=K(K([],z(i)),z(b.errors)):i.push(b)}}}catch(b){o={error:b}}finally{try{d&&!d.done&&(n=u.return)&&n.call(u)}finally{if(o)throw o.error}}}if(i)throw new Ut(i)}},e.prototype.add=function(t){var r;if(t&&t!==this)if(this.closed)lo(t);else{if(t instanceof e){if(t.closed||t._hasParent(this))return;t._addParent(this)}(this._finalizers=(r=this._finalizers)!==null&&r!==void 0?r:[]).push(t)}},e.prototype._hasParent=function(t){var r=this._parentage;return r===t||Array.isArray(r)&&r.includes(t)},e.prototype._addParent=function(t){var r=this._parentage;this._parentage=Array.isArray(r)?(r.push(t),r):r?[r,t]:t},e.prototype._removeParent=function(t){var r=this._parentage;r===t?this._parentage=null:Array.isArray(r)&&ze(r,t)},e.prototype.remove=function(t){var r=this._finalizers;r&&ze(r,t),t instanceof e&&t._removeParent(this)},e.EMPTY=function(){var t=new e;return t.closed=!0,t}(),e}();var Tr=je.EMPTY;function Nt(e){return e instanceof je||e&&"closed"in e&&k(e.remove)&&k(e.add)&&k(e.unsubscribe)}function lo(e){k(e)?e():e.unsubscribe()}var He={onUnhandledError:null,onStoppedNotification:null,Promise:void 0,useDeprecatedSynchronousErrorHandling:!1,useDeprecatedNextContext:!1};var lt={setTimeout:function(e,t){for(var r=[],o=2;o0},enumerable:!1,configurable:!0}),t.prototype._trySubscribe=function(r){return this._throwIfClosed(),e.prototype._trySubscribe.call(this,r)},t.prototype._subscribe=function(r){return this._throwIfClosed(),this._checkFinalizedStatuses(r),this._innerSubscribe(r)},t.prototype._innerSubscribe=function(r){var o=this,n=this,i=n.hasError,s=n.isStopped,a=n.observers;return i||s?Tr:(this.currentObservers=null,a.push(r),new je(function(){o.currentObservers=null,ze(a,r)}))},t.prototype._checkFinalizedStatuses=function(r){var o=this,n=o.hasError,i=o.thrownError,s=o.isStopped;n?r.error(i):s&&r.complete()},t.prototype.asObservable=function(){var r=new I;return r.source=this,r},t.create=function(r,o){return new xo(r,o)},t}(I);var xo=function(e){se(t,e);function t(r,o){var n=e.call(this)||this;return n.destination=r,n.source=o,n}return t.prototype.next=function(r){var o,n;(n=(o=this.destination)===null||o===void 0?void 0:o.next)===null||n===void 0||n.call(o,r)},t.prototype.error=function(r){var o,n;(n=(o=this.destination)===null||o===void 0?void 0:o.error)===null||n===void 0||n.call(o,r)},t.prototype.complete=function(){var r,o;(o=(r=this.destination)===null||r===void 0?void 0:r.complete)===null||o===void 0||o.call(r)},t.prototype._subscribe=function(r){var o,n;return(n=(o=this.source)===null||o===void 0?void 0:o.subscribe(r))!==null&&n!==void 0?n:Tr},t}(x);var St={now:function(){return(St.delegate||Date).now()},delegate:void 0};var Ot=function(e){se(t,e);function t(r,o,n){r===void 0&&(r=1/0),o===void 0&&(o=1/0),n===void 0&&(n=St);var i=e.call(this)||this;return i._bufferSize=r,i._windowTime=o,i._timestampProvider=n,i._buffer=[],i._infiniteTimeWindow=!0,i._infiniteTimeWindow=o===1/0,i._bufferSize=Math.max(1,r),i._windowTime=Math.max(1,o),i}return t.prototype.next=function(r){var o=this,n=o.isStopped,i=o._buffer,s=o._infiniteTimeWindow,a=o._timestampProvider,c=o._windowTime;n||(i.push(r),!s&&i.push(a.now()+c)),this._trimBuffer(),e.prototype.next.call(this,r)},t.prototype._subscribe=function(r){this._throwIfClosed(),this._trimBuffer();for(var o=this._innerSubscribe(r),n=this,i=n._infiniteTimeWindow,s=n._buffer,a=s.slice(),c=0;c0?e.prototype.requestAsyncId.call(this,r,o,n):(r.actions.push(this),r._scheduled||(r._scheduled=ut.requestAnimationFrame(function(){return r.flush(void 0)})))},t.prototype.recycleAsyncId=function(r,o,n){var i;if(n===void 0&&(n=0),n!=null?n>0:this.delay>0)return e.prototype.recycleAsyncId.call(this,r,o,n);var s=r.actions;o!=null&&((i=s[s.length-1])===null||i===void 0?void 0:i.id)!==o&&(ut.cancelAnimationFrame(o),r._scheduled=void 0)},t}(zt);var wo=function(e){se(t,e);function t(){return e!==null&&e.apply(this,arguments)||this}return t.prototype.flush=function(r){this._active=!0;var o=this._scheduled;this._scheduled=void 0;var n=this.actions,i;r=r||n.shift();do if(i=r.execute(r.state,r.delay))break;while((r=n[0])&&r.id===o&&n.shift());if(this._active=!1,i){for(;(r=n[0])&&r.id===o&&n.shift();)r.unsubscribe();throw i}},t}(qt);var ge=new wo(Eo);var M=new I(function(e){return e.complete()});function Kt(e){return e&&k(e.schedule)}function Cr(e){return e[e.length-1]}function Ge(e){return k(Cr(e))?e.pop():void 0}function Ae(e){return Kt(Cr(e))?e.pop():void 0}function Qt(e,t){return typeof Cr(e)=="number"?e.pop():t}var dt=function(e){return e&&typeof e.length=="number"&&typeof e!="function"};function Yt(e){return k(e==null?void 0:e.then)}function Bt(e){return k(e[ft])}function Gt(e){return Symbol.asyncIterator&&k(e==null?void 0:e[Symbol.asyncIterator])}function Jt(e){return new TypeError("You provided "+(e!==null&&typeof e=="object"?"an invalid object":"'"+e+"'")+" where a stream was expected. You can provide an Observable, Promise, ReadableStream, Array, AsyncIterable, or Iterable.")}function Wi(){return typeof Symbol!="function"||!Symbol.iterator?"@@iterator":Symbol.iterator}var Xt=Wi();function Zt(e){return k(e==null?void 0:e[Xt])}function er(e){return co(this,arguments,function(){var r,o,n,i;return Wt(this,function(s){switch(s.label){case 0:r=e.getReader(),s.label=1;case 1:s.trys.push([1,,9,10]),s.label=2;case 2:return[4,ot(r.read())];case 3:return o=s.sent(),n=o.value,i=o.done,i?[4,ot(void 0)]:[3,5];case 4:return[2,s.sent()];case 5:return[4,ot(n)];case 6:return[4,s.sent()];case 7:return s.sent(),[3,2];case 8:return[3,10];case 9:return r.releaseLock(),[7];case 10:return[2]}})})}function tr(e){return k(e==null?void 0:e.getReader)}function F(e){if(e instanceof I)return e;if(e!=null){if(Bt(e))return Ui(e);if(dt(e))return Ni(e);if(Yt(e))return Di(e);if(Gt(e))return To(e);if(Zt(e))return Vi(e);if(tr(e))return zi(e)}throw Jt(e)}function Ui(e){return new I(function(t){var r=e[ft]();if(k(r.subscribe))return r.subscribe(t);throw new TypeError("Provided object does not correctly implement Symbol.observable")})}function Ni(e){return new I(function(t){for(var r=0;r=2;return function(o){return o.pipe(e?v(function(n,i){return e(n,i,o)}):pe,ue(1),r?$e(t):Uo(function(){return new or}))}}function Rr(e){return e<=0?function(){return M}:g(function(t,r){var o=[];t.subscribe(E(r,function(n){o.push(n),e=2,!0))}function de(e){e===void 0&&(e={});var t=e.connector,r=t===void 0?function(){return new x}:t,o=e.resetOnError,n=o===void 0?!0:o,i=e.resetOnComplete,s=i===void 0?!0:i,a=e.resetOnRefCountZero,c=a===void 0?!0:a;return function(p){var l,f,u,d=0,y=!1,b=!1,D=function(){f==null||f.unsubscribe(),f=void 0},Q=function(){D(),l=u=void 0,y=b=!1},J=function(){var C=l;Q(),C==null||C.unsubscribe()};return g(function(C,ct){d++,!b&&!y&&D();var Ve=u=u!=null?u:r();ct.add(function(){d--,d===0&&!b&&!y&&(f=jr(J,c))}),Ve.subscribe(ct),!l&&d>0&&(l=new it({next:function(Fe){return Ve.next(Fe)},error:function(Fe){b=!0,D(),f=jr(Q,n,Fe),Ve.error(Fe)},complete:function(){y=!0,D(),f=jr(Q,s),Ve.complete()}}),F(C).subscribe(l))})(p)}}function jr(e,t){for(var r=[],o=2;oe.next(document)),e}function W(e,t=document){return Array.from(t.querySelectorAll(e))}function U(e,t=document){let r=ce(e,t);if(typeof r=="undefined")throw new ReferenceError(`Missing element: expected "${e}" to be present`);return r}function ce(e,t=document){return t.querySelector(e)||void 0}function Ie(){return document.activeElement instanceof HTMLElement&&document.activeElement||void 0}var ca=L(h(document.body,"focusin"),h(document.body,"focusout")).pipe(ye(1),q(void 0),m(()=>Ie()||document.body),Z(1));function vt(e){return ca.pipe(m(t=>e.contains(t)),X())}function qo(e,t){return L(h(e,"mouseenter").pipe(m(()=>!0)),h(e,"mouseleave").pipe(m(()=>!1))).pipe(t?ye(t):pe,q(!1))}function Ue(e){return{x:e.offsetLeft,y:e.offsetTop}}function Ko(e){return L(h(window,"load"),h(window,"resize")).pipe(Le(0,ge),m(()=>Ue(e)),q(Ue(e)))}function ir(e){return{x:e.scrollLeft,y:e.scrollTop}}function et(e){return L(h(e,"scroll"),h(window,"resize")).pipe(Le(0,ge),m(()=>ir(e)),q(ir(e)))}function Qo(e,t){if(typeof t=="string"||typeof t=="number")e.innerHTML+=t.toString();else if(t instanceof Node)e.appendChild(t);else if(Array.isArray(t))for(let r of t)Qo(e,r)}function S(e,t,...r){let o=document.createElement(e);if(t)for(let n of Object.keys(t))typeof t[n]!="undefined"&&(typeof t[n]!="boolean"?o.setAttribute(n,t[n]):o.setAttribute(n,""));for(let n of r)Qo(o,n);return o}function ar(e){if(e>999){let t=+((e-950)%1e3>99);return`${((e+1e-6)/1e3).toFixed(t)}k`}else return e.toString()}function gt(e){let t=S("script",{src:e});return H(()=>(document.head.appendChild(t),L(h(t,"load"),h(t,"error").pipe(w(()=>kr(()=>new ReferenceError(`Invalid script: ${e}`))))).pipe(m(()=>{}),A(()=>document.head.removeChild(t)),ue(1))))}var Yo=new x,pa=H(()=>typeof ResizeObserver=="undefined"?gt("https://unpkg.com/resize-observer-polyfill"):R(void 0)).pipe(m(()=>new ResizeObserver(e=>{for(let t of e)Yo.next(t)})),w(e=>L(Ke,R(e)).pipe(A(()=>e.disconnect()))),Z(1));function le(e){return{width:e.offsetWidth,height:e.offsetHeight}}function Se(e){return pa.pipe(T(t=>t.observe(e)),w(t=>Yo.pipe(v(({target:r})=>r===e),A(()=>t.unobserve(e)),m(()=>le(e)))),q(le(e)))}function xt(e){return{width:e.scrollWidth,height:e.scrollHeight}}function sr(e){let t=e.parentElement;for(;t&&(e.scrollWidth<=t.scrollWidth&&e.scrollHeight<=t.scrollHeight);)t=(e=t).parentElement;return t?e:void 0}var Bo=new x,la=H(()=>R(new IntersectionObserver(e=>{for(let t of e)Bo.next(t)},{threshold:0}))).pipe(w(e=>L(Ke,R(e)).pipe(A(()=>e.disconnect()))),Z(1));function yt(e){return la.pipe(T(t=>t.observe(e)),w(t=>Bo.pipe(v(({target:r})=>r===e),A(()=>t.unobserve(e)),m(({isIntersecting:r})=>r))))}function Go(e,t=16){return et(e).pipe(m(({y:r})=>{let o=le(e),n=xt(e);return r>=n.height-o.height-t}),X())}var cr={drawer:U("[data-md-toggle=drawer]"),search:U("[data-md-toggle=search]")};function Jo(e){return cr[e].checked}function Ye(e,t){cr[e].checked!==t&&cr[e].click()}function Ne(e){let t=cr[e];return h(t,"change").pipe(m(()=>t.checked),q(t.checked))}function ma(e,t){switch(e.constructor){case HTMLInputElement:return e.type==="radio"?/^Arrow/.test(t):!0;case HTMLSelectElement:case HTMLTextAreaElement:return!0;default:return e.isContentEditable}}function fa(){return L(h(window,"compositionstart").pipe(m(()=>!0)),h(window,"compositionend").pipe(m(()=>!1))).pipe(q(!1))}function Xo(){let e=h(window,"keydown").pipe(v(t=>!(t.metaKey||t.ctrlKey)),m(t=>({mode:Jo("search")?"search":"global",type:t.key,claim(){t.preventDefault(),t.stopPropagation()}})),v(({mode:t,type:r})=>{if(t==="global"){let o=Ie();if(typeof o!="undefined")return!ma(o,r)}return!0}),de());return fa().pipe(w(t=>t?M:e))}function me(){return new URL(location.href)}function st(e,t=!1){if(G("navigation.instant")&&!t){let r=S("a",{href:e.href});document.body.appendChild(r),r.click(),r.remove()}else location.href=e.href}function Zo(){return new x}function en(){return location.hash.slice(1)}function pr(e){let t=S("a",{href:e});t.addEventListener("click",r=>r.stopPropagation()),t.click()}function ua(e){return L(h(window,"hashchange"),e).pipe(m(en),q(en()),v(t=>t.length>0),Z(1))}function tn(e){return ua(e).pipe(m(t=>ce(`[id="${t}"]`)),v(t=>typeof t!="undefined"))}function At(e){let t=matchMedia(e);return nr(r=>t.addListener(()=>r(t.matches))).pipe(q(t.matches))}function rn(){let e=matchMedia("print");return L(h(window,"beforeprint").pipe(m(()=>!0)),h(window,"afterprint").pipe(m(()=>!1))).pipe(q(e.matches))}function Dr(e,t){return e.pipe(w(r=>r?t():M))}function lr(e,t){return new I(r=>{let o=new XMLHttpRequest;o.open("GET",`${e}`),o.responseType="blob",o.addEventListener("load",()=>{o.status>=200&&o.status<300?(r.next(o.response),r.complete()):r.error(new Error(o.statusText))}),o.addEventListener("error",()=>{r.error(new Error("Network Error"))}),o.addEventListener("abort",()=>{r.error(new Error("Request aborted"))}),typeof(t==null?void 0:t.progress$)!="undefined"&&(o.addEventListener("progress",n=>{if(n.lengthComputable)t.progress$.next(n.loaded/n.total*100);else{let i=Number(o.getResponseHeader("Content-Length"))||0;t.progress$.next(n.loaded/i*100)}}),t.progress$.next(5)),o.send()})}function De(e,t){return lr(e,t).pipe(w(r=>r.text()),m(r=>JSON.parse(r)),Z(1))}function on(e,t){let r=new DOMParser;return lr(e,t).pipe(w(o=>o.text()),m(o=>r.parseFromString(o,"text/xml")),Z(1))}function nn(){return{x:Math.max(0,scrollX),y:Math.max(0,scrollY)}}function an(){return L(h(window,"scroll",{passive:!0}),h(window,"resize",{passive:!0})).pipe(m(nn),q(nn()))}function sn(){return{width:innerWidth,height:innerHeight}}function cn(){return h(window,"resize",{passive:!0}).pipe(m(sn),q(sn()))}function pn(){return B([an(),cn()]).pipe(m(([e,t])=>({offset:e,size:t})),Z(1))}function mr(e,{viewport$:t,header$:r}){let o=t.pipe(te("size")),n=B([o,r]).pipe(m(()=>Ue(e)));return B([r,t,n]).pipe(m(([{height:i},{offset:s,size:a},{x:c,y:p}])=>({offset:{x:s.x-c,y:s.y-p+i},size:a})))}function da(e){return h(e,"message",t=>t.data)}function ha(e){let t=new x;return t.subscribe(r=>e.postMessage(r)),t}function ln(e,t=new Worker(e)){let r=da(t),o=ha(t),n=new x;n.subscribe(o);let i=o.pipe(ee(),oe(!0));return n.pipe(ee(),Re(r.pipe(j(i))),de())}var ba=U("#__config"),Et=JSON.parse(ba.textContent);Et.base=`${new URL(Et.base,me())}`;function he(){return Et}function G(e){return Et.features.includes(e)}function we(e,t){return typeof t!="undefined"?Et.translations[e].replace("#",t.toString()):Et.translations[e]}function Oe(e,t=document){return U(`[data-md-component=${e}]`,t)}function ne(e,t=document){return W(`[data-md-component=${e}]`,t)}function va(e){let t=U(".md-typeset > :first-child",e);return h(t,"click",{once:!0}).pipe(m(()=>U(".md-typeset",e)),m(r=>({hash:__md_hash(r.innerHTML)})))}function mn(e){if(!G("announce.dismiss")||!e.childElementCount)return M;if(!e.hidden){let t=U(".md-typeset",e);__md_hash(t.innerHTML)===__md_get("__announce")&&(e.hidden=!0)}return H(()=>{let t=new x;return t.subscribe(({hash:r})=>{e.hidden=!0,__md_set("__announce",r)}),va(e).pipe(T(r=>t.next(r)),A(()=>t.complete()),m(r=>P({ref:e},r)))})}function ga(e,{target$:t}){return t.pipe(m(r=>({hidden:r!==e})))}function fn(e,t){let r=new x;return r.subscribe(({hidden:o})=>{e.hidden=o}),ga(e,t).pipe(T(o=>r.next(o)),A(()=>r.complete()),m(o=>P({ref:e},o)))}function Ct(e,t){return t==="inline"?S("div",{class:"md-tooltip md-tooltip--inline",id:e,role:"tooltip"},S("div",{class:"md-tooltip__inner md-typeset"})):S("div",{class:"md-tooltip",id:e,role:"tooltip"},S("div",{class:"md-tooltip__inner md-typeset"}))}function un(e,t){if(t=t?`${t}_annotation_${e}`:void 0,t){let r=t?`#${t}`:void 0;return S("aside",{class:"md-annotation",tabIndex:0},Ct(t),S("a",{href:r,class:"md-annotation__index",tabIndex:-1},S("span",{"data-md-annotation-id":e})))}else return S("aside",{class:"md-annotation",tabIndex:0},Ct(t),S("span",{class:"md-annotation__index",tabIndex:-1},S("span",{"data-md-annotation-id":e})))}function dn(e){return S("button",{class:"md-clipboard md-icon",title:we("clipboard.copy"),"data-clipboard-target":`#${e} > code`})}function Vr(e,t){let r=t&2,o=t&1,n=Object.keys(e.terms).filter(c=>!e.terms[c]).reduce((c,p)=>[...c,S("del",null,p)," "],[]).slice(0,-1),i=he(),s=new URL(e.location,i.base);G("search.highlight")&&s.searchParams.set("h",Object.entries(e.terms).filter(([,c])=>c).reduce((c,[p])=>`${c} ${p}`.trim(),""));let{tags:a}=he();return S("a",{href:`${s}`,class:"md-search-result__link",tabIndex:-1},S("article",{class:"md-search-result__article md-typeset","data-md-score":e.score.toFixed(2)},r>0&&S("div",{class:"md-search-result__icon md-icon"}),r>0&&S("h1",null,e.title),r<=0&&S("h2",null,e.title),o>0&&e.text.length>0&&e.text,e.tags&&e.tags.map(c=>{let p=a?c in a?`md-tag-icon md-tag--${a[c]}`:"md-tag-icon":"";return S("span",{class:`md-tag ${p}`},c)}),o>0&&n.length>0&&S("p",{class:"md-search-result__terms"},we("search.result.term.missing"),": ",...n)))}function hn(e){let t=e[0].score,r=[...e],o=he(),n=r.findIndex(l=>!`${new URL(l.location,o.base)}`.includes("#")),[i]=r.splice(n,1),s=r.findIndex(l=>l.scoreVr(l,1)),...c.length?[S("details",{class:"md-search-result__more"},S("summary",{tabIndex:-1},S("div",null,c.length>0&&c.length===1?we("search.result.more.one"):we("search.result.more.other",c.length))),...c.map(l=>Vr(l,1)))]:[]];return S("li",{class:"md-search-result__item"},p)}function bn(e){return S("ul",{class:"md-source__facts"},Object.entries(e).map(([t,r])=>S("li",{class:`md-source__fact md-source__fact--${t}`},typeof r=="number"?ar(r):r)))}function zr(e){let t=`tabbed-control tabbed-control--${e}`;return S("div",{class:t,hidden:!0},S("button",{class:"tabbed-button",tabIndex:-1,"aria-hidden":"true"}))}function vn(e){return S("div",{class:"md-typeset__scrollwrap"},S("div",{class:"md-typeset__table"},e))}function xa(e){let t=he(),r=new URL(`../${e.version}/`,t.base);return S("li",{class:"md-version__item"},S("a",{href:`${r}`,class:"md-version__link"},e.title))}function gn(e,t){return S("div",{class:"md-version"},S("button",{class:"md-version__current","aria-label":we("select.version")},t.title),S("ul",{class:"md-version__list"},e.map(xa)))}var ya=0;function Ea(e,t){document.body.append(e);let{width:r}=le(e);e.style.setProperty("--md-tooltip-width",`${r}px`),e.remove();let o=sr(t),n=typeof o!="undefined"?et(o):R({x:0,y:0}),i=L(vt(t),qo(t)).pipe(X());return B([i,n]).pipe(m(([s,a])=>{let{x:c,y:p}=Ue(t),l=le(t),f=t.closest("table");return f&&t.parentElement&&(c+=f.offsetLeft+t.parentElement.offsetLeft,p+=f.offsetTop+t.parentElement.offsetTop),{active:s,offset:{x:c-a.x+l.width/2-r/2,y:p-a.y+l.height+8}}}))}function Be(e){let t=e.title;if(!t.length)return M;let r=`__tooltip_${ya++}`,o=Ct(r,"inline"),n=U(".md-typeset",o);return n.innerHTML=t,H(()=>{let i=new x;return i.subscribe({next({offset:s}){o.style.setProperty("--md-tooltip-x",`${s.x}px`),o.style.setProperty("--md-tooltip-y",`${s.y}px`)},complete(){o.style.removeProperty("--md-tooltip-x"),o.style.removeProperty("--md-tooltip-y")}}),L(i.pipe(v(({active:s})=>s)),i.pipe(ye(250),v(({active:s})=>!s))).subscribe({next({active:s}){s?(e.insertAdjacentElement("afterend",o),e.setAttribute("aria-describedby",r),e.removeAttribute("title")):(o.remove(),e.removeAttribute("aria-describedby"),e.setAttribute("title",t))},complete(){o.remove(),e.removeAttribute("aria-describedby"),e.setAttribute("title",t)}}),i.pipe(Le(16,ge)).subscribe(({active:s})=>{o.classList.toggle("md-tooltip--active",s)}),i.pipe(_t(125,ge),v(()=>!!e.offsetParent),m(()=>e.offsetParent.getBoundingClientRect()),m(({x:s})=>s)).subscribe({next(s){s?o.style.setProperty("--md-tooltip-0",`${-s}px`):o.style.removeProperty("--md-tooltip-0")},complete(){o.style.removeProperty("--md-tooltip-0")}}),Ea(o,e).pipe(T(s=>i.next(s)),A(()=>i.complete()),m(s=>P({ref:e},s)))}).pipe(qe(ie))}function wa(e,t){let r=H(()=>B([Ko(e),et(t)])).pipe(m(([{x:o,y:n},i])=>{let{width:s,height:a}=le(e);return{x:o-i.x+s/2,y:n-i.y+a/2}}));return vt(e).pipe(w(o=>r.pipe(m(n=>({active:o,offset:n})),ue(+!o||1/0))))}function xn(e,t,{target$:r}){let[o,n]=Array.from(e.children);return H(()=>{let i=new x,s=i.pipe(ee(),oe(!0));return i.subscribe({next({offset:a}){e.style.setProperty("--md-tooltip-x",`${a.x}px`),e.style.setProperty("--md-tooltip-y",`${a.y}px`)},complete(){e.style.removeProperty("--md-tooltip-x"),e.style.removeProperty("--md-tooltip-y")}}),yt(e).pipe(j(s)).subscribe(a=>{e.toggleAttribute("data-md-visible",a)}),L(i.pipe(v(({active:a})=>a)),i.pipe(ye(250),v(({active:a})=>!a))).subscribe({next({active:a}){a?e.prepend(o):o.remove()},complete(){e.prepend(o)}}),i.pipe(Le(16,ge)).subscribe(({active:a})=>{o.classList.toggle("md-tooltip--active",a)}),i.pipe(_t(125,ge),v(()=>!!e.offsetParent),m(()=>e.offsetParent.getBoundingClientRect()),m(({x:a})=>a)).subscribe({next(a){a?e.style.setProperty("--md-tooltip-0",`${-a}px`):e.style.removeProperty("--md-tooltip-0")},complete(){e.style.removeProperty("--md-tooltip-0")}}),h(n,"click").pipe(j(s),v(a=>!(a.metaKey||a.ctrlKey))).subscribe(a=>{a.stopPropagation(),a.preventDefault()}),h(n,"mousedown").pipe(j(s),ae(i)).subscribe(([a,{active:c}])=>{var p;if(a.button!==0||a.metaKey||a.ctrlKey)a.preventDefault();else if(c){a.preventDefault();let l=e.parentElement.closest(".md-annotation");l instanceof HTMLElement?l.focus():(p=Ie())==null||p.blur()}}),r.pipe(j(s),v(a=>a===o),Qe(125)).subscribe(()=>e.focus()),wa(e,t).pipe(T(a=>i.next(a)),A(()=>i.complete()),m(a=>P({ref:e},a)))})}function Ta(e){return e.tagName==="CODE"?W(".c, .c1, .cm",e):[e]}function Sa(e){let t=[];for(let r of Ta(e)){let o=[],n=document.createNodeIterator(r,NodeFilter.SHOW_TEXT);for(let i=n.nextNode();i;i=n.nextNode())o.push(i);for(let i of o){let s;for(;s=/(\(\d+\))(!)?/.exec(i.textContent);){let[,a,c]=s;if(typeof c=="undefined"){let p=i.splitText(s.index);i=p.splitText(a.length),t.push(p)}else{i.textContent=a,t.push(i);break}}}}return t}function yn(e,t){t.append(...Array.from(e.childNodes))}function fr(e,t,{target$:r,print$:o}){let n=t.closest("[id]"),i=n==null?void 0:n.id,s=new Map;for(let a of Sa(t)){let[,c]=a.textContent.match(/\((\d+)\)/);ce(`:scope > li:nth-child(${c})`,e)&&(s.set(c,un(c,i)),a.replaceWith(s.get(c)))}return s.size===0?M:H(()=>{let a=new x,c=a.pipe(ee(),oe(!0)),p=[];for(let[l,f]of s)p.push([U(".md-typeset",f),U(`:scope > li:nth-child(${l})`,e)]);return o.pipe(j(c)).subscribe(l=>{e.hidden=!l,e.classList.toggle("md-annotation-list",l);for(let[f,u]of p)l?yn(f,u):yn(u,f)}),L(...[...s].map(([,l])=>xn(l,t,{target$:r}))).pipe(A(()=>a.complete()),de())})}function En(e){if(e.nextElementSibling){let t=e.nextElementSibling;if(t.tagName==="OL")return t;if(t.tagName==="P"&&!t.children.length)return En(t)}}function wn(e,t){return H(()=>{let r=En(e);return typeof r!="undefined"?fr(r,e,t):M})}var Tn=jt(Kr());var Oa=0;function Sn(e){if(e.nextElementSibling){let t=e.nextElementSibling;if(t.tagName==="OL")return t;if(t.tagName==="P"&&!t.children.length)return Sn(t)}}function Ma(e){return Se(e).pipe(m(({width:t})=>({scrollable:xt(e).width>t})),te("scrollable"))}function On(e,t){let{matches:r}=matchMedia("(hover)"),o=H(()=>{let n=new x,i=n.pipe(Rr(1));n.subscribe(({scrollable:c})=>{c&&r?e.setAttribute("tabindex","0"):e.removeAttribute("tabindex")});let s=[];if(Tn.default.isSupported()&&(e.closest(".copy")||G("content.code.copy")&&!e.closest(".no-copy"))){let c=e.closest("pre");c.id=`__code_${Oa++}`;let p=dn(c.id);c.insertBefore(p,e),G("content.tooltips")&&s.push(Be(p))}let a=e.closest(".highlight");if(a instanceof HTMLElement){let c=Sn(a);if(typeof c!="undefined"&&(a.classList.contains("annotate")||G("content.code.annotate"))){let p=fr(c,e,t);s.push(Se(a).pipe(j(i),m(({width:l,height:f})=>l&&f),X(),w(l=>l?p:M)))}}return Ma(e).pipe(T(c=>n.next(c)),A(()=>n.complete()),m(c=>P({ref:e},c)),Re(...s))});return G("content.lazy")?yt(e).pipe(v(n=>n),ue(1),w(()=>o)):o}function La(e,{target$:t,print$:r}){let o=!0;return L(t.pipe(m(n=>n.closest("details:not([open])")),v(n=>e===n),m(()=>({action:"open",reveal:!0}))),r.pipe(v(n=>n||!o),T(()=>o=e.open),m(n=>({action:n?"open":"close"}))))}function Mn(e,t){return H(()=>{let r=new x;return r.subscribe(({action:o,reveal:n})=>{e.toggleAttribute("open",o==="open"),n&&e.scrollIntoView()}),La(e,t).pipe(T(o=>r.next(o)),A(()=>r.complete()),m(o=>P({ref:e},o)))})}var Ln=".node circle,.node ellipse,.node path,.node polygon,.node rect{fill:var(--md-mermaid-node-bg-color);stroke:var(--md-mermaid-node-fg-color)}marker{fill:var(--md-mermaid-edge-color)!important}.edgeLabel .label rect{fill:#0000}.label{color:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.label foreignObject{line-height:normal;overflow:visible}.label div .edgeLabel{color:var(--md-mermaid-label-fg-color)}.edgeLabel,.edgeLabel rect,.label div .edgeLabel{background-color:var(--md-mermaid-label-bg-color)}.edgeLabel,.edgeLabel rect{fill:var(--md-mermaid-label-bg-color);color:var(--md-mermaid-edge-color)}.edgePath .path,.flowchart-link{stroke:var(--md-mermaid-edge-color);stroke-width:.05rem}.edgePath .arrowheadPath{fill:var(--md-mermaid-edge-color);stroke:none}.cluster rect{fill:var(--md-default-fg-color--lightest);stroke:var(--md-default-fg-color--lighter)}.cluster span{color:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}g #flowchart-circleEnd,g #flowchart-circleStart,g #flowchart-crossEnd,g #flowchart-crossStart,g #flowchart-pointEnd,g #flowchart-pointStart{stroke:none}g.classGroup line,g.classGroup rect{fill:var(--md-mermaid-node-bg-color);stroke:var(--md-mermaid-node-fg-color)}g.classGroup text{fill:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.classLabel .box{fill:var(--md-mermaid-label-bg-color);background-color:var(--md-mermaid-label-bg-color);opacity:1}.classLabel .label{fill:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.node .divider{stroke:var(--md-mermaid-node-fg-color)}.relation{stroke:var(--md-mermaid-edge-color)}.cardinality{fill:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.cardinality text{fill:inherit!important}defs #classDiagram-compositionEnd,defs #classDiagram-compositionStart,defs #classDiagram-dependencyEnd,defs #classDiagram-dependencyStart,defs #classDiagram-extensionEnd,defs #classDiagram-extensionStart{fill:var(--md-mermaid-edge-color)!important;stroke:var(--md-mermaid-edge-color)!important}defs #classDiagram-aggregationEnd,defs #classDiagram-aggregationStart{fill:var(--md-mermaid-label-bg-color)!important;stroke:var(--md-mermaid-edge-color)!important}g.stateGroup rect{fill:var(--md-mermaid-node-bg-color);stroke:var(--md-mermaid-node-fg-color)}g.stateGroup .state-title{fill:var(--md-mermaid-label-fg-color)!important;font-family:var(--md-mermaid-font-family)}g.stateGroup .composit{fill:var(--md-mermaid-label-bg-color)}.nodeLabel{color:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.node circle.state-end,.node circle.state-start,.start-state{fill:var(--md-mermaid-edge-color);stroke:none}.end-state-inner,.end-state-outer{fill:var(--md-mermaid-edge-color)}.end-state-inner,.node circle.state-end{stroke:var(--md-mermaid-label-bg-color)}.transition{stroke:var(--md-mermaid-edge-color)}[id^=state-fork] rect,[id^=state-join] rect{fill:var(--md-mermaid-edge-color)!important;stroke:none!important}.statediagram-cluster.statediagram-cluster .inner{fill:var(--md-default-bg-color)}.statediagram-cluster rect{fill:var(--md-mermaid-node-bg-color);stroke:var(--md-mermaid-node-fg-color)}.statediagram-state rect.divider{fill:var(--md-default-fg-color--lightest);stroke:var(--md-default-fg-color--lighter)}defs #statediagram-barbEnd{stroke:var(--md-mermaid-edge-color)}.attributeBoxEven,.attributeBoxOdd{fill:var(--md-mermaid-node-bg-color);stroke:var(--md-mermaid-node-fg-color)}.entityBox{fill:var(--md-mermaid-label-bg-color);stroke:var(--md-mermaid-node-fg-color)}.entityLabel{fill:var(--md-mermaid-label-fg-color);font-family:var(--md-mermaid-font-family)}.relationshipLabelBox{fill:var(--md-mermaid-label-bg-color);fill-opacity:1;background-color:var(--md-mermaid-label-bg-color);opacity:1}.relationshipLabel{fill:var(--md-mermaid-label-fg-color)}.relationshipLine{stroke:var(--md-mermaid-edge-color)}defs #ONE_OR_MORE_END *,defs #ONE_OR_MORE_START *,defs #ONLY_ONE_END *,defs #ONLY_ONE_START *,defs #ZERO_OR_MORE_END *,defs #ZERO_OR_MORE_START *,defs #ZERO_OR_ONE_END *,defs #ZERO_OR_ONE_START *{stroke:var(--md-mermaid-edge-color)!important}defs #ZERO_OR_MORE_END circle,defs #ZERO_OR_MORE_START circle{fill:var(--md-mermaid-label-bg-color)}.actor{fill:var(--md-mermaid-sequence-actor-bg-color);stroke:var(--md-mermaid-sequence-actor-border-color)}text.actor>tspan{fill:var(--md-mermaid-sequence-actor-fg-color);font-family:var(--md-mermaid-font-family)}line{stroke:var(--md-mermaid-sequence-actor-line-color)}.actor-man circle,.actor-man line{fill:var(--md-mermaid-sequence-actorman-bg-color);stroke:var(--md-mermaid-sequence-actorman-line-color)}.messageLine0,.messageLine1{stroke:var(--md-mermaid-sequence-message-line-color)}.note{fill:var(--md-mermaid-sequence-note-bg-color);stroke:var(--md-mermaid-sequence-note-border-color)}.loopText,.loopText>tspan,.messageText,.noteText>tspan{stroke:none;font-family:var(--md-mermaid-font-family)!important}.messageText{fill:var(--md-mermaid-sequence-message-fg-color)}.loopText,.loopText>tspan{fill:var(--md-mermaid-sequence-loop-fg-color)}.noteText>tspan{fill:var(--md-mermaid-sequence-note-fg-color)}#arrowhead path{fill:var(--md-mermaid-sequence-message-line-color);stroke:none}.loopLine{fill:var(--md-mermaid-sequence-loop-bg-color);stroke:var(--md-mermaid-sequence-loop-border-color)}.labelBox{fill:var(--md-mermaid-sequence-label-bg-color);stroke:none}.labelText,.labelText>span{fill:var(--md-mermaid-sequence-label-fg-color);font-family:var(--md-mermaid-font-family)}.sequenceNumber{fill:var(--md-mermaid-sequence-number-fg-color)}rect.rect{fill:var(--md-mermaid-sequence-box-bg-color);stroke:none}rect.rect+text.text{fill:var(--md-mermaid-sequence-box-fg-color)}defs #sequencenumber{fill:var(--md-mermaid-sequence-number-bg-color)!important}";var Qr,Aa=0;function Ca(){return typeof mermaid=="undefined"||mermaid instanceof Element?gt("https://unpkg.com/mermaid@10.6.1/dist/mermaid.min.js"):R(void 0)}function _n(e){return e.classList.remove("mermaid"),Qr||(Qr=Ca().pipe(T(()=>mermaid.initialize({startOnLoad:!1,themeCSS:Ln,sequence:{actorFontSize:"16px",messageFontSize:"16px",noteFontSize:"16px"}})),m(()=>{}),Z(1))),Qr.subscribe(()=>no(this,null,function*(){e.classList.add("mermaid");let t=`__mermaid_${Aa++}`,r=S("div",{class:"mermaid"}),o=e.textContent,{svg:n,fn:i}=yield mermaid.render(t,o),s=r.attachShadow({mode:"closed"});s.innerHTML=n,e.replaceWith(r),i==null||i(s)})),Qr.pipe(m(()=>({ref:e})))}var An=S("table");function Cn(e){return e.replaceWith(An),An.replaceWith(vn(e)),R({ref:e})}function ka(e){let t=e.find(r=>r.checked)||e[0];return L(...e.map(r=>h(r,"change").pipe(m(()=>U(`label[for="${r.id}"]`))))).pipe(q(U(`label[for="${t.id}"]`)),m(r=>({active:r})))}function kn(e,{viewport$:t,target$:r}){let o=U(".tabbed-labels",e),n=W(":scope > input",e),i=zr("prev");e.append(i);let s=zr("next");return e.append(s),H(()=>{let a=new x,c=a.pipe(ee(),oe(!0));B([a,Se(e)]).pipe(j(c),Le(1,ge)).subscribe({next([{active:p},l]){let f=Ue(p),{width:u}=le(p);e.style.setProperty("--md-indicator-x",`${f.x}px`),e.style.setProperty("--md-indicator-width",`${u}px`);let d=ir(o);(f.xd.x+l.width)&&o.scrollTo({left:Math.max(0,f.x-16),behavior:"smooth"})},complete(){e.style.removeProperty("--md-indicator-x"),e.style.removeProperty("--md-indicator-width")}}),B([et(o),Se(o)]).pipe(j(c)).subscribe(([p,l])=>{let f=xt(o);i.hidden=p.x<16,s.hidden=p.x>f.width-l.width-16}),L(h(i,"click").pipe(m(()=>-1)),h(s,"click").pipe(m(()=>1))).pipe(j(c)).subscribe(p=>{let{width:l}=le(o);o.scrollBy({left:l*p,behavior:"smooth"})}),r.pipe(j(c),v(p=>n.includes(p))).subscribe(p=>p.click()),o.classList.add("tabbed-labels--linked");for(let p of n){let l=U(`label[for="${p.id}"]`);l.replaceChildren(S("a",{href:`#${l.htmlFor}`,tabIndex:-1},...Array.from(l.childNodes))),h(l.firstElementChild,"click").pipe(j(c),v(f=>!(f.metaKey||f.ctrlKey)),T(f=>{f.preventDefault(),f.stopPropagation()})).subscribe(()=>{history.replaceState({},"",`#${l.htmlFor}`),l.click()})}return G("content.tabs.link")&&a.pipe(Ee(1),ae(t)).subscribe(([{active:p},{offset:l}])=>{let f=p.innerText.trim();if(p.hasAttribute("data-md-switching"))p.removeAttribute("data-md-switching");else{let u=e.offsetTop-l.y;for(let y of W("[data-tabs]"))for(let b of W(":scope > input",y)){let D=U(`label[for="${b.id}"]`);if(D!==p&&D.innerText.trim()===f){D.setAttribute("data-md-switching",""),b.click();break}}window.scrollTo({top:e.offsetTop-u});let d=__md_get("__tabs")||[];__md_set("__tabs",[...new Set([f,...d])])}}),a.pipe(j(c)).subscribe(()=>{for(let p of W("audio, video",e))p.pause()}),ka(n).pipe(T(p=>a.next(p)),A(()=>a.complete()),m(p=>P({ref:e},p)))}).pipe(qe(ie))}function Hn(e,{viewport$:t,target$:r,print$:o}){return L(...W(".annotate:not(.highlight)",e).map(n=>wn(n,{target$:r,print$:o})),...W("pre:not(.mermaid) > code",e).map(n=>On(n,{target$:r,print$:o})),...W("pre.mermaid",e).map(n=>_n(n)),...W("table:not([class])",e).map(n=>Cn(n)),...W("details",e).map(n=>Mn(n,{target$:r,print$:o})),...W("[data-tabs]",e).map(n=>kn(n,{viewport$:t,target$:r})),...W("[title]",e).filter(()=>G("content.tooltips")).map(n=>Be(n)))}function Ha(e,{alert$:t}){return t.pipe(w(r=>L(R(!0),R(!1).pipe(Qe(2e3))).pipe(m(o=>({message:r,active:o})))))}function $n(e,t){let r=U(".md-typeset",e);return H(()=>{let o=new x;return o.subscribe(({message:n,active:i})=>{e.classList.toggle("md-dialog--active",i),r.textContent=n}),Ha(e,t).pipe(T(n=>o.next(n)),A(()=>o.complete()),m(n=>P({ref:e},n)))})}function $a({viewport$:e}){if(!G("header.autohide"))return R(!1);let t=e.pipe(m(({offset:{y:n}})=>n),Ce(2,1),m(([n,i])=>[nMath.abs(i-n.y)>100),m(([,[n]])=>n),X()),o=Ne("search");return B([e,o]).pipe(m(([{offset:n},i])=>n.y>400&&!i),X(),w(n=>n?r:R(!1)),q(!1))}function Pn(e,t){return H(()=>B([Se(e),$a(t)])).pipe(m(([{height:r},o])=>({height:r,hidden:o})),X((r,o)=>r.height===o.height&&r.hidden===o.hidden),Z(1))}function Rn(e,{header$:t,main$:r}){return H(()=>{let o=new x,n=o.pipe(ee(),oe(!0));o.pipe(te("active"),Ze(t)).subscribe(([{active:s},{hidden:a}])=>{e.classList.toggle("md-header--shadow",s&&!a),e.hidden=a});let i=fe(W("[title]",e)).pipe(v(()=>G("content.tooltips")),re(s=>Be(s)));return r.subscribe(o),t.pipe(j(n),m(s=>P({ref:e},s)),Re(i.pipe(j(n))))})}function Pa(e,{viewport$:t,header$:r}){return mr(e,{viewport$:t,header$:r}).pipe(m(({offset:{y:o}})=>{let{height:n}=le(e);return{active:o>=n}}),te("active"))}function In(e,t){return H(()=>{let r=new x;r.subscribe({next({active:n}){e.classList.toggle("md-header__title--active",n)},complete(){e.classList.remove("md-header__title--active")}});let o=ce(".md-content h1");return typeof o=="undefined"?M:Pa(o,t).pipe(T(n=>r.next(n)),A(()=>r.complete()),m(n=>P({ref:e},n)))})}function Fn(e,{viewport$:t,header$:r}){let o=r.pipe(m(({height:i})=>i),X()),n=o.pipe(w(()=>Se(e).pipe(m(({height:i})=>({top:e.offsetTop,bottom:e.offsetTop+i})),te("bottom"))));return B([o,n,t]).pipe(m(([i,{top:s,bottom:a},{offset:{y:c},size:{height:p}}])=>(p=Math.max(0,p-Math.max(0,s-c,i)-Math.max(0,p+c-a)),{offset:s-i,height:p,active:s-i<=c})),X((i,s)=>i.offset===s.offset&&i.height===s.height&&i.active===s.active))}function Ra(e){let t=__md_get("__palette")||{index:e.findIndex(r=>matchMedia(r.getAttribute("data-md-color-media")).matches)};return R(...e).pipe(re(r=>h(r,"change").pipe(m(()=>r))),q(e[Math.max(0,t.index)]),m(r=>({index:e.indexOf(r),color:{media:r.getAttribute("data-md-color-media"),scheme:r.getAttribute("data-md-color-scheme"),primary:r.getAttribute("data-md-color-primary"),accent:r.getAttribute("data-md-color-accent")}})),Z(1))}function jn(e){let t=W("input",e),r=S("meta",{name:"theme-color"});document.head.appendChild(r);let o=S("meta",{name:"color-scheme"});document.head.appendChild(o);let n=At("(prefers-color-scheme: light)");return H(()=>{let i=new x;return i.subscribe(s=>{if(document.body.setAttribute("data-md-color-switching",""),s.color.media==="(prefers-color-scheme)"){let a=matchMedia("(prefers-color-scheme: light)"),c=document.querySelector(a.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']");s.color.scheme=c.getAttribute("data-md-color-scheme"),s.color.primary=c.getAttribute("data-md-color-primary"),s.color.accent=c.getAttribute("data-md-color-accent")}for(let[a,c]of Object.entries(s.color))document.body.setAttribute(`data-md-color-${a}`,c);for(let a=0;a{let s=Oe("header"),a=window.getComputedStyle(s);return o.content=a.colorScheme,a.backgroundColor.match(/\d+/g).map(c=>(+c).toString(16).padStart(2,"0")).join("")})).subscribe(s=>r.content=`#${s}`),i.pipe(Me(ie)).subscribe(()=>{document.body.removeAttribute("data-md-color-switching")}),Ra(t).pipe(j(n.pipe(Ee(1))),at(),T(s=>i.next(s)),A(()=>i.complete()),m(s=>P({ref:e},s)))})}function Wn(e,{progress$:t}){return H(()=>{let r=new x;return r.subscribe(({value:o})=>{e.style.setProperty("--md-progress-value",`${o}`)}),t.pipe(T(o=>r.next({value:o})),A(()=>r.complete()),m(o=>({ref:e,value:o})))})}var Yr=jt(Kr());function Ia(e){e.setAttribute("data-md-copying","");let t=e.closest("[data-copy]"),r=t?t.getAttribute("data-copy"):e.innerText;return e.removeAttribute("data-md-copying"),r.trimEnd()}function Un({alert$:e}){Yr.default.isSupported()&&new I(t=>{new Yr.default("[data-clipboard-target], [data-clipboard-text]",{text:r=>r.getAttribute("data-clipboard-text")||Ia(U(r.getAttribute("data-clipboard-target")))}).on("success",r=>t.next(r))}).pipe(T(t=>{t.trigger.focus()}),m(()=>we("clipboard.copied"))).subscribe(e)}function Fa(e){if(e.length<2)return[""];let[t,r]=[...e].sort((n,i)=>n.length-i.length).map(n=>n.replace(/[^/]+$/,"")),o=0;if(t===r)o=t.length;else for(;t.charCodeAt(o)===r.charCodeAt(o);)o++;return e.map(n=>n.replace(t.slice(0,o),""))}function ur(e){let t=__md_get("__sitemap",sessionStorage,e);if(t)return R(t);{let r=he();return on(new URL("sitemap.xml",e||r.base)).pipe(m(o=>Fa(W("loc",o).map(n=>n.textContent))),xe(()=>M),$e([]),T(o=>__md_set("__sitemap",o,sessionStorage,e)))}}function Nn(e){let t=ce("[rel=canonical]",e);typeof t!="undefined"&&(t.href=t.href.replace("//localhost:","//127.0.0.1:"));let r=new Map;for(let o of W(":scope > *",e)){let n=o.outerHTML;for(let i of["href","src"]){let s=o.getAttribute(i);if(s===null)continue;let a=new URL(s,t==null?void 0:t.href),c=o.cloneNode();c.setAttribute(i,`${a}`),n=c.outerHTML;break}r.set(n,o)}return r}function Dn({location$:e,viewport$:t,progress$:r}){let o=he();if(location.protocol==="file:")return M;let n=ur().pipe(m(l=>l.map(f=>`${new URL(f,o.base)}`))),i=h(document.body,"click").pipe(ae(n),w(([l,f])=>{if(!(l.target instanceof Element))return M;let u=l.target.closest("a");if(u===null)return M;if(u.target||l.metaKey||l.ctrlKey)return M;let d=new URL(u.href);return d.search=d.hash="",f.includes(`${d}`)?(l.preventDefault(),R(new URL(u.href))):M}),de());i.pipe(ue(1)).subscribe(()=>{let l=ce("link[rel=icon]");typeof l!="undefined"&&(l.href=l.href)}),h(window,"beforeunload").subscribe(()=>{history.scrollRestoration="auto"}),i.pipe(ae(t)).subscribe(([l,{offset:f}])=>{history.scrollRestoration="manual",history.replaceState(f,""),history.pushState(null,"",l)}),i.subscribe(e);let s=e.pipe(q(me()),te("pathname"),Ee(1),w(l=>lr(l,{progress$:r}).pipe(xe(()=>(st(l,!0),M))))),a=new DOMParser,c=s.pipe(w(l=>l.text()),w(l=>{let f=a.parseFromString(l,"text/html");for(let b of["[data-md-component=announce]","[data-md-component=container]","[data-md-component=header-topic]","[data-md-component=outdated]","[data-md-component=logo]","[data-md-component=skip]",...G("navigation.tabs.sticky")?["[data-md-component=tabs]"]:[]]){let D=ce(b),Q=ce(b,f);typeof D!="undefined"&&typeof Q!="undefined"&&D.replaceWith(Q)}let u=Nn(document.head),d=Nn(f.head);for(let[b,D]of d)D.getAttribute("rel")==="stylesheet"||D.hasAttribute("src")||(u.has(b)?u.delete(b):document.head.appendChild(D));for(let b of u.values())b.getAttribute("rel")==="stylesheet"||b.hasAttribute("src")||b.remove();let y=Oe("container");return We(W("script",y)).pipe(w(b=>{let D=f.createElement("script");if(b.src){for(let Q of b.getAttributeNames())D.setAttribute(Q,b.getAttribute(Q));return b.replaceWith(D),new I(Q=>{D.onload=()=>Q.complete()})}else return D.textContent=b.textContent,b.replaceWith(D),M}),ee(),oe(f))}),de());return h(window,"popstate").pipe(m(me)).subscribe(e),e.pipe(q(me()),Ce(2,1),v(([l,f])=>l.pathname===f.pathname&&l.hash!==f.hash),m(([,l])=>l)).subscribe(l=>{var f,u;history.state!==null||!l.hash?window.scrollTo(0,(u=(f=history.state)==null?void 0:f.y)!=null?u:0):(history.scrollRestoration="auto",pr(l.hash),history.scrollRestoration="manual")}),e.pipe(Ir(i),q(me()),Ce(2,1),v(([l,f])=>l.pathname===f.pathname&&l.hash===f.hash),m(([,l])=>l)).subscribe(l=>{history.scrollRestoration="auto",pr(l.hash),history.scrollRestoration="manual",history.back()}),c.pipe(ae(e)).subscribe(([,l])=>{var f,u;history.state!==null||!l.hash?window.scrollTo(0,(u=(f=history.state)==null?void 0:f.y)!=null?u:0):pr(l.hash)}),t.pipe(te("offset"),ye(100)).subscribe(({offset:l})=>{history.replaceState(l,"")}),c}var qn=jt(zn());function Kn(e){let t=e.separator.split("|").map(n=>n.replace(/(\(\?[!=<][^)]+\))/g,"").length===0?"\uFFFD":n).join("|"),r=new RegExp(t,"img"),o=(n,i,s)=>`${i}${s}`;return n=>{n=n.replace(/[\s*+\-:~^]+/g," ").trim();let i=new RegExp(`(^|${e.separator}|)(${n.replace(/[|\\{}()[\]^$+*?.-]/g,"\\$&").replace(r,"|")})`,"img");return s=>(0,qn.default)(s).replace(i,o).replace(/<\/mark>(\s+)]*>/img,"$1")}}function Ht(e){return e.type===1}function dr(e){return e.type===3}function Qn(e,t){let r=ln(e);return L(R(location.protocol!=="file:"),Ne("search")).pipe(Pe(o=>o),w(()=>t)).subscribe(({config:o,docs:n})=>r.next({type:0,data:{config:o,docs:n,options:{suggest:G("search.suggest")}}})),r}function Yn({document$:e}){let t=he(),r=De(new URL("../versions.json",t.base)).pipe(xe(()=>M)),o=r.pipe(m(n=>{let[,i]=t.base.match(/([^/]+)\/?$/);return n.find(({version:s,aliases:a})=>s===i||a.includes(i))||n[0]}));r.pipe(m(n=>new Map(n.map(i=>[`${new URL(`../${i.version}/`,t.base)}`,i]))),w(n=>h(document.body,"click").pipe(v(i=>!i.metaKey&&!i.ctrlKey),ae(o),w(([i,s])=>{if(i.target instanceof Element){let a=i.target.closest("a");if(a&&!a.target&&n.has(a.href)){let c=a.href;return!i.target.closest(".md-version")&&n.get(c)===s?M:(i.preventDefault(),R(c))}}return M}),w(i=>{let{version:s}=n.get(i);return ur(new URL(i)).pipe(m(a=>{let p=me().href.replace(t.base,"");return a.includes(p.split("#")[0])?new URL(`../${s}/${p}`,t.base):new URL(i)}))})))).subscribe(n=>st(n,!0)),B([r,o]).subscribe(([n,i])=>{U(".md-header__topic").appendChild(gn(n,i))}),e.pipe(w(()=>o)).subscribe(n=>{var s;let i=__md_get("__outdated",sessionStorage);if(i===null){i=!0;let a=((s=t.version)==null?void 0:s.default)||"latest";Array.isArray(a)||(a=[a]);e:for(let c of a)for(let p of n.aliases.concat(n.version))if(new RegExp(c,"i").test(p)){i=!1;break e}__md_set("__outdated",i,sessionStorage)}if(i)for(let a of ne("outdated"))a.hidden=!1})}function Da(e,{worker$:t}){let{searchParams:r}=me();r.has("q")&&(Ye("search",!0),e.value=r.get("q"),e.focus(),Ne("search").pipe(Pe(i=>!i)).subscribe(()=>{let i=me();i.searchParams.delete("q"),history.replaceState({},"",`${i}`)}));let o=vt(e),n=L(t.pipe(Pe(Ht)),h(e,"keyup"),o).pipe(m(()=>e.value),X());return B([n,o]).pipe(m(([i,s])=>({value:i,focus:s})),Z(1))}function Bn(e,{worker$:t}){let r=new x,o=r.pipe(ee(),oe(!0));B([t.pipe(Pe(Ht)),r],(i,s)=>s).pipe(te("value")).subscribe(({value:i})=>t.next({type:2,data:i})),r.pipe(te("focus")).subscribe(({focus:i})=>{i&&Ye("search",i)}),h(e.form,"reset").pipe(j(o)).subscribe(()=>e.focus());let n=U("header [for=__search]");return h(n,"click").subscribe(()=>e.focus()),Da(e,{worker$:t}).pipe(T(i=>r.next(i)),A(()=>r.complete()),m(i=>P({ref:e},i)),Z(1))}function Gn(e,{worker$:t,query$:r}){let o=new x,n=Go(e.parentElement).pipe(v(Boolean)),i=e.parentElement,s=U(":scope > :first-child",e),a=U(":scope > :last-child",e);Ne("search").subscribe(l=>a.setAttribute("role",l?"list":"presentation")),o.pipe(ae(r),Wr(t.pipe(Pe(Ht)))).subscribe(([{items:l},{value:f}])=>{switch(l.length){case 0:s.textContent=f.length?we("search.result.none"):we("search.result.placeholder");break;case 1:s.textContent=we("search.result.one");break;default:let u=ar(l.length);s.textContent=we("search.result.other",u)}});let c=o.pipe(T(()=>a.innerHTML=""),w(({items:l})=>L(R(...l.slice(0,10)),R(...l.slice(10)).pipe(Ce(4),Nr(n),w(([f])=>f)))),m(hn),de());return c.subscribe(l=>a.appendChild(l)),c.pipe(re(l=>{let f=ce("details",l);return typeof f=="undefined"?M:h(f,"toggle").pipe(j(o),m(()=>f))})).subscribe(l=>{l.open===!1&&l.offsetTop<=i.scrollTop&&i.scrollTo({top:l.offsetTop})}),t.pipe(v(dr),m(({data:l})=>l)).pipe(T(l=>o.next(l)),A(()=>o.complete()),m(l=>P({ref:e},l)))}function Va(e,{query$:t}){return t.pipe(m(({value:r})=>{let o=me();return o.hash="",r=r.replace(/\s+/g,"+").replace(/&/g,"%26").replace(/=/g,"%3D"),o.search=`q=${r}`,{url:o}}))}function Jn(e,t){let r=new x,o=r.pipe(ee(),oe(!0));return r.subscribe(({url:n})=>{e.setAttribute("data-clipboard-text",e.href),e.href=`${n}`}),h(e,"click").pipe(j(o)).subscribe(n=>n.preventDefault()),Va(e,t).pipe(T(n=>r.next(n)),A(()=>r.complete()),m(n=>P({ref:e},n)))}function Xn(e,{worker$:t,keyboard$:r}){let o=new x,n=Oe("search-query"),i=L(h(n,"keydown"),h(n,"focus")).pipe(Me(ie),m(()=>n.value),X());return o.pipe(Ze(i),m(([{suggest:a},c])=>{let p=c.split(/([\s-]+)/);if(a!=null&&a.length&&p[p.length-1]){let l=a[a.length-1];l.startsWith(p[p.length-1])&&(p[p.length-1]=l)}else p.length=0;return p})).subscribe(a=>e.innerHTML=a.join("").replace(/\s/g," ")),r.pipe(v(({mode:a})=>a==="search")).subscribe(a=>{switch(a.type){case"ArrowRight":e.innerText.length&&n.selectionStart===n.value.length&&(n.value=e.innerText);break}}),t.pipe(v(dr),m(({data:a})=>a)).pipe(T(a=>o.next(a)),A(()=>o.complete()),m(()=>({ref:e})))}function Zn(e,{index$:t,keyboard$:r}){let o=he();try{let n=Qn(o.search,t),i=Oe("search-query",e),s=Oe("search-result",e);h(e,"click").pipe(v(({target:c})=>c instanceof Element&&!!c.closest("a"))).subscribe(()=>Ye("search",!1)),r.pipe(v(({mode:c})=>c==="search")).subscribe(c=>{let p=Ie();switch(c.type){case"Enter":if(p===i){let l=new Map;for(let f of W(":first-child [href]",s)){let u=f.firstElementChild;l.set(f,parseFloat(u.getAttribute("data-md-score")))}if(l.size){let[[f]]=[...l].sort(([,u],[,d])=>d-u);f.click()}c.claim()}break;case"Escape":case"Tab":Ye("search",!1),i.blur();break;case"ArrowUp":case"ArrowDown":if(typeof p=="undefined")i.focus();else{let l=[i,...W(":not(details) > [href], summary, details[open] [href]",s)],f=Math.max(0,(Math.max(0,l.indexOf(p))+l.length+(c.type==="ArrowUp"?-1:1))%l.length);l[f].focus()}c.claim();break;default:i!==Ie()&&i.focus()}}),r.pipe(v(({mode:c})=>c==="global")).subscribe(c=>{switch(c.type){case"f":case"s":case"/":i.focus(),i.select(),c.claim();break}});let a=Bn(i,{worker$:n});return L(a,Gn(s,{worker$:n,query$:a})).pipe(Re(...ne("search-share",e).map(c=>Jn(c,{query$:a})),...ne("search-suggest",e).map(c=>Xn(c,{worker$:n,keyboard$:r}))))}catch(n){return e.hidden=!0,Ke}}function ei(e,{index$:t,location$:r}){return B([t,r.pipe(q(me()),v(o=>!!o.searchParams.get("h")))]).pipe(m(([o,n])=>Kn(o.config)(n.searchParams.get("h"))),m(o=>{var s;let n=new Map,i=document.createNodeIterator(e,NodeFilter.SHOW_TEXT);for(let a=i.nextNode();a;a=i.nextNode())if((s=a.parentElement)!=null&&s.offsetHeight){let c=a.textContent,p=o(c);p.length>c.length&&n.set(a,p)}for(let[a,c]of n){let{childNodes:p}=S("span",null,c);a.replaceWith(...Array.from(p))}return{ref:e,nodes:n}}))}function za(e,{viewport$:t,main$:r}){let o=e.closest(".md-grid"),n=o.offsetTop-o.parentElement.offsetTop;return B([r,t]).pipe(m(([{offset:i,height:s},{offset:{y:a}}])=>(s=s+Math.min(n,Math.max(0,a-i))-n,{height:s,locked:a>=i+n})),X((i,s)=>i.height===s.height&&i.locked===s.locked))}function Br(e,o){var n=o,{header$:t}=n,r=oo(n,["header$"]);let i=U(".md-sidebar__scrollwrap",e),{y:s}=Ue(i);return H(()=>{let a=new x,c=a.pipe(ee(),oe(!0)),p=a.pipe(Le(0,ge));return p.pipe(ae(t)).subscribe({next([{height:l},{height:f}]){i.style.height=`${l-2*s}px`,e.style.top=`${f}px`},complete(){i.style.height="",e.style.top=""}}),p.pipe(Pe()).subscribe(()=>{for(let l of W(".md-nav__link--active[href]",e)){if(!l.clientHeight)continue;let f=l.closest(".md-sidebar__scrollwrap");if(typeof f!="undefined"){let u=l.offsetTop-f.offsetTop,{height:d}=le(f);f.scrollTo({top:u-d/2})}}}),fe(W("label[tabindex]",e)).pipe(re(l=>h(l,"click").pipe(Me(ie),m(()=>l),j(c)))).subscribe(l=>{let f=U(`[id="${l.htmlFor}"]`);U(`[aria-labelledby="${l.id}"]`).setAttribute("aria-expanded",`${f.checked}`)}),za(e,r).pipe(T(l=>a.next(l)),A(()=>a.complete()),m(l=>P({ref:e},l)))})}function ti(e,t){if(typeof t!="undefined"){let r=`https://api.github.com/repos/${e}/${t}`;return Lt(De(`${r}/releases/latest`).pipe(xe(()=>M),m(o=>({version:o.tag_name})),$e({})),De(r).pipe(xe(()=>M),m(o=>({stars:o.stargazers_count,forks:o.forks_count})),$e({}))).pipe(m(([o,n])=>P(P({},o),n)))}else{let r=`https://api.github.com/users/${e}`;return De(r).pipe(m(o=>({repositories:o.public_repos})),$e({}))}}function ri(e,t){let r=`https://${e}/api/v4/projects/${encodeURIComponent(t)}`;return De(r).pipe(xe(()=>M),m(({star_count:o,forks_count:n})=>({stars:o,forks:n})),$e({}))}function oi(e){let t=e.match(/^.+github\.com\/([^/]+)\/?([^/]+)?/i);if(t){let[,r,o]=t;return ti(r,o)}if(t=e.match(/^.+?([^/]*gitlab[^/]+)\/(.+?)\/?$/i),t){let[,r,o]=t;return ri(r,o)}return M}var qa;function Ka(e){return qa||(qa=H(()=>{let t=__md_get("__source",sessionStorage);if(t)return R(t);if(ne("consent").length){let o=__md_get("__consent");if(!(o&&o.github))return M}return oi(e.href).pipe(T(o=>__md_set("__source",o,sessionStorage)))}).pipe(xe(()=>M),v(t=>Object.keys(t).length>0),m(t=>({facts:t})),Z(1)))}function ni(e){let t=U(":scope > :last-child",e);return H(()=>{let r=new x;return r.subscribe(({facts:o})=>{t.appendChild(bn(o)),t.classList.add("md-source__repository--active")}),Ka(e).pipe(T(o=>r.next(o)),A(()=>r.complete()),m(o=>P({ref:e},o)))})}function Qa(e,{viewport$:t,header$:r}){return Se(document.body).pipe(w(()=>mr(e,{header$:r,viewport$:t})),m(({offset:{y:o}})=>({hidden:o>=10})),te("hidden"))}function ii(e,t){return H(()=>{let r=new x;return r.subscribe({next({hidden:o}){e.hidden=o},complete(){e.hidden=!1}}),(G("navigation.tabs.sticky")?R({hidden:!1}):Qa(e,t)).pipe(T(o=>r.next(o)),A(()=>r.complete()),m(o=>P({ref:e},o)))})}function Ya(e,{viewport$:t,header$:r}){let o=new Map,n=W("[href^=\\#]",e);for(let a of n){let c=decodeURIComponent(a.hash.substring(1)),p=ce(`[id="${c}"]`);typeof p!="undefined"&&o.set(a,p)}let i=r.pipe(te("height"),m(({height:a})=>{let c=Oe("main"),p=U(":scope > :first-child",c);return a+.8*(p.offsetTop-c.offsetTop)}),de());return Se(document.body).pipe(te("height"),w(a=>H(()=>{let c=[];return R([...o].reduce((p,[l,f])=>{for(;c.length&&o.get(c[c.length-1]).tagName>=f.tagName;)c.pop();let u=f.offsetTop;for(;!u&&f.parentElement;)f=f.parentElement,u=f.offsetTop;let d=f.offsetParent;for(;d;d=d.offsetParent)u+=d.offsetTop;return p.set([...c=[...c,l]].reverse(),u)},new Map))}).pipe(m(c=>new Map([...c].sort(([,p],[,l])=>p-l))),Ze(i),w(([c,p])=>t.pipe(Fr(([l,f],{offset:{y:u},size:d})=>{let y=u+d.height>=Math.floor(a.height);for(;f.length;){let[,b]=f[0];if(b-p=u&&!y)f=[l.pop(),...f];else break}return[l,f]},[[],[...c]]),X((l,f)=>l[0]===f[0]&&l[1]===f[1])))))).pipe(m(([a,c])=>({prev:a.map(([p])=>p),next:c.map(([p])=>p)})),q({prev:[],next:[]}),Ce(2,1),m(([a,c])=>a.prev.length{let i=new x,s=i.pipe(ee(),oe(!0));if(i.subscribe(({prev:a,next:c})=>{for(let[p]of c)p.classList.remove("md-nav__link--passed"),p.classList.remove("md-nav__link--active");for(let[p,[l]]of a.entries())l.classList.add("md-nav__link--passed"),l.classList.toggle("md-nav__link--active",p===a.length-1)}),G("toc.follow")){let a=L(t.pipe(ye(1),m(()=>{})),t.pipe(ye(250),m(()=>"smooth")));i.pipe(v(({prev:c})=>c.length>0),Ze(o.pipe(Me(ie))),ae(a)).subscribe(([[{prev:c}],p])=>{let[l]=c[c.length-1];if(l.offsetHeight){let f=sr(l);if(typeof f!="undefined"){let u=l.offsetTop-f.offsetTop,{height:d}=le(f);f.scrollTo({top:u-d/2,behavior:p})}}})}return G("navigation.tracking")&&t.pipe(j(s),te("offset"),ye(250),Ee(1),j(n.pipe(Ee(1))),at({delay:250}),ae(i)).subscribe(([,{prev:a}])=>{let c=me(),p=a[a.length-1];if(p&&p.length){let[l]=p,{hash:f}=new URL(l.href);c.hash!==f&&(c.hash=f,history.replaceState({},"",`${c}`))}else c.hash="",history.replaceState({},"",`${c}`)}),Ya(e,{viewport$:t,header$:r}).pipe(T(a=>i.next(a)),A(()=>i.complete()),m(a=>P({ref:e},a)))})}function Ba(e,{viewport$:t,main$:r,target$:o}){let n=t.pipe(m(({offset:{y:s}})=>s),Ce(2,1),m(([s,a])=>s>a&&a>0),X()),i=r.pipe(m(({active:s})=>s));return B([i,n]).pipe(m(([s,a])=>!(s&&a)),X(),j(o.pipe(Ee(1))),oe(!0),at({delay:250}),m(s=>({hidden:s})))}function si(e,{viewport$:t,header$:r,main$:o,target$:n}){let i=new x,s=i.pipe(ee(),oe(!0));return i.subscribe({next({hidden:a}){e.hidden=a,a?(e.setAttribute("tabindex","-1"),e.blur()):e.removeAttribute("tabindex")},complete(){e.style.top="",e.hidden=!0,e.removeAttribute("tabindex")}}),r.pipe(j(s),te("height")).subscribe(({height:a})=>{e.style.top=`${a+16}px`}),h(e,"click").subscribe(a=>{a.preventDefault(),window.scrollTo({top:0})}),Ba(e,{viewport$:t,main$:o,target$:n}).pipe(T(a=>i.next(a)),A(()=>i.complete()),m(a=>P({ref:e},a)))}function ci({document$:e}){e.pipe(w(()=>W(".md-ellipsis")),re(t=>yt(t).pipe(j(e.pipe(Ee(1))),v(r=>r),m(()=>t),ue(1))),v(t=>t.offsetWidth{let r=t.innerText,o=t.closest("a")||t;return o.title=r,Be(o).pipe(j(e.pipe(Ee(1))),A(()=>o.removeAttribute("title")))})).subscribe(),e.pipe(w(()=>W(".md-status")),re(t=>Be(t))).subscribe()}function pi({document$:e,tablet$:t}){e.pipe(w(()=>W(".md-toggle--indeterminate")),T(r=>{r.indeterminate=!0,r.checked=!1}),re(r=>h(r,"change").pipe(Ur(()=>r.classList.contains("md-toggle--indeterminate")),m(()=>r))),ae(t)).subscribe(([r,o])=>{r.classList.remove("md-toggle--indeterminate"),o&&(r.checked=!1)})}function Ga(){return/(iPad|iPhone|iPod)/.test(navigator.userAgent)}function li({document$:e}){e.pipe(w(()=>W("[data-md-scrollfix]")),T(t=>t.removeAttribute("data-md-scrollfix")),v(Ga),re(t=>h(t,"touchstart").pipe(m(()=>t)))).subscribe(t=>{let r=t.scrollTop;r===0?t.scrollTop=1:r+t.offsetHeight===t.scrollHeight&&(t.scrollTop=r-1)})}function mi({viewport$:e,tablet$:t}){B([Ne("search"),t]).pipe(m(([r,o])=>r&&!o),w(r=>R(r).pipe(Qe(r?400:100))),ae(e)).subscribe(([r,{offset:{y:o}}])=>{if(r)document.body.setAttribute("data-md-scrolllock",""),document.body.style.top=`-${o}px`;else{let n=-1*parseInt(document.body.style.top,10);document.body.removeAttribute("data-md-scrolllock"),document.body.style.top="",n&&window.scrollTo(0,n)}})}Object.entries||(Object.entries=function(e){let t=[];for(let r of Object.keys(e))t.push([r,e[r]]);return t});Object.values||(Object.values=function(e){let t=[];for(let r of Object.keys(e))t.push(e[r]);return t});typeof Element!="undefined"&&(Element.prototype.scrollTo||(Element.prototype.scrollTo=function(e,t){typeof e=="object"?(this.scrollLeft=e.left,this.scrollTop=e.top):(this.scrollLeft=e,this.scrollTop=t)}),Element.prototype.replaceWith||(Element.prototype.replaceWith=function(...e){let t=this.parentNode;if(t){e.length===0&&t.removeChild(this);for(let r=e.length-1;r>=0;r--){let o=e[r];typeof o=="string"?o=document.createTextNode(o):o.parentNode&&o.parentNode.removeChild(o),r?t.insertBefore(this.previousSibling,o):t.replaceChild(o,this)}}}));function Ja(){return location.protocol==="file:"?gt(`${new URL("search/search_index.js",Gr.base)}`).pipe(m(()=>__index),Z(1)):De(new URL("search/search_index.json",Gr.base))}document.documentElement.classList.remove("no-js");document.documentElement.classList.add("js");var rt=zo(),Pt=Zo(),wt=tn(Pt),Jr=Xo(),_e=pn(),hr=At("(min-width: 960px)"),ui=At("(min-width: 1220px)"),di=rn(),Gr=he(),hi=document.forms.namedItem("search")?Ja():Ke,Xr=new x;Un({alert$:Xr});var Zr=new x;G("navigation.instant")&&Dn({location$:Pt,viewport$:_e,progress$:Zr}).subscribe(rt);var fi;((fi=Gr.version)==null?void 0:fi.provider)==="mike"&&Yn({document$:rt});L(Pt,wt).pipe(Qe(125)).subscribe(()=>{Ye("drawer",!1),Ye("search",!1)});Jr.pipe(v(({mode:e})=>e==="global")).subscribe(e=>{switch(e.type){case"p":case",":let t=ce("link[rel=prev]");typeof t!="undefined"&&st(t);break;case"n":case".":let r=ce("link[rel=next]");typeof r!="undefined"&&st(r);break;case"Enter":let o=Ie();o instanceof HTMLLabelElement&&o.click()}});ci({document$:rt});pi({document$:rt,tablet$:hr});li({document$:rt});mi({viewport$:_e,tablet$:hr});var tt=Pn(Oe("header"),{viewport$:_e}),$t=rt.pipe(m(()=>Oe("main")),w(e=>Fn(e,{viewport$:_e,header$:tt})),Z(1)),Xa=L(...ne("consent").map(e=>fn(e,{target$:wt})),...ne("dialog").map(e=>$n(e,{alert$:Xr})),...ne("header").map(e=>Rn(e,{viewport$:_e,header$:tt,main$:$t})),...ne("palette").map(e=>jn(e)),...ne("progress").map(e=>Wn(e,{progress$:Zr})),...ne("search").map(e=>Zn(e,{index$:hi,keyboard$:Jr})),...ne("source").map(e=>ni(e))),Za=H(()=>L(...ne("announce").map(e=>mn(e)),...ne("content").map(e=>Hn(e,{viewport$:_e,target$:wt,print$:di})),...ne("content").map(e=>G("search.highlight")?ei(e,{index$:hi,location$:Pt}):M),...ne("header-title").map(e=>In(e,{viewport$:_e,header$:tt})),...ne("sidebar").map(e=>e.getAttribute("data-md-type")==="navigation"?Dr(ui,()=>Br(e,{viewport$:_e,header$:tt,main$:$t})):Dr(hr,()=>Br(e,{viewport$:_e,header$:tt,main$:$t}))),...ne("tabs").map(e=>ii(e,{viewport$:_e,header$:tt})),...ne("toc").map(e=>ai(e,{viewport$:_e,header$:tt,main$:$t,target$:wt})),...ne("top").map(e=>si(e,{viewport$:_e,header$:tt,main$:$t,target$:wt})))),bi=rt.pipe(w(()=>Za),Re(Xa),Z(1));bi.subscribe();window.document$=rt;window.location$=Pt;window.target$=wt;window.keyboard$=Jr;window.viewport$=_e;window.tablet$=hr;window.screen$=ui;window.print$=di;window.alert$=Xr;window.progress$=Zr;window.component$=bi;})(); +//# sourceMappingURL=bundle.d7c377c4.min.js.map + diff --git a/assets/javascripts/bundle.d7c377c4.min.js.map b/assets/javascripts/bundle.d7c377c4.min.js.map new file mode 100644 index 0000000000..a57d388af0 --- /dev/null +++ b/assets/javascripts/bundle.d7c377c4.min.js.map @@ -0,0 +1,7 @@ +{ + "version": 3, + "sources": ["node_modules/focus-visible/dist/focus-visible.js", "node_modules/clipboard/dist/clipboard.js", "node_modules/escape-html/index.js", "src/templates/assets/javascripts/bundle.ts", "node_modules/rxjs/node_modules/tslib/tslib.es6.js", "node_modules/rxjs/src/internal/util/isFunction.ts", "node_modules/rxjs/src/internal/util/createErrorClass.ts", "node_modules/rxjs/src/internal/util/UnsubscriptionError.ts", "node_modules/rxjs/src/internal/util/arrRemove.ts", "node_modules/rxjs/src/internal/Subscription.ts", "node_modules/rxjs/src/internal/config.ts", "node_modules/rxjs/src/internal/scheduler/timeoutProvider.ts", "node_modules/rxjs/src/internal/util/reportUnhandledError.ts", "node_modules/rxjs/src/internal/util/noop.ts", "node_modules/rxjs/src/internal/NotificationFactories.ts", "node_modules/rxjs/src/internal/util/errorContext.ts", "node_modules/rxjs/src/internal/Subscriber.ts", "node_modules/rxjs/src/internal/symbol/observable.ts", "node_modules/rxjs/src/internal/util/identity.ts", "node_modules/rxjs/src/internal/util/pipe.ts", "node_modules/rxjs/src/internal/Observable.ts", "node_modules/rxjs/src/internal/util/lift.ts", "node_modules/rxjs/src/internal/operators/OperatorSubscriber.ts", "node_modules/rxjs/src/internal/scheduler/animationFrameProvider.ts", "node_modules/rxjs/src/internal/util/ObjectUnsubscribedError.ts", "node_modules/rxjs/src/internal/Subject.ts", "node_modules/rxjs/src/internal/scheduler/dateTimestampProvider.ts", "node_modules/rxjs/src/internal/ReplaySubject.ts", "node_modules/rxjs/src/internal/scheduler/Action.ts", "node_modules/rxjs/src/internal/scheduler/intervalProvider.ts", "node_modules/rxjs/src/internal/scheduler/AsyncAction.ts", "node_modules/rxjs/src/internal/Scheduler.ts", "node_modules/rxjs/src/internal/scheduler/AsyncScheduler.ts", "node_modules/rxjs/src/internal/scheduler/async.ts", "node_modules/rxjs/src/internal/scheduler/AnimationFrameAction.ts", "node_modules/rxjs/src/internal/scheduler/AnimationFrameScheduler.ts", "node_modules/rxjs/src/internal/scheduler/animationFrame.ts", "node_modules/rxjs/src/internal/observable/empty.ts", "node_modules/rxjs/src/internal/util/isScheduler.ts", "node_modules/rxjs/src/internal/util/args.ts", "node_modules/rxjs/src/internal/util/isArrayLike.ts", "node_modules/rxjs/src/internal/util/isPromise.ts", "node_modules/rxjs/src/internal/util/isInteropObservable.ts", "node_modules/rxjs/src/internal/util/isAsyncIterable.ts", "node_modules/rxjs/src/internal/util/throwUnobservableError.ts", "node_modules/rxjs/src/internal/symbol/iterator.ts", "node_modules/rxjs/src/internal/util/isIterable.ts", "node_modules/rxjs/src/internal/util/isReadableStreamLike.ts", "node_modules/rxjs/src/internal/observable/innerFrom.ts", "node_modules/rxjs/src/internal/util/executeSchedule.ts", "node_modules/rxjs/src/internal/operators/observeOn.ts", "node_modules/rxjs/src/internal/operators/subscribeOn.ts", "node_modules/rxjs/src/internal/scheduled/scheduleObservable.ts", "node_modules/rxjs/src/internal/scheduled/schedulePromise.ts", "node_modules/rxjs/src/internal/scheduled/scheduleArray.ts", "node_modules/rxjs/src/internal/scheduled/scheduleIterable.ts", "node_modules/rxjs/src/internal/scheduled/scheduleAsyncIterable.ts", "node_modules/rxjs/src/internal/scheduled/scheduleReadableStreamLike.ts", "node_modules/rxjs/src/internal/scheduled/scheduled.ts", "node_modules/rxjs/src/internal/observable/from.ts", "node_modules/rxjs/src/internal/observable/of.ts", "node_modules/rxjs/src/internal/observable/throwError.ts", "node_modules/rxjs/src/internal/util/EmptyError.ts", "node_modules/rxjs/src/internal/util/isDate.ts", "node_modules/rxjs/src/internal/operators/map.ts", "node_modules/rxjs/src/internal/util/mapOneOrManyArgs.ts", "node_modules/rxjs/src/internal/util/argsArgArrayOrObject.ts", "node_modules/rxjs/src/internal/util/createObject.ts", "node_modules/rxjs/src/internal/observable/combineLatest.ts", "node_modules/rxjs/src/internal/operators/mergeInternals.ts", "node_modules/rxjs/src/internal/operators/mergeMap.ts", "node_modules/rxjs/src/internal/operators/mergeAll.ts", "node_modules/rxjs/src/internal/operators/concatAll.ts", "node_modules/rxjs/src/internal/observable/concat.ts", "node_modules/rxjs/src/internal/observable/defer.ts", "node_modules/rxjs/src/internal/observable/fromEvent.ts", "node_modules/rxjs/src/internal/observable/fromEventPattern.ts", "node_modules/rxjs/src/internal/observable/timer.ts", "node_modules/rxjs/src/internal/observable/merge.ts", "node_modules/rxjs/src/internal/observable/never.ts", "node_modules/rxjs/src/internal/util/argsOrArgArray.ts", "node_modules/rxjs/src/internal/operators/filter.ts", "node_modules/rxjs/src/internal/observable/zip.ts", "node_modules/rxjs/src/internal/operators/audit.ts", "node_modules/rxjs/src/internal/operators/auditTime.ts", "node_modules/rxjs/src/internal/operators/bufferCount.ts", "node_modules/rxjs/src/internal/operators/catchError.ts", "node_modules/rxjs/src/internal/operators/scanInternals.ts", "node_modules/rxjs/src/internal/operators/combineLatest.ts", "node_modules/rxjs/src/internal/operators/combineLatestWith.ts", "node_modules/rxjs/src/internal/operators/debounceTime.ts", "node_modules/rxjs/src/internal/operators/defaultIfEmpty.ts", "node_modules/rxjs/src/internal/operators/take.ts", "node_modules/rxjs/src/internal/operators/ignoreElements.ts", "node_modules/rxjs/src/internal/operators/mapTo.ts", "node_modules/rxjs/src/internal/operators/delayWhen.ts", "node_modules/rxjs/src/internal/operators/delay.ts", "node_modules/rxjs/src/internal/operators/distinctUntilChanged.ts", "node_modules/rxjs/src/internal/operators/distinctUntilKeyChanged.ts", "node_modules/rxjs/src/internal/operators/throwIfEmpty.ts", "node_modules/rxjs/src/internal/operators/endWith.ts", "node_modules/rxjs/src/internal/operators/finalize.ts", "node_modules/rxjs/src/internal/operators/first.ts", "node_modules/rxjs/src/internal/operators/takeLast.ts", "node_modules/rxjs/src/internal/operators/merge.ts", "node_modules/rxjs/src/internal/operators/mergeWith.ts", "node_modules/rxjs/src/internal/operators/repeat.ts", "node_modules/rxjs/src/internal/operators/sample.ts", "node_modules/rxjs/src/internal/operators/scan.ts", "node_modules/rxjs/src/internal/operators/share.ts", "node_modules/rxjs/src/internal/operators/shareReplay.ts", "node_modules/rxjs/src/internal/operators/skip.ts", "node_modules/rxjs/src/internal/operators/skipUntil.ts", "node_modules/rxjs/src/internal/operators/startWith.ts", "node_modules/rxjs/src/internal/operators/switchMap.ts", "node_modules/rxjs/src/internal/operators/takeUntil.ts", "node_modules/rxjs/src/internal/operators/takeWhile.ts", "node_modules/rxjs/src/internal/operators/tap.ts", "node_modules/rxjs/src/internal/operators/throttle.ts", "node_modules/rxjs/src/internal/operators/throttleTime.ts", "node_modules/rxjs/src/internal/operators/withLatestFrom.ts", "node_modules/rxjs/src/internal/operators/zip.ts", "node_modules/rxjs/src/internal/operators/zipWith.ts", "src/templates/assets/javascripts/browser/document/index.ts", "src/templates/assets/javascripts/browser/element/_/index.ts", "src/templates/assets/javascripts/browser/element/focus/index.ts", "src/templates/assets/javascripts/browser/element/hover/index.ts", "src/templates/assets/javascripts/browser/element/offset/_/index.ts", "src/templates/assets/javascripts/browser/element/offset/content/index.ts", "src/templates/assets/javascripts/utilities/h/index.ts", "src/templates/assets/javascripts/utilities/round/index.ts", "src/templates/assets/javascripts/browser/script/index.ts", "src/templates/assets/javascripts/browser/element/size/_/index.ts", "src/templates/assets/javascripts/browser/element/size/content/index.ts", "src/templates/assets/javascripts/browser/element/visibility/index.ts", "src/templates/assets/javascripts/browser/toggle/index.ts", "src/templates/assets/javascripts/browser/keyboard/index.ts", "src/templates/assets/javascripts/browser/location/_/index.ts", "src/templates/assets/javascripts/browser/location/hash/index.ts", "src/templates/assets/javascripts/browser/media/index.ts", "src/templates/assets/javascripts/browser/request/index.ts", "src/templates/assets/javascripts/browser/viewport/offset/index.ts", "src/templates/assets/javascripts/browser/viewport/size/index.ts", "src/templates/assets/javascripts/browser/viewport/_/index.ts", "src/templates/assets/javascripts/browser/viewport/at/index.ts", "src/templates/assets/javascripts/browser/worker/index.ts", "src/templates/assets/javascripts/_/index.ts", "src/templates/assets/javascripts/components/_/index.ts", "src/templates/assets/javascripts/components/announce/index.ts", "src/templates/assets/javascripts/components/consent/index.ts", "src/templates/assets/javascripts/templates/tooltip/index.tsx", "src/templates/assets/javascripts/templates/annotation/index.tsx", "src/templates/assets/javascripts/templates/clipboard/index.tsx", "src/templates/assets/javascripts/templates/search/index.tsx", "src/templates/assets/javascripts/templates/source/index.tsx", "src/templates/assets/javascripts/templates/tabbed/index.tsx", "src/templates/assets/javascripts/templates/table/index.tsx", "src/templates/assets/javascripts/templates/version/index.tsx", "src/templates/assets/javascripts/components/tooltip/index.ts", "src/templates/assets/javascripts/components/content/annotation/_/index.ts", "src/templates/assets/javascripts/components/content/annotation/list/index.ts", "src/templates/assets/javascripts/components/content/annotation/block/index.ts", "src/templates/assets/javascripts/components/content/code/_/index.ts", "src/templates/assets/javascripts/components/content/details/index.ts", "src/templates/assets/javascripts/components/content/mermaid/index.css", "src/templates/assets/javascripts/components/content/mermaid/index.ts", "src/templates/assets/javascripts/components/content/table/index.ts", "src/templates/assets/javascripts/components/content/tabs/index.ts", "src/templates/assets/javascripts/components/content/_/index.ts", "src/templates/assets/javascripts/components/dialog/index.ts", "src/templates/assets/javascripts/components/header/_/index.ts", "src/templates/assets/javascripts/components/header/title/index.ts", "src/templates/assets/javascripts/components/main/index.ts", "src/templates/assets/javascripts/components/palette/index.ts", "src/templates/assets/javascripts/components/progress/index.ts", "src/templates/assets/javascripts/integrations/clipboard/index.ts", "src/templates/assets/javascripts/integrations/sitemap/index.ts", "src/templates/assets/javascripts/integrations/instant/index.ts", "src/templates/assets/javascripts/integrations/search/highlighter/index.ts", "src/templates/assets/javascripts/integrations/search/worker/message/index.ts", "src/templates/assets/javascripts/integrations/search/worker/_/index.ts", "src/templates/assets/javascripts/integrations/version/index.ts", "src/templates/assets/javascripts/components/search/query/index.ts", "src/templates/assets/javascripts/components/search/result/index.ts", "src/templates/assets/javascripts/components/search/share/index.ts", "src/templates/assets/javascripts/components/search/suggest/index.ts", "src/templates/assets/javascripts/components/search/_/index.ts", "src/templates/assets/javascripts/components/search/highlight/index.ts", "src/templates/assets/javascripts/components/sidebar/index.ts", "src/templates/assets/javascripts/components/source/facts/github/index.ts", "src/templates/assets/javascripts/components/source/facts/gitlab/index.ts", "src/templates/assets/javascripts/components/source/facts/_/index.ts", "src/templates/assets/javascripts/components/source/_/index.ts", "src/templates/assets/javascripts/components/tabs/index.ts", "src/templates/assets/javascripts/components/toc/index.ts", "src/templates/assets/javascripts/components/top/index.ts", "src/templates/assets/javascripts/patches/ellipsis/index.ts", "src/templates/assets/javascripts/patches/indeterminate/index.ts", "src/templates/assets/javascripts/patches/scrollfix/index.ts", "src/templates/assets/javascripts/patches/scrolllock/index.ts", "src/templates/assets/javascripts/polyfills/index.ts"], + "sourcesContent": ["(function (global, factory) {\n typeof exports === 'object' && typeof module !== 'undefined' ? factory() :\n typeof define === 'function' && define.amd ? define(factory) :\n (factory());\n}(this, (function () { 'use strict';\n\n /**\n * Applies the :focus-visible polyfill at the given scope.\n * A scope in this case is either the top-level Document or a Shadow Root.\n *\n * @param {(Document|ShadowRoot)} scope\n * @see https://github.com/WICG/focus-visible\n */\n function applyFocusVisiblePolyfill(scope) {\n var hadKeyboardEvent = true;\n var hadFocusVisibleRecently = false;\n var hadFocusVisibleRecentlyTimeout = null;\n\n var inputTypesAllowlist = {\n text: true,\n search: true,\n url: true,\n tel: true,\n email: true,\n password: true,\n number: true,\n date: true,\n month: true,\n week: true,\n time: true,\n datetime: true,\n 'datetime-local': true\n };\n\n /**\n * Helper function for legacy browsers and iframes which sometimes focus\n * elements like document, body, and non-interactive SVG.\n * @param {Element} el\n */\n function isValidFocusTarget(el) {\n if (\n el &&\n el !== document &&\n el.nodeName !== 'HTML' &&\n el.nodeName !== 'BODY' &&\n 'classList' in el &&\n 'contains' in el.classList\n ) {\n return true;\n }\n return false;\n }\n\n /**\n * Computes whether the given element should automatically trigger the\n * `focus-visible` class being added, i.e. whether it should always match\n * `:focus-visible` when focused.\n * @param {Element} el\n * @return {boolean}\n */\n function focusTriggersKeyboardModality(el) {\n var type = el.type;\n var tagName = el.tagName;\n\n if (tagName === 'INPUT' && inputTypesAllowlist[type] && !el.readOnly) {\n return true;\n }\n\n if (tagName === 'TEXTAREA' && !el.readOnly) {\n return true;\n }\n\n if (el.isContentEditable) {\n return true;\n }\n\n return false;\n }\n\n /**\n * Add the `focus-visible` class to the given element if it was not added by\n * the author.\n * @param {Element} el\n */\n function addFocusVisibleClass(el) {\n if (el.classList.contains('focus-visible')) {\n return;\n }\n el.classList.add('focus-visible');\n el.setAttribute('data-focus-visible-added', '');\n }\n\n /**\n * Remove the `focus-visible` class from the given element if it was not\n * originally added by the author.\n * @param {Element} el\n */\n function removeFocusVisibleClass(el) {\n if (!el.hasAttribute('data-focus-visible-added')) {\n return;\n }\n el.classList.remove('focus-visible');\n el.removeAttribute('data-focus-visible-added');\n }\n\n /**\n * If the most recent user interaction was via the keyboard;\n * and the key press did not include a meta, alt/option, or control key;\n * then the modality is keyboard. Otherwise, the modality is not keyboard.\n * Apply `focus-visible` to any current active element and keep track\n * of our keyboard modality state with `hadKeyboardEvent`.\n * @param {KeyboardEvent} e\n */\n function onKeyDown(e) {\n if (e.metaKey || e.altKey || e.ctrlKey) {\n return;\n }\n\n if (isValidFocusTarget(scope.activeElement)) {\n addFocusVisibleClass(scope.activeElement);\n }\n\n hadKeyboardEvent = true;\n }\n\n /**\n * If at any point a user clicks with a pointing device, ensure that we change\n * the modality away from keyboard.\n * This avoids the situation where a user presses a key on an already focused\n * element, and then clicks on a different element, focusing it with a\n * pointing device, while we still think we're in keyboard modality.\n * @param {Event} e\n */\n function onPointerDown(e) {\n hadKeyboardEvent = false;\n }\n\n /**\n * On `focus`, add the `focus-visible` class to the target if:\n * - the target received focus as a result of keyboard navigation, or\n * - the event target is an element that will likely require interaction\n * via the keyboard (e.g. a text box)\n * @param {Event} e\n */\n function onFocus(e) {\n // Prevent IE from focusing the document or HTML element.\n if (!isValidFocusTarget(e.target)) {\n return;\n }\n\n if (hadKeyboardEvent || focusTriggersKeyboardModality(e.target)) {\n addFocusVisibleClass(e.target);\n }\n }\n\n /**\n * On `blur`, remove the `focus-visible` class from the target.\n * @param {Event} e\n */\n function onBlur(e) {\n if (!isValidFocusTarget(e.target)) {\n return;\n }\n\n if (\n e.target.classList.contains('focus-visible') ||\n e.target.hasAttribute('data-focus-visible-added')\n ) {\n // To detect a tab/window switch, we look for a blur event followed\n // rapidly by a visibility change.\n // If we don't see a visibility change within 100ms, it's probably a\n // regular focus change.\n hadFocusVisibleRecently = true;\n window.clearTimeout(hadFocusVisibleRecentlyTimeout);\n hadFocusVisibleRecentlyTimeout = window.setTimeout(function() {\n hadFocusVisibleRecently = false;\n }, 100);\n removeFocusVisibleClass(e.target);\n }\n }\n\n /**\n * If the user changes tabs, keep track of whether or not the previously\n * focused element had .focus-visible.\n * @param {Event} e\n */\n function onVisibilityChange(e) {\n if (document.visibilityState === 'hidden') {\n // If the tab becomes active again, the browser will handle calling focus\n // on the element (Safari actually calls it twice).\n // If this tab change caused a blur on an element with focus-visible,\n // re-apply the class when the user switches back to the tab.\n if (hadFocusVisibleRecently) {\n hadKeyboardEvent = true;\n }\n addInitialPointerMoveListeners();\n }\n }\n\n /**\n * Add a group of listeners to detect usage of any pointing devices.\n * These listeners will be added when the polyfill first loads, and anytime\n * the window is blurred, so that they are active when the window regains\n * focus.\n */\n function addInitialPointerMoveListeners() {\n document.addEventListener('mousemove', onInitialPointerMove);\n document.addEventListener('mousedown', onInitialPointerMove);\n document.addEventListener('mouseup', onInitialPointerMove);\n document.addEventListener('pointermove', onInitialPointerMove);\n document.addEventListener('pointerdown', onInitialPointerMove);\n document.addEventListener('pointerup', onInitialPointerMove);\n document.addEventListener('touchmove', onInitialPointerMove);\n document.addEventListener('touchstart', onInitialPointerMove);\n document.addEventListener('touchend', onInitialPointerMove);\n }\n\n function removeInitialPointerMoveListeners() {\n document.removeEventListener('mousemove', onInitialPointerMove);\n document.removeEventListener('mousedown', onInitialPointerMove);\n document.removeEventListener('mouseup', onInitialPointerMove);\n document.removeEventListener('pointermove', onInitialPointerMove);\n document.removeEventListener('pointerdown', onInitialPointerMove);\n document.removeEventListener('pointerup', onInitialPointerMove);\n document.removeEventListener('touchmove', onInitialPointerMove);\n document.removeEventListener('touchstart', onInitialPointerMove);\n document.removeEventListener('touchend', onInitialPointerMove);\n }\n\n /**\n * When the polfyill first loads, assume the user is in keyboard modality.\n * If any event is received from a pointing device (e.g. mouse, pointer,\n * touch), turn off keyboard modality.\n * This accounts for situations where focus enters the page from the URL bar.\n * @param {Event} e\n */\n function onInitialPointerMove(e) {\n // Work around a Safari quirk that fires a mousemove on whenever the\n // window blurs, even if you're tabbing out of the page. \u00AF\\_(\u30C4)_/\u00AF\n if (e.target.nodeName && e.target.nodeName.toLowerCase() === 'html') {\n return;\n }\n\n hadKeyboardEvent = false;\n removeInitialPointerMoveListeners();\n }\n\n // For some kinds of state, we are interested in changes at the global scope\n // only. For example, global pointer input, global key presses and global\n // visibility change should affect the state at every scope:\n document.addEventListener('keydown', onKeyDown, true);\n document.addEventListener('mousedown', onPointerDown, true);\n document.addEventListener('pointerdown', onPointerDown, true);\n document.addEventListener('touchstart', onPointerDown, true);\n document.addEventListener('visibilitychange', onVisibilityChange, true);\n\n addInitialPointerMoveListeners();\n\n // For focus and blur, we specifically care about state changes in the local\n // scope. This is because focus / blur events that originate from within a\n // shadow root are not re-dispatched from the host element if it was already\n // the active element in its own scope:\n scope.addEventListener('focus', onFocus, true);\n scope.addEventListener('blur', onBlur, true);\n\n // We detect that a node is a ShadowRoot by ensuring that it is a\n // DocumentFragment and also has a host property. This check covers native\n // implementation and polyfill implementation transparently. If we only cared\n // about the native implementation, we could just check if the scope was\n // an instance of a ShadowRoot.\n if (scope.nodeType === Node.DOCUMENT_FRAGMENT_NODE && scope.host) {\n // Since a ShadowRoot is a special kind of DocumentFragment, it does not\n // have a root element to add a class to. So, we add this attribute to the\n // host element instead:\n scope.host.setAttribute('data-js-focus-visible', '');\n } else if (scope.nodeType === Node.DOCUMENT_NODE) {\n document.documentElement.classList.add('js-focus-visible');\n document.documentElement.setAttribute('data-js-focus-visible', '');\n }\n }\n\n // It is important to wrap all references to global window and document in\n // these checks to support server-side rendering use cases\n // @see https://github.com/WICG/focus-visible/issues/199\n if (typeof window !== 'undefined' && typeof document !== 'undefined') {\n // Make the polyfill helper globally available. This can be used as a signal\n // to interested libraries that wish to coordinate with the polyfill for e.g.,\n // applying the polyfill to a shadow root:\n window.applyFocusVisiblePolyfill = applyFocusVisiblePolyfill;\n\n // Notify interested libraries of the polyfill's presence, in case the\n // polyfill was loaded lazily:\n var event;\n\n try {\n event = new CustomEvent('focus-visible-polyfill-ready');\n } catch (error) {\n // IE11 does not support using CustomEvent as a constructor directly:\n event = document.createEvent('CustomEvent');\n event.initCustomEvent('focus-visible-polyfill-ready', false, false, {});\n }\n\n window.dispatchEvent(event);\n }\n\n if (typeof document !== 'undefined') {\n // Apply the polyfill to the global document, so that no JavaScript\n // coordination is required to use the polyfill in the top-level document:\n applyFocusVisiblePolyfill(document);\n }\n\n})));\n", "/*!\n * clipboard.js v2.0.11\n * https://clipboardjs.com/\n *\n * Licensed MIT \u00A9 Zeno Rocha\n */\n(function webpackUniversalModuleDefinition(root, factory) {\n\tif(typeof exports === 'object' && typeof module === 'object')\n\t\tmodule.exports = factory();\n\telse if(typeof define === 'function' && define.amd)\n\t\tdefine([], factory);\n\telse if(typeof exports === 'object')\n\t\texports[\"ClipboardJS\"] = factory();\n\telse\n\t\troot[\"ClipboardJS\"] = factory();\n})(this, function() {\nreturn /******/ (function() { // webpackBootstrap\n/******/ \tvar __webpack_modules__ = ({\n\n/***/ 686:\n/***/ (function(__unused_webpack_module, __webpack_exports__, __webpack_require__) {\n\n\"use strict\";\n\n// EXPORTS\n__webpack_require__.d(__webpack_exports__, {\n \"default\": function() { return /* binding */ clipboard; }\n});\n\n// EXTERNAL MODULE: ./node_modules/tiny-emitter/index.js\nvar tiny_emitter = __webpack_require__(279);\nvar tiny_emitter_default = /*#__PURE__*/__webpack_require__.n(tiny_emitter);\n// EXTERNAL MODULE: ./node_modules/good-listener/src/listen.js\nvar listen = __webpack_require__(370);\nvar listen_default = /*#__PURE__*/__webpack_require__.n(listen);\n// EXTERNAL MODULE: ./node_modules/select/src/select.js\nvar src_select = __webpack_require__(817);\nvar select_default = /*#__PURE__*/__webpack_require__.n(src_select);\n;// CONCATENATED MODULE: ./src/common/command.js\n/**\n * Executes a given operation type.\n * @param {String} type\n * @return {Boolean}\n */\nfunction command(type) {\n try {\n return document.execCommand(type);\n } catch (err) {\n return false;\n }\n}\n;// CONCATENATED MODULE: ./src/actions/cut.js\n\n\n/**\n * Cut action wrapper.\n * @param {String|HTMLElement} target\n * @return {String}\n */\n\nvar ClipboardActionCut = function ClipboardActionCut(target) {\n var selectedText = select_default()(target);\n command('cut');\n return selectedText;\n};\n\n/* harmony default export */ var actions_cut = (ClipboardActionCut);\n;// CONCATENATED MODULE: ./src/common/create-fake-element.js\n/**\n * Creates a fake textarea element with a value.\n * @param {String} value\n * @return {HTMLElement}\n */\nfunction createFakeElement(value) {\n var isRTL = document.documentElement.getAttribute('dir') === 'rtl';\n var fakeElement = document.createElement('textarea'); // Prevent zooming on iOS\n\n fakeElement.style.fontSize = '12pt'; // Reset box model\n\n fakeElement.style.border = '0';\n fakeElement.style.padding = '0';\n fakeElement.style.margin = '0'; // Move element out of screen horizontally\n\n fakeElement.style.position = 'absolute';\n fakeElement.style[isRTL ? 'right' : 'left'] = '-9999px'; // Move element to the same position vertically\n\n var yPosition = window.pageYOffset || document.documentElement.scrollTop;\n fakeElement.style.top = \"\".concat(yPosition, \"px\");\n fakeElement.setAttribute('readonly', '');\n fakeElement.value = value;\n return fakeElement;\n}\n;// CONCATENATED MODULE: ./src/actions/copy.js\n\n\n\n/**\n * Create fake copy action wrapper using a fake element.\n * @param {String} target\n * @param {Object} options\n * @return {String}\n */\n\nvar fakeCopyAction = function fakeCopyAction(value, options) {\n var fakeElement = createFakeElement(value);\n options.container.appendChild(fakeElement);\n var selectedText = select_default()(fakeElement);\n command('copy');\n fakeElement.remove();\n return selectedText;\n};\n/**\n * Copy action wrapper.\n * @param {String|HTMLElement} target\n * @param {Object} options\n * @return {String}\n */\n\n\nvar ClipboardActionCopy = function ClipboardActionCopy(target) {\n var options = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : {\n container: document.body\n };\n var selectedText = '';\n\n if (typeof target === 'string') {\n selectedText = fakeCopyAction(target, options);\n } else if (target instanceof HTMLInputElement && !['text', 'search', 'url', 'tel', 'password'].includes(target === null || target === void 0 ? void 0 : target.type)) {\n // If input type doesn't support `setSelectionRange`. Simulate it. https://developer.mozilla.org/en-US/docs/Web/API/HTMLInputElement/setSelectionRange\n selectedText = fakeCopyAction(target.value, options);\n } else {\n selectedText = select_default()(target);\n command('copy');\n }\n\n return selectedText;\n};\n\n/* harmony default export */ var actions_copy = (ClipboardActionCopy);\n;// CONCATENATED MODULE: ./src/actions/default.js\nfunction _typeof(obj) { \"@babel/helpers - typeof\"; if (typeof Symbol === \"function\" && typeof Symbol.iterator === \"symbol\") { _typeof = function _typeof(obj) { return typeof obj; }; } else { _typeof = function _typeof(obj) { return obj && typeof Symbol === \"function\" && obj.constructor === Symbol && obj !== Symbol.prototype ? \"symbol\" : typeof obj; }; } return _typeof(obj); }\n\n\n\n/**\n * Inner function which performs selection from either `text` or `target`\n * properties and then executes copy or cut operations.\n * @param {Object} options\n */\n\nvar ClipboardActionDefault = function ClipboardActionDefault() {\n var options = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : {};\n // Defines base properties passed from constructor.\n var _options$action = options.action,\n action = _options$action === void 0 ? 'copy' : _options$action,\n container = options.container,\n target = options.target,\n text = options.text; // Sets the `action` to be performed which can be either 'copy' or 'cut'.\n\n if (action !== 'copy' && action !== 'cut') {\n throw new Error('Invalid \"action\" value, use either \"copy\" or \"cut\"');\n } // Sets the `target` property using an element that will be have its content copied.\n\n\n if (target !== undefined) {\n if (target && _typeof(target) === 'object' && target.nodeType === 1) {\n if (action === 'copy' && target.hasAttribute('disabled')) {\n throw new Error('Invalid \"target\" attribute. Please use \"readonly\" instead of \"disabled\" attribute');\n }\n\n if (action === 'cut' && (target.hasAttribute('readonly') || target.hasAttribute('disabled'))) {\n throw new Error('Invalid \"target\" attribute. You can\\'t cut text from elements with \"readonly\" or \"disabled\" attributes');\n }\n } else {\n throw new Error('Invalid \"target\" value, use a valid Element');\n }\n } // Define selection strategy based on `text` property.\n\n\n if (text) {\n return actions_copy(text, {\n container: container\n });\n } // Defines which selection strategy based on `target` property.\n\n\n if (target) {\n return action === 'cut' ? actions_cut(target) : actions_copy(target, {\n container: container\n });\n }\n};\n\n/* harmony default export */ var actions_default = (ClipboardActionDefault);\n;// CONCATENATED MODULE: ./src/clipboard.js\nfunction clipboard_typeof(obj) { \"@babel/helpers - typeof\"; if (typeof Symbol === \"function\" && typeof Symbol.iterator === \"symbol\") { clipboard_typeof = function _typeof(obj) { return typeof obj; }; } else { clipboard_typeof = function _typeof(obj) { return obj && typeof Symbol === \"function\" && obj.constructor === Symbol && obj !== Symbol.prototype ? \"symbol\" : typeof obj; }; } return clipboard_typeof(obj); }\n\nfunction _classCallCheck(instance, Constructor) { if (!(instance instanceof Constructor)) { throw new TypeError(\"Cannot call a class as a function\"); } }\n\nfunction _defineProperties(target, props) { for (var i = 0; i < props.length; i++) { var descriptor = props[i]; descriptor.enumerable = descriptor.enumerable || false; descriptor.configurable = true; if (\"value\" in descriptor) descriptor.writable = true; Object.defineProperty(target, descriptor.key, descriptor); } }\n\nfunction _createClass(Constructor, protoProps, staticProps) { if (protoProps) _defineProperties(Constructor.prototype, protoProps); if (staticProps) _defineProperties(Constructor, staticProps); return Constructor; }\n\nfunction _inherits(subClass, superClass) { if (typeof superClass !== \"function\" && superClass !== null) { throw new TypeError(\"Super expression must either be null or a function\"); } subClass.prototype = Object.create(superClass && superClass.prototype, { constructor: { value: subClass, writable: true, configurable: true } }); if (superClass) _setPrototypeOf(subClass, superClass); }\n\nfunction _setPrototypeOf(o, p) { _setPrototypeOf = Object.setPrototypeOf || function _setPrototypeOf(o, p) { o.__proto__ = p; return o; }; return _setPrototypeOf(o, p); }\n\nfunction _createSuper(Derived) { var hasNativeReflectConstruct = _isNativeReflectConstruct(); return function _createSuperInternal() { var Super = _getPrototypeOf(Derived), result; if (hasNativeReflectConstruct) { var NewTarget = _getPrototypeOf(this).constructor; result = Reflect.construct(Super, arguments, NewTarget); } else { result = Super.apply(this, arguments); } return _possibleConstructorReturn(this, result); }; }\n\nfunction _possibleConstructorReturn(self, call) { if (call && (clipboard_typeof(call) === \"object\" || typeof call === \"function\")) { return call; } return _assertThisInitialized(self); }\n\nfunction _assertThisInitialized(self) { if (self === void 0) { throw new ReferenceError(\"this hasn't been initialised - super() hasn't been called\"); } return self; }\n\nfunction _isNativeReflectConstruct() { if (typeof Reflect === \"undefined\" || !Reflect.construct) return false; if (Reflect.construct.sham) return false; if (typeof Proxy === \"function\") return true; try { Date.prototype.toString.call(Reflect.construct(Date, [], function () {})); return true; } catch (e) { return false; } }\n\nfunction _getPrototypeOf(o) { _getPrototypeOf = Object.setPrototypeOf ? Object.getPrototypeOf : function _getPrototypeOf(o) { return o.__proto__ || Object.getPrototypeOf(o); }; return _getPrototypeOf(o); }\n\n\n\n\n\n\n/**\n * Helper function to retrieve attribute value.\n * @param {String} suffix\n * @param {Element} element\n */\n\nfunction getAttributeValue(suffix, element) {\n var attribute = \"data-clipboard-\".concat(suffix);\n\n if (!element.hasAttribute(attribute)) {\n return;\n }\n\n return element.getAttribute(attribute);\n}\n/**\n * Base class which takes one or more elements, adds event listeners to them,\n * and instantiates a new `ClipboardAction` on each click.\n */\n\n\nvar Clipboard = /*#__PURE__*/function (_Emitter) {\n _inherits(Clipboard, _Emitter);\n\n var _super = _createSuper(Clipboard);\n\n /**\n * @param {String|HTMLElement|HTMLCollection|NodeList} trigger\n * @param {Object} options\n */\n function Clipboard(trigger, options) {\n var _this;\n\n _classCallCheck(this, Clipboard);\n\n _this = _super.call(this);\n\n _this.resolveOptions(options);\n\n _this.listenClick(trigger);\n\n return _this;\n }\n /**\n * Defines if attributes would be resolved using internal setter functions\n * or custom functions that were passed in the constructor.\n * @param {Object} options\n */\n\n\n _createClass(Clipboard, [{\n key: \"resolveOptions\",\n value: function resolveOptions() {\n var options = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : {};\n this.action = typeof options.action === 'function' ? options.action : this.defaultAction;\n this.target = typeof options.target === 'function' ? options.target : this.defaultTarget;\n this.text = typeof options.text === 'function' ? options.text : this.defaultText;\n this.container = clipboard_typeof(options.container) === 'object' ? options.container : document.body;\n }\n /**\n * Adds a click event listener to the passed trigger.\n * @param {String|HTMLElement|HTMLCollection|NodeList} trigger\n */\n\n }, {\n key: \"listenClick\",\n value: function listenClick(trigger) {\n var _this2 = this;\n\n this.listener = listen_default()(trigger, 'click', function (e) {\n return _this2.onClick(e);\n });\n }\n /**\n * Defines a new `ClipboardAction` on each click event.\n * @param {Event} e\n */\n\n }, {\n key: \"onClick\",\n value: function onClick(e) {\n var trigger = e.delegateTarget || e.currentTarget;\n var action = this.action(trigger) || 'copy';\n var text = actions_default({\n action: action,\n container: this.container,\n target: this.target(trigger),\n text: this.text(trigger)\n }); // Fires an event based on the copy operation result.\n\n this.emit(text ? 'success' : 'error', {\n action: action,\n text: text,\n trigger: trigger,\n clearSelection: function clearSelection() {\n if (trigger) {\n trigger.focus();\n }\n\n window.getSelection().removeAllRanges();\n }\n });\n }\n /**\n * Default `action` lookup function.\n * @param {Element} trigger\n */\n\n }, {\n key: \"defaultAction\",\n value: function defaultAction(trigger) {\n return getAttributeValue('action', trigger);\n }\n /**\n * Default `target` lookup function.\n * @param {Element} trigger\n */\n\n }, {\n key: \"defaultTarget\",\n value: function defaultTarget(trigger) {\n var selector = getAttributeValue('target', trigger);\n\n if (selector) {\n return document.querySelector(selector);\n }\n }\n /**\n * Allow fire programmatically a copy action\n * @param {String|HTMLElement} target\n * @param {Object} options\n * @returns Text copied.\n */\n\n }, {\n key: \"defaultText\",\n\n /**\n * Default `text` lookup function.\n * @param {Element} trigger\n */\n value: function defaultText(trigger) {\n return getAttributeValue('text', trigger);\n }\n /**\n * Destroy lifecycle.\n */\n\n }, {\n key: \"destroy\",\n value: function destroy() {\n this.listener.destroy();\n }\n }], [{\n key: \"copy\",\n value: function copy(target) {\n var options = arguments.length > 1 && arguments[1] !== undefined ? arguments[1] : {\n container: document.body\n };\n return actions_copy(target, options);\n }\n /**\n * Allow fire programmatically a cut action\n * @param {String|HTMLElement} target\n * @returns Text cutted.\n */\n\n }, {\n key: \"cut\",\n value: function cut(target) {\n return actions_cut(target);\n }\n /**\n * Returns the support of the given action, or all actions if no action is\n * given.\n * @param {String} [action]\n */\n\n }, {\n key: \"isSupported\",\n value: function isSupported() {\n var action = arguments.length > 0 && arguments[0] !== undefined ? arguments[0] : ['copy', 'cut'];\n var actions = typeof action === 'string' ? [action] : action;\n var support = !!document.queryCommandSupported;\n actions.forEach(function (action) {\n support = support && !!document.queryCommandSupported(action);\n });\n return support;\n }\n }]);\n\n return Clipboard;\n}((tiny_emitter_default()));\n\n/* harmony default export */ var clipboard = (Clipboard);\n\n/***/ }),\n\n/***/ 828:\n/***/ (function(module) {\n\nvar DOCUMENT_NODE_TYPE = 9;\n\n/**\n * A polyfill for Element.matches()\n */\nif (typeof Element !== 'undefined' && !Element.prototype.matches) {\n var proto = Element.prototype;\n\n proto.matches = proto.matchesSelector ||\n proto.mozMatchesSelector ||\n proto.msMatchesSelector ||\n proto.oMatchesSelector ||\n proto.webkitMatchesSelector;\n}\n\n/**\n * Finds the closest parent that matches a selector.\n *\n * @param {Element} element\n * @param {String} selector\n * @return {Function}\n */\nfunction closest (element, selector) {\n while (element && element.nodeType !== DOCUMENT_NODE_TYPE) {\n if (typeof element.matches === 'function' &&\n element.matches(selector)) {\n return element;\n }\n element = element.parentNode;\n }\n}\n\nmodule.exports = closest;\n\n\n/***/ }),\n\n/***/ 438:\n/***/ (function(module, __unused_webpack_exports, __webpack_require__) {\n\nvar closest = __webpack_require__(828);\n\n/**\n * Delegates event to a selector.\n *\n * @param {Element} element\n * @param {String} selector\n * @param {String} type\n * @param {Function} callback\n * @param {Boolean} useCapture\n * @return {Object}\n */\nfunction _delegate(element, selector, type, callback, useCapture) {\n var listenerFn = listener.apply(this, arguments);\n\n element.addEventListener(type, listenerFn, useCapture);\n\n return {\n destroy: function() {\n element.removeEventListener(type, listenerFn, useCapture);\n }\n }\n}\n\n/**\n * Delegates event to a selector.\n *\n * @param {Element|String|Array} [elements]\n * @param {String} selector\n * @param {String} type\n * @param {Function} callback\n * @param {Boolean} useCapture\n * @return {Object}\n */\nfunction delegate(elements, selector, type, callback, useCapture) {\n // Handle the regular Element usage\n if (typeof elements.addEventListener === 'function') {\n return _delegate.apply(null, arguments);\n }\n\n // Handle Element-less usage, it defaults to global delegation\n if (typeof type === 'function') {\n // Use `document` as the first parameter, then apply arguments\n // This is a short way to .unshift `arguments` without running into deoptimizations\n return _delegate.bind(null, document).apply(null, arguments);\n }\n\n // Handle Selector-based usage\n if (typeof elements === 'string') {\n elements = document.querySelectorAll(elements);\n }\n\n // Handle Array-like based usage\n return Array.prototype.map.call(elements, function (element) {\n return _delegate(element, selector, type, callback, useCapture);\n });\n}\n\n/**\n * Finds closest match and invokes callback.\n *\n * @param {Element} element\n * @param {String} selector\n * @param {String} type\n * @param {Function} callback\n * @return {Function}\n */\nfunction listener(element, selector, type, callback) {\n return function(e) {\n e.delegateTarget = closest(e.target, selector);\n\n if (e.delegateTarget) {\n callback.call(element, e);\n }\n }\n}\n\nmodule.exports = delegate;\n\n\n/***/ }),\n\n/***/ 879:\n/***/ (function(__unused_webpack_module, exports) {\n\n/**\n * Check if argument is a HTML element.\n *\n * @param {Object} value\n * @return {Boolean}\n */\nexports.node = function(value) {\n return value !== undefined\n && value instanceof HTMLElement\n && value.nodeType === 1;\n};\n\n/**\n * Check if argument is a list of HTML elements.\n *\n * @param {Object} value\n * @return {Boolean}\n */\nexports.nodeList = function(value) {\n var type = Object.prototype.toString.call(value);\n\n return value !== undefined\n && (type === '[object NodeList]' || type === '[object HTMLCollection]')\n && ('length' in value)\n && (value.length === 0 || exports.node(value[0]));\n};\n\n/**\n * Check if argument is a string.\n *\n * @param {Object} value\n * @return {Boolean}\n */\nexports.string = function(value) {\n return typeof value === 'string'\n || value instanceof String;\n};\n\n/**\n * Check if argument is a function.\n *\n * @param {Object} value\n * @return {Boolean}\n */\nexports.fn = function(value) {\n var type = Object.prototype.toString.call(value);\n\n return type === '[object Function]';\n};\n\n\n/***/ }),\n\n/***/ 370:\n/***/ (function(module, __unused_webpack_exports, __webpack_require__) {\n\nvar is = __webpack_require__(879);\nvar delegate = __webpack_require__(438);\n\n/**\n * Validates all params and calls the right\n * listener function based on its target type.\n *\n * @param {String|HTMLElement|HTMLCollection|NodeList} target\n * @param {String} type\n * @param {Function} callback\n * @return {Object}\n */\nfunction listen(target, type, callback) {\n if (!target && !type && !callback) {\n throw new Error('Missing required arguments');\n }\n\n if (!is.string(type)) {\n throw new TypeError('Second argument must be a String');\n }\n\n if (!is.fn(callback)) {\n throw new TypeError('Third argument must be a Function');\n }\n\n if (is.node(target)) {\n return listenNode(target, type, callback);\n }\n else if (is.nodeList(target)) {\n return listenNodeList(target, type, callback);\n }\n else if (is.string(target)) {\n return listenSelector(target, type, callback);\n }\n else {\n throw new TypeError('First argument must be a String, HTMLElement, HTMLCollection, or NodeList');\n }\n}\n\n/**\n * Adds an event listener to a HTML element\n * and returns a remove listener function.\n *\n * @param {HTMLElement} node\n * @param {String} type\n * @param {Function} callback\n * @return {Object}\n */\nfunction listenNode(node, type, callback) {\n node.addEventListener(type, callback);\n\n return {\n destroy: function() {\n node.removeEventListener(type, callback);\n }\n }\n}\n\n/**\n * Add an event listener to a list of HTML elements\n * and returns a remove listener function.\n *\n * @param {NodeList|HTMLCollection} nodeList\n * @param {String} type\n * @param {Function} callback\n * @return {Object}\n */\nfunction listenNodeList(nodeList, type, callback) {\n Array.prototype.forEach.call(nodeList, function(node) {\n node.addEventListener(type, callback);\n });\n\n return {\n destroy: function() {\n Array.prototype.forEach.call(nodeList, function(node) {\n node.removeEventListener(type, callback);\n });\n }\n }\n}\n\n/**\n * Add an event listener to a selector\n * and returns a remove listener function.\n *\n * @param {String} selector\n * @param {String} type\n * @param {Function} callback\n * @return {Object}\n */\nfunction listenSelector(selector, type, callback) {\n return delegate(document.body, selector, type, callback);\n}\n\nmodule.exports = listen;\n\n\n/***/ }),\n\n/***/ 817:\n/***/ (function(module) {\n\nfunction select(element) {\n var selectedText;\n\n if (element.nodeName === 'SELECT') {\n element.focus();\n\n selectedText = element.value;\n }\n else if (element.nodeName === 'INPUT' || element.nodeName === 'TEXTAREA') {\n var isReadOnly = element.hasAttribute('readonly');\n\n if (!isReadOnly) {\n element.setAttribute('readonly', '');\n }\n\n element.select();\n element.setSelectionRange(0, element.value.length);\n\n if (!isReadOnly) {\n element.removeAttribute('readonly');\n }\n\n selectedText = element.value;\n }\n else {\n if (element.hasAttribute('contenteditable')) {\n element.focus();\n }\n\n var selection = window.getSelection();\n var range = document.createRange();\n\n range.selectNodeContents(element);\n selection.removeAllRanges();\n selection.addRange(range);\n\n selectedText = selection.toString();\n }\n\n return selectedText;\n}\n\nmodule.exports = select;\n\n\n/***/ }),\n\n/***/ 279:\n/***/ (function(module) {\n\nfunction E () {\n // Keep this empty so it's easier to inherit from\n // (via https://github.com/lipsmack from https://github.com/scottcorgan/tiny-emitter/issues/3)\n}\n\nE.prototype = {\n on: function (name, callback, ctx) {\n var e = this.e || (this.e = {});\n\n (e[name] || (e[name] = [])).push({\n fn: callback,\n ctx: ctx\n });\n\n return this;\n },\n\n once: function (name, callback, ctx) {\n var self = this;\n function listener () {\n self.off(name, listener);\n callback.apply(ctx, arguments);\n };\n\n listener._ = callback\n return this.on(name, listener, ctx);\n },\n\n emit: function (name) {\n var data = [].slice.call(arguments, 1);\n var evtArr = ((this.e || (this.e = {}))[name] || []).slice();\n var i = 0;\n var len = evtArr.length;\n\n for (i; i < len; i++) {\n evtArr[i].fn.apply(evtArr[i].ctx, data);\n }\n\n return this;\n },\n\n off: function (name, callback) {\n var e = this.e || (this.e = {});\n var evts = e[name];\n var liveEvents = [];\n\n if (evts && callback) {\n for (var i = 0, len = evts.length; i < len; i++) {\n if (evts[i].fn !== callback && evts[i].fn._ !== callback)\n liveEvents.push(evts[i]);\n }\n }\n\n // Remove event from queue to prevent memory leak\n // Suggested by https://github.com/lazd\n // Ref: https://github.com/scottcorgan/tiny-emitter/commit/c6ebfaa9bc973b33d110a84a307742b7cf94c953#commitcomment-5024910\n\n (liveEvents.length)\n ? e[name] = liveEvents\n : delete e[name];\n\n return this;\n }\n};\n\nmodule.exports = E;\nmodule.exports.TinyEmitter = E;\n\n\n/***/ })\n\n/******/ \t});\n/************************************************************************/\n/******/ \t// The module cache\n/******/ \tvar __webpack_module_cache__ = {};\n/******/ \t\n/******/ \t// The require function\n/******/ \tfunction __webpack_require__(moduleId) {\n/******/ \t\t// Check if module is in cache\n/******/ \t\tif(__webpack_module_cache__[moduleId]) {\n/******/ \t\t\treturn __webpack_module_cache__[moduleId].exports;\n/******/ \t\t}\n/******/ \t\t// Create a new module (and put it into the cache)\n/******/ \t\tvar module = __webpack_module_cache__[moduleId] = {\n/******/ \t\t\t// no module.id needed\n/******/ \t\t\t// no module.loaded needed\n/******/ \t\t\texports: {}\n/******/ \t\t};\n/******/ \t\n/******/ \t\t// Execute the module function\n/******/ \t\t__webpack_modules__[moduleId](module, module.exports, __webpack_require__);\n/******/ \t\n/******/ \t\t// Return the exports of the module\n/******/ \t\treturn module.exports;\n/******/ \t}\n/******/ \t\n/************************************************************************/\n/******/ \t/* webpack/runtime/compat get default export */\n/******/ \t!function() {\n/******/ \t\t// getDefaultExport function for compatibility with non-harmony modules\n/******/ \t\t__webpack_require__.n = function(module) {\n/******/ \t\t\tvar getter = module && module.__esModule ?\n/******/ \t\t\t\tfunction() { return module['default']; } :\n/******/ \t\t\t\tfunction() { return module; };\n/******/ \t\t\t__webpack_require__.d(getter, { a: getter });\n/******/ \t\t\treturn getter;\n/******/ \t\t};\n/******/ \t}();\n/******/ \t\n/******/ \t/* webpack/runtime/define property getters */\n/******/ \t!function() {\n/******/ \t\t// define getter functions for harmony exports\n/******/ \t\t__webpack_require__.d = function(exports, definition) {\n/******/ \t\t\tfor(var key in definition) {\n/******/ \t\t\t\tif(__webpack_require__.o(definition, key) && !__webpack_require__.o(exports, key)) {\n/******/ \t\t\t\t\tObject.defineProperty(exports, key, { enumerable: true, get: definition[key] });\n/******/ \t\t\t\t}\n/******/ \t\t\t}\n/******/ \t\t};\n/******/ \t}();\n/******/ \t\n/******/ \t/* webpack/runtime/hasOwnProperty shorthand */\n/******/ \t!function() {\n/******/ \t\t__webpack_require__.o = function(obj, prop) { return Object.prototype.hasOwnProperty.call(obj, prop); }\n/******/ \t}();\n/******/ \t\n/************************************************************************/\n/******/ \t// module exports must be returned from runtime so entry inlining is disabled\n/******/ \t// startup\n/******/ \t// Load entry module and return exports\n/******/ \treturn __webpack_require__(686);\n/******/ })()\n.default;\n});", "/*!\n * escape-html\n * Copyright(c) 2012-2013 TJ Holowaychuk\n * Copyright(c) 2015 Andreas Lubbe\n * Copyright(c) 2015 Tiancheng \"Timothy\" Gu\n * MIT Licensed\n */\n\n'use strict';\n\n/**\n * Module variables.\n * @private\n */\n\nvar matchHtmlRegExp = /[\"'&<>]/;\n\n/**\n * Module exports.\n * @public\n */\n\nmodule.exports = escapeHtml;\n\n/**\n * Escape special characters in the given string of html.\n *\n * @param {string} string The string to escape for inserting into HTML\n * @return {string}\n * @public\n */\n\nfunction escapeHtml(string) {\n var str = '' + string;\n var match = matchHtmlRegExp.exec(str);\n\n if (!match) {\n return str;\n }\n\n var escape;\n var html = '';\n var index = 0;\n var lastIndex = 0;\n\n for (index = match.index; index < str.length; index++) {\n switch (str.charCodeAt(index)) {\n case 34: // \"\n escape = '"';\n break;\n case 38: // &\n escape = '&';\n break;\n case 39: // '\n escape = ''';\n break;\n case 60: // <\n escape = '<';\n break;\n case 62: // >\n escape = '>';\n break;\n default:\n continue;\n }\n\n if (lastIndex !== index) {\n html += str.substring(lastIndex, index);\n }\n\n lastIndex = index + 1;\n html += escape;\n }\n\n return lastIndex !== index\n ? html + str.substring(lastIndex, index)\n : html;\n}\n", "/*\n * Copyright (c) 2016-2023 Martin Donath \n *\n * Permission is hereby granted, free of charge, to any person obtaining a copy\n * of this software and associated documentation files (the \"Software\"), to\n * deal in the Software without restriction, including without limitation the\n * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or\n * sell copies of the Software, and to permit persons to whom the Software is\n * furnished to do so, subject to the following conditions:\n *\n * The above copyright notice and this permission notice shall be included in\n * all copies or substantial portions of the Software.\n *\n * THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE\n * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS\n * IN THE SOFTWARE.\n */\n\nimport \"focus-visible\"\n\nimport {\n EMPTY,\n NEVER,\n Observable,\n Subject,\n defer,\n delay,\n filter,\n map,\n merge,\n mergeWith,\n shareReplay,\n switchMap\n} from \"rxjs\"\n\nimport { configuration, feature } from \"./_\"\nimport {\n at,\n getActiveElement,\n getOptionalElement,\n requestJSON,\n setLocation,\n setToggle,\n watchDocument,\n watchKeyboard,\n watchLocation,\n watchLocationTarget,\n watchMedia,\n watchPrint,\n watchScript,\n watchViewport\n} from \"./browser\"\nimport {\n getComponentElement,\n getComponentElements,\n mountAnnounce,\n mountBackToTop,\n mountConsent,\n mountContent,\n mountDialog,\n mountHeader,\n mountHeaderTitle,\n mountPalette,\n mountProgress,\n mountSearch,\n mountSearchHiglight,\n mountSidebar,\n mountSource,\n mountTableOfContents,\n mountTabs,\n watchHeader,\n watchMain\n} from \"./components\"\nimport {\n SearchIndex,\n setupClipboardJS,\n setupInstantNavigation,\n setupVersionSelector\n} from \"./integrations\"\nimport {\n patchEllipsis,\n patchIndeterminate,\n patchScrollfix,\n patchScrolllock\n} from \"./patches\"\nimport \"./polyfills\"\n\n/* ----------------------------------------------------------------------------\n * Functions - @todo refactor\n * ------------------------------------------------------------------------- */\n\n/**\n * Fetch search index\n *\n * @returns Search index observable\n */\nfunction fetchSearchIndex(): Observable {\n if (location.protocol === \"file:\") {\n return watchScript(\n `${new URL(\"search/search_index.js\", config.base)}`\n )\n .pipe(\n // @ts-ignore - @todo fix typings\n map(() => __index),\n shareReplay(1)\n )\n } else {\n return requestJSON(\n new URL(\"search/search_index.json\", config.base)\n )\n }\n}\n\n/* ----------------------------------------------------------------------------\n * Application\n * ------------------------------------------------------------------------- */\n\n/* Yay, JavaScript is available */\ndocument.documentElement.classList.remove(\"no-js\")\ndocument.documentElement.classList.add(\"js\")\n\n/* Set up navigation observables and subjects */\nconst document$ = watchDocument()\nconst location$ = watchLocation()\nconst target$ = watchLocationTarget(location$)\nconst keyboard$ = watchKeyboard()\n\n/* Set up media observables */\nconst viewport$ = watchViewport()\nconst tablet$ = watchMedia(\"(min-width: 960px)\")\nconst screen$ = watchMedia(\"(min-width: 1220px)\")\nconst print$ = watchPrint()\n\n/* Retrieve search index, if search is enabled */\nconst config = configuration()\nconst index$ = document.forms.namedItem(\"search\")\n ? fetchSearchIndex()\n : NEVER\n\n/* Set up Clipboard.js integration */\nconst alert$ = new Subject()\nsetupClipboardJS({ alert$ })\n\n/* Set up progress indicator */\nconst progress$ = new Subject()\n\n/* Set up instant navigation, if enabled */\nif (feature(\"navigation.instant\"))\n setupInstantNavigation({ location$, viewport$, progress$ })\n .subscribe(document$)\n\n/* Set up version selector */\nif (config.version?.provider === \"mike\")\n setupVersionSelector({ document$ })\n\n/* Always close drawer and search on navigation */\nmerge(location$, target$)\n .pipe(\n delay(125)\n )\n .subscribe(() => {\n setToggle(\"drawer\", false)\n setToggle(\"search\", false)\n })\n\n/* Set up global keyboard handlers */\nkeyboard$\n .pipe(\n filter(({ mode }) => mode === \"global\")\n )\n .subscribe(key => {\n switch (key.type) {\n\n /* Go to previous page */\n case \"p\":\n case \",\":\n const prev = getOptionalElement(\"link[rel=prev]\")\n if (typeof prev !== \"undefined\")\n setLocation(prev)\n break\n\n /* Go to next page */\n case \"n\":\n case \".\":\n const next = getOptionalElement(\"link[rel=next]\")\n if (typeof next !== \"undefined\")\n setLocation(next)\n break\n\n /* Expand navigation, see https://bit.ly/3ZjG5io */\n case \"Enter\":\n const active = getActiveElement()\n if (active instanceof HTMLLabelElement)\n active.click()\n }\n })\n\n/* Set up patches */\npatchEllipsis({ document$ })\npatchIndeterminate({ document$, tablet$ })\npatchScrollfix({ document$ })\npatchScrolllock({ viewport$, tablet$ })\n\n/* Set up header and main area observable */\nconst header$ = watchHeader(getComponentElement(\"header\"), { viewport$ })\nconst main$ = document$\n .pipe(\n map(() => getComponentElement(\"main\")),\n switchMap(el => watchMain(el, { viewport$, header$ })),\n shareReplay(1)\n )\n\n/* Set up control component observables */\nconst control$ = merge(\n\n /* Consent */\n ...getComponentElements(\"consent\")\n .map(el => mountConsent(el, { target$ })),\n\n /* Dialog */\n ...getComponentElements(\"dialog\")\n .map(el => mountDialog(el, { alert$ })),\n\n /* Header */\n ...getComponentElements(\"header\")\n .map(el => mountHeader(el, { viewport$, header$, main$ })),\n\n /* Color palette */\n ...getComponentElements(\"palette\")\n .map(el => mountPalette(el)),\n\n /* Progress bar */\n ...getComponentElements(\"progress\")\n .map(el => mountProgress(el, { progress$ })),\n\n /* Search */\n ...getComponentElements(\"search\")\n .map(el => mountSearch(el, { index$, keyboard$ })),\n\n /* Repository information */\n ...getComponentElements(\"source\")\n .map(el => mountSource(el))\n)\n\n/* Set up content component observables */\nconst content$ = defer(() => merge(\n\n /* Announcement bar */\n ...getComponentElements(\"announce\")\n .map(el => mountAnnounce(el)),\n\n /* Content */\n ...getComponentElements(\"content\")\n .map(el => mountContent(el, { viewport$, target$, print$ })),\n\n /* Search highlighting */\n ...getComponentElements(\"content\")\n .map(el => feature(\"search.highlight\")\n ? mountSearchHiglight(el, { index$, location$ })\n : EMPTY\n ),\n\n /* Header title */\n ...getComponentElements(\"header-title\")\n .map(el => mountHeaderTitle(el, { viewport$, header$ })),\n\n /* Sidebar */\n ...getComponentElements(\"sidebar\")\n .map(el => el.getAttribute(\"data-md-type\") === \"navigation\"\n ? at(screen$, () => mountSidebar(el, { viewport$, header$, main$ }))\n : at(tablet$, () => mountSidebar(el, { viewport$, header$, main$ }))\n ),\n\n /* Navigation tabs */\n ...getComponentElements(\"tabs\")\n .map(el => mountTabs(el, { viewport$, header$ })),\n\n /* Table of contents */\n ...getComponentElements(\"toc\")\n .map(el => mountTableOfContents(el, {\n viewport$, header$, main$, target$\n })),\n\n /* Back-to-top button */\n ...getComponentElements(\"top\")\n .map(el => mountBackToTop(el, { viewport$, header$, main$, target$ }))\n))\n\n/* Set up component observables */\nconst component$ = document$\n .pipe(\n switchMap(() => content$),\n mergeWith(control$),\n shareReplay(1)\n )\n\n/* Subscribe to all components */\ncomponent$.subscribe()\n\n/* ----------------------------------------------------------------------------\n * Exports\n * ------------------------------------------------------------------------- */\n\nwindow.document$ = document$ /* Document observable */\nwindow.location$ = location$ /* Location subject */\nwindow.target$ = target$ /* Location target observable */\nwindow.keyboard$ = keyboard$ /* Keyboard observable */\nwindow.viewport$ = viewport$ /* Viewport observable */\nwindow.tablet$ = tablet$ /* Media tablet observable */\nwindow.screen$ = screen$ /* Media screen observable */\nwindow.print$ = print$ /* Media print observable */\nwindow.alert$ = alert$ /* Alert subject */\nwindow.progress$ = progress$ /* Progress indicator subject */\nwindow.component$ = component$ /* Component observable */\n", "/*! *****************************************************************************\r\nCopyright (c) Microsoft Corporation.\r\n\r\nPermission to use, copy, modify, and/or distribute this software for any\r\npurpose with or without fee is hereby granted.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH\r\nREGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY\r\nAND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,\r\nINDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM\r\nLOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR\r\nOTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR\r\nPERFORMANCE OF THIS SOFTWARE.\r\n***************************************************************************** */\r\n/* global Reflect, Promise */\r\n\r\nvar extendStatics = function(d, b) {\r\n extendStatics = Object.setPrototypeOf ||\r\n ({ __proto__: [] } instanceof Array && function (d, b) { d.__proto__ = b; }) ||\r\n function (d, b) { for (var p in b) if (Object.prototype.hasOwnProperty.call(b, p)) d[p] = b[p]; };\r\n return extendStatics(d, b);\r\n};\r\n\r\nexport function __extends(d, b) {\r\n if (typeof b !== \"function\" && b !== null)\r\n throw new TypeError(\"Class extends value \" + String(b) + \" is not a constructor or null\");\r\n extendStatics(d, b);\r\n function __() { this.constructor = d; }\r\n d.prototype = b === null ? Object.create(b) : (__.prototype = b.prototype, new __());\r\n}\r\n\r\nexport var __assign = function() {\r\n __assign = Object.assign || function __assign(t) {\r\n for (var s, i = 1, n = arguments.length; i < n; i++) {\r\n s = arguments[i];\r\n for (var p in s) if (Object.prototype.hasOwnProperty.call(s, p)) t[p] = s[p];\r\n }\r\n return t;\r\n }\r\n return __assign.apply(this, arguments);\r\n}\r\n\r\nexport function __rest(s, e) {\r\n var t = {};\r\n for (var p in s) if (Object.prototype.hasOwnProperty.call(s, p) && e.indexOf(p) < 0)\r\n t[p] = s[p];\r\n if (s != null && typeof Object.getOwnPropertySymbols === \"function\")\r\n for (var i = 0, p = Object.getOwnPropertySymbols(s); i < p.length; i++) {\r\n if (e.indexOf(p[i]) < 0 && Object.prototype.propertyIsEnumerable.call(s, p[i]))\r\n t[p[i]] = s[p[i]];\r\n }\r\n return t;\r\n}\r\n\r\nexport function __decorate(decorators, target, key, desc) {\r\n var c = arguments.length, r = c < 3 ? target : desc === null ? desc = Object.getOwnPropertyDescriptor(target, key) : desc, d;\r\n if (typeof Reflect === \"object\" && typeof Reflect.decorate === \"function\") r = Reflect.decorate(decorators, target, key, desc);\r\n else for (var i = decorators.length - 1; i >= 0; i--) if (d = decorators[i]) r = (c < 3 ? d(r) : c > 3 ? d(target, key, r) : d(target, key)) || r;\r\n return c > 3 && r && Object.defineProperty(target, key, r), r;\r\n}\r\n\r\nexport function __param(paramIndex, decorator) {\r\n return function (target, key) { decorator(target, key, paramIndex); }\r\n}\r\n\r\nexport function __metadata(metadataKey, metadataValue) {\r\n if (typeof Reflect === \"object\" && typeof Reflect.metadata === \"function\") return Reflect.metadata(metadataKey, metadataValue);\r\n}\r\n\r\nexport function __awaiter(thisArg, _arguments, P, generator) {\r\n function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }\r\n return new (P || (P = Promise))(function (resolve, reject) {\r\n function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }\r\n function rejected(value) { try { step(generator[\"throw\"](value)); } catch (e) { reject(e); } }\r\n function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }\r\n step((generator = generator.apply(thisArg, _arguments || [])).next());\r\n });\r\n}\r\n\r\nexport function __generator(thisArg, body) {\r\n var _ = { label: 0, sent: function() { if (t[0] & 1) throw t[1]; return t[1]; }, trys: [], ops: [] }, f, y, t, g;\r\n return g = { next: verb(0), \"throw\": verb(1), \"return\": verb(2) }, typeof Symbol === \"function\" && (g[Symbol.iterator] = function() { return this; }), g;\r\n function verb(n) { return function (v) { return step([n, v]); }; }\r\n function step(op) {\r\n if (f) throw new TypeError(\"Generator is already executing.\");\r\n while (_) try {\r\n if (f = 1, y && (t = op[0] & 2 ? y[\"return\"] : op[0] ? y[\"throw\"] || ((t = y[\"return\"]) && t.call(y), 0) : y.next) && !(t = t.call(y, op[1])).done) return t;\r\n if (y = 0, t) op = [op[0] & 2, t.value];\r\n switch (op[0]) {\r\n case 0: case 1: t = op; break;\r\n case 4: _.label++; return { value: op[1], done: false };\r\n case 5: _.label++; y = op[1]; op = [0]; continue;\r\n case 7: op = _.ops.pop(); _.trys.pop(); continue;\r\n default:\r\n if (!(t = _.trys, t = t.length > 0 && t[t.length - 1]) && (op[0] === 6 || op[0] === 2)) { _ = 0; continue; }\r\n if (op[0] === 3 && (!t || (op[1] > t[0] && op[1] < t[3]))) { _.label = op[1]; break; }\r\n if (op[0] === 6 && _.label < t[1]) { _.label = t[1]; t = op; break; }\r\n if (t && _.label < t[2]) { _.label = t[2]; _.ops.push(op); break; }\r\n if (t[2]) _.ops.pop();\r\n _.trys.pop(); continue;\r\n }\r\n op = body.call(thisArg, _);\r\n } catch (e) { op = [6, e]; y = 0; } finally { f = t = 0; }\r\n if (op[0] & 5) throw op[1]; return { value: op[0] ? op[1] : void 0, done: true };\r\n }\r\n}\r\n\r\nexport var __createBinding = Object.create ? (function(o, m, k, k2) {\r\n if (k2 === undefined) k2 = k;\r\n Object.defineProperty(o, k2, { enumerable: true, get: function() { return m[k]; } });\r\n}) : (function(o, m, k, k2) {\r\n if (k2 === undefined) k2 = k;\r\n o[k2] = m[k];\r\n});\r\n\r\nexport function __exportStar(m, o) {\r\n for (var p in m) if (p !== \"default\" && !Object.prototype.hasOwnProperty.call(o, p)) __createBinding(o, m, p);\r\n}\r\n\r\nexport function __values(o) {\r\n var s = typeof Symbol === \"function\" && Symbol.iterator, m = s && o[s], i = 0;\r\n if (m) return m.call(o);\r\n if (o && typeof o.length === \"number\") return {\r\n next: function () {\r\n if (o && i >= o.length) o = void 0;\r\n return { value: o && o[i++], done: !o };\r\n }\r\n };\r\n throw new TypeError(s ? \"Object is not iterable.\" : \"Symbol.iterator is not defined.\");\r\n}\r\n\r\nexport function __read(o, n) {\r\n var m = typeof Symbol === \"function\" && o[Symbol.iterator];\r\n if (!m) return o;\r\n var i = m.call(o), r, ar = [], e;\r\n try {\r\n while ((n === void 0 || n-- > 0) && !(r = i.next()).done) ar.push(r.value);\r\n }\r\n catch (error) { e = { error: error }; }\r\n finally {\r\n try {\r\n if (r && !r.done && (m = i[\"return\"])) m.call(i);\r\n }\r\n finally { if (e) throw e.error; }\r\n }\r\n return ar;\r\n}\r\n\r\n/** @deprecated */\r\nexport function __spread() {\r\n for (var ar = [], i = 0; i < arguments.length; i++)\r\n ar = ar.concat(__read(arguments[i]));\r\n return ar;\r\n}\r\n\r\n/** @deprecated */\r\nexport function __spreadArrays() {\r\n for (var s = 0, i = 0, il = arguments.length; i < il; i++) s += arguments[i].length;\r\n for (var r = Array(s), k = 0, i = 0; i < il; i++)\r\n for (var a = arguments[i], j = 0, jl = a.length; j < jl; j++, k++)\r\n r[k] = a[j];\r\n return r;\r\n}\r\n\r\nexport function __spreadArray(to, from, pack) {\r\n if (pack || arguments.length === 2) for (var i = 0, l = from.length, ar; i < l; i++) {\r\n if (ar || !(i in from)) {\r\n if (!ar) ar = Array.prototype.slice.call(from, 0, i);\r\n ar[i] = from[i];\r\n }\r\n }\r\n return to.concat(ar || Array.prototype.slice.call(from));\r\n}\r\n\r\nexport function __await(v) {\r\n return this instanceof __await ? (this.v = v, this) : new __await(v);\r\n}\r\n\r\nexport function __asyncGenerator(thisArg, _arguments, generator) {\r\n if (!Symbol.asyncIterator) throw new TypeError(\"Symbol.asyncIterator is not defined.\");\r\n var g = generator.apply(thisArg, _arguments || []), i, q = [];\r\n return i = {}, verb(\"next\"), verb(\"throw\"), verb(\"return\"), i[Symbol.asyncIterator] = function () { return this; }, i;\r\n function verb(n) { if (g[n]) i[n] = function (v) { return new Promise(function (a, b) { q.push([n, v, a, b]) > 1 || resume(n, v); }); }; }\r\n function resume(n, v) { try { step(g[n](v)); } catch (e) { settle(q[0][3], e); } }\r\n function step(r) { r.value instanceof __await ? Promise.resolve(r.value.v).then(fulfill, reject) : settle(q[0][2], r); }\r\n function fulfill(value) { resume(\"next\", value); }\r\n function reject(value) { resume(\"throw\", value); }\r\n function settle(f, v) { if (f(v), q.shift(), q.length) resume(q[0][0], q[0][1]); }\r\n}\r\n\r\nexport function __asyncDelegator(o) {\r\n var i, p;\r\n return i = {}, verb(\"next\"), verb(\"throw\", function (e) { throw e; }), verb(\"return\"), i[Symbol.iterator] = function () { return this; }, i;\r\n function verb(n, f) { i[n] = o[n] ? function (v) { return (p = !p) ? { value: __await(o[n](v)), done: n === \"return\" } : f ? f(v) : v; } : f; }\r\n}\r\n\r\nexport function __asyncValues(o) {\r\n if (!Symbol.asyncIterator) throw new TypeError(\"Symbol.asyncIterator is not defined.\");\r\n var m = o[Symbol.asyncIterator], i;\r\n return m ? m.call(o) : (o = typeof __values === \"function\" ? __values(o) : o[Symbol.iterator](), i = {}, verb(\"next\"), verb(\"throw\"), verb(\"return\"), i[Symbol.asyncIterator] = function () { return this; }, i);\r\n function verb(n) { i[n] = o[n] && function (v) { return new Promise(function (resolve, reject) { v = o[n](v), settle(resolve, reject, v.done, v.value); }); }; }\r\n function settle(resolve, reject, d, v) { Promise.resolve(v).then(function(v) { resolve({ value: v, done: d }); }, reject); }\r\n}\r\n\r\nexport function __makeTemplateObject(cooked, raw) {\r\n if (Object.defineProperty) { Object.defineProperty(cooked, \"raw\", { value: raw }); } else { cooked.raw = raw; }\r\n return cooked;\r\n};\r\n\r\nvar __setModuleDefault = Object.create ? (function(o, v) {\r\n Object.defineProperty(o, \"default\", { enumerable: true, value: v });\r\n}) : function(o, v) {\r\n o[\"default\"] = v;\r\n};\r\n\r\nexport function __importStar(mod) {\r\n if (mod && mod.__esModule) return mod;\r\n var result = {};\r\n if (mod != null) for (var k in mod) if (k !== \"default\" && Object.prototype.hasOwnProperty.call(mod, k)) __createBinding(result, mod, k);\r\n __setModuleDefault(result, mod);\r\n return result;\r\n}\r\n\r\nexport function __importDefault(mod) {\r\n return (mod && mod.__esModule) ? mod : { default: mod };\r\n}\r\n\r\nexport function __classPrivateFieldGet(receiver, state, kind, f) {\r\n if (kind === \"a\" && !f) throw new TypeError(\"Private accessor was defined without a getter\");\r\n if (typeof state === \"function\" ? receiver !== state || !f : !state.has(receiver)) throw new TypeError(\"Cannot read private member from an object whose class did not declare it\");\r\n return kind === \"m\" ? f : kind === \"a\" ? f.call(receiver) : f ? f.value : state.get(receiver);\r\n}\r\n\r\nexport function __classPrivateFieldSet(receiver, state, value, kind, f) {\r\n if (kind === \"m\") throw new TypeError(\"Private method is not writable\");\r\n if (kind === \"a\" && !f) throw new TypeError(\"Private accessor was defined without a setter\");\r\n if (typeof state === \"function\" ? receiver !== state || !f : !state.has(receiver)) throw new TypeError(\"Cannot write private member to an object whose class did not declare it\");\r\n return (kind === \"a\" ? f.call(receiver, value) : f ? f.value = value : state.set(receiver, value)), value;\r\n}\r\n", "/**\n * Returns true if the object is a function.\n * @param value The value to check\n */\nexport function isFunction(value: any): value is (...args: any[]) => any {\n return typeof value === 'function';\n}\n", "/**\n * Used to create Error subclasses until the community moves away from ES5.\n *\n * This is because compiling from TypeScript down to ES5 has issues with subclassing Errors\n * as well as other built-in types: https://github.com/Microsoft/TypeScript/issues/12123\n *\n * @param createImpl A factory function to create the actual constructor implementation. The returned\n * function should be a named function that calls `_super` internally.\n */\nexport function createErrorClass(createImpl: (_super: any) => any): T {\n const _super = (instance: any) => {\n Error.call(instance);\n instance.stack = new Error().stack;\n };\n\n const ctorFunc = createImpl(_super);\n ctorFunc.prototype = Object.create(Error.prototype);\n ctorFunc.prototype.constructor = ctorFunc;\n return ctorFunc;\n}\n", "import { createErrorClass } from './createErrorClass';\n\nexport interface UnsubscriptionError extends Error {\n readonly errors: any[];\n}\n\nexport interface UnsubscriptionErrorCtor {\n /**\n * @deprecated Internal implementation detail. Do not construct error instances.\n * Cannot be tagged as internal: https://github.com/ReactiveX/rxjs/issues/6269\n */\n new (errors: any[]): UnsubscriptionError;\n}\n\n/**\n * An error thrown when one or more errors have occurred during the\n * `unsubscribe` of a {@link Subscription}.\n */\nexport const UnsubscriptionError: UnsubscriptionErrorCtor = createErrorClass(\n (_super) =>\n function UnsubscriptionErrorImpl(this: any, errors: (Error | string)[]) {\n _super(this);\n this.message = errors\n ? `${errors.length} errors occurred during unsubscription:\n${errors.map((err, i) => `${i + 1}) ${err.toString()}`).join('\\n ')}`\n : '';\n this.name = 'UnsubscriptionError';\n this.errors = errors;\n }\n);\n", "/**\n * Removes an item from an array, mutating it.\n * @param arr The array to remove the item from\n * @param item The item to remove\n */\nexport function arrRemove(arr: T[] | undefined | null, item: T) {\n if (arr) {\n const index = arr.indexOf(item);\n 0 <= index && arr.splice(index, 1);\n }\n}\n", "import { isFunction } from './util/isFunction';\nimport { UnsubscriptionError } from './util/UnsubscriptionError';\nimport { SubscriptionLike, TeardownLogic, Unsubscribable } from './types';\nimport { arrRemove } from './util/arrRemove';\n\n/**\n * Represents a disposable resource, such as the execution of an Observable. A\n * Subscription has one important method, `unsubscribe`, that takes no argument\n * and just disposes the resource held by the subscription.\n *\n * Additionally, subscriptions may be grouped together through the `add()`\n * method, which will attach a child Subscription to the current Subscription.\n * When a Subscription is unsubscribed, all its children (and its grandchildren)\n * will be unsubscribed as well.\n *\n * @class Subscription\n */\nexport class Subscription implements SubscriptionLike {\n /** @nocollapse */\n public static EMPTY = (() => {\n const empty = new Subscription();\n empty.closed = true;\n return empty;\n })();\n\n /**\n * A flag to indicate whether this Subscription has already been unsubscribed.\n */\n public closed = false;\n\n private _parentage: Subscription[] | Subscription | null = null;\n\n /**\n * The list of registered finalizers to execute upon unsubscription. Adding and removing from this\n * list occurs in the {@link #add} and {@link #remove} methods.\n */\n private _finalizers: Exclude[] | null = null;\n\n /**\n * @param initialTeardown A function executed first as part of the finalization\n * process that is kicked off when {@link #unsubscribe} is called.\n */\n constructor(private initialTeardown?: () => void) {}\n\n /**\n * Disposes the resources held by the subscription. May, for instance, cancel\n * an ongoing Observable execution or cancel any other type of work that\n * started when the Subscription was created.\n * @return {void}\n */\n unsubscribe(): void {\n let errors: any[] | undefined;\n\n if (!this.closed) {\n this.closed = true;\n\n // Remove this from it's parents.\n const { _parentage } = this;\n if (_parentage) {\n this._parentage = null;\n if (Array.isArray(_parentage)) {\n for (const parent of _parentage) {\n parent.remove(this);\n }\n } else {\n _parentage.remove(this);\n }\n }\n\n const { initialTeardown: initialFinalizer } = this;\n if (isFunction(initialFinalizer)) {\n try {\n initialFinalizer();\n } catch (e) {\n errors = e instanceof UnsubscriptionError ? e.errors : [e];\n }\n }\n\n const { _finalizers } = this;\n if (_finalizers) {\n this._finalizers = null;\n for (const finalizer of _finalizers) {\n try {\n execFinalizer(finalizer);\n } catch (err) {\n errors = errors ?? [];\n if (err instanceof UnsubscriptionError) {\n errors = [...errors, ...err.errors];\n } else {\n errors.push(err);\n }\n }\n }\n }\n\n if (errors) {\n throw new UnsubscriptionError(errors);\n }\n }\n }\n\n /**\n * Adds a finalizer to this subscription, so that finalization will be unsubscribed/called\n * when this subscription is unsubscribed. If this subscription is already {@link #closed},\n * because it has already been unsubscribed, then whatever finalizer is passed to it\n * will automatically be executed (unless the finalizer itself is also a closed subscription).\n *\n * Closed Subscriptions cannot be added as finalizers to any subscription. Adding a closed\n * subscription to a any subscription will result in no operation. (A noop).\n *\n * Adding a subscription to itself, or adding `null` or `undefined` will not perform any\n * operation at all. (A noop).\n *\n * `Subscription` instances that are added to this instance will automatically remove themselves\n * if they are unsubscribed. Functions and {@link Unsubscribable} objects that you wish to remove\n * will need to be removed manually with {@link #remove}\n *\n * @param teardown The finalization logic to add to this subscription.\n */\n add(teardown: TeardownLogic): void {\n // Only add the finalizer if it's not undefined\n // and don't add a subscription to itself.\n if (teardown && teardown !== this) {\n if (this.closed) {\n // If this subscription is already closed,\n // execute whatever finalizer is handed to it automatically.\n execFinalizer(teardown);\n } else {\n if (teardown instanceof Subscription) {\n // We don't add closed subscriptions, and we don't add the same subscription\n // twice. Subscription unsubscribe is idempotent.\n if (teardown.closed || teardown._hasParent(this)) {\n return;\n }\n teardown._addParent(this);\n }\n (this._finalizers = this._finalizers ?? []).push(teardown);\n }\n }\n }\n\n /**\n * Checks to see if a this subscription already has a particular parent.\n * This will signal that this subscription has already been added to the parent in question.\n * @param parent the parent to check for\n */\n private _hasParent(parent: Subscription) {\n const { _parentage } = this;\n return _parentage === parent || (Array.isArray(_parentage) && _parentage.includes(parent));\n }\n\n /**\n * Adds a parent to this subscription so it can be removed from the parent if it\n * unsubscribes on it's own.\n *\n * NOTE: THIS ASSUMES THAT {@link _hasParent} HAS ALREADY BEEN CHECKED.\n * @param parent The parent subscription to add\n */\n private _addParent(parent: Subscription) {\n const { _parentage } = this;\n this._parentage = Array.isArray(_parentage) ? (_parentage.push(parent), _parentage) : _parentage ? [_parentage, parent] : parent;\n }\n\n /**\n * Called on a child when it is removed via {@link #remove}.\n * @param parent The parent to remove\n */\n private _removeParent(parent: Subscription) {\n const { _parentage } = this;\n if (_parentage === parent) {\n this._parentage = null;\n } else if (Array.isArray(_parentage)) {\n arrRemove(_parentage, parent);\n }\n }\n\n /**\n * Removes a finalizer from this subscription that was previously added with the {@link #add} method.\n *\n * Note that `Subscription` instances, when unsubscribed, will automatically remove themselves\n * from every other `Subscription` they have been added to. This means that using the `remove` method\n * is not a common thing and should be used thoughtfully.\n *\n * If you add the same finalizer instance of a function or an unsubscribable object to a `Subscription` instance\n * more than once, you will need to call `remove` the same number of times to remove all instances.\n *\n * All finalizer instances are removed to free up memory upon unsubscription.\n *\n * @param teardown The finalizer to remove from this subscription\n */\n remove(teardown: Exclude): void {\n const { _finalizers } = this;\n _finalizers && arrRemove(_finalizers, teardown);\n\n if (teardown instanceof Subscription) {\n teardown._removeParent(this);\n }\n }\n}\n\nexport const EMPTY_SUBSCRIPTION = Subscription.EMPTY;\n\nexport function isSubscription(value: any): value is Subscription {\n return (\n value instanceof Subscription ||\n (value && 'closed' in value && isFunction(value.remove) && isFunction(value.add) && isFunction(value.unsubscribe))\n );\n}\n\nfunction execFinalizer(finalizer: Unsubscribable | (() => void)) {\n if (isFunction(finalizer)) {\n finalizer();\n } else {\n finalizer.unsubscribe();\n }\n}\n", "import { Subscriber } from './Subscriber';\nimport { ObservableNotification } from './types';\n\n/**\n * The {@link GlobalConfig} object for RxJS. It is used to configure things\n * like how to react on unhandled errors.\n */\nexport const config: GlobalConfig = {\n onUnhandledError: null,\n onStoppedNotification: null,\n Promise: undefined,\n useDeprecatedSynchronousErrorHandling: false,\n useDeprecatedNextContext: false,\n};\n\n/**\n * The global configuration object for RxJS, used to configure things\n * like how to react on unhandled errors. Accessible via {@link config}\n * object.\n */\nexport interface GlobalConfig {\n /**\n * A registration point for unhandled errors from RxJS. These are errors that\n * cannot were not handled by consuming code in the usual subscription path. For\n * example, if you have this configured, and you subscribe to an observable without\n * providing an error handler, errors from that subscription will end up here. This\n * will _always_ be called asynchronously on another job in the runtime. This is because\n * we do not want errors thrown in this user-configured handler to interfere with the\n * behavior of the library.\n */\n onUnhandledError: ((err: any) => void) | null;\n\n /**\n * A registration point for notifications that cannot be sent to subscribers because they\n * have completed, errored or have been explicitly unsubscribed. By default, next, complete\n * and error notifications sent to stopped subscribers are noops. However, sometimes callers\n * might want a different behavior. For example, with sources that attempt to report errors\n * to stopped subscribers, a caller can configure RxJS to throw an unhandled error instead.\n * This will _always_ be called asynchronously on another job in the runtime. This is because\n * we do not want errors thrown in this user-configured handler to interfere with the\n * behavior of the library.\n */\n onStoppedNotification: ((notification: ObservableNotification, subscriber: Subscriber) => void) | null;\n\n /**\n * The promise constructor used by default for {@link Observable#toPromise toPromise} and {@link Observable#forEach forEach}\n * methods.\n *\n * @deprecated As of version 8, RxJS will no longer support this sort of injection of a\n * Promise constructor. If you need a Promise implementation other than native promises,\n * please polyfill/patch Promise as you see appropriate. Will be removed in v8.\n */\n Promise?: PromiseConstructorLike;\n\n /**\n * If true, turns on synchronous error rethrowing, which is a deprecated behavior\n * in v6 and higher. This behavior enables bad patterns like wrapping a subscribe\n * call in a try/catch block. It also enables producer interference, a nasty bug\n * where a multicast can be broken for all observers by a downstream consumer with\n * an unhandled error. DO NOT USE THIS FLAG UNLESS IT'S NEEDED TO BUY TIME\n * FOR MIGRATION REASONS.\n *\n * @deprecated As of version 8, RxJS will no longer support synchronous throwing\n * of unhandled errors. All errors will be thrown on a separate call stack to prevent bad\n * behaviors described above. Will be removed in v8.\n */\n useDeprecatedSynchronousErrorHandling: boolean;\n\n /**\n * If true, enables an as-of-yet undocumented feature from v5: The ability to access\n * `unsubscribe()` via `this` context in `next` functions created in observers passed\n * to `subscribe`.\n *\n * This is being removed because the performance was severely problematic, and it could also cause\n * issues when types other than POJOs are passed to subscribe as subscribers, as they will likely have\n * their `this` context overwritten.\n *\n * @deprecated As of version 8, RxJS will no longer support altering the\n * context of next functions provided as part of an observer to Subscribe. Instead,\n * you will have access to a subscription or a signal or token that will allow you to do things like\n * unsubscribe and test closed status. Will be removed in v8.\n */\n useDeprecatedNextContext: boolean;\n}\n", "import type { TimerHandle } from './timerHandle';\ntype SetTimeoutFunction = (handler: () => void, timeout?: number, ...args: any[]) => TimerHandle;\ntype ClearTimeoutFunction = (handle: TimerHandle) => void;\n\ninterface TimeoutProvider {\n setTimeout: SetTimeoutFunction;\n clearTimeout: ClearTimeoutFunction;\n delegate:\n | {\n setTimeout: SetTimeoutFunction;\n clearTimeout: ClearTimeoutFunction;\n }\n | undefined;\n}\n\nexport const timeoutProvider: TimeoutProvider = {\n // When accessing the delegate, use the variable rather than `this` so that\n // the functions can be called without being bound to the provider.\n setTimeout(handler: () => void, timeout?: number, ...args) {\n const { delegate } = timeoutProvider;\n if (delegate?.setTimeout) {\n return delegate.setTimeout(handler, timeout, ...args);\n }\n return setTimeout(handler, timeout, ...args);\n },\n clearTimeout(handle) {\n const { delegate } = timeoutProvider;\n return (delegate?.clearTimeout || clearTimeout)(handle as any);\n },\n delegate: undefined,\n};\n", "import { config } from '../config';\nimport { timeoutProvider } from '../scheduler/timeoutProvider';\n\n/**\n * Handles an error on another job either with the user-configured {@link onUnhandledError},\n * or by throwing it on that new job so it can be picked up by `window.onerror`, `process.on('error')`, etc.\n *\n * This should be called whenever there is an error that is out-of-band with the subscription\n * or when an error hits a terminal boundary of the subscription and no error handler was provided.\n *\n * @param err the error to report\n */\nexport function reportUnhandledError(err: any) {\n timeoutProvider.setTimeout(() => {\n const { onUnhandledError } = config;\n if (onUnhandledError) {\n // Execute the user-configured error handler.\n onUnhandledError(err);\n } else {\n // Throw so it is picked up by the runtime's uncaught error mechanism.\n throw err;\n }\n });\n}\n", "/* tslint:disable:no-empty */\nexport function noop() { }\n", "import { CompleteNotification, NextNotification, ErrorNotification } from './types';\n\n/**\n * A completion object optimized for memory use and created to be the\n * same \"shape\" as other notifications in v8.\n * @internal\n */\nexport const COMPLETE_NOTIFICATION = (() => createNotification('C', undefined, undefined) as CompleteNotification)();\n\n/**\n * Internal use only. Creates an optimized error notification that is the same \"shape\"\n * as other notifications.\n * @internal\n */\nexport function errorNotification(error: any): ErrorNotification {\n return createNotification('E', undefined, error) as any;\n}\n\n/**\n * Internal use only. Creates an optimized next notification that is the same \"shape\"\n * as other notifications.\n * @internal\n */\nexport function nextNotification(value: T) {\n return createNotification('N', value, undefined) as NextNotification;\n}\n\n/**\n * Ensures that all notifications created internally have the same \"shape\" in v8.\n *\n * TODO: This is only exported to support a crazy legacy test in `groupBy`.\n * @internal\n */\nexport function createNotification(kind: 'N' | 'E' | 'C', value: any, error: any) {\n return {\n kind,\n value,\n error,\n };\n}\n", "import { config } from '../config';\n\nlet context: { errorThrown: boolean; error: any } | null = null;\n\n/**\n * Handles dealing with errors for super-gross mode. Creates a context, in which\n * any synchronously thrown errors will be passed to {@link captureError}. Which\n * will record the error such that it will be rethrown after the call back is complete.\n * TODO: Remove in v8\n * @param cb An immediately executed function.\n */\nexport function errorContext(cb: () => void) {\n if (config.useDeprecatedSynchronousErrorHandling) {\n const isRoot = !context;\n if (isRoot) {\n context = { errorThrown: false, error: null };\n }\n cb();\n if (isRoot) {\n const { errorThrown, error } = context!;\n context = null;\n if (errorThrown) {\n throw error;\n }\n }\n } else {\n // This is the general non-deprecated path for everyone that\n // isn't crazy enough to use super-gross mode (useDeprecatedSynchronousErrorHandling)\n cb();\n }\n}\n\n/**\n * Captures errors only in super-gross mode.\n * @param err the error to capture\n */\nexport function captureError(err: any) {\n if (config.useDeprecatedSynchronousErrorHandling && context) {\n context.errorThrown = true;\n context.error = err;\n }\n}\n", "import { isFunction } from './util/isFunction';\nimport { Observer, ObservableNotification } from './types';\nimport { isSubscription, Subscription } from './Subscription';\nimport { config } from './config';\nimport { reportUnhandledError } from './util/reportUnhandledError';\nimport { noop } from './util/noop';\nimport { nextNotification, errorNotification, COMPLETE_NOTIFICATION } from './NotificationFactories';\nimport { timeoutProvider } from './scheduler/timeoutProvider';\nimport { captureError } from './util/errorContext';\n\n/**\n * Implements the {@link Observer} interface and extends the\n * {@link Subscription} class. While the {@link Observer} is the public API for\n * consuming the values of an {@link Observable}, all Observers get converted to\n * a Subscriber, in order to provide Subscription-like capabilities such as\n * `unsubscribe`. Subscriber is a common type in RxJS, and crucial for\n * implementing operators, but it is rarely used as a public API.\n *\n * @class Subscriber\n */\nexport class Subscriber extends Subscription implements Observer {\n /**\n * A static factory for a Subscriber, given a (potentially partial) definition\n * of an Observer.\n * @param next The `next` callback of an Observer.\n * @param error The `error` callback of an\n * Observer.\n * @param complete The `complete` callback of an\n * Observer.\n * @return A Subscriber wrapping the (partially defined)\n * Observer represented by the given arguments.\n * @nocollapse\n * @deprecated Do not use. Will be removed in v8. There is no replacement for this\n * method, and there is no reason to be creating instances of `Subscriber` directly.\n * If you have a specific use case, please file an issue.\n */\n static create(next?: (x?: T) => void, error?: (e?: any) => void, complete?: () => void): Subscriber {\n return new SafeSubscriber(next, error, complete);\n }\n\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n protected isStopped: boolean = false;\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n protected destination: Subscriber | Observer; // this `any` is the escape hatch to erase extra type param (e.g. R)\n\n /**\n * @deprecated Internal implementation detail, do not use directly. Will be made internal in v8.\n * There is no reason to directly create an instance of Subscriber. This type is exported for typings reasons.\n */\n constructor(destination?: Subscriber | Observer) {\n super();\n if (destination) {\n this.destination = destination;\n // Automatically chain subscriptions together here.\n // if destination is a Subscription, then it is a Subscriber.\n if (isSubscription(destination)) {\n destination.add(this);\n }\n } else {\n this.destination = EMPTY_OBSERVER;\n }\n }\n\n /**\n * The {@link Observer} callback to receive notifications of type `next` from\n * the Observable, with a value. The Observable may call this method 0 or more\n * times.\n * @param {T} [value] The `next` value.\n * @return {void}\n */\n next(value?: T): void {\n if (this.isStopped) {\n handleStoppedNotification(nextNotification(value), this);\n } else {\n this._next(value!);\n }\n }\n\n /**\n * The {@link Observer} callback to receive notifications of type `error` from\n * the Observable, with an attached `Error`. Notifies the Observer that\n * the Observable has experienced an error condition.\n * @param {any} [err] The `error` exception.\n * @return {void}\n */\n error(err?: any): void {\n if (this.isStopped) {\n handleStoppedNotification(errorNotification(err), this);\n } else {\n this.isStopped = true;\n this._error(err);\n }\n }\n\n /**\n * The {@link Observer} callback to receive a valueless notification of type\n * `complete` from the Observable. Notifies the Observer that the Observable\n * has finished sending push-based notifications.\n * @return {void}\n */\n complete(): void {\n if (this.isStopped) {\n handleStoppedNotification(COMPLETE_NOTIFICATION, this);\n } else {\n this.isStopped = true;\n this._complete();\n }\n }\n\n unsubscribe(): void {\n if (!this.closed) {\n this.isStopped = true;\n super.unsubscribe();\n this.destination = null!;\n }\n }\n\n protected _next(value: T): void {\n this.destination.next(value);\n }\n\n protected _error(err: any): void {\n try {\n this.destination.error(err);\n } finally {\n this.unsubscribe();\n }\n }\n\n protected _complete(): void {\n try {\n this.destination.complete();\n } finally {\n this.unsubscribe();\n }\n }\n}\n\n/**\n * This bind is captured here because we want to be able to have\n * compatibility with monoid libraries that tend to use a method named\n * `bind`. In particular, a library called Monio requires this.\n */\nconst _bind = Function.prototype.bind;\n\nfunction bind any>(fn: Fn, thisArg: any): Fn {\n return _bind.call(fn, thisArg);\n}\n\n/**\n * Internal optimization only, DO NOT EXPOSE.\n * @internal\n */\nclass ConsumerObserver implements Observer {\n constructor(private partialObserver: Partial>) {}\n\n next(value: T): void {\n const { partialObserver } = this;\n if (partialObserver.next) {\n try {\n partialObserver.next(value);\n } catch (error) {\n handleUnhandledError(error);\n }\n }\n }\n\n error(err: any): void {\n const { partialObserver } = this;\n if (partialObserver.error) {\n try {\n partialObserver.error(err);\n } catch (error) {\n handleUnhandledError(error);\n }\n } else {\n handleUnhandledError(err);\n }\n }\n\n complete(): void {\n const { partialObserver } = this;\n if (partialObserver.complete) {\n try {\n partialObserver.complete();\n } catch (error) {\n handleUnhandledError(error);\n }\n }\n }\n}\n\nexport class SafeSubscriber extends Subscriber {\n constructor(\n observerOrNext?: Partial> | ((value: T) => void) | null,\n error?: ((e?: any) => void) | null,\n complete?: (() => void) | null\n ) {\n super();\n\n let partialObserver: Partial>;\n if (isFunction(observerOrNext) || !observerOrNext) {\n // The first argument is a function, not an observer. The next\n // two arguments *could* be observers, or they could be empty.\n partialObserver = {\n next: (observerOrNext ?? undefined) as (((value: T) => void) | undefined),\n error: error ?? undefined,\n complete: complete ?? undefined,\n };\n } else {\n // The first argument is a partial observer.\n let context: any;\n if (this && config.useDeprecatedNextContext) {\n // This is a deprecated path that made `this.unsubscribe()` available in\n // next handler functions passed to subscribe. This only exists behind a flag\n // now, as it is *very* slow.\n context = Object.create(observerOrNext);\n context.unsubscribe = () => this.unsubscribe();\n partialObserver = {\n next: observerOrNext.next && bind(observerOrNext.next, context),\n error: observerOrNext.error && bind(observerOrNext.error, context),\n complete: observerOrNext.complete && bind(observerOrNext.complete, context),\n };\n } else {\n // The \"normal\" path. Just use the partial observer directly.\n partialObserver = observerOrNext;\n }\n }\n\n // Wrap the partial observer to ensure it's a full observer, and\n // make sure proper error handling is accounted for.\n this.destination = new ConsumerObserver(partialObserver);\n }\n}\n\nfunction handleUnhandledError(error: any) {\n if (config.useDeprecatedSynchronousErrorHandling) {\n captureError(error);\n } else {\n // Ideal path, we report this as an unhandled error,\n // which is thrown on a new call stack.\n reportUnhandledError(error);\n }\n}\n\n/**\n * An error handler used when no error handler was supplied\n * to the SafeSubscriber -- meaning no error handler was supplied\n * do the `subscribe` call on our observable.\n * @param err The error to handle\n */\nfunction defaultErrorHandler(err: any) {\n throw err;\n}\n\n/**\n * A handler for notifications that cannot be sent to a stopped subscriber.\n * @param notification The notification being sent\n * @param subscriber The stopped subscriber\n */\nfunction handleStoppedNotification(notification: ObservableNotification, subscriber: Subscriber) {\n const { onStoppedNotification } = config;\n onStoppedNotification && timeoutProvider.setTimeout(() => onStoppedNotification(notification, subscriber));\n}\n\n/**\n * The observer used as a stub for subscriptions where the user did not\n * pass any arguments to `subscribe`. Comes with the default error handling\n * behavior.\n */\nexport const EMPTY_OBSERVER: Readonly> & { closed: true } = {\n closed: true,\n next: noop,\n error: defaultErrorHandler,\n complete: noop,\n};\n", "/**\n * Symbol.observable or a string \"@@observable\". Used for interop\n *\n * @deprecated We will no longer be exporting this symbol in upcoming versions of RxJS.\n * Instead polyfill and use Symbol.observable directly *or* use https://www.npmjs.com/package/symbol-observable\n */\nexport const observable: string | symbol = (() => (typeof Symbol === 'function' && Symbol.observable) || '@@observable')();\n", "/**\n * This function takes one parameter and just returns it. Simply put,\n * this is like `(x: T): T => x`.\n *\n * ## Examples\n *\n * This is useful in some cases when using things like `mergeMap`\n *\n * ```ts\n * import { interval, take, map, range, mergeMap, identity } from 'rxjs';\n *\n * const source$ = interval(1000).pipe(take(5));\n *\n * const result$ = source$.pipe(\n * map(i => range(i)),\n * mergeMap(identity) // same as mergeMap(x => x)\n * );\n *\n * result$.subscribe({\n * next: console.log\n * });\n * ```\n *\n * Or when you want to selectively apply an operator\n *\n * ```ts\n * import { interval, take, identity } from 'rxjs';\n *\n * const shouldLimit = () => Math.random() < 0.5;\n *\n * const source$ = interval(1000);\n *\n * const result$ = source$.pipe(shouldLimit() ? take(5) : identity);\n *\n * result$.subscribe({\n * next: console.log\n * });\n * ```\n *\n * @param x Any value that is returned by this function\n * @returns The value passed as the first parameter to this function\n */\nexport function identity(x: T): T {\n return x;\n}\n", "import { identity } from './identity';\nimport { UnaryFunction } from '../types';\n\nexport function pipe(): typeof identity;\nexport function pipe(fn1: UnaryFunction): UnaryFunction;\nexport function pipe(fn1: UnaryFunction, fn2: UnaryFunction): UnaryFunction;\nexport function pipe(fn1: UnaryFunction, fn2: UnaryFunction, fn3: UnaryFunction): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction,\n fn6: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction,\n fn6: UnaryFunction,\n fn7: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction,\n fn6: UnaryFunction,\n fn7: UnaryFunction,\n fn8: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction,\n fn6: UnaryFunction,\n fn7: UnaryFunction,\n fn8: UnaryFunction,\n fn9: UnaryFunction\n): UnaryFunction;\nexport function pipe(\n fn1: UnaryFunction,\n fn2: UnaryFunction,\n fn3: UnaryFunction,\n fn4: UnaryFunction,\n fn5: UnaryFunction,\n fn6: UnaryFunction,\n fn7: UnaryFunction,\n fn8: UnaryFunction,\n fn9: UnaryFunction,\n ...fns: UnaryFunction[]\n): UnaryFunction;\n\n/**\n * pipe() can be called on one or more functions, each of which can take one argument (\"UnaryFunction\")\n * and uses it to return a value.\n * It returns a function that takes one argument, passes it to the first UnaryFunction, and then\n * passes the result to the next one, passes that result to the next one, and so on. \n */\nexport function pipe(...fns: Array>): UnaryFunction {\n return pipeFromArray(fns);\n}\n\n/** @internal */\nexport function pipeFromArray(fns: Array>): UnaryFunction {\n if (fns.length === 0) {\n return identity as UnaryFunction;\n }\n\n if (fns.length === 1) {\n return fns[0];\n }\n\n return function piped(input: T): R {\n return fns.reduce((prev: any, fn: UnaryFunction) => fn(prev), input as any);\n };\n}\n", "import { Operator } from './Operator';\nimport { SafeSubscriber, Subscriber } from './Subscriber';\nimport { isSubscription, Subscription } from './Subscription';\nimport { TeardownLogic, OperatorFunction, Subscribable, Observer } from './types';\nimport { observable as Symbol_observable } from './symbol/observable';\nimport { pipeFromArray } from './util/pipe';\nimport { config } from './config';\nimport { isFunction } from './util/isFunction';\nimport { errorContext } from './util/errorContext';\n\n/**\n * A representation of any set of values over any amount of time. This is the most basic building block\n * of RxJS.\n *\n * @class Observable\n */\nexport class Observable implements Subscribable {\n /**\n * @deprecated Internal implementation detail, do not use directly. Will be made internal in v8.\n */\n source: Observable | undefined;\n\n /**\n * @deprecated Internal implementation detail, do not use directly. Will be made internal in v8.\n */\n operator: Operator | undefined;\n\n /**\n * @constructor\n * @param {Function} subscribe the function that is called when the Observable is\n * initially subscribed to. This function is given a Subscriber, to which new values\n * can be `next`ed, or an `error` method can be called to raise an error, or\n * `complete` can be called to notify of a successful completion.\n */\n constructor(subscribe?: (this: Observable, subscriber: Subscriber) => TeardownLogic) {\n if (subscribe) {\n this._subscribe = subscribe;\n }\n }\n\n // HACK: Since TypeScript inherits static properties too, we have to\n // fight against TypeScript here so Subject can have a different static create signature\n /**\n * Creates a new Observable by calling the Observable constructor\n * @owner Observable\n * @method create\n * @param {Function} subscribe? the subscriber function to be passed to the Observable constructor\n * @return {Observable} a new observable\n * @nocollapse\n * @deprecated Use `new Observable()` instead. Will be removed in v8.\n */\n static create: (...args: any[]) => any = (subscribe?: (subscriber: Subscriber) => TeardownLogic) => {\n return new Observable(subscribe);\n };\n\n /**\n * Creates a new Observable, with this Observable instance as the source, and the passed\n * operator defined as the new observable's operator.\n * @method lift\n * @param operator the operator defining the operation to take on the observable\n * @return a new observable with the Operator applied\n * @deprecated Internal implementation detail, do not use directly. Will be made internal in v8.\n * If you have implemented an operator using `lift`, it is recommended that you create an\n * operator by simply returning `new Observable()` directly. See \"Creating new operators from\n * scratch\" section here: https://rxjs.dev/guide/operators\n */\n lift(operator?: Operator): Observable {\n const observable = new Observable();\n observable.source = this;\n observable.operator = operator;\n return observable;\n }\n\n subscribe(observerOrNext?: Partial> | ((value: T) => void)): Subscription;\n /** @deprecated Instead of passing separate callback arguments, use an observer argument. Signatures taking separate callback arguments will be removed in v8. Details: https://rxjs.dev/deprecations/subscribe-arguments */\n subscribe(next?: ((value: T) => void) | null, error?: ((error: any) => void) | null, complete?: (() => void) | null): Subscription;\n /**\n * Invokes an execution of an Observable and registers Observer handlers for notifications it will emit.\n *\n * Use it when you have all these Observables, but still nothing is happening.\n *\n * `subscribe` is not a regular operator, but a method that calls Observable's internal `subscribe` function. It\n * might be for example a function that you passed to Observable's constructor, but most of the time it is\n * a library implementation, which defines what will be emitted by an Observable, and when it be will emitted. This means\n * that calling `subscribe` is actually the moment when Observable starts its work, not when it is created, as it is often\n * the thought.\n *\n * Apart from starting the execution of an Observable, this method allows you to listen for values\n * that an Observable emits, as well as for when it completes or errors. You can achieve this in two\n * of the following ways.\n *\n * The first way is creating an object that implements {@link Observer} interface. It should have methods\n * defined by that interface, but note that it should be just a regular JavaScript object, which you can create\n * yourself in any way you want (ES6 class, classic function constructor, object literal etc.). In particular, do\n * not attempt to use any RxJS implementation details to create Observers - you don't need them. Remember also\n * that your object does not have to implement all methods. If you find yourself creating a method that doesn't\n * do anything, you can simply omit it. Note however, if the `error` method is not provided and an error happens,\n * it will be thrown asynchronously. Errors thrown asynchronously cannot be caught using `try`/`catch`. Instead,\n * use the {@link onUnhandledError} configuration option or use a runtime handler (like `window.onerror` or\n * `process.on('error)`) to be notified of unhandled errors. Because of this, it's recommended that you provide\n * an `error` method to avoid missing thrown errors.\n *\n * The second way is to give up on Observer object altogether and simply provide callback functions in place of its methods.\n * This means you can provide three functions as arguments to `subscribe`, where the first function is equivalent\n * of a `next` method, the second of an `error` method and the third of a `complete` method. Just as in case of an Observer,\n * if you do not need to listen for something, you can omit a function by passing `undefined` or `null`,\n * since `subscribe` recognizes these functions by where they were placed in function call. When it comes\n * to the `error` function, as with an Observer, if not provided, errors emitted by an Observable will be thrown asynchronously.\n *\n * You can, however, subscribe with no parameters at all. This may be the case where you're not interested in terminal events\n * and you also handled emissions internally by using operators (e.g. using `tap`).\n *\n * Whichever style of calling `subscribe` you use, in both cases it returns a Subscription object.\n * This object allows you to call `unsubscribe` on it, which in turn will stop the work that an Observable does and will clean\n * up all resources that an Observable used. Note that cancelling a subscription will not call `complete` callback\n * provided to `subscribe` function, which is reserved for a regular completion signal that comes from an Observable.\n *\n * Remember that callbacks provided to `subscribe` are not guaranteed to be called asynchronously.\n * It is an Observable itself that decides when these functions will be called. For example {@link of}\n * by default emits all its values synchronously. Always check documentation for how given Observable\n * will behave when subscribed and if its default behavior can be modified with a `scheduler`.\n *\n * #### Examples\n *\n * Subscribe with an {@link guide/observer Observer}\n *\n * ```ts\n * import { of } from 'rxjs';\n *\n * const sumObserver = {\n * sum: 0,\n * next(value) {\n * console.log('Adding: ' + value);\n * this.sum = this.sum + value;\n * },\n * error() {\n * // We actually could just remove this method,\n * // since we do not really care about errors right now.\n * },\n * complete() {\n * console.log('Sum equals: ' + this.sum);\n * }\n * };\n *\n * of(1, 2, 3) // Synchronously emits 1, 2, 3 and then completes.\n * .subscribe(sumObserver);\n *\n * // Logs:\n * // 'Adding: 1'\n * // 'Adding: 2'\n * // 'Adding: 3'\n * // 'Sum equals: 6'\n * ```\n *\n * Subscribe with functions ({@link deprecations/subscribe-arguments deprecated})\n *\n * ```ts\n * import { of } from 'rxjs'\n *\n * let sum = 0;\n *\n * of(1, 2, 3).subscribe(\n * value => {\n * console.log('Adding: ' + value);\n * sum = sum + value;\n * },\n * undefined,\n * () => console.log('Sum equals: ' + sum)\n * );\n *\n * // Logs:\n * // 'Adding: 1'\n * // 'Adding: 2'\n * // 'Adding: 3'\n * // 'Sum equals: 6'\n * ```\n *\n * Cancel a subscription\n *\n * ```ts\n * import { interval } from 'rxjs';\n *\n * const subscription = interval(1000).subscribe({\n * next(num) {\n * console.log(num)\n * },\n * complete() {\n * // Will not be called, even when cancelling subscription.\n * console.log('completed!');\n * }\n * });\n *\n * setTimeout(() => {\n * subscription.unsubscribe();\n * console.log('unsubscribed!');\n * }, 2500);\n *\n * // Logs:\n * // 0 after 1s\n * // 1 after 2s\n * // 'unsubscribed!' after 2.5s\n * ```\n *\n * @param {Observer|Function} observerOrNext (optional) Either an observer with methods to be called,\n * or the first of three possible handlers, which is the handler for each value emitted from the subscribed\n * Observable.\n * @param {Function} error (optional) A handler for a terminal event resulting from an error. If no error handler is provided,\n * the error will be thrown asynchronously as unhandled.\n * @param {Function} complete (optional) A handler for a terminal event resulting from successful completion.\n * @return {Subscription} a subscription reference to the registered handlers\n * @method subscribe\n */\n subscribe(\n observerOrNext?: Partial> | ((value: T) => void) | null,\n error?: ((error: any) => void) | null,\n complete?: (() => void) | null\n ): Subscription {\n const subscriber = isSubscriber(observerOrNext) ? observerOrNext : new SafeSubscriber(observerOrNext, error, complete);\n\n errorContext(() => {\n const { operator, source } = this;\n subscriber.add(\n operator\n ? // We're dealing with a subscription in the\n // operator chain to one of our lifted operators.\n operator.call(subscriber, source)\n : source\n ? // If `source` has a value, but `operator` does not, something that\n // had intimate knowledge of our API, like our `Subject`, must have\n // set it. We're going to just call `_subscribe` directly.\n this._subscribe(subscriber)\n : // In all other cases, we're likely wrapping a user-provided initializer\n // function, so we need to catch errors and handle them appropriately.\n this._trySubscribe(subscriber)\n );\n });\n\n return subscriber;\n }\n\n /** @internal */\n protected _trySubscribe(sink: Subscriber): TeardownLogic {\n try {\n return this._subscribe(sink);\n } catch (err) {\n // We don't need to return anything in this case,\n // because it's just going to try to `add()` to a subscription\n // above.\n sink.error(err);\n }\n }\n\n /**\n * Used as a NON-CANCELLABLE means of subscribing to an observable, for use with\n * APIs that expect promises, like `async/await`. You cannot unsubscribe from this.\n *\n * **WARNING**: Only use this with observables you *know* will complete. If the source\n * observable does not complete, you will end up with a promise that is hung up, and\n * potentially all of the state of an async function hanging out in memory. To avoid\n * this situation, look into adding something like {@link timeout}, {@link take},\n * {@link takeWhile}, or {@link takeUntil} amongst others.\n *\n * #### Example\n *\n * ```ts\n * import { interval, take } from 'rxjs';\n *\n * const source$ = interval(1000).pipe(take(4));\n *\n * async function getTotal() {\n * let total = 0;\n *\n * await source$.forEach(value => {\n * total += value;\n * console.log('observable -> ' + value);\n * });\n *\n * return total;\n * }\n *\n * getTotal().then(\n * total => console.log('Total: ' + total)\n * );\n *\n * // Expected:\n * // 'observable -> 0'\n * // 'observable -> 1'\n * // 'observable -> 2'\n * // 'observable -> 3'\n * // 'Total: 6'\n * ```\n *\n * @param next a handler for each value emitted by the observable\n * @return a promise that either resolves on observable completion or\n * rejects with the handled error\n */\n forEach(next: (value: T) => void): Promise;\n\n /**\n * @param next a handler for each value emitted by the observable\n * @param promiseCtor a constructor function used to instantiate the Promise\n * @return a promise that either resolves on observable completion or\n * rejects with the handled error\n * @deprecated Passing a Promise constructor will no longer be available\n * in upcoming versions of RxJS. This is because it adds weight to the library, for very\n * little benefit. If you need this functionality, it is recommended that you either\n * polyfill Promise, or you create an adapter to convert the returned native promise\n * to whatever promise implementation you wanted. Will be removed in v8.\n */\n forEach(next: (value: T) => void, promiseCtor: PromiseConstructorLike): Promise;\n\n forEach(next: (value: T) => void, promiseCtor?: PromiseConstructorLike): Promise {\n promiseCtor = getPromiseCtor(promiseCtor);\n\n return new promiseCtor((resolve, reject) => {\n const subscriber = new SafeSubscriber({\n next: (value) => {\n try {\n next(value);\n } catch (err) {\n reject(err);\n subscriber.unsubscribe();\n }\n },\n error: reject,\n complete: resolve,\n });\n this.subscribe(subscriber);\n }) as Promise;\n }\n\n /** @internal */\n protected _subscribe(subscriber: Subscriber): TeardownLogic {\n return this.source?.subscribe(subscriber);\n }\n\n /**\n * An interop point defined by the es7-observable spec https://github.com/zenparsing/es-observable\n * @method Symbol.observable\n * @return {Observable} this instance of the observable\n */\n [Symbol_observable]() {\n return this;\n }\n\n /* tslint:disable:max-line-length */\n pipe(): Observable;\n pipe(op1: OperatorFunction): Observable;\n pipe(op1: OperatorFunction, op2: OperatorFunction): Observable;\n pipe(op1: OperatorFunction, op2: OperatorFunction, op3: OperatorFunction): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction,\n op6: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction,\n op6: OperatorFunction,\n op7: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction,\n op6: OperatorFunction,\n op7: OperatorFunction,\n op8: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction,\n op6: OperatorFunction,\n op7: OperatorFunction,\n op8: OperatorFunction,\n op9: OperatorFunction\n ): Observable;\n pipe(\n op1: OperatorFunction,\n op2: OperatorFunction,\n op3: OperatorFunction,\n op4: OperatorFunction,\n op5: OperatorFunction,\n op6: OperatorFunction,\n op7: OperatorFunction,\n op8: OperatorFunction,\n op9: OperatorFunction,\n ...operations: OperatorFunction[]\n ): Observable;\n /* tslint:enable:max-line-length */\n\n /**\n * Used to stitch together functional operators into a chain.\n * @method pipe\n * @return {Observable} the Observable result of all of the operators having\n * been called in the order they were passed in.\n *\n * ## Example\n *\n * ```ts\n * import { interval, filter, map, scan } from 'rxjs';\n *\n * interval(1000)\n * .pipe(\n * filter(x => x % 2 === 0),\n * map(x => x + x),\n * scan((acc, x) => acc + x)\n * )\n * .subscribe(x => console.log(x));\n * ```\n */\n pipe(...operations: OperatorFunction[]): Observable {\n return pipeFromArray(operations)(this);\n }\n\n /* tslint:disable:max-line-length */\n /** @deprecated Replaced with {@link firstValueFrom} and {@link lastValueFrom}. Will be removed in v8. Details: https://rxjs.dev/deprecations/to-promise */\n toPromise(): Promise;\n /** @deprecated Replaced with {@link firstValueFrom} and {@link lastValueFrom}. Will be removed in v8. Details: https://rxjs.dev/deprecations/to-promise */\n toPromise(PromiseCtor: typeof Promise): Promise;\n /** @deprecated Replaced with {@link firstValueFrom} and {@link lastValueFrom}. Will be removed in v8. Details: https://rxjs.dev/deprecations/to-promise */\n toPromise(PromiseCtor: PromiseConstructorLike): Promise;\n /* tslint:enable:max-line-length */\n\n /**\n * Subscribe to this Observable and get a Promise resolving on\n * `complete` with the last emission (if any).\n *\n * **WARNING**: Only use this with observables you *know* will complete. If the source\n * observable does not complete, you will end up with a promise that is hung up, and\n * potentially all of the state of an async function hanging out in memory. To avoid\n * this situation, look into adding something like {@link timeout}, {@link take},\n * {@link takeWhile}, or {@link takeUntil} amongst others.\n *\n * @method toPromise\n * @param [promiseCtor] a constructor function used to instantiate\n * the Promise\n * @return A Promise that resolves with the last value emit, or\n * rejects on an error. If there were no emissions, Promise\n * resolves with undefined.\n * @deprecated Replaced with {@link firstValueFrom} and {@link lastValueFrom}. Will be removed in v8. Details: https://rxjs.dev/deprecations/to-promise\n */\n toPromise(promiseCtor?: PromiseConstructorLike): Promise {\n promiseCtor = getPromiseCtor(promiseCtor);\n\n return new promiseCtor((resolve, reject) => {\n let value: T | undefined;\n this.subscribe(\n (x: T) => (value = x),\n (err: any) => reject(err),\n () => resolve(value)\n );\n }) as Promise;\n }\n}\n\n/**\n * Decides between a passed promise constructor from consuming code,\n * A default configured promise constructor, and the native promise\n * constructor and returns it. If nothing can be found, it will throw\n * an error.\n * @param promiseCtor The optional promise constructor to passed by consuming code\n */\nfunction getPromiseCtor(promiseCtor: PromiseConstructorLike | undefined) {\n return promiseCtor ?? config.Promise ?? Promise;\n}\n\nfunction isObserver(value: any): value is Observer {\n return value && isFunction(value.next) && isFunction(value.error) && isFunction(value.complete);\n}\n\nfunction isSubscriber(value: any): value is Subscriber {\n return (value && value instanceof Subscriber) || (isObserver(value) && isSubscription(value));\n}\n", "import { Observable } from '../Observable';\nimport { Subscriber } from '../Subscriber';\nimport { OperatorFunction } from '../types';\nimport { isFunction } from './isFunction';\n\n/**\n * Used to determine if an object is an Observable with a lift function.\n */\nexport function hasLift(source: any): source is { lift: InstanceType['lift'] } {\n return isFunction(source?.lift);\n}\n\n/**\n * Creates an `OperatorFunction`. Used to define operators throughout the library in a concise way.\n * @param init The logic to connect the liftedSource to the subscriber at the moment of subscription.\n */\nexport function operate(\n init: (liftedSource: Observable, subscriber: Subscriber) => (() => void) | void\n): OperatorFunction {\n return (source: Observable) => {\n if (hasLift(source)) {\n return source.lift(function (this: Subscriber, liftedSource: Observable) {\n try {\n return init(liftedSource, this);\n } catch (err) {\n this.error(err);\n }\n });\n }\n throw new TypeError('Unable to lift unknown Observable type');\n };\n}\n", "import { Subscriber } from '../Subscriber';\n\n/**\n * Creates an instance of an `OperatorSubscriber`.\n * @param destination The downstream subscriber.\n * @param onNext Handles next values, only called if this subscriber is not stopped or closed. Any\n * error that occurs in this function is caught and sent to the `error` method of this subscriber.\n * @param onError Handles errors from the subscription, any errors that occur in this handler are caught\n * and send to the `destination` error handler.\n * @param onComplete Handles completion notification from the subscription. Any errors that occur in\n * this handler are sent to the `destination` error handler.\n * @param onFinalize Additional teardown logic here. This will only be called on teardown if the\n * subscriber itself is not already closed. This is called after all other teardown logic is executed.\n */\nexport function createOperatorSubscriber(\n destination: Subscriber,\n onNext?: (value: T) => void,\n onComplete?: () => void,\n onError?: (err: any) => void,\n onFinalize?: () => void\n): Subscriber {\n return new OperatorSubscriber(destination, onNext, onComplete, onError, onFinalize);\n}\n\n/**\n * A generic helper for allowing operators to be created with a Subscriber and\n * use closures to capture necessary state from the operator function itself.\n */\nexport class OperatorSubscriber extends Subscriber {\n /**\n * Creates an instance of an `OperatorSubscriber`.\n * @param destination The downstream subscriber.\n * @param onNext Handles next values, only called if this subscriber is not stopped or closed. Any\n * error that occurs in this function is caught and sent to the `error` method of this subscriber.\n * @param onError Handles errors from the subscription, any errors that occur in this handler are caught\n * and send to the `destination` error handler.\n * @param onComplete Handles completion notification from the subscription. Any errors that occur in\n * this handler are sent to the `destination` error handler.\n * @param onFinalize Additional finalization logic here. This will only be called on finalization if the\n * subscriber itself is not already closed. This is called after all other finalization logic is executed.\n * @param shouldUnsubscribe An optional check to see if an unsubscribe call should truly unsubscribe.\n * NOTE: This currently **ONLY** exists to support the strange behavior of {@link groupBy}, where unsubscription\n * to the resulting observable does not actually disconnect from the source if there are active subscriptions\n * to any grouped observable. (DO NOT EXPOSE OR USE EXTERNALLY!!!)\n */\n constructor(\n destination: Subscriber,\n onNext?: (value: T) => void,\n onComplete?: () => void,\n onError?: (err: any) => void,\n private onFinalize?: () => void,\n private shouldUnsubscribe?: () => boolean\n ) {\n // It's important - for performance reasons - that all of this class's\n // members are initialized and that they are always initialized in the same\n // order. This will ensure that all OperatorSubscriber instances have the\n // same hidden class in V8. This, in turn, will help keep the number of\n // hidden classes involved in property accesses within the base class as\n // low as possible. If the number of hidden classes involved exceeds four,\n // the property accesses will become megamorphic and performance penalties\n // will be incurred - i.e. inline caches won't be used.\n //\n // The reasons for ensuring all instances have the same hidden class are\n // further discussed in this blog post from Benedikt Meurer:\n // https://benediktmeurer.de/2018/03/23/impact-of-polymorphism-on-component-based-frameworks-like-react/\n super(destination);\n this._next = onNext\n ? function (this: OperatorSubscriber, value: T) {\n try {\n onNext(value);\n } catch (err) {\n destination.error(err);\n }\n }\n : super._next;\n this._error = onError\n ? function (this: OperatorSubscriber, err: any) {\n try {\n onError(err);\n } catch (err) {\n // Send any errors that occur down stream.\n destination.error(err);\n } finally {\n // Ensure finalization.\n this.unsubscribe();\n }\n }\n : super._error;\n this._complete = onComplete\n ? function (this: OperatorSubscriber) {\n try {\n onComplete();\n } catch (err) {\n // Send any errors that occur down stream.\n destination.error(err);\n } finally {\n // Ensure finalization.\n this.unsubscribe();\n }\n }\n : super._complete;\n }\n\n unsubscribe() {\n if (!this.shouldUnsubscribe || this.shouldUnsubscribe()) {\n const { closed } = this;\n super.unsubscribe();\n // Execute additional teardown if we have any and we didn't already do so.\n !closed && this.onFinalize?.();\n }\n }\n}\n", "import { Subscription } from '../Subscription';\n\ninterface AnimationFrameProvider {\n schedule(callback: FrameRequestCallback): Subscription;\n requestAnimationFrame: typeof requestAnimationFrame;\n cancelAnimationFrame: typeof cancelAnimationFrame;\n delegate:\n | {\n requestAnimationFrame: typeof requestAnimationFrame;\n cancelAnimationFrame: typeof cancelAnimationFrame;\n }\n | undefined;\n}\n\nexport const animationFrameProvider: AnimationFrameProvider = {\n // When accessing the delegate, use the variable rather than `this` so that\n // the functions can be called without being bound to the provider.\n schedule(callback) {\n let request = requestAnimationFrame;\n let cancel: typeof cancelAnimationFrame | undefined = cancelAnimationFrame;\n const { delegate } = animationFrameProvider;\n if (delegate) {\n request = delegate.requestAnimationFrame;\n cancel = delegate.cancelAnimationFrame;\n }\n const handle = request((timestamp) => {\n // Clear the cancel function. The request has been fulfilled, so\n // attempting to cancel the request upon unsubscription would be\n // pointless.\n cancel = undefined;\n callback(timestamp);\n });\n return new Subscription(() => cancel?.(handle));\n },\n requestAnimationFrame(...args) {\n const { delegate } = animationFrameProvider;\n return (delegate?.requestAnimationFrame || requestAnimationFrame)(...args);\n },\n cancelAnimationFrame(...args) {\n const { delegate } = animationFrameProvider;\n return (delegate?.cancelAnimationFrame || cancelAnimationFrame)(...args);\n },\n delegate: undefined,\n};\n", "import { createErrorClass } from './createErrorClass';\n\nexport interface ObjectUnsubscribedError extends Error {}\n\nexport interface ObjectUnsubscribedErrorCtor {\n /**\n * @deprecated Internal implementation detail. Do not construct error instances.\n * Cannot be tagged as internal: https://github.com/ReactiveX/rxjs/issues/6269\n */\n new (): ObjectUnsubscribedError;\n}\n\n/**\n * An error thrown when an action is invalid because the object has been\n * unsubscribed.\n *\n * @see {@link Subject}\n * @see {@link BehaviorSubject}\n *\n * @class ObjectUnsubscribedError\n */\nexport const ObjectUnsubscribedError: ObjectUnsubscribedErrorCtor = createErrorClass(\n (_super) =>\n function ObjectUnsubscribedErrorImpl(this: any) {\n _super(this);\n this.name = 'ObjectUnsubscribedError';\n this.message = 'object unsubscribed';\n }\n);\n", "import { Operator } from './Operator';\nimport { Observable } from './Observable';\nimport { Subscriber } from './Subscriber';\nimport { Subscription, EMPTY_SUBSCRIPTION } from './Subscription';\nimport { Observer, SubscriptionLike, TeardownLogic } from './types';\nimport { ObjectUnsubscribedError } from './util/ObjectUnsubscribedError';\nimport { arrRemove } from './util/arrRemove';\nimport { errorContext } from './util/errorContext';\n\n/**\n * A Subject is a special type of Observable that allows values to be\n * multicasted to many Observers. Subjects are like EventEmitters.\n *\n * Every Subject is an Observable and an Observer. You can subscribe to a\n * Subject, and you can call next to feed values as well as error and complete.\n */\nexport class Subject extends Observable implements SubscriptionLike {\n closed = false;\n\n private currentObservers: Observer[] | null = null;\n\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n observers: Observer[] = [];\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n isStopped = false;\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n hasError = false;\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n thrownError: any = null;\n\n /**\n * Creates a \"subject\" by basically gluing an observer to an observable.\n *\n * @nocollapse\n * @deprecated Recommended you do not use. Will be removed at some point in the future. Plans for replacement still under discussion.\n */\n static create: (...args: any[]) => any = (destination: Observer, source: Observable): AnonymousSubject => {\n return new AnonymousSubject(destination, source);\n };\n\n constructor() {\n // NOTE: This must be here to obscure Observable's constructor.\n super();\n }\n\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n lift(operator: Operator): Observable {\n const subject = new AnonymousSubject(this, this);\n subject.operator = operator as any;\n return subject as any;\n }\n\n /** @internal */\n protected _throwIfClosed() {\n if (this.closed) {\n throw new ObjectUnsubscribedError();\n }\n }\n\n next(value: T) {\n errorContext(() => {\n this._throwIfClosed();\n if (!this.isStopped) {\n if (!this.currentObservers) {\n this.currentObservers = Array.from(this.observers);\n }\n for (const observer of this.currentObservers) {\n observer.next(value);\n }\n }\n });\n }\n\n error(err: any) {\n errorContext(() => {\n this._throwIfClosed();\n if (!this.isStopped) {\n this.hasError = this.isStopped = true;\n this.thrownError = err;\n const { observers } = this;\n while (observers.length) {\n observers.shift()!.error(err);\n }\n }\n });\n }\n\n complete() {\n errorContext(() => {\n this._throwIfClosed();\n if (!this.isStopped) {\n this.isStopped = true;\n const { observers } = this;\n while (observers.length) {\n observers.shift()!.complete();\n }\n }\n });\n }\n\n unsubscribe() {\n this.isStopped = this.closed = true;\n this.observers = this.currentObservers = null!;\n }\n\n get observed() {\n return this.observers?.length > 0;\n }\n\n /** @internal */\n protected _trySubscribe(subscriber: Subscriber): TeardownLogic {\n this._throwIfClosed();\n return super._trySubscribe(subscriber);\n }\n\n /** @internal */\n protected _subscribe(subscriber: Subscriber): Subscription {\n this._throwIfClosed();\n this._checkFinalizedStatuses(subscriber);\n return this._innerSubscribe(subscriber);\n }\n\n /** @internal */\n protected _innerSubscribe(subscriber: Subscriber) {\n const { hasError, isStopped, observers } = this;\n if (hasError || isStopped) {\n return EMPTY_SUBSCRIPTION;\n }\n this.currentObservers = null;\n observers.push(subscriber);\n return new Subscription(() => {\n this.currentObservers = null;\n arrRemove(observers, subscriber);\n });\n }\n\n /** @internal */\n protected _checkFinalizedStatuses(subscriber: Subscriber) {\n const { hasError, thrownError, isStopped } = this;\n if (hasError) {\n subscriber.error(thrownError);\n } else if (isStopped) {\n subscriber.complete();\n }\n }\n\n /**\n * Creates a new Observable with this Subject as the source. You can do this\n * to create custom Observer-side logic of the Subject and conceal it from\n * code that uses the Observable.\n * @return {Observable} Observable that the Subject casts to\n */\n asObservable(): Observable {\n const observable: any = new Observable();\n observable.source = this;\n return observable;\n }\n}\n\n/**\n * @class AnonymousSubject\n */\nexport class AnonymousSubject extends Subject {\n constructor(\n /** @deprecated Internal implementation detail, do not use directly. Will be made internal in v8. */\n public destination?: Observer,\n source?: Observable\n ) {\n super();\n this.source = source;\n }\n\n next(value: T) {\n this.destination?.next?.(value);\n }\n\n error(err: any) {\n this.destination?.error?.(err);\n }\n\n complete() {\n this.destination?.complete?.();\n }\n\n /** @internal */\n protected _subscribe(subscriber: Subscriber): Subscription {\n return this.source?.subscribe(subscriber) ?? EMPTY_SUBSCRIPTION;\n }\n}\n", "import { TimestampProvider } from '../types';\n\ninterface DateTimestampProvider extends TimestampProvider {\n delegate: TimestampProvider | undefined;\n}\n\nexport const dateTimestampProvider: DateTimestampProvider = {\n now() {\n // Use the variable rather than `this` so that the function can be called\n // without being bound to the provider.\n return (dateTimestampProvider.delegate || Date).now();\n },\n delegate: undefined,\n};\n", "import { Subject } from './Subject';\nimport { TimestampProvider } from './types';\nimport { Subscriber } from './Subscriber';\nimport { Subscription } from './Subscription';\nimport { dateTimestampProvider } from './scheduler/dateTimestampProvider';\n\n/**\n * A variant of {@link Subject} that \"replays\" old values to new subscribers by emitting them when they first subscribe.\n *\n * `ReplaySubject` has an internal buffer that will store a specified number of values that it has observed. Like `Subject`,\n * `ReplaySubject` \"observes\" values by having them passed to its `next` method. When it observes a value, it will store that\n * value for a time determined by the configuration of the `ReplaySubject`, as passed to its constructor.\n *\n * When a new subscriber subscribes to the `ReplaySubject` instance, it will synchronously emit all values in its buffer in\n * a First-In-First-Out (FIFO) manner. The `ReplaySubject` will also complete, if it has observed completion; and it will\n * error if it has observed an error.\n *\n * There are two main configuration items to be concerned with:\n *\n * 1. `bufferSize` - This will determine how many items are stored in the buffer, defaults to infinite.\n * 2. `windowTime` - The amount of time to hold a value in the buffer before removing it from the buffer.\n *\n * Both configurations may exist simultaneously. So if you would like to buffer a maximum of 3 values, as long as the values\n * are less than 2 seconds old, you could do so with a `new ReplaySubject(3, 2000)`.\n *\n * ### Differences with BehaviorSubject\n *\n * `BehaviorSubject` is similar to `new ReplaySubject(1)`, with a couple of exceptions:\n *\n * 1. `BehaviorSubject` comes \"primed\" with a single value upon construction.\n * 2. `ReplaySubject` will replay values, even after observing an error, where `BehaviorSubject` will not.\n *\n * @see {@link Subject}\n * @see {@link BehaviorSubject}\n * @see {@link shareReplay}\n */\nexport class ReplaySubject extends Subject {\n private _buffer: (T | number)[] = [];\n private _infiniteTimeWindow = true;\n\n /**\n * @param bufferSize The size of the buffer to replay on subscription\n * @param windowTime The amount of time the buffered items will stay buffered\n * @param timestampProvider An object with a `now()` method that provides the current timestamp. This is used to\n * calculate the amount of time something has been buffered.\n */\n constructor(\n private _bufferSize = Infinity,\n private _windowTime = Infinity,\n private _timestampProvider: TimestampProvider = dateTimestampProvider\n ) {\n super();\n this._infiniteTimeWindow = _windowTime === Infinity;\n this._bufferSize = Math.max(1, _bufferSize);\n this._windowTime = Math.max(1, _windowTime);\n }\n\n next(value: T): void {\n const { isStopped, _buffer, _infiniteTimeWindow, _timestampProvider, _windowTime } = this;\n if (!isStopped) {\n _buffer.push(value);\n !_infiniteTimeWindow && _buffer.push(_timestampProvider.now() + _windowTime);\n }\n this._trimBuffer();\n super.next(value);\n }\n\n /** @internal */\n protected _subscribe(subscriber: Subscriber): Subscription {\n this._throwIfClosed();\n this._trimBuffer();\n\n const subscription = this._innerSubscribe(subscriber);\n\n const { _infiniteTimeWindow, _buffer } = this;\n // We use a copy here, so reentrant code does not mutate our array while we're\n // emitting it to a new subscriber.\n const copy = _buffer.slice();\n for (let i = 0; i < copy.length && !subscriber.closed; i += _infiniteTimeWindow ? 1 : 2) {\n subscriber.next(copy[i] as T);\n }\n\n this._checkFinalizedStatuses(subscriber);\n\n return subscription;\n }\n\n private _trimBuffer() {\n const { _bufferSize, _timestampProvider, _buffer, _infiniteTimeWindow } = this;\n // If we don't have an infinite buffer size, and we're over the length,\n // use splice to truncate the old buffer values off. Note that we have to\n // double the size for instances where we're not using an infinite time window\n // because we're storing the values and the timestamps in the same array.\n const adjustedBufferSize = (_infiniteTimeWindow ? 1 : 2) * _bufferSize;\n _bufferSize < Infinity && adjustedBufferSize < _buffer.length && _buffer.splice(0, _buffer.length - adjustedBufferSize);\n\n // Now, if we're not in an infinite time window, remove all values where the time is\n // older than what is allowed.\n if (!_infiniteTimeWindow) {\n const now = _timestampProvider.now();\n let last = 0;\n // Search the array for the first timestamp that isn't expired and\n // truncate the buffer up to that point.\n for (let i = 1; i < _buffer.length && (_buffer[i] as number) <= now; i += 2) {\n last = i;\n }\n last && _buffer.splice(0, last + 1);\n }\n }\n}\n", "import { Scheduler } from '../Scheduler';\nimport { Subscription } from '../Subscription';\nimport { SchedulerAction } from '../types';\n\n/**\n * A unit of work to be executed in a `scheduler`. An action is typically\n * created from within a {@link SchedulerLike} and an RxJS user does not need to concern\n * themselves about creating and manipulating an Action.\n *\n * ```ts\n * class Action extends Subscription {\n * new (scheduler: Scheduler, work: (state?: T) => void);\n * schedule(state?: T, delay: number = 0): Subscription;\n * }\n * ```\n *\n * @class Action\n */\nexport class Action extends Subscription {\n constructor(scheduler: Scheduler, work: (this: SchedulerAction, state?: T) => void) {\n super();\n }\n /**\n * Schedules this action on its parent {@link SchedulerLike} for execution. May be passed\n * some context object, `state`. May happen at some point in the future,\n * according to the `delay` parameter, if specified.\n * @param {T} [state] Some contextual data that the `work` function uses when\n * called by the Scheduler.\n * @param {number} [delay] Time to wait before executing the work, where the\n * time unit is implicit and defined by the Scheduler.\n * @return {void}\n */\n public schedule(state?: T, delay: number = 0): Subscription {\n return this;\n }\n}\n", "import type { TimerHandle } from './timerHandle';\ntype SetIntervalFunction = (handler: () => void, timeout?: number, ...args: any[]) => TimerHandle;\ntype ClearIntervalFunction = (handle: TimerHandle) => void;\n\ninterface IntervalProvider {\n setInterval: SetIntervalFunction;\n clearInterval: ClearIntervalFunction;\n delegate:\n | {\n setInterval: SetIntervalFunction;\n clearInterval: ClearIntervalFunction;\n }\n | undefined;\n}\n\nexport const intervalProvider: IntervalProvider = {\n // When accessing the delegate, use the variable rather than `this` so that\n // the functions can be called without being bound to the provider.\n setInterval(handler: () => void, timeout?: number, ...args) {\n const { delegate } = intervalProvider;\n if (delegate?.setInterval) {\n return delegate.setInterval(handler, timeout, ...args);\n }\n return setInterval(handler, timeout, ...args);\n },\n clearInterval(handle) {\n const { delegate } = intervalProvider;\n return (delegate?.clearInterval || clearInterval)(handle as any);\n },\n delegate: undefined,\n};\n", "import { Action } from './Action';\nimport { SchedulerAction } from '../types';\nimport { Subscription } from '../Subscription';\nimport { AsyncScheduler } from './AsyncScheduler';\nimport { intervalProvider } from './intervalProvider';\nimport { arrRemove } from '../util/arrRemove';\nimport { TimerHandle } from './timerHandle';\n\nexport class AsyncAction extends Action {\n public id: TimerHandle | undefined;\n public state?: T;\n // @ts-ignore: Property has no initializer and is not definitely assigned\n public delay: number;\n protected pending: boolean = false;\n\n constructor(protected scheduler: AsyncScheduler, protected work: (this: SchedulerAction, state?: T) => void) {\n super(scheduler, work);\n }\n\n public schedule(state?: T, delay: number = 0): Subscription {\n if (this.closed) {\n return this;\n }\n\n // Always replace the current state with the new state.\n this.state = state;\n\n const id = this.id;\n const scheduler = this.scheduler;\n\n //\n // Important implementation note:\n //\n // Actions only execute once by default, unless rescheduled from within the\n // scheduled callback. This allows us to implement single and repeat\n // actions via the same code path, without adding API surface area, as well\n // as mimic traditional recursion but across asynchronous boundaries.\n //\n // However, JS runtimes and timers distinguish between intervals achieved by\n // serial `setTimeout` calls vs. a single `setInterval` call. An interval of\n // serial `setTimeout` calls can be individually delayed, which delays\n // scheduling the next `setTimeout`, and so on. `setInterval` attempts to\n // guarantee the interval callback will be invoked more precisely to the\n // interval period, regardless of load.\n //\n // Therefore, we use `setInterval` to schedule single and repeat actions.\n // If the action reschedules itself with the same delay, the interval is not\n // canceled. If the action doesn't reschedule, or reschedules with a\n // different delay, the interval will be canceled after scheduled callback\n // execution.\n //\n if (id != null) {\n this.id = this.recycleAsyncId(scheduler, id, delay);\n }\n\n // Set the pending flag indicating that this action has been scheduled, or\n // has recursively rescheduled itself.\n this.pending = true;\n\n this.delay = delay;\n // If this action has already an async Id, don't request a new one.\n this.id = this.id ?? this.requestAsyncId(scheduler, this.id, delay);\n\n return this;\n }\n\n protected requestAsyncId(scheduler: AsyncScheduler, _id?: TimerHandle, delay: number = 0): TimerHandle {\n return intervalProvider.setInterval(scheduler.flush.bind(scheduler, this), delay);\n }\n\n protected recycleAsyncId(_scheduler: AsyncScheduler, id?: TimerHandle, delay: number | null = 0): TimerHandle | undefined {\n // If this action is rescheduled with the same delay time, don't clear the interval id.\n if (delay != null && this.delay === delay && this.pending === false) {\n return id;\n }\n // Otherwise, if the action's delay time is different from the current delay,\n // or the action has been rescheduled before it's executed, clear the interval id\n if (id != null) {\n intervalProvider.clearInterval(id);\n }\n\n return undefined;\n }\n\n /**\n * Immediately executes this action and the `work` it contains.\n * @return {any}\n */\n public execute(state: T, delay: number): any {\n if (this.closed) {\n return new Error('executing a cancelled action');\n }\n\n this.pending = false;\n const error = this._execute(state, delay);\n if (error) {\n return error;\n } else if (this.pending === false && this.id != null) {\n // Dequeue if the action didn't reschedule itself. Don't call\n // unsubscribe(), because the action could reschedule later.\n // For example:\n // ```\n // scheduler.schedule(function doWork(counter) {\n // /* ... I'm a busy worker bee ... */\n // var originalAction = this;\n // /* wait 100ms before rescheduling the action */\n // setTimeout(function () {\n // originalAction.schedule(counter + 1);\n // }, 100);\n // }, 1000);\n // ```\n this.id = this.recycleAsyncId(this.scheduler, this.id, null);\n }\n }\n\n protected _execute(state: T, _delay: number): any {\n let errored: boolean = false;\n let errorValue: any;\n try {\n this.work(state);\n } catch (e) {\n errored = true;\n // HACK: Since code elsewhere is relying on the \"truthiness\" of the\n // return here, we can't have it return \"\" or 0 or false.\n // TODO: Clean this up when we refactor schedulers mid-version-8 or so.\n errorValue = e ? e : new Error('Scheduled action threw falsy error');\n }\n if (errored) {\n this.unsubscribe();\n return errorValue;\n }\n }\n\n unsubscribe() {\n if (!this.closed) {\n const { id, scheduler } = this;\n const { actions } = scheduler;\n\n this.work = this.state = this.scheduler = null!;\n this.pending = false;\n\n arrRemove(actions, this);\n if (id != null) {\n this.id = this.recycleAsyncId(scheduler, id, null);\n }\n\n this.delay = null!;\n super.unsubscribe();\n }\n }\n}\n", "import { Action } from './scheduler/Action';\nimport { Subscription } from './Subscription';\nimport { SchedulerLike, SchedulerAction } from './types';\nimport { dateTimestampProvider } from './scheduler/dateTimestampProvider';\n\n/**\n * An execution context and a data structure to order tasks and schedule their\n * execution. Provides a notion of (potentially virtual) time, through the\n * `now()` getter method.\n *\n * Each unit of work in a Scheduler is called an `Action`.\n *\n * ```ts\n * class Scheduler {\n * now(): number;\n * schedule(work, delay?, state?): Subscription;\n * }\n * ```\n *\n * @class Scheduler\n * @deprecated Scheduler is an internal implementation detail of RxJS, and\n * should not be used directly. Rather, create your own class and implement\n * {@link SchedulerLike}. Will be made internal in v8.\n */\nexport class Scheduler implements SchedulerLike {\n public static now: () => number = dateTimestampProvider.now;\n\n constructor(private schedulerActionCtor: typeof Action, now: () => number = Scheduler.now) {\n this.now = now;\n }\n\n /**\n * A getter method that returns a number representing the current time\n * (at the time this function was called) according to the scheduler's own\n * internal clock.\n * @return {number} A number that represents the current time. May or may not\n * have a relation to wall-clock time. May or may not refer to a time unit\n * (e.g. milliseconds).\n */\n public now: () => number;\n\n /**\n * Schedules a function, `work`, for execution. May happen at some point in\n * the future, according to the `delay` parameter, if specified. May be passed\n * some context object, `state`, which will be passed to the `work` function.\n *\n * The given arguments will be processed an stored as an Action object in a\n * queue of actions.\n *\n * @param {function(state: ?T): ?Subscription} work A function representing a\n * task, or some unit of work to be executed by the Scheduler.\n * @param {number} [delay] Time to wait before executing the work, where the\n * time unit is implicit and defined by the Scheduler itself.\n * @param {T} [state] Some contextual data that the `work` function uses when\n * called by the Scheduler.\n * @return {Subscription} A subscription in order to be able to unsubscribe\n * the scheduled work.\n */\n public schedule(work: (this: SchedulerAction, state?: T) => void, delay: number = 0, state?: T): Subscription {\n return new this.schedulerActionCtor(this, work).schedule(state, delay);\n }\n}\n", "import { Scheduler } from '../Scheduler';\nimport { Action } from './Action';\nimport { AsyncAction } from './AsyncAction';\nimport { TimerHandle } from './timerHandle';\n\nexport class AsyncScheduler extends Scheduler {\n public actions: Array> = [];\n /**\n * A flag to indicate whether the Scheduler is currently executing a batch of\n * queued actions.\n * @type {boolean}\n * @internal\n */\n public _active: boolean = false;\n /**\n * An internal ID used to track the latest asynchronous task such as those\n * coming from `setTimeout`, `setInterval`, `requestAnimationFrame`, and\n * others.\n * @type {any}\n * @internal\n */\n public _scheduled: TimerHandle | undefined;\n\n constructor(SchedulerAction: typeof Action, now: () => number = Scheduler.now) {\n super(SchedulerAction, now);\n }\n\n public flush(action: AsyncAction): void {\n const { actions } = this;\n\n if (this._active) {\n actions.push(action);\n return;\n }\n\n let error: any;\n this._active = true;\n\n do {\n if ((error = action.execute(action.state, action.delay))) {\n break;\n }\n } while ((action = actions.shift()!)); // exhaust the scheduler queue\n\n this._active = false;\n\n if (error) {\n while ((action = actions.shift()!)) {\n action.unsubscribe();\n }\n throw error;\n }\n }\n}\n", "import { AsyncAction } from './AsyncAction';\nimport { AsyncScheduler } from './AsyncScheduler';\n\n/**\n *\n * Async Scheduler\n *\n * Schedule task as if you used setTimeout(task, duration)\n *\n * `async` scheduler schedules tasks asynchronously, by putting them on the JavaScript\n * event loop queue. It is best used to delay tasks in time or to schedule tasks repeating\n * in intervals.\n *\n * If you just want to \"defer\" task, that is to perform it right after currently\n * executing synchronous code ends (commonly achieved by `setTimeout(deferredTask, 0)`),\n * better choice will be the {@link asapScheduler} scheduler.\n *\n * ## Examples\n * Use async scheduler to delay task\n * ```ts\n * import { asyncScheduler } from 'rxjs';\n *\n * const task = () => console.log('it works!');\n *\n * asyncScheduler.schedule(task, 2000);\n *\n * // After 2 seconds logs:\n * // \"it works!\"\n * ```\n *\n * Use async scheduler to repeat task in intervals\n * ```ts\n * import { asyncScheduler } from 'rxjs';\n *\n * function task(state) {\n * console.log(state);\n * this.schedule(state + 1, 1000); // `this` references currently executing Action,\n * // which we reschedule with new state and delay\n * }\n *\n * asyncScheduler.schedule(task, 3000, 0);\n *\n * // Logs:\n * // 0 after 3s\n * // 1 after 4s\n * // 2 after 5s\n * // 3 after 6s\n * ```\n */\n\nexport const asyncScheduler = new AsyncScheduler(AsyncAction);\n\n/**\n * @deprecated Renamed to {@link asyncScheduler}. Will be removed in v8.\n */\nexport const async = asyncScheduler;\n", "import { AsyncAction } from './AsyncAction';\nimport { AnimationFrameScheduler } from './AnimationFrameScheduler';\nimport { SchedulerAction } from '../types';\nimport { animationFrameProvider } from './animationFrameProvider';\nimport { TimerHandle } from './timerHandle';\n\nexport class AnimationFrameAction extends AsyncAction {\n constructor(protected scheduler: AnimationFrameScheduler, protected work: (this: SchedulerAction, state?: T) => void) {\n super(scheduler, work);\n }\n\n protected requestAsyncId(scheduler: AnimationFrameScheduler, id?: TimerHandle, delay: number = 0): TimerHandle {\n // If delay is greater than 0, request as an async action.\n if (delay !== null && delay > 0) {\n return super.requestAsyncId(scheduler, id, delay);\n }\n // Push the action to the end of the scheduler queue.\n scheduler.actions.push(this);\n // If an animation frame has already been requested, don't request another\n // one. If an animation frame hasn't been requested yet, request one. Return\n // the current animation frame request id.\n return scheduler._scheduled || (scheduler._scheduled = animationFrameProvider.requestAnimationFrame(() => scheduler.flush(undefined)));\n }\n\n protected recycleAsyncId(scheduler: AnimationFrameScheduler, id?: TimerHandle, delay: number = 0): TimerHandle | undefined {\n // If delay exists and is greater than 0, or if the delay is null (the\n // action wasn't rescheduled) but was originally scheduled as an async\n // action, then recycle as an async action.\n if (delay != null ? delay > 0 : this.delay > 0) {\n return super.recycleAsyncId(scheduler, id, delay);\n }\n // If the scheduler queue has no remaining actions with the same async id,\n // cancel the requested animation frame and set the scheduled flag to\n // undefined so the next AnimationFrameAction will request its own.\n const { actions } = scheduler;\n if (id != null && actions[actions.length - 1]?.id !== id) {\n animationFrameProvider.cancelAnimationFrame(id as number);\n scheduler._scheduled = undefined;\n }\n // Return undefined so the action knows to request a new async id if it's rescheduled.\n return undefined;\n }\n}\n", "import { AsyncAction } from './AsyncAction';\nimport { AsyncScheduler } from './AsyncScheduler';\n\nexport class AnimationFrameScheduler extends AsyncScheduler {\n public flush(action?: AsyncAction): void {\n this._active = true;\n // The async id that effects a call to flush is stored in _scheduled.\n // Before executing an action, it's necessary to check the action's async\n // id to determine whether it's supposed to be executed in the current\n // flush.\n // Previous implementations of this method used a count to determine this,\n // but that was unsound, as actions that are unsubscribed - i.e. cancelled -\n // are removed from the actions array and that can shift actions that are\n // scheduled to be executed in a subsequent flush into positions at which\n // they are executed within the current flush.\n const flushId = this._scheduled;\n this._scheduled = undefined;\n\n const { actions } = this;\n let error: any;\n action = action || actions.shift()!;\n\n do {\n if ((error = action.execute(action.state, action.delay))) {\n break;\n }\n } while ((action = actions[0]) && action.id === flushId && actions.shift());\n\n this._active = false;\n\n if (error) {\n while ((action = actions[0]) && action.id === flushId && actions.shift()) {\n action.unsubscribe();\n }\n throw error;\n }\n }\n}\n", "import { AnimationFrameAction } from './AnimationFrameAction';\nimport { AnimationFrameScheduler } from './AnimationFrameScheduler';\n\n/**\n *\n * Animation Frame Scheduler\n *\n * Perform task when `window.requestAnimationFrame` would fire\n *\n * When `animationFrame` scheduler is used with delay, it will fall back to {@link asyncScheduler} scheduler\n * behaviour.\n *\n * Without delay, `animationFrame` scheduler can be used to create smooth browser animations.\n * It makes sure scheduled task will happen just before next browser content repaint,\n * thus performing animations as efficiently as possible.\n *\n * ## Example\n * Schedule div height animation\n * ```ts\n * // html:
\n * import { animationFrameScheduler } from 'rxjs';\n *\n * const div = document.querySelector('div');\n *\n * animationFrameScheduler.schedule(function(height) {\n * div.style.height = height + \"px\";\n *\n * this.schedule(height + 1); // `this` references currently executing Action,\n * // which we reschedule with new state\n * }, 0, 0);\n *\n * // You will see a div element growing in height\n * ```\n */\n\nexport const animationFrameScheduler = new AnimationFrameScheduler(AnimationFrameAction);\n\n/**\n * @deprecated Renamed to {@link animationFrameScheduler}. Will be removed in v8.\n */\nexport const animationFrame = animationFrameScheduler;\n", "import { Observable } from '../Observable';\nimport { SchedulerLike } from '../types';\n\n/**\n * A simple Observable that emits no items to the Observer and immediately\n * emits a complete notification.\n *\n * Just emits 'complete', and nothing else.\n *\n * ![](empty.png)\n *\n * A simple Observable that only emits the complete notification. It can be used\n * for composing with other Observables, such as in a {@link mergeMap}.\n *\n * ## Examples\n *\n * Log complete notification\n *\n * ```ts\n * import { EMPTY } from 'rxjs';\n *\n * EMPTY.subscribe({\n * next: () => console.log('Next'),\n * complete: () => console.log('Complete!')\n * });\n *\n * // Outputs\n * // Complete!\n * ```\n *\n * Emit the number 7, then complete\n *\n * ```ts\n * import { EMPTY, startWith } from 'rxjs';\n *\n * const result = EMPTY.pipe(startWith(7));\n * result.subscribe(x => console.log(x));\n *\n * // Outputs\n * // 7\n * ```\n *\n * Map and flatten only odd numbers to the sequence `'a'`, `'b'`, `'c'`\n *\n * ```ts\n * import { interval, mergeMap, of, EMPTY } from 'rxjs';\n *\n * const interval$ = interval(1000);\n * const result = interval$.pipe(\n * mergeMap(x => x % 2 === 1 ? of('a', 'b', 'c') : EMPTY),\n * );\n * result.subscribe(x => console.log(x));\n *\n * // Results in the following to the console:\n * // x is equal to the count on the interval, e.g. (0, 1, 2, 3, ...)\n * // x will occur every 1000ms\n * // if x % 2 is equal to 1, print a, b, c (each on its own)\n * // if x % 2 is not equal to 1, nothing will be output\n * ```\n *\n * @see {@link Observable}\n * @see {@link NEVER}\n * @see {@link of}\n * @see {@link throwError}\n */\nexport const EMPTY = new Observable((subscriber) => subscriber.complete());\n\n/**\n * @param scheduler A {@link SchedulerLike} to use for scheduling\n * the emission of the complete notification.\n * @deprecated Replaced with the {@link EMPTY} constant or {@link scheduled} (e.g. `scheduled([], scheduler)`). Will be removed in v8.\n */\nexport function empty(scheduler?: SchedulerLike) {\n return scheduler ? emptyScheduled(scheduler) : EMPTY;\n}\n\nfunction emptyScheduled(scheduler: SchedulerLike) {\n return new Observable((subscriber) => scheduler.schedule(() => subscriber.complete()));\n}\n", "import { SchedulerLike } from '../types';\nimport { isFunction } from './isFunction';\n\nexport function isScheduler(value: any): value is SchedulerLike {\n return value && isFunction(value.schedule);\n}\n", "import { SchedulerLike } from '../types';\nimport { isFunction } from './isFunction';\nimport { isScheduler } from './isScheduler';\n\nfunction last(arr: T[]): T | undefined {\n return arr[arr.length - 1];\n}\n\nexport function popResultSelector(args: any[]): ((...args: unknown[]) => unknown) | undefined {\n return isFunction(last(args)) ? args.pop() : undefined;\n}\n\nexport function popScheduler(args: any[]): SchedulerLike | undefined {\n return isScheduler(last(args)) ? args.pop() : undefined;\n}\n\nexport function popNumber(args: any[], defaultValue: number): number {\n return typeof last(args) === 'number' ? args.pop()! : defaultValue;\n}\n", "export const isArrayLike = ((x: any): x is ArrayLike => x && typeof x.length === 'number' && typeof x !== 'function');", "import { isFunction } from \"./isFunction\";\n\n/**\n * Tests to see if the object is \"thennable\".\n * @param value the object to test\n */\nexport function isPromise(value: any): value is PromiseLike {\n return isFunction(value?.then);\n}\n", "import { InteropObservable } from '../types';\nimport { observable as Symbol_observable } from '../symbol/observable';\nimport { isFunction } from './isFunction';\n\n/** Identifies an input as being Observable (but not necessary an Rx Observable) */\nexport function isInteropObservable(input: any): input is InteropObservable {\n return isFunction(input[Symbol_observable]);\n}\n", "import { isFunction } from './isFunction';\n\nexport function isAsyncIterable(obj: any): obj is AsyncIterable {\n return Symbol.asyncIterator && isFunction(obj?.[Symbol.asyncIterator]);\n}\n", "/**\n * Creates the TypeError to throw if an invalid object is passed to `from` or `scheduled`.\n * @param input The object that was passed.\n */\nexport function createInvalidObservableTypeError(input: any) {\n // TODO: We should create error codes that can be looked up, so this can be less verbose.\n return new TypeError(\n `You provided ${\n input !== null && typeof input === 'object' ? 'an invalid object' : `'${input}'`\n } where a stream was expected. You can provide an Observable, Promise, ReadableStream, Array, AsyncIterable, or Iterable.`\n );\n}\n", "export function getSymbolIterator(): symbol {\n if (typeof Symbol !== 'function' || !Symbol.iterator) {\n return '@@iterator' as any;\n }\n\n return Symbol.iterator;\n}\n\nexport const iterator = getSymbolIterator();\n", "import { iterator as Symbol_iterator } from '../symbol/iterator';\nimport { isFunction } from './isFunction';\n\n/** Identifies an input as being an Iterable */\nexport function isIterable(input: any): input is Iterable {\n return isFunction(input?.[Symbol_iterator]);\n}\n", "import { ReadableStreamLike } from '../types';\nimport { isFunction } from './isFunction';\n\nexport async function* readableStreamLikeToAsyncGenerator(readableStream: ReadableStreamLike): AsyncGenerator {\n const reader = readableStream.getReader();\n try {\n while (true) {\n const { value, done } = await reader.read();\n if (done) {\n return;\n }\n yield value!;\n }\n } finally {\n reader.releaseLock();\n }\n}\n\nexport function isReadableStreamLike(obj: any): obj is ReadableStreamLike {\n // We don't want to use instanceof checks because they would return\n // false for instances from another Realm, like an +

+
+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("substr(first_name,1,1)"),
+        block_on("substr(surname, 1,1)"),
+    ],
+    retain_intermediate_calculation_columns=True,
+    retain_matching_columns=True,
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+df_predictions = linker.inference.predict(threshold_match_probability=0.2)
+df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predictions, threshold_match_probability=0.5
+)
+
+linker.visualisations.cluster_studio_dashboard(
+    df_predictions, df_clusters, "img/cluster_studio.html",
+    sampling_method="by_cluster_size", overwrite=True
+)
+
+# You can view the scv.html file in your browser, or inline in a notebook as follows
+from IPython.display import IFrame
+IFrame(src="./img/cluster_studio.html", width="100%", height=1200)
+
+

What the chart shows

+

See here for a video explanation of the chart.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/comparison_viewer_dashboard.html b/charts/comparison_viewer_dashboard.html new file mode 100644 index 0000000000..3bff1aaca6 --- /dev/null +++ b/charts/comparison_viewer_dashboard.html @@ -0,0 +1,5371 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + comparison viewer dashboard - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

comparison_viewer_dashboard

+

+

+
+

At a glance

+

API Documentation: comparison_viewer_dashboard()

+
+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("substr(first_name,1,1)"),
+        block_on("substr(surname, 1,1)"),
+    ],
+    retain_intermediate_calculation_columns=True,
+    retain_matching_columns=True,
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+df_predictions = linker.inference.predict(threshold_match_probability=0.2)
+
+linker.visualisations.comparison_viewer_dashboard(
+    df_predictions, "img/scv.html", overwrite=True
+)
+
+# You can view the scv.html file in your browser, or inline in a notebook as follows
+from IPython.display import IFrame
+IFrame(
+    src="./img/scv.html", width="100%", height=1200
+)
+
+

What the chart shows

+

See the following video: +An introduction to the Splink Comparison Viewer dashboard

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/completeness_chart.html b/charts/completeness_chart.html new file mode 100644 index 0000000000..e5fe111f84 --- /dev/null +++ b/charts/completeness_chart.html @@ -0,0 +1,5436 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + completeness chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

completeness_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at which columns are populated across datasets.

+

API Documentation: completeness_chart()

+

What is needed to generate the chart? A linker with some data.

+
+

What the chart shows

+

The completeness_chart shows the proportion of populated (non-null) values in the columns of multiple datasets.

+
+What the chart tooltip shows +

+

The tooltip shows a number of values based on the panel that the user is hovering over, including:

+
    +
  • The dataset and column name
  • +
  • The count and percentage of non-null values in the column for the relelvant dataset.
  • +
+
+
+ +

How to interpret the chart

+

Each panel represents the percentage of non-null values in a given dataset-column combination. The darker the panel, the lower the percentage of non-null values.

+
+ +

Actions to take as a result of the chart

+

Only choose features that are sufficiently populated across all datasets in a linkage model.

+

Worked Example

+
from splink import splink_datasets, DuckDBAPI
+from splink.exploratory import completeness_chart
+
+df = splink_datasets.fake_1000
+
+# Split a simple dataset into two, separate datasets which can be linked together.
+df_l = df.sample(frac=0.5)
+df_r = df.drop(df_l.index)
+
+
+chart = completeness_chart([df_l, df_r], db_api=DuckDBAPI())
+chart
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html b/charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html new file mode 100644 index 0000000000..f63bd01a87 --- /dev/null +++ b/charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html @@ -0,0 +1,5660 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + cumulative num comparisons from blocking rules chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

cumulative_comparisons_to_be_scored_from_blocking_rules_chart

+ +
+ + +
+

At a glance

+

Useful for: Counting the number of comparisons generated by Blocking Rules.

+

API Documentation: cumulative_comparisons_to_be_scored_from_blocking_rules_chart()

+

What is needed to generate the chart? A linker with some data and a settings dictionary defining some Blocking Rules.

+
+

What the chart shows

+

The cumulative_comparisons_to_be_scored_from_blocking_rules_chart shows the count of pairwise comparisons generated by a set of blocking rules.

+
+What the chart tooltip shows +

+

The tooltip shows a number of statistics based on the bar that the user is hovering over, including:

+
    +
  • The blocking rule as an SQL statement.
  • +
  • The number of additional pairwise comparisons generated by the blocking rule.
  • +
  • The cumulative number of pairwise comparisons generated by the blocking rule and the previous blocking rules.
  • +
  • The total number of possible pariwise comparisons (i.e. the Cartesian product). This represents the number of comparisons which would need to be evaluated if no blocking was implemented.
  • +
  • The percentage of possible pairwise comparisons excluded by the blocking rule and the previous blocking rules (i.e. the Reduction Ratio). This is calculated as \(1-\frac{\textsf{cumulative comparisons}}{\textsf{total possible comparisons}}\).
  • +
+
+
+ +

How to interpret the chart

+

Blocking rules are order dependent, therefore each bar in this chart shows the additional comparisons generated ontop of the previous blocking rules.

+

For example, the chart above shows an exact match on surname generates an additional 1351 comparisons. If we reverse the order of the surname and first_name blocking rules:

+
blocking_rules_for_analysis = [
+    block_on("surname"),
+    block_on("first_name"),
+    block_on("email"),
+]
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules_for_analysis,
+    db_api=db_api,
+    link_type="dedupe_only",
+)
+
+ +
+ + +

The total number of comparisons is the same (3,664), but now 1,638 have been generated by the surname blocking rule. This suggests that 287 record comparisons have the same first_name and surname.

+
+ +

Actions to take as a result of the chart

+

The main aim of this chart is to understand how many comparisons are generated by blocking rules that the Splink model will consider. The number of comparisons is the main primary driver of the amount of computational resource required for Splink model training, predictions etc. (i.e. how long things will take to run).

+

The number of comparisons that are appropriate for a model varies. In general, if a model is taking hours to run (unless you are working with 100+ million records), it could be helpful to reduce the number of comparisons by defining more restrictive blocking rules.

+

For instance, there are many people who could share the same first_name in the example above you may want to add an additonal requirement for a match on dob as well to reduce the number of records the model needs to consider.

+
blocking_rules_for_analysis = [
+    block_on("first_name", "dob"),
+    block_on("surname"),
+    block_on("email"),
+]
+
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules_for_analysis,
+    db_api=db_api,
+    link_type="dedupe_only",
+)
+
+ +
+ + +

Here, the total number of records pairs considered by the model have been reduced from 3,664 to 2,213.

+
+

Further Reading

+

For a deeper dive on blocking, please refer to the Blocking Topic Guides.

+

For more on the blocking tools in Splink, please refer to the Blocking API documentation.

+
+

Worked Example

+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/img/accuracy_chart_from_labels_table.png b/charts/img/accuracy_chart_from_labels_table.png new file mode 100644 index 0000000000..7ce6f00aac Binary files /dev/null and b/charts/img/accuracy_chart_from_labels_table.png differ diff --git a/charts/img/cluster_studio.html b/charts/img/cluster_studio.html new file mode 100644 index 0000000000..bf5d4e8ded --- /dev/null +++ b/charts/img/cluster_studio.html @@ -0,0 +1,11080 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/charts/img/cluster_studio_dashboard.png b/charts/img/cluster_studio_dashboard.png new file mode 100644 index 0000000000..94615c67ee Binary files /dev/null and b/charts/img/cluster_studio_dashboard.png differ diff --git a/charts/img/comparator_score_chart.png b/charts/img/comparator_score_chart.png new file mode 100644 index 0000000000..60e45c5821 Binary files /dev/null and b/charts/img/comparator_score_chart.png differ diff --git a/charts/img/comparator_score_threshold_chart.png b/charts/img/comparator_score_threshold_chart.png new file mode 100644 index 0000000000..0be421b966 Binary files /dev/null and b/charts/img/comparator_score_threshold_chart.png differ diff --git a/charts/img/comparison_viewer_dashboard.png b/charts/img/comparison_viewer_dashboard.png new file mode 100644 index 0000000000..a76ff801a5 Binary files /dev/null and b/charts/img/comparison_viewer_dashboard.png differ diff --git a/charts/img/completeness_chart.png b/charts/img/completeness_chart.png new file mode 100644 index 0000000000..681880c5cf Binary files /dev/null and b/charts/img/completeness_chart.png differ diff --git a/charts/img/completeness_chart_tooltip.png b/charts/img/completeness_chart_tooltip.png new file mode 100644 index 0000000000..80216ecfa1 Binary files /dev/null and b/charts/img/completeness_chart_tooltip.png differ diff --git a/charts/img/cumulative_num_comparisons_from_blocking_rules_chart.png b/charts/img/cumulative_num_comparisons_from_blocking_rules_chart.png new file mode 100644 index 0000000000..f8ab502441 Binary files /dev/null and b/charts/img/cumulative_num_comparisons_from_blocking_rules_chart.png differ diff --git a/charts/img/cumulative_num_comparisons_from_blocking_rules_chart_tooltip.png b/charts/img/cumulative_num_comparisons_from_blocking_rules_chart_tooltip.png new file mode 100644 index 0000000000..18c1e6ec2e Binary files /dev/null and b/charts/img/cumulative_num_comparisons_from_blocking_rules_chart_tooltip.png differ diff --git a/charts/img/m_u_parameters_chart.png b/charts/img/m_u_parameters_chart.png new file mode 100644 index 0000000000..dbf4b0b562 Binary files /dev/null and b/charts/img/m_u_parameters_chart.png differ diff --git a/charts/img/m_u_parameters_chart_tooltip_1.png b/charts/img/m_u_parameters_chart_tooltip_1.png new file mode 100644 index 0000000000..bdf538d569 Binary files /dev/null and b/charts/img/m_u_parameters_chart_tooltip_1.png differ diff --git a/charts/img/m_u_parameters_chart_tooltip_2.png b/charts/img/m_u_parameters_chart_tooltip_2.png new file mode 100644 index 0000000000..e4013ba80a Binary files /dev/null and b/charts/img/m_u_parameters_chart_tooltip_2.png differ diff --git a/charts/img/match_weights_chart.png b/charts/img/match_weights_chart.png new file mode 100644 index 0000000000..163c638fc1 Binary files /dev/null and b/charts/img/match_weights_chart.png differ diff --git a/charts/img/match_weights_chart_tooltip.png b/charts/img/match_weights_chart_tooltip.png new file mode 100644 index 0000000000..f9004007a1 Binary files /dev/null and b/charts/img/match_weights_chart_tooltip.png differ diff --git a/charts/img/missingness_chart.png b/charts/img/missingness_chart.png new file mode 100644 index 0000000000..8d77f0d0f4 Binary files /dev/null and b/charts/img/missingness_chart.png differ diff --git a/charts/img/missingness_chart_tooltip.png b/charts/img/missingness_chart_tooltip.png new file mode 100644 index 0000000000..05b66a601b Binary files /dev/null and b/charts/img/missingness_chart_tooltip.png differ diff --git a/charts/img/parameter_estimate_comparisons_chart.png b/charts/img/parameter_estimate_comparisons_chart.png new file mode 100644 index 0000000000..1bef208236 Binary files /dev/null and b/charts/img/parameter_estimate_comparisons_chart.png differ diff --git a/charts/img/phonetic_match_chart.png b/charts/img/phonetic_match_chart.png new file mode 100644 index 0000000000..c7e38b296d Binary files /dev/null and b/charts/img/phonetic_match_chart.png differ diff --git a/charts/img/profile_columns.png b/charts/img/profile_columns.png new file mode 100644 index 0000000000..a67c565910 Binary files /dev/null and b/charts/img/profile_columns.png differ diff --git a/charts/img/profile_columns_tooltip_1.png b/charts/img/profile_columns_tooltip_1.png new file mode 100644 index 0000000000..fd4570908c Binary files /dev/null and b/charts/img/profile_columns_tooltip_1.png differ diff --git a/charts/img/profile_columns_tooltip_2.png b/charts/img/profile_columns_tooltip_2.png new file mode 100644 index 0000000000..a785da0469 Binary files /dev/null and b/charts/img/profile_columns_tooltip_2.png differ diff --git a/charts/img/profile_columns_tooltip_3.png b/charts/img/profile_columns_tooltip_3.png new file mode 100644 index 0000000000..3f9fda2cf1 Binary files /dev/null and b/charts/img/profile_columns_tooltip_3.png differ diff --git a/charts/img/roc_chart_from_labels_table.png b/charts/img/roc_chart_from_labels_table.png new file mode 100644 index 0000000000..170222b1c5 Binary files /dev/null and b/charts/img/roc_chart_from_labels_table.png differ diff --git a/charts/img/roc_chart_from_labels_table_tooltip.png b/charts/img/roc_chart_from_labels_table_tooltip.png new file mode 100644 index 0000000000..2942432ca6 Binary files /dev/null and b/charts/img/roc_chart_from_labels_table_tooltip.png differ diff --git a/charts/img/roc_curve_explainer.png b/charts/img/roc_curve_explainer.png new file mode 100644 index 0000000000..350c5e7676 Binary files /dev/null and b/charts/img/roc_curve_explainer.png differ diff --git a/charts/img/scv.html b/charts/img/scv.html new file mode 100644 index 0000000000..df15a11b6f --- /dev/null +++ b/charts/img/scv.html @@ -0,0 +1,11024 @@ + + + + + +Splink comparison viewer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink comparison viewer

+ +
+ +
+
+
+
+ +
+
+ +
+
+
+
+
+ + + + + + \ No newline at end of file diff --git a/charts/img/tf_adjustment_chart.png b/charts/img/tf_adjustment_chart.png new file mode 100644 index 0000000000..143c11afc5 Binary files /dev/null and b/charts/img/tf_adjustment_chart.png differ diff --git a/charts/img/tf_adjustment_chart_tooltip_1.png b/charts/img/tf_adjustment_chart_tooltip_1.png new file mode 100644 index 0000000000..d545a4da41 Binary files /dev/null and b/charts/img/tf_adjustment_chart_tooltip_1.png differ diff --git a/charts/img/tf_adjustment_chart_tooltip_2.png b/charts/img/tf_adjustment_chart_tooltip_2.png new file mode 100644 index 0000000000..2531f5fd68 Binary files /dev/null and b/charts/img/tf_adjustment_chart_tooltip_2.png differ diff --git a/charts/img/threshold_selection_tool_from_labels_table.png b/charts/img/threshold_selection_tool_from_labels_table.png new file mode 100644 index 0000000000..04730cb7be Binary files /dev/null and b/charts/img/threshold_selection_tool_from_labels_table.png differ diff --git a/charts/img/unlinkables_chart.png b/charts/img/unlinkables_chart.png new file mode 100644 index 0000000000..111b01179b Binary files /dev/null and b/charts/img/unlinkables_chart.png differ diff --git a/charts/img/unlinkables_chart_tooltip.png b/charts/img/unlinkables_chart_tooltip.png new file mode 100644 index 0000000000..c197bc1e26 Binary files /dev/null and b/charts/img/unlinkables_chart_tooltip.png differ diff --git a/charts/img/waterfall_chart.png b/charts/img/waterfall_chart.png new file mode 100644 index 0000000000..9827e9f9ec Binary files /dev/null and b/charts/img/waterfall_chart.png differ diff --git a/charts/img/waterfall_chart_tooltip.png b/charts/img/waterfall_chart_tooltip.png new file mode 100644 index 0000000000..7d4cef57f1 Binary files /dev/null and b/charts/img/waterfall_chart_tooltip.png differ diff --git a/charts/index.html b/charts/index.html new file mode 100644 index 0000000000..bdeb9efa16 --- /dev/null +++ b/charts/index.html @@ -0,0 +1,5354 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Charts Gallery - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + + + + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/m_u_parameters_chart.html b/charts/m_u_parameters_chart.html new file mode 100644 index 0000000000..684ebc92a7 --- /dev/null +++ b/charts/m_u_parameters_chart.html @@ -0,0 +1,5513 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + m u parameters chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+ +
+ + + +
+
+ + + + + + + + + + + + +

m_u_parameters_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at the m and u values generated by a Splink model.

+

API Documentation: m_u_parameters_chart()

+

What is needed to generate the chart? A trained Splink model.

+
+

What the chart shows

+

The m_u_parameters_chart shows the results of a trained Splink model:

+
    +
  • The left chart shows the estimated m probabilities from the Splink model
  • +
  • The right chart shows the estimated u probabilities from the Splink model.
  • +
+

Each comparison within a model is represented in trained m and u values that have been estimated during the Splink model training for each comparison level.

+
+What the chart tooltip shows +

Estimated m probability tooltip

+

+

The tooltip of the left chart shows information based on the comparison level bar that the user is hovering over, including:

+
    +
  • An explanation of the m probability for the comparison level.
  • +
  • The name of the comparison and comparison level.
  • +
  • The comparison level condition as an SQL statement.
  • +
  • The m and u proability for the comparison level.
  • +
  • The resulting bayes factor and match weight for the comparison level.
  • +
+

Estimated u probability tooltip

+

+

The tooltip of the right chart shows information based on the comparison level bar that the user is hovering over, including:

+
    +
  • An explanation of the u probability from the comparison level.
  • +
  • The name of the comparison and comparison level.
  • +
  • The comparison level condition as an SQL statement.
  • +
  • The m and u proability for the comparison level.
  • +
  • The resulting bayes factor and match weight for the comparison level.
  • +
+
+

How to interpret the chart

+

Each bar of the left chart shows the probability of a given comparison level when two records are a match. This can also be interpreted as the proportion of matching records which are allocated to the comparison level (as stated in the x axis label).

+

Similarly, each bar of the right chart shows the probability of a given comparison level when two records are not a match. This can also be interpreted as the proportion of non-matching records which are allocated to the comparison level (as stated in the x axis label).

+
+

Further Reading

+

For a more comprehensive introduction to m and u probabilities, check out the Fellegi Sunter model topic guide.

+
+

Actions to take as a result of the chart

+

As with the match_weights_chart, one of the most effective methods to assess a Splink model is to walk through each of the comparison levels of the m_u_parameters_chart and sense check the m and u probabilities that have been allocated by the model.

+

For example, for all non-matching pairwise comparisons (which form the vast majority of all pairwise comparisons), it makes sense that the exact match and fuzzy levels occur very rarely. Furthermore, dob and city are lower cardinality features (i.e. have fewer possible values) than names so "All other comparisons" is less likely.

+

If there are any m or u values that appear unusual, check out the values generated for each training session in the parameter_estimate_comparisons_chart.

+ + +

Worked Example

+
import splink.comparison_library as cl
+
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+chart = linker.visualisations.m_u_parameters_chart()
+chart
+
+

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/match_weights_chart.html b/charts/match_weights_chart.html new file mode 100644 index 0000000000..37906706d0 --- /dev/null +++ b/charts/match_weights_chart.html @@ -0,0 +1,5509 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + match weights chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+ +
+ + + +
+
+ + + + + + + + + + + + +

match_weights_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at the whole Splink model definition.

+

API Documentation: match_weights_chart()

+

What is needed to generate the chart? A trained Splink model.

+
+

What the chart shows

+

The match_weights_chart show the results of a trained Splink model. Each comparison within a model is represented in a bar chart, with a bar showing the evidence for two records being a match (i.e. match weight) for each comparison level.

+
+What the chart tooltip shows +

+

The tooltip shows information based on the comparison level bar that the user is hovering over, including:

+
    +
  • The name of the comparison and comaprison level.
  • +
  • The comparison level condition as an SQL statement.
  • +
  • The m and u proability for the comparison level.
  • +
  • The resulting bayes factor and match weight for the comparison level.
  • +
+
+

How to interpret the chart

+

Each bar in the match_weights_chart shows the evidence of a match provided by each level in a Splink model (i.e. match weight). As such, the match weight chart provides a summary for the entire Splink model, as it shows the match weights for every type of comparison defined within the model.

+

Any Splink score generated to compare two records will add up the evidence (i.e. match weights) for each comparison to come up with a final match weight score, which can then be converted into a probability of a match.

+

The first bar chart is the Prior Match Weight, which is the . This can be thought of in the same way as the y-intercept of a simple regression model

+

This chart is an aggregation of the m_u_parameters_chart. The match weight for a comparison level is simply \(log_2(\frac{m}{u})\).

+

Actions to take as a result of the chart

+

Some heuristics to help assess Splink models with the match_weights_chart:

+

Match weights gradually reducing within a comparison

+

Comparison levels are order dependent, therefore they are constructed that the most "similar" levels come first and get gradually less "similar". As a result, we would generally expect that match weight will reduce as we go down the levels in a comparison.

+

Very similar comparison levels

+

Comparisons are broken up into comparison levels to show different levels of similarity between records. As these levels are associated with different levels of similarity, we expect the amount of evidence (i.e. match weight) to vary between comparison levels. Two levels with the same match weight does not provide the model with any additional information which could make it perform better.

+

Therefore, if two levels of a comparison return the same match weight, these should be combined into a single level.

+

Very different comparison levels

+

Levels that have a large variation between comparison levels have a significant impact on the model results. For example, looking at the email comparison in the chart above, the difference in match weight between an exact/fuzzy match and "All other comparisons" is > 13, which is quite extreme. This generally happens with highly predictive features (e.g. email, national insurance number, social security number).

+

If there are a number of highly predictive features, it is worth looking at simplifying your model using these more predictive features. In some cases, similar results may be obtained with a deterministic rather than a probabilistic linkage model.

+

Logical Walk-through

+

One of the most effective methods to assess a splink model is to walk through each of the comparison levels of the match_weights_chart and sense check the amount of evidence (i.e. match weight) that has been allocated by the model.

+

For example, in the chart above, we would expect records with the same dob to provide more evidence of a match that first_name or surname. Conversely, given how people can move location, we would expect that city would be less predictive than people's fixed, personally identifying characteristics like surname, dob etc.

+

Anything look strange?

+

If anything still looks unusual, check out:

+ + + +

Worked Example

+
import splink.comparison_library as cl
+
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+chart = linker.visualisations.match_weights_chart()
+chart
+
+

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/parameter_estimate_comparisons_chart.html b/charts/parameter_estimate_comparisons_chart.html new file mode 100644 index 0000000000..d3277e9597 --- /dev/null +++ b/charts/parameter_estimate_comparisons_chart.html @@ -0,0 +1,5422 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + parameter estimate comparisons chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

parameter_estimate_comparisons_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at the m and u value estimates across multiple Splink model training sessions.

+

API Documentation: parameter_estimate_comparisons_chart()

+

What is needed to generate the chart? A trained Splink model.

+
+ + +

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("email")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+chart = linker.visualisations.parameter_estimate_comparisons_chart()
+chart
+
+

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/profile_columns.html b/charts/profile_columns.html new file mode 100644 index 0000000000..cf0e2a23be --- /dev/null +++ b/charts/profile_columns.html @@ -0,0 +1,5674 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + profile columns - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

profile_columns

+ +
+ + +
+

At a glance

+

Useful for: Looking at the distribution of values in columns.

+

API Documentation: profile_columns()

+

What is needed to generate the chart?: A linker with some data.

+
+
+ +

What the chart shows

+

The profile_columns chart shows 3 charts for each selected column:

+
    +
  • The left chart shows the distribution of all values in the column. It is a summary of the skew of value frequencies. The width of each "step" represents the proportion of all (non-null) values with a given count while the height of each "step" gives the count of the same given value.
  • +
  • The middle chart shows the counts of the ten most common values in the column. These correspond to the 10 leftmost "steps" in the left chart.
  • +
  • The right chart shows the counts of the ten least common values in the column. These correspond to the 10 rightmost "steps" in the left chart.
  • +
+
+What the chart tooltip shows +
Left chart:
+

+

This tooltip shows a number of statistics based on the column value of the "step" that the user is hovering over, including:

+
    +
  • The number of occurances of the given value.
  • +
  • The precentile of the column value (excluding and including null values).
  • +
  • The total number of rows in the column (excluding and including null values).
  • +
+
Middle and right chart:
+

+

This tooltip shows a number of statistics based on the column value of the bar that the user is hovering over, including:

+
    +
  • The column value
  • +
  • The count of the column value.
  • +
  • The total number of rows in the column (excluding and including null values).
  • +
+
+
+ +

How to interpret the chart

+

The distribution of values in your data is important for two main reasons:

+
    +
  1. +

    Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.

    +
  2. +
  3. +

    The skew of values is important. If you have a birth_place column that has 1,000 distinct values, but 75% of them are London, this is much less useful for linkage than if the 1,000 values were equally distributed

    +
  4. +
+
+ +

Actions to take as a result of the chart

+

In an ideal world, all of the columns in datasets used for linkage would be high cardinality with a low skew (i.e. many distinct values that are evenly distributed). This is rarely the case with real-life datasets, but there a number of steps to extract the most predictive value, particularly with skewed data.

+

Skewed String Columns

+

Consider the skew of birth_place in our example:

+
profile_columns(df, column_expressions="birth_place", db_api=DuckDBAPI())
+
+ +
+ + +

Here we can see that "london" is the most common value, with many multiples more entires than the other values. In this case two records both having a birth_place of "london" gives far less evidence for a match than both having a rarer birth_place (e.g. "felthorpe").

+

To take this skew into account, we can build Splink models with Term Frequency Adjustments. These adjustments will increase the amount of evidence for rare matching values and reduce the amount of evidence for common matching values.

+

To understand how these work in more detail, check out the Term Frequency Adjustments Topic Guide

+
+ +

Skewed Date Columns

+

Dates can also be skewed, but tend to be dealt with slightly differently.

+

Consider the dob column from our example:

+
profile_columns(df, column_expressions="dob", db_api=DuckDBAPI())
+
+ +
+ + +

Here we can see a large skew towards dates which are the 1st January. We can narrow down the profiling to show the distribution of month and day to explore this further:

+
profile_columns(df, column_expressions="substr(dob, 6, 10)", db_api=DuckDBAPI())
+
+ +
+ + +

Here we can see that over 35% of all dates in this dataset are the 1st January. This is fairly common in manually entered datasets where if only the year of birth is known, people will generally enter the 1st January for that year.

+
+ +

Low cardinality columns

+

Unfortunately, there is not much that can be done to improve low cardinality data. Ultimately, they will provide some evidence of a match between records, but need to be used in conjunction with some more predictive, higher cardinality fields.

+

Worked Example

+
from splink import splink_datasets, DuckDBAPI
+from splink.exploratory import profile_columns
+
+df = splink_datasets.historical_50k
+profile_columns(df, db_api=DuckDBAPI())
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/template.html b/charts/template.html new file mode 100644 index 0000000000..2bc4811177 --- /dev/null +++ b/charts/template.html @@ -0,0 +1,5260 @@ + + + + + + + + + + + + + + + + + + + + + + + + XXXXX_chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

XXXXX_chart

+
+

At a glance

+

Useful for:

+

API Documentation: XXXXXX_chart()

+

What is needed to generate the chart?

+
+

Worked Example

+
from splink.duckdb.linker import DuckDBLinker
+import splink.duckdb.comparison_library as cl
+import splink.duckdb.comparison_template_library as ctl
+from splink.duckdb.blocking_rule_library import block_on
+from splink.datasets import splink_datasets
+import logging, sys
+logging.disable(sys.maxsize)
+
+df = splink_datasets.fake_1000
+
+settings = {
+    "link_type": "dedupe_only",
+    "blocking_rules_to_generate_predictions": [
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    "comparisons": [
+        ctl.name_comparison("first_name"),
+        ctl.name_comparison("surname"),
+        ctl.date_comparison("dob", cast_strings_to_date=True),
+        cl.exact_match("city", term_frequency_adjustments=True),
+        ctl.email_comparison("email", include_username_fuzzy_level=False),
+    ],
+}
+
+linker = DuckDBLinker(df, settings)
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on(["first_name", "surname"])
+
+linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
+
+

What the chart shows

+
+What the chart tooltip shows +

+
+

How to interpret the chart

+

Actions to take as a result of the chart

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/tf_adjustment_chart.html b/charts/tf_adjustment_chart.html new file mode 100644 index 0000000000..0ef6c32f76 --- /dev/null +++ b/charts/tf_adjustment_chart.html @@ -0,0 +1,5484 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + tf adjustment chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

tf_adjustment_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at the impact of Term Frequency Adjustments on Match Weights.

+

API Documentation: tf_adjustment_chart()

+

What is needed to generate the chart?: A trained Splink model, including comparisons with term frequency adjustments.

+
+

What the chart shows

+

The tf_adjustment_chart shows the impact of Term Frequency Adjustments on the Match Weight of a comparison. It is made up of two charts for each selected comparison:

+
    +
  • The left chart shows the match weight for two records with a matching first_name including a term frequency adjustment. The black horizontal line represents the base match weight (i.e. with no term frequency adjustment applied). By default this chart contains the 10 most frequent and 10 least frequent values in a comparison as well as any values assigned in the vals_to_include parameter.
  • +
  • The right chart shows the distribution of match weights across all of the values of first_name.
  • +
+
+What the tooltip shows +

Left chart

+

+

The tooltip shows a number of statistics based on the column value of the point theat the user is hovering over, including:

+
    +
  • The column value
  • +
  • The base match weight (i.e. with no term frequency adjustment) for a match on the column.
  • +
  • The term frequency adjustment for the column value.
  • +
  • The final match weight (i.e. the combined base match weight and term frequency adjustment)
  • +
+

Right chart

+

+

The tooltip shows a number of statistics based on the bar that the user is hovering over, including:

+
    +
  • The final match weight bucket (in steps of 0.5).
  • +
  • The number of records with a final match weight in the final match weight bucket.
  • +
+
+
+ +

How to interpret the chart

+

The most common terms (on the left of the first chart) will have a negative term frequency adjustment and the values on the chart and represent the lowest match weight for a match for the selected comparison. Conversely, the least common terms (on the right of the first chart) will have a positive term frequency adjustment and the values on the chart represent the highest match weight for a match for the selected comparison.

+

Given that the first chart only shows the most and least frequently occuring values, the second chart is provided to show the distribution of final match weights (including term frequency adjustments) across all values in the dataset.

+
+ +

Actions to take as a result of the chart

+

There are no direct actions that need to be taken as a result of this chart. It is intended to give the user an indication of the size of the impact of Term Frequency Adjustments on comparisons, as seen in the Waterfall Chart.

+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]).configure(
+            term_frequency_adjustments=True
+        ),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+chart = linker.visualisations.tf_adjustment_chart(
+    "first_name", vals_to_include=["Robert", "Grace"]
+)
+chart
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/threshold_selection_tool_from_labels_table.html b/charts/threshold_selection_tool_from_labels_table.html new file mode 100644 index 0000000000..d6996f4851 --- /dev/null +++ b/charts/threshold_selection_tool_from_labels_table.html @@ -0,0 +1,5475 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + threshold selection tool - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

threshold_selection_tool_from_labels_table

+ +
+ + +
+

At a glance

+

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

+

API Documentation: accuracy_chart_from_labels_table()

+

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

+
+

What the chart shows

+

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. Lowering the threshold to the extreme ensures many more matches are generated - this maximises the True Positives (high recall) but at the expense of some False Positives (low precision).

+

You can then see the effect on the confusion matrix of raising the match threshold. As more predicted matches become non-matches at the higher threshold, True Positives become False Negatives, but False Positives become True Negatives.

+

This demonstrates the trade-off between Type 1 (FP) and Type 2 (FN) errors when selecting a match threshold, or precision vs recall.

+

This chart adds further context to accuracy_analysis_from_labels_table showing:

+
    +
  • the relationship between match weight and match probability
  • +
  • various accuracy metrics comparing the Splink scores against clerical labels
  • +
  • the confusion matrix of the predictions and the labels
  • +
+

How to interpret the chart

+

Precision can be maximised by increasing the match threshold (reducing false positives).

+

Recall can be maximised by decreasing the match threshold (reducing false negatives).

+

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

+

Actions to take as a result of the chart

+

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().

+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+from splink.datasets import splink_dataset_labels
+
+db_api = DuckDBAPI()
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("substr(first_name,1,1)"),
+        block_on("substr(surname, 1,1)"),
+    ],
+)
+
+linker = Linker(df, settings, db_api)
+
+linker.training.estimate_probability_two_random_records_match(
+    [block_on("first_name", "surname")], recall=0.7
+)
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+df_labels = splink_dataset_labels.fake_1000_labels
+labels_table = linker.table_management.register_labels_table(df_labels)
+
+chart = linker.evaluation.accuracy_analysis_from_labels_table(
+    labels_table, output_type="threshold_selection", add_metrics=["f1"]
+)
+chart
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/unlinkables_chart.html b/charts/unlinkables_chart.html new file mode 100644 index 0000000000..3c68f9fb40 --- /dev/null +++ b/charts/unlinkables_chart.html @@ -0,0 +1,5478 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + unlinkables chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

unlinkables_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at how many records have insufficient information to be linked to themselves.

+

API Documentation: unlinkables_chart()

+

What is needed to generate the chart? A trained Splink model

+
+

What the chart shows

+

The unlinkables_chart shows the proportion of records with insufficient information to be matched to themselves at differing match thresholds.

+
+What the chart tooltip shows +

+

This tooltip shows a number of statistics based on the match weight of the selected point of the line, including:

+
    +
  • The chosen match weight and corresponding match probability.
  • +
  • The proportion of records of records that cannot be linked to themselves given the chosen match weight threshold for a match.
  • +
+
+
+ +

How to interpret the chart

+

This chart gives an indication of both data quality and/or model predictiveness within a Splink model. If a high proportion of records are not linkable to themselves at a low match threshold (e.g. 0 match weight/50% probability) we can conclude that either/or:

+
    +
  • the data quality is low enough such that a significant proportion of records are unable to be linked to themselves
  • +
  • the parameters of the Splink model are such that features have not been assigned enough weight, and therefore will not perform well
  • +
+

This chart also gives an indication of the number of False Negatives (i.e. missed links) at a given threshold, assuming sufficient data quality. For example:

+
    +
  • we know that a record should be linked to itself, so seeing that a match weight \(\approx\) 10 gives 16% of records unable to link to themselves
  • +
  • exact matches generally provide the strongest matches, therefore, we can expect that any "fuzzy" matches to have lower match scores. As a result, we can deduce that the propoertion of False Negatives will be higher than 16%.
  • +
+
+ +

Actions to take as a result of the chart

+

If the level of unlinkable records is extremely high at low match weight thresholds, you have a poorly performing model. This may be an issue that can be resolved by tweaking the models comparisons, but if the poor performance is primarily down to poor data quality, there is very little that can be done to improve the model.

+

When interpretted as an indicator of False Negatives, this chart can be used to establish an upper bound for match weight, depending on the propensity for False Negatives in the particular use case.

+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+db_api = DuckDBAPI()
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
+        cl.JaroAtThresholds("surname", [0.9, 0.7]),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, db_api)
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+chart = linker.evaluation.unlinkables_chart()
+chart
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/charts/waterfall_chart.html b/charts/waterfall_chart.html new file mode 100644 index 0000000000..b7e062221b --- /dev/null +++ b/charts/waterfall_chart.html @@ -0,0 +1,5493 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + waterfall chart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

waterfall_chart

+ +
+ + +
+

At a glance

+

Useful for: Looking at the breakdown of the match weight for a pair of records.

+

API Documentation: waterfall_chart()

+

What is needed to generate the chart? A trained Splink model

+
+

What the chart shows

+

The waterfall_chart shows the amount of evidence of a match that is provided by each comparison for a pair of records. Each bar represents a comparison and the corresponding amount of evidence (i.e. match weight) of a match for the pair of values displayed above the bar.

+
+What the chart tooltip shows +

+

The tooltip contains information based on the bar that the user is hovering over, including:

+
    +
  • The comparison column (or columns)
  • +
  • The column values from the pair of records being compared
  • +
  • The comparison level as a label, SQL statement and the corresponding comparison vector value
  • +
  • The bayes factor (i.e. how many times more likely is a match based on this evidence)
  • +
  • The match weight for the comparison level
  • +
  • The cumulative match probability from the chosen comparison and all of the previous comparisons.
  • +
+
+
+ +

How to interpret the chart

+

The first bar (labelled "Prior") is the match weight if no additional knowledge of features is taken into account, and can be thought of as similar to the y-intercept in a simple regression.

+

Each subsequent bar shows the match weight for a comparison. These bars can be positive or negative depending on whether the given comparison gives positive or negative evidence for the two records being a match.

+

Additional bars are added for comparisons with term frequency adjustments. For example, the chart above has term frequency adjustments for first_name so there is an extra tf_first_name bar showing how the frequency of a given name impacts the amount of evidence for the two records being a match.

+

The final bar represents total match weight for the pair of records. This match weight can also be translated into a final match probablility, and the corresponding match probability is shown on the right axis (note the logarithmic scale).

+
+ +

Actions to take as a result of the chart

+

This chart is useful for spot checking pairs of records to see if the Splink model is behaving as expected.

+

If a pair of records look like they are incorrectly being assigned as a match/non-match, it is a sign that the Splink model is not working optimally. If this is the case, it is worth revisiting the model training step.

+

Some common scenarios include:

+
    +
  • +

    If a comparison isn't capturing a specific edge case (e.g. fuzzy match), add a comparison level to capture this case and retrain the model.

    +
  • +
  • +

    If the match weight for a comparison is looking unusual, refer to the match_weights_chart to see the match weight in context with the rest of the comparison levels within that comparison. If it is still looking unusual, you can dig deeper with the parameter_estimate_comparisons_chart to see if the model training runs are consistent. If there is a lot of variation between model training sessions, this can suggest some instability in the model. In this case, try some different model training rules and/or comparison levels.

    +
  • +
  • +

    If the "Prior" match weight is too small or large compared to the match weight provided by the comparisons, try some different determininstic rules and recall inputs to the estimate_probability_two_records_match function.

    +
  • +
  • +

    If you are working with a model with term frequency adjustments and want to dig deeper into the impact of term frequency on the model as a whole (i.e. not just for a single pairwise comparison), check out the tf_adjustment_chart.

    +
  • +
+

Worked Example

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        ctl.NameComparison("first_name").configure(term_frequency_adjustments=True),
+        ctl.NameComparison("surname"),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["year", "month"],
+            datetime_thresholds=[1, 1],
+        ),
+        cl.ExactMatch("city"),
+        ctl.EmailComparison("email", include_username_fuzzy_level=False),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    retain_intermediate_calculation_columns=True,
+    retain_matching_columns=True,
+)
+
+linker = Linker(df, settings, DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+df_predictions = linker.inference.predict(threshold_match_probability=0.2)
+records_to_view = df_predictions.as_record_dict(limit=5)
+
+chart = linker.visualisations.waterfall_chart(records_to_view, filter_nulls=False)
+chart
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/css/ansi-colours.css b/css/ansi-colours.css new file mode 100644 index 0000000000..42301ef93b --- /dev/null +++ b/css/ansi-colours.css @@ -0,0 +1,174 @@ +/*! +* +* IPython notebook +* +*/ +/* CSS font colors for translated ANSI escape sequences */ +/* The color values are a mix of + http://www.xcolors.net/dl/baskerville-ivorylight and + http://www.xcolors.net/dl/euphrasia */ +.ansi-black-fg { + color: #3E424D; +} +.ansi-black-bg { + background-color: #3E424D; +} +.ansi-black-intense-fg { + color: #282C36; +} +.ansi-black-intense-bg { + background-color: #282C36; +} +.ansi-red-fg { + color: #E75C58; +} +.ansi-red-bg { + background-color: #E75C58; +} +.ansi-red-intense-fg { + color: #B22B31; +} +.ansi-red-intense-bg { + background-color: #B22B31; +} +.ansi-green-fg { + color: #00A250; +} +.ansi-green-bg { + background-color: #00A250; +} +.ansi-green-intense-fg { + color: #007427; +} +.ansi-green-intense-bg { + background-color: #007427; +} +.ansi-yellow-fg { + color: #DDB62B; +} +.ansi-yellow-bg { + background-color: #DDB62B; +} +.ansi-yellow-intense-fg { + color: #B27D12; +} +.ansi-yellow-intense-bg { + background-color: #B27D12; +} +.ansi-blue-fg { + color: #208FFB; +} +.ansi-blue-bg { + background-color: #208FFB; +} +.ansi-blue-intense-fg { + color: #0065CA; +} +.ansi-blue-intense-bg { + background-color: #0065CA; +} +.ansi-magenta-fg { + color: #D160C4; +} +.ansi-magenta-bg { + background-color: #D160C4; +} +.ansi-magenta-intense-fg { + color: #A03196; +} +.ansi-magenta-intense-bg { + background-color: #A03196; +} +.ansi-cyan-fg { + color: #60C6C8; +} +.ansi-cyan-bg { + background-color: #60C6C8; +} +.ansi-cyan-intense-fg { + color: #258F8F; +} +.ansi-cyan-intense-bg { + background-color: #258F8F; +} +.ansi-white-fg { + color: #C5C1B4; +} +.ansi-white-bg { + background-color: #C5C1B4; +} +.ansi-white-intense-fg { + color: #A1A6B2; +} +.ansi-white-intense-bg { + background-color: #A1A6B2; +} +.ansi-default-inverse-fg { + color: #FFFFFF; +} +.ansi-default-inverse-bg { + background-color: #000000; +} +.ansi-bold { + font-weight: bold; +} +.ansi-underline { + text-decoration: underline; +} +/* The following styles are deprecated an will be removed in a future version */ +.ansibold { + font-weight: bold; +} +.ansi-inverse { + outline: 0.5px dotted; +} +/* use dark versions for foreground, to improve visibility */ +.ansiblack { + color: black; +} +.ansired { + color: darkred; +} +.ansigreen { + color: darkgreen; +} +.ansiyellow { + color: #c4a000; +} +.ansiblue { + color: darkblue; +} +.ansipurple { + color: darkviolet; +} +.ansicyan { + color: steelblue; +} +.ansigray { + color: gray; +} +/* and light for background, for the same reason */ +.ansibgblack { + background-color: black; +} +.ansibgred { + background-color: red; +} +.ansibggreen { + background-color: green; +} +.ansibgyellow { + background-color: yellow; +} +.ansibgblue { + background-color: blue; +} +.ansibgpurple { + background-color: magenta; +} +.ansibgcyan { + background-color: cyan; +} +.ansibggray { + background-color: gray; +} \ No newline at end of file diff --git a/css/custom.css b/css/custom.css new file mode 100644 index 0000000000..1217777fcd --- /dev/null +++ b/css/custom.css @@ -0,0 +1,33 @@ +code.language-python > span:not(:first-child) { + font-size: small; + color: rgb(113 112 112); +} + +.vega-embed details, +.vega-embed summary details, +.vega-embed summary, +.vega-embed summary::before, +.vega-embed summary::after { + all: unset; +} + +/* 'This one weird trick will increase specificity'! :awesome: */ +/* This deals with the problem of mkdocs overriding the styles */ +/* on the summary button on vega embed charts */ +.vega-embed.vega-embed.vega-embed summary { + list-style: none; + position: absolute; + top: 0; + right: 0; + padding: 6px; + z-index: 1000; + background: white; + box-shadow: 1px 1px 3px rgba(0, 0, 0, 0.1); + color: #1b1e23; + border: 1px solid #aaa; + border-radius: 999px; + opacity: 0.2; + transition: opacity 0.4s ease-in; + cursor: pointer; + line-height: 0px; +} diff --git a/css/jupyter-cells.css b/css/jupyter-cells.css new file mode 100644 index 0000000000..46def9f9c4 --- /dev/null +++ b/css/jupyter-cells.css @@ -0,0 +1,10 @@ +/* Input cells */ +.input code, .input pre { + background-color: #3333aa11; +} + +/* Output cells */ +.output pre { + background-color: #ececec80; + padding: 10px; +} diff --git a/css/neoteroi-mkdocs.css b/css/neoteroi-mkdocs.css new file mode 100644 index 0000000000..4265dc22d0 --- /dev/null +++ b/css/neoteroi-mkdocs.css @@ -0,0 +1,1625 @@ +/** + * All CSS for the neoteroi-mkdocs extensions. + * + * https://github.com/Neoteroi/mkdocs-plugins +**/ +:root { + --nt-color-0: #CD853F; + --nt-color-1: #B22222; + --nt-color-2: #000080; + --nt-color-3: #4B0082; + --nt-color-4: #3CB371; + --nt-color-5: #D2B48C; + --nt-color-6: #FF00FF; + --nt-color-7: #98FB98; + --nt-color-8: #FFEBCD; + --nt-color-9: #2E8B57; + --nt-color-10: #6A5ACD; + --nt-color-11: #48D1CC; + --nt-color-12: #FFA500; + --nt-color-13: #F4A460; + --nt-color-14: #A52A2A; + --nt-color-15: #FFE4C4; + --nt-color-16: #FF4500; + --nt-color-17: #AFEEEE; + --nt-color-18: #FA8072; + --nt-color-19: #2F4F4F; + --nt-color-20: #FFDAB9; + --nt-color-21: #BC8F8F; + --nt-color-22: #FFC0CB; + --nt-color-23: #00FA9A; + --nt-color-24: #F0FFF0; + --nt-color-25: #FFFACD; + --nt-color-26: #F5F5F5; + --nt-color-27: #FF6347; + --nt-color-28: #FFFFF0; + --nt-color-29: #7FFFD4; + --nt-color-30: #E9967A; + --nt-color-31: #7B68EE; + --nt-color-32: #FFF8DC; + --nt-color-33: #0000CD; + --nt-color-34: #D2691E; + --nt-color-35: #708090; + --nt-color-36: #5F9EA0; + --nt-color-37: #008080; + --nt-color-38: #008000; + --nt-color-39: #FFE4E1; + --nt-color-40: #FFFF00; + --nt-color-41: #FFFAF0; + --nt-color-42: #DCDCDC; + --nt-color-43: #ADFF2F; + --nt-color-44: #ADD8E6; + --nt-color-45: #8B008B; + --nt-color-46: #7FFF00; + --nt-color-47: #800000; + --nt-color-48: #20B2AA; + --nt-color-49: #556B2F; + --nt-color-50: #778899; + --nt-color-51: #E6E6FA; + --nt-color-52: #FFFAFA; + --nt-color-53: #FF7F50; + --nt-color-54: #FF0000; + --nt-color-55: #F5DEB3; + --nt-color-56: #008B8B; + --nt-color-57: #66CDAA; + --nt-color-58: #808000; + --nt-color-59: #FAF0E6; + --nt-color-60: #00BFFF; + --nt-color-61: #C71585; + --nt-color-62: #00FFFF; + --nt-color-63: #8B4513; + --nt-color-64: #F0F8FF; + --nt-color-65: #FAEBD7; + --nt-color-66: #8B0000; + --nt-color-67: #4682B4; + --nt-color-68: #F0E68C; + --nt-color-69: #BDB76B; + --nt-color-70: #A0522D; + --nt-color-71: #FAFAD2; + --nt-color-72: #FFD700; + --nt-color-73: #DEB887; + --nt-color-74: #E0FFFF; + --nt-color-75: #8A2BE2; + --nt-color-76: #32CD32; + --nt-color-77: #87CEFA; + --nt-color-78: #00CED1; + --nt-color-79: #696969; + --nt-color-80: #DDA0DD; + --nt-color-81: #EE82EE; + --nt-color-82: #FFB6C1; + --nt-color-83: #8FBC8F; + --nt-color-84: #D8BFD8; + --nt-color-85: #9400D3; + --nt-color-86: #A9A9A9; + --nt-color-87: #FFFFE0; + --nt-color-88: #FFF5EE; + --nt-color-89: #FFF0F5; + --nt-color-90: #FFDEAD; + --nt-color-91: #800080; + --nt-color-92: #B0E0E6; + --nt-color-93: #9932CC; + --nt-color-94: #DAA520; + --nt-color-95: #F0FFFF; + --nt-color-96: #40E0D0; + --nt-color-97: #00FF7F; + --nt-color-98: #006400; + --nt-color-99: #808080; + --nt-color-100: #87CEEB; + --nt-color-101: #0000FF; + --nt-color-102: #6495ED; + --nt-color-103: #FDF5E6; + --nt-color-104: #B8860B; + --nt-color-105: #BA55D3; + --nt-color-106: #C0C0C0; + --nt-color-107: #000000; + --nt-color-108: #F08080; + --nt-color-109: #B0C4DE; + --nt-color-110: #00008B; + --nt-color-111: #6B8E23; + --nt-color-112: #FFE4B5; + --nt-color-113: #FFA07A; + --nt-color-114: #9ACD32; + --nt-color-115: #FFFFFF; + --nt-color-116: #F5F5DC; + --nt-color-117: #90EE90; + --nt-color-118: #1E90FF; + --nt-color-119: #7CFC00; + --nt-color-120: #FF69B4; + --nt-color-121: #F8F8FF; + --nt-color-122: #F5FFFA; + --nt-color-123: #00FF00; + --nt-color-124: #D3D3D3; + --nt-color-125: #DB7093; + --nt-color-126: #DA70D6; + --nt-color-127: #FF1493; + --nt-color-128: #228B22; + --nt-color-129: #FFEFD5; + --nt-color-130: #4169E1; + --nt-color-131: #191970; + --nt-color-132: #9370DB; + --nt-color-133: #483D8B; + --nt-color-134: #FF8C00; + --nt-color-135: #EEE8AA; + --nt-color-136: #CD5C5C; + --nt-color-137: #DC143C; +} + +:root { + --nt-group-0-main: #000000; + --nt-group-0-dark: #FFFFFF; + --nt-group-0-light: #000000; + --nt-group-0-main-bg: #F44336; + --nt-group-0-dark-bg: #BA000D; + --nt-group-0-light-bg: #FF7961; + --nt-group-1-main: #000000; + --nt-group-1-dark: #FFFFFF; + --nt-group-1-light: #000000; + --nt-group-1-main-bg: #E91E63; + --nt-group-1-dark-bg: #B0003A; + --nt-group-1-light-bg: #FF6090; + --nt-group-2-main: #FFFFFF; + --nt-group-2-dark: #FFFFFF; + --nt-group-2-light: #000000; + --nt-group-2-main-bg: #9C27B0; + --nt-group-2-dark-bg: #6A0080; + --nt-group-2-light-bg: #D05CE3; + --nt-group-3-main: #FFFFFF; + --nt-group-3-dark: #FFFFFF; + --nt-group-3-light: #000000; + --nt-group-3-main-bg: #673AB7; + --nt-group-3-dark-bg: #320B86; + --nt-group-3-light-bg: #9A67EA; + --nt-group-4-main: #FFFFFF; + --nt-group-4-dark: #FFFFFF; + --nt-group-4-light: #000000; + --nt-group-4-main-bg: #3F51B5; + --nt-group-4-dark-bg: #002984; + --nt-group-4-light-bg: #757DE8; + --nt-group-5-main: #000000; + --nt-group-5-dark: #FFFFFF; + --nt-group-5-light: #000000; + --nt-group-5-main-bg: #2196F3; + --nt-group-5-dark-bg: #0069C0; + --nt-group-5-light-bg: #6EC6FF; + --nt-group-6-main: #000000; + --nt-group-6-dark: #FFFFFF; + --nt-group-6-light: #000000; + --nt-group-6-main-bg: #03A9F4; + --nt-group-6-dark-bg: #007AC1; + --nt-group-6-light-bg: #67DAFF; + --nt-group-7-main: #000000; + --nt-group-7-dark: #000000; + --nt-group-7-light: #000000; + --nt-group-7-main-bg: #00BCD4; + --nt-group-7-dark-bg: #008BA3; + --nt-group-7-light-bg: #62EFFF; + --nt-group-8-main: #000000; + --nt-group-8-dark: #FFFFFF; + --nt-group-8-light: #000000; + --nt-group-8-main-bg: #009688; + --nt-group-8-dark-bg: #00675B; + --nt-group-8-light-bg: #52C7B8; + --nt-group-9-main: #000000; + --nt-group-9-dark: #FFFFFF; + --nt-group-9-light: #000000; + --nt-group-9-main-bg: #4CAF50; + --nt-group-9-dark-bg: #087F23; + --nt-group-9-light-bg: #80E27E; + --nt-group-10-main: #000000; + --nt-group-10-dark: #000000; + --nt-group-10-light: #000000; + --nt-group-10-main-bg: #8BC34A; + --nt-group-10-dark-bg: #5A9216; + --nt-group-10-light-bg: #BEF67A; + --nt-group-11-main: #000000; + --nt-group-11-dark: #000000; + --nt-group-11-light: #000000; + --nt-group-11-main-bg: #CDDC39; + --nt-group-11-dark-bg: #99AA00; + --nt-group-11-light-bg: #FFFF6E; + --nt-group-12-main: #000000; + --nt-group-12-dark: #000000; + --nt-group-12-light: #000000; + --nt-group-12-main-bg: #FFEB3B; + --nt-group-12-dark-bg: #C8B900; + --nt-group-12-light-bg: #FFFF72; + --nt-group-13-main: #000000; + --nt-group-13-dark: #000000; + --nt-group-13-light: #000000; + --nt-group-13-main-bg: #FFC107; + --nt-group-13-dark-bg: #C79100; + --nt-group-13-light-bg: #FFF350; + --nt-group-14-main: #000000; + --nt-group-14-dark: #000000; + --nt-group-14-light: #000000; + --nt-group-14-main-bg: #FF9800; + --nt-group-14-dark-bg: #C66900; + --nt-group-14-light-bg: #FFC947; + --nt-group-15-main: #000000; + --nt-group-15-dark: #FFFFFF; + --nt-group-15-light: #000000; + --nt-group-15-main-bg: #FF5722; + --nt-group-15-dark-bg: #C41C00; + --nt-group-15-light-bg: #FF8A50; + --nt-group-16-main: #FFFFFF; + --nt-group-16-dark: #FFFFFF; + --nt-group-16-light: #000000; + --nt-group-16-main-bg: #795548; + --nt-group-16-dark-bg: #4B2C20; + --nt-group-16-light-bg: #A98274; + --nt-group-17-main: #000000; + --nt-group-17-dark: #FFFFFF; + --nt-group-17-light: #000000; + --nt-group-17-main-bg: #9E9E9E; + --nt-group-17-dark-bg: #707070; + --nt-group-17-light-bg: #CFCFCF; + --nt-group-18-main: #000000; + --nt-group-18-dark: #FFFFFF; + --nt-group-18-light: #000000; + --nt-group-18-main-bg: #607D8B; + --nt-group-18-dark-bg: #34515E; + --nt-group-18-light-bg: #8EACBB; +} + +.nt-pastello { + --nt-group-0-main: #000000; + --nt-group-0-dark: #000000; + --nt-group-0-light: #000000; + --nt-group-0-main-bg: #EF9A9A; + --nt-group-0-dark-bg: #BA6B6C; + --nt-group-0-light-bg: #FFCCCB; + --nt-group-1-main: #000000; + --nt-group-1-dark: #000000; + --nt-group-1-light: #000000; + --nt-group-1-main-bg: #F48FB1; + --nt-group-1-dark-bg: #BF5F82; + --nt-group-1-light-bg: #FFC1E3; + --nt-group-2-main: #000000; + --nt-group-2-dark: #000000; + --nt-group-2-light: #000000; + --nt-group-2-main-bg: #CE93D8; + --nt-group-2-dark-bg: #9C64A6; + --nt-group-2-light-bg: #FFC4FF; + --nt-group-3-main: #000000; + --nt-group-3-dark: #000000; + --nt-group-3-light: #000000; + --nt-group-3-main-bg: #B39DDB; + --nt-group-3-dark-bg: #836FA9; + --nt-group-3-light-bg: #E6CEFF; + --nt-group-4-main: #000000; + --nt-group-4-dark: #000000; + --nt-group-4-light: #000000; + --nt-group-4-main-bg: #9FA8DA; + --nt-group-4-dark-bg: #6F79A8; + --nt-group-4-light-bg: #D1D9FF; + --nt-group-5-main: #000000; + --nt-group-5-dark: #000000; + --nt-group-5-light: #000000; + --nt-group-5-main-bg: #90CAF9; + --nt-group-5-dark-bg: #5D99C6; + --nt-group-5-light-bg: #C3FDFF; + --nt-group-6-main: #000000; + --nt-group-6-dark: #000000; + --nt-group-6-light: #000000; + --nt-group-6-main-bg: #81D4FA; + --nt-group-6-dark-bg: #4BA3C7; + --nt-group-6-light-bg: #B6FFFF; + --nt-group-7-main: #000000; + --nt-group-7-dark: #000000; + --nt-group-7-light: #000000; + --nt-group-7-main-bg: #80DEEA; + --nt-group-7-dark-bg: #4BACB8; + --nt-group-7-light-bg: #B4FFFF; + --nt-group-8-main: #000000; + --nt-group-8-dark: #000000; + --nt-group-8-light: #000000; + --nt-group-8-main-bg: #80CBC4; + --nt-group-8-dark-bg: #4F9A94; + --nt-group-8-light-bg: #B2FEF7; + --nt-group-9-main: #000000; + --nt-group-9-dark: #000000; + --nt-group-9-light: #000000; + --nt-group-9-main-bg: #A5D6A7; + --nt-group-9-dark-bg: #75A478; + --nt-group-9-light-bg: #D7FFD9; + --nt-group-10-main: #000000; + --nt-group-10-dark: #000000; + --nt-group-10-light: #000000; + --nt-group-10-main-bg: #C5E1A5; + --nt-group-10-dark-bg: #94AF76; + --nt-group-10-light-bg: #F8FFD7; + --nt-group-11-main: #000000; + --nt-group-11-dark: #000000; + --nt-group-11-light: #000000; + --nt-group-11-main-bg: #E6EE9C; + --nt-group-11-dark-bg: #B3BC6D; + --nt-group-11-light-bg: #FFFFCE; + --nt-group-12-main: #000000; + --nt-group-12-dark: #000000; + --nt-group-12-light: #000000; + --nt-group-12-main-bg: #FFF59D; + --nt-group-12-dark-bg: #CBC26D; + --nt-group-12-light-bg: #FFFFCF; + --nt-group-13-main: #000000; + --nt-group-13-dark: #000000; + --nt-group-13-light: #000000; + --nt-group-13-main-bg: #FFE082; + --nt-group-13-dark-bg: #CAAE53; + --nt-group-13-light-bg: #FFFFB3; + --nt-group-14-main: #000000; + --nt-group-14-dark: #000000; + --nt-group-14-light: #000000; + --nt-group-14-main-bg: #FFCC80; + --nt-group-14-dark-bg: #CA9B52; + --nt-group-14-light-bg: #FFFFB0; + --nt-group-15-main: #000000; + --nt-group-15-dark: #000000; + --nt-group-15-light: #000000; + --nt-group-15-main-bg: #FFAB91; + --nt-group-15-dark-bg: #C97B63; + --nt-group-15-light-bg: #FFDDC1; + --nt-group-16-main: #000000; + --nt-group-16-dark: #000000; + --nt-group-16-light: #000000; + --nt-group-16-main-bg: #BCAAA4; + --nt-group-16-dark-bg: #8C7B75; + --nt-group-16-light-bg: #EFDCD5; + --nt-group-17-main: #000000; + --nt-group-17-dark: #000000; + --nt-group-17-light: #000000; + --nt-group-17-main-bg: #EEEEEE; + --nt-group-17-dark-bg: #BCBCBC; + --nt-group-17-light-bg: #FFFFFF; + --nt-group-18-main: #000000; + --nt-group-18-dark: #000000; + --nt-group-18-light: #000000; + --nt-group-18-main-bg: #B0BEC5; + --nt-group-18-dark-bg: #808E95; + --nt-group-18-light-bg: #E2F1F8; +} + +.nt-group-0 .nt-plan-group-summary, +.nt-group-0 .nt-timeline-dot { + color: var(--nt-group-0-dark); + background-color: var(--nt-group-0-dark-bg); +} +.nt-group-0 .period { + color: var(--nt-group-0-main); + background-color: var(--nt-group-0-main-bg); +} + +.nt-group-1 .nt-plan-group-summary, +.nt-group-1 .nt-timeline-dot { + color: var(--nt-group-1-dark); + background-color: var(--nt-group-1-dark-bg); +} +.nt-group-1 .period { + color: var(--nt-group-1-main); + background-color: var(--nt-group-1-main-bg); +} + +.nt-group-2 .nt-plan-group-summary, +.nt-group-2 .nt-timeline-dot { + color: var(--nt-group-2-dark); + background-color: var(--nt-group-2-dark-bg); +} +.nt-group-2 .period { + color: var(--nt-group-2-main); + background-color: var(--nt-group-2-main-bg); +} + +.nt-group-3 .nt-plan-group-summary, +.nt-group-3 .nt-timeline-dot { + color: var(--nt-group-3-dark); + background-color: var(--nt-group-3-dark-bg); +} +.nt-group-3 .period { + color: var(--nt-group-3-main); + background-color: var(--nt-group-3-main-bg); +} + +.nt-group-4 .nt-plan-group-summary, +.nt-group-4 .nt-timeline-dot { + color: var(--nt-group-4-dark); + background-color: var(--nt-group-4-dark-bg); +} +.nt-group-4 .period { + color: var(--nt-group-4-main); + background-color: var(--nt-group-4-main-bg); +} + +.nt-group-5 .nt-plan-group-summary, +.nt-group-5 .nt-timeline-dot { + color: var(--nt-group-5-dark); + background-color: var(--nt-group-5-dark-bg); +} +.nt-group-5 .period { + color: var(--nt-group-5-main); + background-color: var(--nt-group-5-main-bg); +} + +.nt-group-6 .nt-plan-group-summary, +.nt-group-6 .nt-timeline-dot { + color: var(--nt-group-6-dark); + background-color: var(--nt-group-6-dark-bg); +} +.nt-group-6 .period { + color: var(--nt-group-6-main); + background-color: var(--nt-group-6-main-bg); +} + +.nt-group-7 .nt-plan-group-summary, +.nt-group-7 .nt-timeline-dot { + color: var(--nt-group-7-dark); + background-color: var(--nt-group-7-dark-bg); +} +.nt-group-7 .period { + color: var(--nt-group-7-main); + background-color: var(--nt-group-7-main-bg); +} + +.nt-group-8 .nt-plan-group-summary, +.nt-group-8 .nt-timeline-dot { + color: var(--nt-group-8-dark); + background-color: var(--nt-group-8-dark-bg); +} +.nt-group-8 .period { + color: var(--nt-group-8-main); + background-color: var(--nt-group-8-main-bg); +} + +.nt-group-9 .nt-plan-group-summary, +.nt-group-9 .nt-timeline-dot { + color: var(--nt-group-9-dark); + background-color: var(--nt-group-9-dark-bg); +} +.nt-group-9 .period { + color: var(--nt-group-9-main); + background-color: var(--nt-group-9-main-bg); +} + +.nt-group-10 .nt-plan-group-summary, +.nt-group-10 .nt-timeline-dot { + color: var(--nt-group-10-dark); + background-color: var(--nt-group-10-dark-bg); +} +.nt-group-10 .period { + color: var(--nt-group-10-main); + background-color: var(--nt-group-10-main-bg); +} + +.nt-group-11 .nt-plan-group-summary, +.nt-group-11 .nt-timeline-dot { + color: var(--nt-group-11-dark); + background-color: var(--nt-group-11-dark-bg); +} +.nt-group-11 .period { + color: var(--nt-group-11-main); + background-color: var(--nt-group-11-main-bg); +} + +.nt-group-12 .nt-plan-group-summary, +.nt-group-12 .nt-timeline-dot { + color: var(--nt-group-12-dark); + background-color: var(--nt-group-12-dark-bg); +} +.nt-group-12 .period { + color: var(--nt-group-12-main); + background-color: var(--nt-group-12-main-bg); +} + +.nt-group-13 .nt-plan-group-summary, +.nt-group-13 .nt-timeline-dot { + color: var(--nt-group-13-dark); + background-color: var(--nt-group-13-dark-bg); +} +.nt-group-13 .period { + color: var(--nt-group-13-main); + background-color: var(--nt-group-13-main-bg); +} + +.nt-group-14 .nt-plan-group-summary, +.nt-group-14 .nt-timeline-dot { + color: var(--nt-group-14-dark); + background-color: var(--nt-group-14-dark-bg); +} +.nt-group-14 .period { + color: var(--nt-group-14-main); + background-color: var(--nt-group-14-main-bg); +} + +.nt-group-15 .nt-plan-group-summary, +.nt-group-15 .nt-timeline-dot { + color: var(--nt-group-15-dark); + background-color: var(--nt-group-15-dark-bg); +} +.nt-group-15 .period { + color: var(--nt-group-15-main); + background-color: var(--nt-group-15-main-bg); +} + +.nt-group-16 .nt-plan-group-summary, +.nt-group-16 .nt-timeline-dot { + color: var(--nt-group-16-dark); + background-color: var(--nt-group-16-dark-bg); +} +.nt-group-16 .period { + color: var(--nt-group-16-main); + background-color: var(--nt-group-16-main-bg); +} + +.nt-group-17 .nt-plan-group-summary, +.nt-group-17 .nt-timeline-dot { + color: var(--nt-group-17-dark); + background-color: var(--nt-group-17-dark-bg); +} +.nt-group-17 .period { + color: var(--nt-group-17-main); + background-color: var(--nt-group-17-main-bg); +} + +.nt-group-18 .nt-plan-group-summary, +.nt-group-18 .nt-timeline-dot { + color: var(--nt-group-18-dark); + background-color: var(--nt-group-18-dark-bg); +} +.nt-group-18 .period { + color: var(--nt-group-18-main); + background-color: var(--nt-group-18-main-bg); +} + +/** + * Extra CSS file for MkDocs and the neoteroi.timeline extension. + * + * https://github.com/Neoteroi/mkdocs-plugins +**/ +.nt-error { + border: 2px dashed darkred; + padding: 0 1rem; + background: #faf9ba; + color: darkred; +} + +.nt-timeline { + margin-top: 30px; +} +.nt-timeline .nt-timeline-title { + font-size: 1.1rem; + margin-top: 0; +} +.nt-timeline .nt-timeline-sub-title { + margin-top: 0; +} +.nt-timeline .nt-timeline-content { + font-size: 0.8rem; + border-bottom: 2px dashed #ccc; + padding-bottom: 1.2rem; +} +.nt-timeline.horizontal .nt-timeline-items { + flex-direction: row; + overflow-x: scroll; +} +.nt-timeline.horizontal .nt-timeline-items > div { + min-width: 400px; + margin-right: 50px; +} +.nt-timeline.horizontal.reverse .nt-timeline-items { + flex-direction: row-reverse; +} +.nt-timeline.horizontal.center .nt-timeline-before { + background-image: linear-gradient(rgba(252, 70, 107, 0) 0%, rgb(252, 70, 107) 100%); + background-repeat: no-repeat; + background-size: 100% 2px; + background-position: 0 center; +} +.nt-timeline.horizontal.center .nt-timeline-after { + background-image: linear-gradient(180deg, rgb(252, 70, 107) 0%, rgba(252, 70, 107, 0) 100%); + background-repeat: no-repeat; + background-size: 100% 2px; + background-position: 0 center; +} +.nt-timeline.horizontal.center .nt-timeline-items { + background-image: radial-gradient(circle, rgb(63, 94, 251) 0%, rgb(252, 70, 107) 100%); + background-repeat: no-repeat; + background-size: 100% 2px; + background-position: 0 center; +} +.nt-timeline.horizontal .nt-timeline-dot { + left: 50%; +} +.nt-timeline.horizontal .nt-timeline-dot:not(.bigger) { + top: calc(50% - 4px); +} +.nt-timeline.horizontal .nt-timeline-dot.bigger { + top: calc(50% - 15px); +} +.nt-timeline.vertical .nt-timeline-items { + flex-direction: column; +} +.nt-timeline.vertical.reverse .nt-timeline-items { + flex-direction: column-reverse; +} +.nt-timeline.vertical.center .nt-timeline-before { + background: linear-gradient(rgba(252, 70, 107, 0) 0%, rgb(252, 70, 107) 100%) no-repeat center/2px 100%; +} +.nt-timeline.vertical.center .nt-timeline-after { + background: linear-gradient(rgb(252, 70, 107) 0%, rgba(252, 70, 107, 0) 100%) no-repeat center/2px 100%; +} +.nt-timeline.vertical.center .nt-timeline-items { + background: radial-gradient(circle, rgb(63, 94, 251) 0%, rgb(252, 70, 107) 100%) no-repeat center/2px 100%; +} +.nt-timeline.vertical.center .nt-timeline-dot { + left: calc(50% - 10px); +} +.nt-timeline.vertical.center .nt-timeline-dot:not(.bigger) { + top: 10px; +} +.nt-timeline.vertical.center .nt-timeline-dot.bigger { + left: calc(50% - 20px); +} +.nt-timeline.vertical.left { + padding-left: 100px; +} +.nt-timeline.vertical.left .nt-timeline-item { + padding-left: 70px; +} +.nt-timeline.vertical.left .nt-timeline-sub-title { + left: -100px; + width: 100px; +} +.nt-timeline.vertical.left .nt-timeline-before { + background: linear-gradient(rgba(252, 70, 107, 0) 0%, rgb(252, 70, 107) 100%) no-repeat 30px/2px 100%; +} +.nt-timeline.vertical.left .nt-timeline-after { + background: linear-gradient(rgb(252, 70, 107) 0%, rgba(252, 70, 107, 0) 100%) no-repeat 30px/2px 100%; +} +.nt-timeline.vertical.left .nt-timeline-items { + background: radial-gradient(circle, rgb(63, 94, 251) 0%, rgb(252, 70, 107) 100%) no-repeat 30px/2px 100%; +} +.nt-timeline.vertical.left .nt-timeline-dot { + left: 21px; + top: 8px; +} +.nt-timeline.vertical.left .nt-timeline-dot.bigger { + top: 0px; + left: 10px; +} +.nt-timeline.vertical.right { + padding-right: 100px; +} +.nt-timeline.vertical.right .nt-timeline-sub-title { + right: -100px; + text-align: left; + width: 100px; +} +.nt-timeline.vertical.right .nt-timeline-item { + padding-right: 70px; +} +.nt-timeline.vertical.right .nt-timeline-before { + background: linear-gradient(rgba(252, 70, 107, 0) 0%, rgb(252, 70, 107) 100%) no-repeat calc(100% - 30px)/2px 100%; +} +.nt-timeline.vertical.right .nt-timeline-after { + background: linear-gradient(rgb(252, 70, 107) 0%, rgba(252, 70, 107, 0) 100%) no-repeat calc(100% - 30px)/2px 100%; +} +.nt-timeline.vertical.right .nt-timeline-items { + background: radial-gradient(circle, rgb(63, 94, 251) 0%, rgb(252, 70, 107) 100%) no-repeat calc(100% - 30px)/2px 100%; +} +.nt-timeline.vertical.right .nt-timeline-dot { + right: 21px; + top: 8px; +} +.nt-timeline.vertical.right .nt-timeline-dot.bigger { + top: 10px; + right: 10px; +} + +.nt-timeline-items { + display: flex; + position: relative; +} +.nt-timeline-items > div { + min-height: 100px; + padding-top: 2px; + padding-bottom: 20px; +} + +.nt-timeline-before { + content: ""; + height: 15px; +} + +.nt-timeline-after { + content: ""; + height: 60px; + margin-bottom: 20px; +} + +.nt-timeline-sub-title { + position: absolute; + width: 50%; + top: 4px; + font-size: 18px; + color: var(--nt-color-50); +} + +[data-md-color-scheme=slate] .nt-timeline-sub-title { + color: var(--nt-color-51); +} + +.nt-timeline-item { + position: relative; +} + +.nt-timeline.vertical.center:not(.alternate) .nt-timeline-item { + padding-left: calc(50% + 40px); +} +.nt-timeline.vertical.center:not(.alternate) .nt-timeline-item .nt-timeline-sub-title { + left: 0; + padding-right: 40px; + text-align: right; +} +.nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(odd) { + padding-left: calc(50% + 40px); +} +.nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(odd) .nt-timeline-sub-title { + left: 0; + padding-right: 40px; + text-align: right; +} +.nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(even) { + text-align: right; + padding-right: calc(50% + 40px); +} +.nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(even) .nt-timeline-sub-title { + right: 0; + padding-left: 40px; + text-align: left; +} + +.nt-timeline-dot { + position: relative; + width: 20px; + height: 20px; + border-radius: 100%; + background-color: #fc5b5b; + position: absolute; + top: 0px; + z-index: 2; + display: flex; + justify-content: center; + align-items: center; + box-shadow: 0 2px 1px -1px rgba(0, 0, 0, 0.2), 0 1px 1px 0 rgba(0, 0, 0, 0.14), 0 1px 3px 0 rgba(0, 0, 0, 0.12); + border: 3px solid white; +} +.nt-timeline-dot:not(.bigger) .icon { + font-size: 10px; +} +.nt-timeline-dot.bigger { + width: 40px; + height: 40px; + padding: 3px; +} +.nt-timeline-dot .icon { + color: white; + position: relative; + top: 1px; +} + +/* Fix for webkit (Chrome, Safari) */ +@supports not (-moz-appearance: none) { + /* + This fix is necessary, for some reason, to render the timeline properly + inside `details` elements used by pymdownx. Firefox doesn't need this fix, + it renders elements properly. + */ + details .nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(odd) .nt-timeline-sub-title, +details .nt-timeline.vertical.center:not(.alternate) .nt-timeline-item .nt-timeline-sub-title { + left: -40px; + } + details .nt-timeline.vertical.center.alternate .nt-timeline-item:nth-child(even) .nt-timeline-sub-title { + right: -40px; + } + details .nt-timeline.vertical.center .nt-timeline-dot { + left: calc(50% - 12px); + } + details .nt-timeline-dot.bigger { + font-size: 1rem !important; + } +} +/* default colors */ +.nt-timeline-item:nth-child(0) .nt-timeline-dot { + background-color: var(--nt-color-0); +} + +.nt-timeline-item:nth-child(1) .nt-timeline-dot { + background-color: var(--nt-color-1); +} + +.nt-timeline-item:nth-child(2) .nt-timeline-dot { + background-color: var(--nt-color-2); +} + +.nt-timeline-item:nth-child(3) .nt-timeline-dot { + background-color: var(--nt-color-3); +} + +.nt-timeline-item:nth-child(4) .nt-timeline-dot { + background-color: var(--nt-color-4); +} + +.nt-timeline-item:nth-child(5) .nt-timeline-dot { + background-color: var(--nt-color-5); +} + +.nt-timeline-item:nth-child(6) .nt-timeline-dot { + background-color: var(--nt-color-6); +} + +.nt-timeline-item:nth-child(7) .nt-timeline-dot { + background-color: var(--nt-color-7); +} + +.nt-timeline-item:nth-child(8) .nt-timeline-dot { + background-color: var(--nt-color-8); +} + +.nt-timeline-item:nth-child(9) .nt-timeline-dot { + background-color: var(--nt-color-9); +} + +.nt-timeline-item:nth-child(10) .nt-timeline-dot { + background-color: var(--nt-color-10); +} + +.nt-timeline-item:nth-child(11) .nt-timeline-dot { + background-color: var(--nt-color-11); +} + +.nt-timeline-item:nth-child(12) .nt-timeline-dot { + background-color: var(--nt-color-12); +} + +.nt-timeline-item:nth-child(13) .nt-timeline-dot { + background-color: var(--nt-color-13); +} + +.nt-timeline-item:nth-child(14) .nt-timeline-dot { + background-color: var(--nt-color-14); +} + +.nt-timeline-item:nth-child(15) .nt-timeline-dot { + background-color: var(--nt-color-15); +} + +.nt-timeline-item:nth-child(16) .nt-timeline-dot { + background-color: var(--nt-color-16); +} + +.nt-timeline-item:nth-child(17) .nt-timeline-dot { + background-color: var(--nt-color-17); +} + +.nt-timeline-item:nth-child(18) .nt-timeline-dot { + background-color: var(--nt-color-18); +} + +.nt-timeline-item:nth-child(19) .nt-timeline-dot { + background-color: var(--nt-color-19); +} + +.nt-timeline-item:nth-child(20) .nt-timeline-dot { + background-color: var(--nt-color-20); +} + +/** + * Extra CSS for the neoteroi.projects.gantt extension. + * + * https://github.com/Neoteroi/mkdocs-plugins +**/ +:root { + --nt-scrollbar-color: #2751b0; + --nt-plan-actions-height: 24px; + --nt-units-background: #ff9800; + --nt-months-background: #2751b0; + --nt-plan-vertical-line-color: #a3a3a3ad; +} + +.nt-pastello { + --nt-scrollbar-color: #9fb8f4; + --nt-units-background: #f5dc82; + --nt-months-background: #5b7fd1; +} + +[data-md-color-scheme=slate] { + --nt-units-background: #003773; +} +[data-md-color-scheme=slate] .nt-pastello { + --nt-units-background: #3f4997; +} + +.nt-plan-root { + min-height: 200px; + scrollbar-width: 20px; + scrollbar-color: var(--nt-scrollbar-color); + display: flex; +} +.nt-plan-root ::-webkit-scrollbar { + width: 20px; +} +.nt-plan-root ::-webkit-scrollbar-track { + box-shadow: inset 0 0 5px grey; + border-radius: 10px; +} +.nt-plan-root ::-webkit-scrollbar-thumb { + background: var(--nt-scrollbar-color); + border-radius: 10px; +} +.nt-plan-root .nt-plan { + flex: 80%; +} +.nt-plan-root.no-groups .nt-plan-periods { + padding-left: 0; +} +.nt-plan-root.no-groups .nt-plan-group-summary { + display: none; +} +.nt-plan-root .nt-timeline-dot.bigger { + top: -10px; +} +.nt-plan-root .nt-timeline-dot.bigger[title] { + cursor: help; +} + +.nt-plan { + white-space: nowrap; + overflow-x: auto; + display: flex; +} +.nt-plan .ug-timeline-dot { + left: 368px; + top: -8px; + cursor: help; +} + +.months { + display: flex; +} + +.month { + flex: auto; + display: inline-block; + box-shadow: rgba(0, 0, 0, 0.2) 0px 3px 1px -2px, rgba(0, 0, 0, 0.14) 0px 2px 2px 0px, rgba(0, 0, 0, 0.12) 0px 1px 5px 0px inset; + background-color: var(--nt-months-background); + color: white; + text-transform: uppercase; + font-family: Roboto, Helvetica, Arial, sans-serif; + padding: 2px 5px; + font-size: 12px; + border: 1px solid #000; + width: 150px; + border-radius: 8px; +} + +.nt-plan-group-activities { + flex: auto; + position: relative; +} + +.nt-vline { + border-left: 1px dashed var(--nt-plan-vertical-line-color); + height: 100%; + left: 0; + position: absolute; + margin-left: -0.5px; + top: 0; + -webkit-transition: all 0.5s linear !important; + -moz-transition: all 0.5s linear !important; + -ms-transition: all 0.5s linear !important; + -o-transition: all 0.5s linear !important; + transition: all 0.5s linear !important; + z-index: -2; +} + +.nt-plan-activity { + display: flex; + margin: 2px 0; + background-color: rgba(187, 187, 187, 0.2509803922); +} + +.actions { + height: var(--nt-plan-actions-height); +} + +.actions { + position: relative; +} + +.period { + display: inline-block; + height: var(--nt-plan-actions-height); + width: 120px; + position: absolute; + left: 0px; + background: #1da1f2; + border-radius: 5px; + transition: all 0.5s; + cursor: help; + -webkit-transition: width 1s ease-in-out; + -moz-transition: width 1s ease-in-out; + -o-transition: width 1s ease-in-out; + transition: width 1s ease-in-out; +} +.period .nt-tooltip { + display: none; + top: 30px; + position: relative; + padding: 1rem; + text-align: center; + font-size: 12px; +} +.period:hover .nt-tooltip { + display: inline-block; +} + +.period-0 { + left: 340px; + visibility: visible; + background-color: rgb(69, 97, 101); +} + +.period-1 { + left: 40px; + visibility: visible; + background-color: green; +} + +.period-2 { + left: 120px; + visibility: visible; + background-color: pink; + width: 80px; +} + +.period-3 { + left: 190px; + visibility: visible; + background-color: darkred; + width: 150px; +} + +.weeks > span, +.days > span { + height: 25px; +} + +.weeks > span { + display: inline-block; + margin: 0; + padding: 0; + font-weight: bold; +} +.weeks > span .week-text { + font-size: 10px; + position: absolute; + display: inline-block; + padding: 3px 4px; +} + +.days { + z-index: -2; + position: relative; +} + +.day-text { + font-size: 10px; + position: absolute; + display: inline-block; + padding: 3px 4px; +} + +.period span { + font-size: 12px; + vertical-align: top; + margin-left: 4px; + color: black; + background: rgba(255, 255, 255, 0.6588235294); + border-radius: 6px; + padding: 0 4px; +} + +.weeks, +.days { + height: 20px; + display: flex; + box-sizing: content-box; +} + +.months { + display: flex; +} + +.week, +.day { + height: 20px; + position: relative; + border: 1; + flex: auto; + border: 2px solid white; + border-radius: 4px; + background-color: var(--nt-units-background); + cursor: help; +} + +.years { + display: flex; +} + +.year { + text-align: center; + border-right: 1px solid var(--nt-plan-vertical-line-color); + font-weight: bold; +} +.year:first-child { + border-left: 1px solid var(--nt-plan-vertical-line-color); +} +.year:first-child:last-child { + width: 100%; +} + +.quarters { + display: flex; +} + +.quarter { + width: 12.5%; + text-align: center; + border-right: 1px solid var(--nt-plan-vertical-line-color); + font-weight: bold; +} +.quarter:first-child { + border-left: 1px solid var(--nt-plan-vertical-line-color); +} + +.nt-plan-group { + margin: 20px 0; + position: relative; +} + +.nt-plan-group { + display: flex; +} + +.nt-plan-group-summary { + background: #2751b0; + width: 150px; + white-space: normal; + padding: 0.1rem 0.5rem; + border-radius: 5px; + color: #fff; + z-index: 3; +} +.nt-plan-group-summary p { + margin: 0; + padding: 0; + font-size: 0.6rem; + color: #fff; +} + +.nt-plan-group-summary, +.month, +.period, +.week, +.day, +.nt-tooltip { + border: 3px solid white; + box-shadow: 0 2px 3px -1px rgba(0, 0, 0, 0.2), 0 3px 3px 0 rgba(0, 0, 0, 0.14), 0 1px 5px 0 rgba(0, 0, 0, 0.12); +} + +.nt-plan-periods { + padding-left: 150px; +} + +.months { + z-index: 2; + position: relative; +} + +.weeks { + position: relative; + top: -2px; + z-index: 0; +} + +.month, +.quarter, +.year, +.week, +.day, +.nt-tooltip { + font-family: Roboto, Helvetica, Arial, sans-serif; + box-sizing: border-box; +} + +.nt-cards.nt-grid { + display: grid; + grid-auto-columns: 1fr; + gap: 0.5rem; + max-width: 100vw; + overflow-x: auto; + padding: 1px; +} +.nt-cards.nt-grid.cols-1 { + grid-template-columns: repeat(1, 1fr); +} +.nt-cards.nt-grid.cols-2 { + grid-template-columns: repeat(2, 1fr); +} +.nt-cards.nt-grid.cols-3 { + grid-template-columns: repeat(3, 1fr); +} +.nt-cards.nt-grid.cols-4 { + grid-template-columns: repeat(4, 1fr); +} +.nt-cards.nt-grid.cols-5 { + grid-template-columns: repeat(5, 1fr); +} +.nt-cards.nt-grid.cols-6 { + grid-template-columns: repeat(6, 1fr); +} + +@media only screen and (max-width: 400px) { + .nt-cards.nt-grid { + grid-template-columns: repeat(1, 1fr) !important; + } +} +.nt-card { + box-shadow: 0 2px 2px 0 rgba(0, 0, 0, 0.14), 0 3px 1px -2px rgba(0, 0, 0, 0.2), 0 1px 5px 0 rgba(0, 0, 0, 0.12); +} +.nt-card:hover { + box-shadow: 0 2px 2px 0 rgba(0, 0, 0, 0.24), 0 3px 1px -2px rgba(0, 0, 0, 0.3), 0 1px 5px 0 rgba(0, 0, 0, 0.22); +} + +[data-md-color-scheme=slate] .nt-card { + box-shadow: 0 2px 2px 0 rgba(4, 40, 33, 0.14), 0 3px 1px -2px rgba(40, 86, 94, 0.47), 0 1px 5px 0 rgba(139, 252, 255, 0.64); +} +[data-md-color-scheme=slate] .nt-card:hover { + box-shadow: 0 2px 2px 0 rgba(0, 255, 206, 0.14), 0 3px 1px -2px rgba(33, 156, 177, 0.47), 0 1px 5px 0 rgba(96, 251, 255, 0.64); +} + +.nt-card > a { + color: var(--md-default-fg-color); +} + +.nt-card > a > div { + cursor: pointer; +} + +.nt-card { + padding: 5px; + margin-bottom: 0.5rem; +} + +.nt-card-title { + font-size: 1rem; + font-weight: bold; + margin: 4px 0 8px 0; + line-height: 22px; +} + +.nt-card-content { + padding: 0.4rem 0.8rem 0.8rem 0.8rem; +} + +.nt-card-text { + font-size: 14px; + padding: 0; + margin: 0; +} + +.nt-card .nt-card-image { + text-align: center; + border-radius: 2px; + background-position: center center; + background-size: cover; + background-repeat: no-repeat; + min-height: 120px; +} + +.nt-card .nt-card-image.tags img { + margin-top: 12px; +} + +.nt-card .nt-card-image img { + height: 105px; + margin-top: 5px; +} + +.nt-card .nt-card-icon { + text-align: center; + padding-top: 12px; + min-height: 120px; +} + +.nt-card .nt-card-icon .icon { + font-size: 95px; + line-height: 1; +} + +.nt-card a:hover, +.nt-card a:focus { + color: var(--md-accent-fg-color); +} + +.nt-card h2 { + margin: 0; +} + +/** + * Extra CSS file recommended for MkDocs and neoteroi.spantable extension. + * + * https://github.com/Neoteroi/mkdocs-plugins +**/ +.span-table-wrapper table { + border-collapse: collapse; + margin-bottom: 2rem; + border-radius: 0.1rem; +} + +.span-table td, +.span-table th { + padding: 0.2rem; + background-color: var(--md-default-bg-color); + font-size: 0.64rem; + max-width: 100%; + overflow: auto; + touch-action: auto; + border-top: 0.05rem solid var(--md-typeset-table-color); + padding: 0.9375em 1.25em; + vertical-align: top; +} + +.span-table tr:first-child td { + font-weight: 700; + min-width: 5rem; + padding: 0.9375em 1.25em; + vertical-align: top; +} + +.span-table td:first-child { + border-left: 0.05rem solid var(--md-typeset-table-color); +} + +.span-table td:last-child { + border-right: 0.05rem solid var(--md-typeset-table-color); +} + +.span-table tr:last-child { + border-bottom: 0.05rem solid var(--md-typeset-table-color); +} + +.span-table [colspan], +.span-table [rowspan] { + font-weight: bold; + border: 0.05rem solid var(--md-typeset-table-color); +} + +.span-table tr:not(:first-child):hover td:not([colspan]):not([rowspan]), +.span-table td[colspan]:hover, +.span-table td[rowspan]:hover { + background-color: rgba(0, 0, 0, 0.035); + box-shadow: 0 0.05rem 0 var(--md-default-bg-color) inset; + transition: background-color 125ms; +} + +.nt-contribs { + margin-top: 2rem; + font-size: small; + border-top: 1px dotted lightgray; + padding-top: 0.5rem; +} +.nt-contribs .nt-contributors { + padding-top: 0.5rem; + display: flex; + flex-wrap: wrap; +} +.nt-contribs .nt-contributor { + background: lightgrey; + background-size: cover; + width: 40px; + height: 40px; + border-radius: 100%; + margin: 0 6px 6px 0; + cursor: help; + opacity: 0.7; +} +.nt-contribs .nt-contributor:hover { + opacity: 1; +} +.nt-contribs .nt-contributors-title { + font-style: italic; + margin-bottom: 0; +} +.nt-contribs .nt-initials { + text-transform: uppercase; + font-size: 20px; + text-align: center; + width: 40px; + height: 40px; + display: inline-block; + vertical-align: middle; + position: relative; + top: 4px; + color: inherit; + font-weight: bold; +} +.nt-contribs .nt-group-0 { + background-color: var(--nt-color-0); +} +.nt-contribs .nt-group-1 { + background-color: var(--nt-color-1); +} +.nt-contribs .nt-group-2 { + background-color: var(--nt-color-2); +} +.nt-contribs .nt-group-3 { + background-color: var(--nt-color-3); +} +.nt-contribs .nt-group-4 { + background-color: var(--nt-color-4); +} +.nt-contribs .nt-group-5 { + background-color: var(--nt-color-5); +} +.nt-contribs .nt-group-6 { + background-color: var(--nt-color-6); +} +.nt-contribs .nt-group-7 { + color: #000; + background-color: var(--nt-color-7); +} +.nt-contribs .nt-group-8 { + color: #000; + background-color: var(--nt-color-8); +} +.nt-contribs .nt-group-9 { + background-color: var(--nt-color-9); +} +.nt-contribs .nt-group-10 { + background-color: var(--nt-color-10); +} +.nt-contribs .nt-group-11 { + background-color: var(--nt-color-11); +} +.nt-contribs .nt-group-12 { + background-color: var(--nt-color-12); +} +.nt-contribs .nt-group-13 { + background-color: var(--nt-color-13); +} +.nt-contribs .nt-group-14 { + background-color: var(--nt-color-14); +} +.nt-contribs .nt-group-15 { + color: #000; + background-color: var(--nt-color-15); +} +.nt-contribs .nt-group-16 { + background-color: var(--nt-color-16); +} +.nt-contribs .nt-group-17 { + color: #000; + background-color: var(--nt-color-17); +} +.nt-contribs .nt-group-18 { + background-color: var(--nt-color-18); +} +.nt-contribs .nt-group-19 { + background-color: var(--nt-color-19); +} +.nt-contribs .nt-group-20 { + color: #000; + background-color: var(--nt-color-20); +} +.nt-contribs .nt-group-21 { + color: #000; + background-color: var(--nt-color-21); +} +.nt-contribs .nt-group-22 { + color: #000; + background-color: var(--nt-color-22); +} +.nt-contribs .nt-group-23 { + color: #000; + background-color: var(--nt-color-23); +} +.nt-contribs .nt-group-24 { + color: #000; + background-color: var(--nt-color-24); +} +.nt-contribs .nt-group-25 { + color: #000; + background-color: var(--nt-color-25); +} +.nt-contribs .nt-group-26 { + color: #000; + background-color: var(--nt-color-26); +} +.nt-contribs .nt-group-27 { + background-color: var(--nt-color-27); +} +.nt-contribs .nt-group-28 { + color: #000; + background-color: var(--nt-color-28); +} +.nt-contribs .nt-group-29 { + color: #000; + background-color: var(--nt-color-29); +} +.nt-contribs .nt-group-30 { + background-color: var(--nt-color-30); +} +.nt-contribs .nt-group-31 { + background-color: var(--nt-color-31); +} +.nt-contribs .nt-group-32 { + color: #000; + background-color: var(--nt-color-32); +} +.nt-contribs .nt-group-33 { + background-color: var(--nt-color-33); +} +.nt-contribs .nt-group-34 { + background-color: var(--nt-color-34); +} +.nt-contribs .nt-group-35 { + background-color: var(--nt-color-35); +} +.nt-contribs .nt-group-36 { + background-color: var(--nt-color-36); +} +.nt-contribs .nt-group-37 { + background-color: var(--nt-color-37); +} +.nt-contribs .nt-group-38 { + background-color: var(--nt-color-38); +} +.nt-contribs .nt-group-39 { + color: #000; + background-color: var(--nt-color-39); +} +.nt-contribs .nt-group-40 { + color: #000; + background-color: var(--nt-color-40); +} +.nt-contribs .nt-group-41 { + color: #000; + background-color: var(--nt-color-41); +} +.nt-contribs .nt-group-42 { + color: #000; + background-color: var(--nt-color-42); +} +.nt-contribs .nt-group-43 { + color: #000; + background-color: var(--nt-color-43); +} +.nt-contribs .nt-group-44 { + color: #000; + background-color: var(--nt-color-44); +} +.nt-contribs .nt-group-45 { + background-color: var(--nt-color-45); +} +.nt-contribs .nt-group-46 { + color: #000; + background-color: var(--nt-color-46); +} +.nt-contribs .nt-group-47 { + background-color: var(--nt-color-47); +} +.nt-contribs .nt-group-48 { + background-color: var(--nt-color-48); +} +.nt-contribs .nt-group-49 { + background-color: var(--nt-color-49); +} diff --git a/css/pandas-dataframe.css b/css/pandas-dataframe.css new file mode 100644 index 0000000000..2c18015dba --- /dev/null +++ b/css/pandas-dataframe.css @@ -0,0 +1,36 @@ +/* Pretty Pandas Dataframes */ +/* Supports mkdocs-material color variables */ +.dataframe { + border: 0; + font-size: smaller; +} +.dataframe tr { + border: none; + background: var(--md-code-bg-color, #ffffff); +} +.dataframe tr:nth-child(even) { + background: var(--md-default-bg-color, #f5f5f5); +} +.dataframe tr:hover { + background-color: var(--md-footer-bg-color--dark, #e1f5fe); +} + +.dataframe thead th { + background: var(--md-default-bg-color, #ffffff); + border-bottom: 1px solid #aaa; + font-weight: bold; +} +.dataframe th { + border: none; + padding-left: 10px; + padding-right: 10px; +} + +.dataframe td{ + /* background: #fff; */ + border: none; + text-align: right; + min-width:5em; + padding-left: 10px; + padding-right: 10px; +} diff --git a/demos/conftest.py b/demos/conftest.py new file mode 100644 index 0000000000..7e334ec7e6 --- /dev/null +++ b/demos/conftest.py @@ -0,0 +1,5 @@ +# add default marker to all tests - this flag is on by default +# set in pyproject.toml to aid testing tests/ +def pytest_collection_modifyitems(items, config): + for item in items: + item.add_marker("default") diff --git a/demos/data/demos_requirements.txt b/demos/data/demos_requirements.txt new file mode 100644 index 0000000000..deb4675178 --- /dev/null +++ b/demos/data/demos_requirements.txt @@ -0,0 +1,6 @@ +ipywidgets==8.1.2 +pytest==8.0.0 +nbmake==1.5.0 +pytest-xdist==3.5.0 +jupyterlab==4.1.1 +rapidfuzz==3.6.1 diff --git a/demos/data/fake_1000_combined.json b/demos/data/fake_1000_combined.json new file mode 100644 index 0000000000..37d0b1e0a6 --- /dev/null +++ b/demos/data/fake_1000_combined.json @@ -0,0 +1,827 @@ +{ + "current_settings_dict": { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.38295871197174913, + 0.15226716735596263, + 0.4647741206722884 + ], + "u_probabilities": [ + 0.99388666286368, + 0.0028669412982625627, + 0.0032463958380572937 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.39129544025833124, + 0.12030161113062919, + 0.48840294861103967 + ], + "u_probabilities": [ + 0.9926894735506957, + 0.002265232513862262, + 0.00504529393544206 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.3155452170681331, + 0.6844547829318668 + ], + "u_probabilities": [ + 0.999744431865398, + 0.0002555681346020589 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.28847678300721113, + 0.711523216992789 + ], + "u_probabilities": [ + 0.9101196115782743, + 0.08988038842172567 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.26345589294588023, + 0.7365441070541197 + ], + "u_probabilities": [ + 0.999752381434188, + 0.00024761856581195573 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.0050396660717059545 + }, + "historical_settings_dicts": [ + { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.1, + 0.2, + 0.7 + ], + "u_probabilities": [ + 0.7000000000000001, + 0.20000000000000004, + 0.10000000000000002 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.1, + 0.2, + 0.7 + ], + "u_probabilities": [ + 0.7000000000000001, + 0.20000000000000004, + 0.10000000000000002 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.1, + 0.9 + ], + "u_probabilities": [ + 0.9, + 0.1 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.1, + 0.9 + ], + "u_probabilities": [ + 0.9, + 0.1 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.1, + 0.9 + ], + "u_probabilities": [ + 0.9, + 0.1 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.3 + }, + { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.3186934546499692, + 0.12840032519139744, + 0.5529062201586333 + ], + "u_probabilities": [ + 0.9939759786854048, + 0.0030314329671755963, + 0.002992588347419581 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.3198128508055946, + 0.1012872814376532, + 0.5788998677567522 + ], + "u_probabilities": [ + 0.9928755085862229, + 0.0023850489513080227, + 0.004739442462469059 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.4458241699110161, + 0.5541758300889841 + ], + "u_probabilities": [ + 0.9991201294011005, + 0.0008798705988996084 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.23947578325976432, + 0.7605242167402356 + ], + "u_probabilities": [ + 0.9103126147950835, + 0.08968738520491662 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.33765358450383004, + 0.6623464154961699 + ], + "u_probabilities": [ + 0.9990162350531908, + 0.000983764946809028 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.0051036575568089455 + }, + { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.35487565573699936, + 0.15688609023387654, + 0.4882382540291243 + ], + "u_probabilities": [ + 0.9939453208371932, + 0.0028636761577645457, + 0.003191003005042176 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.36256020556857577, + 0.12485690828706437, + 0.5125828861443598 + ], + "u_probabilities": [ + 0.9927532940008038, + 0.002257975949059752, + 0.004988730050136418 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.3279820237915454, + 0.6720179762084547 + ], + "u_probabilities": [ + 0.9996008268999741, + 0.0003991731000261967 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.28081937910577376, + 0.7191806208942262 + ], + "u_probabilities": [ + 0.9100766010176691, + 0.08992339898233102 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.27396677650624485, + 0.7260332234937552 + ], + "u_probabilities": [ + 0.9996263656879927, + 0.0003736343120073876 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.004920247301570404 + }, + { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.3738161602012986, + 0.15437008371167008, + 0.47181375608703136 + ], + "u_probabilities": [ + 0.9938954272411262, + 0.0028654619727581858, + 0.0032391107861156807 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.38219314199949384, + 0.12219952769923849, + 0.49560733030126775 + ], + "u_probabilities": [ + 0.9926988136057959, + 0.0022628506063210436, + 0.005038335787882892 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.3153843236207835, + 0.6846156763792166 + ], + "u_probabilities": [ + 0.9997049272927501, + 0.0002950727072498611 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.2858843246576875, + 0.7141156753423125 + ], + "u_probabilities": [ + 0.9100965728213073, + 0.08990342717869258 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.2631134529996605, + 0.7368865470003393 + ], + "u_probabilities": [ + 0.9997170170485102, + 0.0002829829514897332 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.004981043940754473 + }, + { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 0, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when first_name_l is null or first_name_r is null then -1\n when jaro_winkler_sim(first_name_l, first_name_r) >= 1.0 then 2\n when jaro_winkler_sim(first_name_l, first_name_r) >= 0.88 then 1\n else 0 end as gamma_first_name", + "m_probabilities": [ + 0.38295871197174913, + 0.15226716735596263, + 0.4647741206722884 + ], + "u_probabilities": [ + 0.99388666286368, + 0.0028669412982625627, + 0.0032463958380572937 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true, + "gamma_index": 1, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when surname_l is null or surname_r is null then -1\n when jaro_winkler_sim(surname_l, surname_r) >= 1.0 then 2\n when jaro_winkler_sim(surname_l, surname_r) >= 0.88 then 1\n else 0 end as gamma_surname", + "m_probabilities": [ + 0.39129544025833124, + 0.12030161113062919, + 0.48840294861103967 + ], + "u_probabilities": [ + 0.9926894735506957, + 0.002265232513862262, + 0.00504529393544206 + ], + "tf_adjustment_weights": [ + 0.0, + 0.0, + 1.0 + ] + }, + { + "col_name": "dob", + "gamma_index": 2, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when dob_l is null or dob_r is null then -1\n when dob_l = dob_r then 1\n else 0 end as gamma_dob", + "m_probabilities": [ + 0.3155452170681331, + 0.6844547829318668 + ], + "u_probabilities": [ + 0.999744431865398, + 0.0002555681346020589 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "city", + "term_frequency_adjustments": true, + "gamma_index": 3, + "num_levels": 2, + "data_type": "string", + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when city_l is null or city_r is null then -1\n when city_l = city_r then 1\n else 0 end as gamma_city", + "m_probabilities": [ + 0.28847678300721113, + 0.711523216992789 + ], + "u_probabilities": [ + 0.9101196115782743, + 0.08988038842172567 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + }, + { + "col_name": "email", + "gamma_index": 4, + "num_levels": 2, + "data_type": "string", + "term_frequency_adjustments": false, + "fix_u_probabilities": false, + "fix_m_probabilities": false, + "case_expression": "case\n when email_l is null or email_r is null then -1\n when email_l = email_r then 1\n else 0 end as gamma_email", + "m_probabilities": [ + 0.26345589294588023, + 0.7365441070541197 + ], + "u_probabilities": [ + 0.999752381434188, + 0.00024761856581195573 + ], + "tf_adjustment_weights": [ + 0.0, + 1.0 + ] + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01, + "source_dataset_column_name": "source_dataset", + "unique_id_column_name": "unique_id", + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": false, + "max_iterations": 25, + "proportion_of_matches": 0.0050396660717059545 + } + ], + "original_settings_dict": { + "link_type": "dedupe_only", + "blocking_rules": [], + "comparison_columns": [ + { + "col_name": "first_name", + "num_levels": 3, + "term_frequency_adjustments": true + }, + { + "col_name": "surname", + "num_levels": 3, + "term_frequency_adjustments": true + }, + { + "col_name": "dob" + }, + { + "col_name": "city", + "term_frequency_adjustments": true + }, + { + "col_name": "email" + } + ], + "additional_columns_to_retain": [ + "group" + ], + "em_convergence": 0.01 + }, + "iteration": 4 +} \ No newline at end of file diff --git a/demos/data/febrl/source.txt b/demos/data/febrl/source.txt new file mode 100644 index 0000000000..73e2d951c6 --- /dev/null +++ b/demos/data/febrl/source.txt @@ -0,0 +1 @@ +febrl datasets from https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html ref A.2 https://arxiv.org/pdf/2008.04443.pdf \ No newline at end of file diff --git a/demos/demo_settings/real_time_settings.json b/demos/demo_settings/real_time_settings.json new file mode 100644 index 0000000000..6dcb15af08 --- /dev/null +++ b/demos/demo_settings/real_time_settings.json @@ -0,0 +1 @@ +{"probability_two_random_records_match": 0.02326254791526835, "link_type": "dedupe_only", "blocking_rules_to_generate_predictions": ["l.surname = r.surname", "l.first_name = r.first_name"], "comparisons": [{"output_column_name": "first_name", "comparison_levels": [{"sql_condition": "first_name_l IS NULL OR first_name_r IS NULL", "label_for_charts": "Null", "is_null_level": true}, {"sql_condition": "first_name_l = first_name_r", "label_for_charts": "exact_match", "m_probability": 0.5073501669215337, "u_probability": 0.0057935713975033705, "tf_adjustment_column": "first_name", "tf_adjustment_weight": 1.0}, {"sql_condition": "levenshtein(first_name_l, first_name_r) <= 2", "label_for_charts": "Levenstein <= 2", "m_probability": 0.27736434159157797, "u_probability": 0.010119901990634016}, {"sql_condition": "ELSE", "label_for_charts": "All other comparisons", "m_probability": 0.21528549148688833, "u_probability": 0.9840865266118626}]}, {"output_column_name": "surname", "comparison_levels": [{"sql_condition": "surname_l IS NULL OR surname_r IS NULL", "label_for_charts": "Null", "is_null_level": true}, {"sql_condition": "surname_l = surname_r", "label_for_charts": "exact_match", "m_probability": 0.4517645215191846, "u_probability": 0.004889975550122249, "tf_adjustment_column": "surname", "tf_adjustment_weight": 1.0}, {"sql_condition": "levenshtein(surname_l, surname_r) <= 2", "label_for_charts": "Levenstein <= 2", "m_probability": 0.3078165102205689, "u_probability": 0.007373772654946249}, {"sql_condition": "ELSE", "label_for_charts": "All other comparisons", "m_probability": 0.24041896826024636, "u_probability": 0.9877362517949315}]}, {"output_column_name": "dob", "comparison_levels": [{"sql_condition": "dob_l IS NULL OR dob_r IS NULL", "label_for_charts": "Null", "is_null_level": true}, {"sql_condition": "dob_l = dob_r", "label_for_charts": "exact_match", "m_probability": 0.405530771330678, "u_probability": 0.0017477477477477479, "tf_adjustment_column": "dob", "tf_adjustment_weight": 1.0}, {"sql_condition": "levenshtein(dob_l, dob_r) <= 2", "label_for_charts": "Levenstein <= 2", "m_probability": 0.3679356056637918, "u_probability": 0.01711911911911912}, {"sql_condition": "ELSE", "label_for_charts": "All other comparisons", "m_probability": 0.22653362300553073, "u_probability": 0.9811331331331331}]}, {"output_column_name": "city", "comparison_levels": [{"sql_condition": "city_l IS NULL OR city_r IS NULL", "label_for_charts": "Null", "is_null_level": true}, {"sql_condition": "city_l = city_r", "label_for_charts": "exact_match", "m_probability": 0.5782144900964232, "u_probability": 0.0551475711801453, "tf_adjustment_column": "city", "tf_adjustment_weight": 1.0}, {"sql_condition": "ELSE", "label_for_charts": "All other comparisons", "m_probability": 0.4217855099035769, "u_probability": 0.9448524288198547}]}, {"output_column_name": "email", "comparison_levels": [{"sql_condition": "email_l IS NULL OR email_r IS NULL", "label_for_charts": "Null", "is_null_level": true}, {"sql_condition": "email_l = email_r", "label_for_charts": "exact_match", "m_probability": 0.5774909200578013, "u_probability": 0.0021938713143283602, "tf_adjustment_column": "email", "tf_adjustment_weight": 1.0}, {"sql_condition": "ELSE", "label_for_charts": "All other comparisons", "m_probability": 0.42250907994219916, "u_probability": 0.9978061286856716}]}], "retain_matching_columns": true, "retain_intermediate_calculation_columns": true, "max_iterations": 20} \ No newline at end of file diff --git a/demos/demo_settings/saved_model_from_demo.json b/demos/demo_settings/saved_model_from_demo.json new file mode 100644 index 0000000000..5e750aa9ec --- /dev/null +++ b/demos/demo_settings/saved_model_from_demo.json @@ -0,0 +1,210 @@ +{ + "link_type": "dedupe_only", + "probability_two_random_records_match": 0.00298012298012298, + "retain_matching_columns": true, + "retain_intermediate_calculation_columns": true, + "additional_columns_to_retain": [], + "sql_dialect": "duckdb", + "linker_uid": "gael41sp", + "em_convergence": 0.0001, + "max_iterations": 25, + "bayes_factor_column_prefix": "bf_", + "term_frequency_adjustment_column_prefix": "tf_", + "comparison_vector_value_column_prefix": "gamma_", + "unique_id_column_name": "unique_id", + "source_dataset_column_name": "source_dataset", + "blocking_rules_to_generate_predictions": [ + { + "blocking_rule": "(l.\"first_name\" = r.\"first_name\") AND (l.\"city\" = r.\"city\")", + "sql_dialect": "duckdb" + }, + { + "blocking_rule": "l.\"surname\" = r.\"surname\"", + "sql_dialect": "duckdb" + } + ], + "comparisons": [ + { + "output_column_name": "first_name", + "comparison_levels": [ + { + "sql_condition": "\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL", + "label_for_charts": "first_name is NULL", + "is_null_level": true + }, + { + "sql_condition": "\"first_name_l\" = \"first_name_r\"", + "label_for_charts": "Exact match on first_name", + "m_probability": 0.49142094931763786, + "u_probability": 0.0057935713975033705, + "tf_adjustment_column": "first_name", + "tf_adjustment_weight": 1.0 + }, + { + "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92", + "label_for_charts": "Jaro-Winkler distance of first_name >= 0.92", + "m_probability": 0.15176057384758357, + "u_probability": 0.0023429457903817435 + }, + { + "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88", + "label_for_charts": "Jaro-Winkler distance of first_name >= 0.88", + "m_probability": 0.07406496776118936, + "u_probability": 0.0015484319951285285 + }, + { + "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7", + "label_for_charts": "Jaro-Winkler distance of first_name >= 0.7", + "m_probability": 0.07908610771504865, + "u_probability": 0.018934945558406913 + }, + { + "sql_condition": "ELSE", + "label_for_charts": "All other comparisons", + "m_probability": 0.20366740135854072, + "u_probability": 0.9713801052585794 + } + ], + "comparison_description": "NameComparison" + }, + { + "output_column_name": "surname", + "comparison_levels": [ + { + "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL", + "label_for_charts": "surname is NULL", + "is_null_level": true + }, + { + "sql_condition": "\"surname_l\" = \"surname_r\"", + "label_for_charts": "Exact match on surname", + "m_probability": 0.43457460622893745, + "u_probability": 0.004889975550122249, + "tf_adjustment_column": "surname", + "tf_adjustment_weight": 1.0 + }, + { + "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.92", + "label_for_charts": "Jaro-Winkler distance of surname >= 0.92", + "m_probability": 0.22529103510053106, + "u_probability": 0.00283905173880724 + }, + { + "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.88", + "label_for_charts": "Jaro-Winkler distance of surname >= 0.88", + "m_probability": 0.044322214569398714, + "u_probability": 0.0011314412292407403 + }, + { + "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.7", + "label_for_charts": "Jaro-Winkler distance of surname >= 0.7", + "m_probability": 0.0778408787095487, + "u_probability": 0.014729633311540402 + }, + { + "sql_condition": "ELSE", + "label_for_charts": "All other comparisons", + "m_probability": 0.21797126539158398, + "u_probability": 0.9764098981702893 + } + ], + "comparison_description": "NameComparison" + }, + { + "output_column_name": "dob", + "comparison_levels": [ + { + "sql_condition": "\"dob_l\" IS NULL OR \"dob_r\" IS NULL", + "label_for_charts": "dob is NULL", + "is_null_level": true + }, + { + "sql_condition": "\"dob_l\" = \"dob_r\"", + "label_for_charts": "Exact match on dob", + "m_probability": 0.39142166528829947, + "u_probability": 0.0017477477477477479 + }, + { + "sql_condition": "levenshtein(\"dob_l\", \"dob_r\") <= 1", + "label_for_charts": "Levenshtein distance of dob <= 1", + "m_probability": 0.14937817941895076, + "u_probability": 0.0016016016016016017 + }, + { + "sql_condition": "ELSE", + "label_for_charts": "All other comparisons", + "m_probability": 0.4592001552927498, + "u_probability": 0.9966506506506506 + } + ], + "comparison_description": "LevenshteinAtThresholds" + }, + { + "output_column_name": "city", + "comparison_levels": [ + { + "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL", + "label_for_charts": "city is NULL", + "is_null_level": true + }, + { + "sql_condition": "\"city_l\" = \"city_r\"", + "label_for_charts": "Exact match on city", + "m_probability": 0.5625747223574914, + "u_probability": 0.0551475711801453, + "tf_adjustment_column": "city", + "tf_adjustment_weight": 1.0 + }, + { + "sql_condition": "ELSE", + "label_for_charts": "All other comparisons", + "m_probability": 0.43742527764250866, + "u_probability": 0.9448524288198547 + } + ], + "comparison_description": "ExactMatch" + }, + { + "output_column_name": "email", + "comparison_levels": [ + { + "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL", + "label_for_charts": "email is NULL", + "is_null_level": true + }, + { + "sql_condition": "\"email_l\" = \"email_r\"", + "label_for_charts": "Exact match on email", + "m_probability": 0.5529665836585227, + "u_probability": 0.0021938713143283602, + "tf_adjustment_column": "\"email\"", + "tf_adjustment_weight": 1.0 + }, + { + "sql_condition": "NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')", + "label_for_charts": "Exact match on username", + "m_probability": 0.2208741262673715, + "u_probability": 0.0010390328952024346 + }, + { + "sql_condition": "jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88", + "label_for_charts": "Jaro-Winkler distance of email >= 0.88", + "m_probability": 0.21412999464826887, + "u_probability": 0.0009135769109519858 + }, + { + "sql_condition": "jaro_winkler_similarity(NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')) >= 0.88", + "label_for_charts": "Jaro-Winkler >0.88 on username", + "u_probability": 0.000501823937001795 + }, + { + "sql_condition": "ELSE", + "label_for_charts": "All other comparisons", + "m_probability": 0.01202929542583697, + "u_probability": 0.9953516949425154 + } + ], + "comparison_description": "EmailComparison" + } + ] +} \ No newline at end of file diff --git a/demos/examples/athena/dashboards/50k_cluster.html b/demos/examples/athena/dashboards/50k_cluster.html new file mode 100644 index 0000000000..0781d500d2 --- /dev/null +++ b/demos/examples/athena/dashboards/50k_cluster.html @@ -0,0 +1,9513 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/demos/examples/athena/deduplicate_50k_synthetic.html b/demos/examples/athena/deduplicate_50k_synthetic.html new file mode 100644 index 0000000000..3b0c7a81f8 --- /dev/null +++ b/demos/examples/athena/deduplicate_50k_synthetic.html @@ -0,0 +1,6025 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deduplicate 50k rows historical persons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Deduplicate 50k rows historical persons

+ +

Linking a dataset of real historical persons

+

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

+

Create a boto3 session to be used within the linker

+
import boto3
+
+boto3_session = boto3.Session(region_name="eu-west-1")
+
+

AthenaLinker Setup

+

To work nicely with Athena, you need to outline various filepaths, buckets and the database(s) you wish to interact with.

+
+ +

The AthenaLinker has three required inputs: +* input_table_or_tables - the input table to use for linking. This can either be a table in a database or a pandas dataframe +* output_database - the database to output all of your splink tables to. +* output_bucket - the s3 bucket you wish any parquet files produced by splink to be output to.

+

and two optional inputs: +* output_filepath - the s3 filepath to output files to. This is an extension of output_bucket and dictate the full filepath your files will be output to. +* input_table_aliases - the name of your table within your database, should you choose to use a pandas df as an input.

+
# Set the output bucket and the additional filepath to write outputs to
+############################################
+# EDIT THESE BEFORE ATTEMPTING TO RUN THIS #
+############################################
+
+from splink.backends.athena import AthenaAPI
+
+
+bucket = "MYTESTBUCKET"
+database = "MYTESTDATABASE"
+filepath = "MYTESTFILEPATH"  # file path inside of your bucket
+
+aws_filepath = f"s3://{bucket}/{filepath}"
+db_api = AthenaAPI(
+    boto3_session,
+    output_bucket=bucket,
+    output_database=database,
+    output_filepath=filepath,
+)
+
+
import splink.comparison_library as cl
+from splink import block_on
+
+from splink import Linker, SettingsCreator, splink_datasets
+
+df = splink_datasets.historical_50k
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname"),
+        block_on("surname", "dob"),
+    ],
+    comparisons=[
+        cl.ExactMatch("first_name").configure(term_frequency_adjustments=True),
+        cl.LevenshteinAtThresholds("surname", [1, 3]),
+        cl.LevenshteinAtThresholds("dob", [1, 2]),
+        cl.LevenshteinAtThresholds("postcode_fake", [1, 2]),
+        cl.ExactMatch("birth_place").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+
from splink.exploratory import profile_columns
+
+profile_columns(df, db_api, column_expressions=["first_name", "substr(surname,1,2)"])
+
+ +
+ + +
from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+from splink import block_on
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    db_api=db_api,
+    blocking_rules=[block_on("first_name", "surname"), block_on("surname", "dob")],
+    link_type="dedupe_only",
+)
+
+ +
+ + +
import splink.comparison_library as cl
+
+
+from splink import Linker, SettingsCreator
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname"),
+        block_on("surname", "dob"),
+    ],
+    comparisons=[
+        cl.ExactMatch("first_name").configure(term_frequency_adjustments=True),
+        cl.LevenshteinAtThresholds("surname", [1, 3]),
+        cl.LevenshteinAtThresholds("dob", [1, 2]),
+        cl.LevenshteinAtThresholds("postcode_fake", [1, 2]),
+        cl.ExactMatch("birth_place").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+linker = Linker(df, settings, db_api=db_api)
+
+
linker.training.estimate_probability_two_random_records_match(
+    [
+        block_on("first_name", "surname", "dob"),
+        block_on("substr(first_name,1,2)", "surname", "substr(postcode_fake, 1,2)"),
+        block_on("dob", "postcode_fake"),
+    ],
+    recall=0.6,
+)
+
+
Probability two random records match is estimated to be  0.000136.
+This means that amongst all possible pairwise record comparisons, one in 7,362.31 are expected to match.  With 1,279,041,753 total possible comparisons, we expect a total of around 173,728.33 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=5e6)
+
+
----- Estimating u probabilities using random sampling -----
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+    - dob (no m values are trained).
+    - postcode_fake (no m values are trained).
+    - birth_place (no m values are trained).
+    - occupation (no m values are trained).
+
+
blocking_rule = block_on("first_name", "surname")
+training_session_names = (
+    linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule)
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+(l."first_name" = r."first_name") AND (l."surname" = r."surname")
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name
+    - surname
+
+Iteration 1: Largest change in params was -0.526 in probability_two_random_records_match
+Iteration 2: Largest change in params was -0.0321 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.0109 in the m_probability of birth_place, level `Exact match on birth_place`
+Iteration 4: Largest change in params was -0.00341 in the m_probability of birth_place, level `All other comparisons`
+Iteration 5: Largest change in params was -0.00116 in the m_probability of dob, level `All other comparisons`
+Iteration 6: Largest change in params was -0.000547 in the m_probability of dob, level `All other comparisons`
+Iteration 7: Largest change in params was -0.00029 in the m_probability of dob, level `All other comparisons`
+Iteration 8: Largest change in params was -0.000169 in the m_probability of dob, level `All other comparisons`
+Iteration 9: Largest change in params was -0.000105 in the m_probability of dob, level `All other comparisons`
+Iteration 10: Largest change in params was -6.87e-05 in the m_probability of dob, level `All other comparisons`
+
+EM converged after 10 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+
+
blocking_rule = block_on("dob")
+training_session_dob = (
+    linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule)
+)
+
+
----- Starting EM training session -----
+
+
+
+Estimating the m probabilities of the model by blocking on:
+l."dob" = r."dob"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+Iteration 1: Largest change in params was -0.355 in the m_probability of first_name, level `Exact match on first_name`
+Iteration 2: Largest change in params was -0.0383 in the m_probability of first_name, level `Exact match on first_name`
+Iteration 3: Largest change in params was 0.00531 in the m_probability of postcode_fake, level `All other comparisons`
+Iteration 4: Largest change in params was 0.00129 in the m_probability of postcode_fake, level `All other comparisons`
+Iteration 5: Largest change in params was 0.00034 in the m_probability of surname, level `All other comparisons`
+Iteration 6: Largest change in params was 8.9e-05 in the m_probability of surname, level `All other comparisons`
+
+EM converged after 6 iterations
+
+Your model is fully trained. All comparisons have at least one estimate for their m and u values
+
+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
linker.evaluation.unlinkables_chart()
+
+ +
+ + +
df_predict = linker.inference.predict()
+df_e = df_predict.as_pandas_dataframe(limit=5)
+df_e
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_name...bf_birth_placebf_tf_adj_birth_placeoccupation_loccupation_rgamma_occupationtf_occupation_ltf_occupation_rbf_occupationbf_tf_adj_occupationmatch_key
027.1494931.000000Q2296770-1Q2296770-12thomasrhomas00.0286670.0000590.455194...160.7139334.179108politicianpolitician10.0889320.08893222.9168590.4412731
11.6272420.755454Q2296770-1Q2296770-15thomasclifford,00.0286670.0000200.455194...0.1545501.000000politician<NA>-10.088932NaN1.0000001.0000001
229.2065051.000000Q2296770-1Q2296770-3thomastom00.0286670.0129480.455194...160.7139334.179108politicianpolitician10.0889320.08893222.9168590.4412731
313.7830270.999929Q2296770-1Q2296770-7thomastom00.0286670.0129480.455194...0.1545501.000000politician<NA>-10.088932NaN1.0000001.0000001
429.2065051.000000Q2296770-2Q2296770-3thomastom00.0286670.0129480.455194...160.7139334.179108politicianpolitician10.0889320.08893222.9168590.4412731
+

5 rows × 38 columns

+
+ +

You can also view rows in this dataset as a waterfall chart as follows:

+
records_to_plot = df_e.to_dict(orient="records")
+linker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)
+
+ +
+ + +
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predict, threshold_match_probability=0.95
+)
+
+
Completed iteration 1, root rows count 641
+Completed iteration 2, root rows count 187
+Completed iteration 3, root rows count 251
+Completed iteration 4, root rows count 75
+Completed iteration 5, root rows count 23
+Completed iteration 6, root rows count 30
+Completed iteration 7, root rows count 34
+Completed iteration 8, root rows count 30
+Completed iteration 9, root rows count 9
+Completed iteration 10, root rows count 5
+Completed iteration 11, root rows count 0
+
+
linker.visualisations.cluster_studio_dashboard(
+    df_predict,
+    clusters,
+    "dashboards/50k_cluster.html",
+    sampling_method="by_cluster_size",
+    overwrite=True,
+)
+
+from IPython.display import IFrame
+
+IFrame(src="./dashboards/50k_cluster.html", width="100%", height=1200)
+
+

+

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/accuracy_analysis_from_labels_column.html b/demos/examples/duckdb/accuracy_analysis_from_labels_column.html new file mode 100644 index 0000000000..8691350a5b --- /dev/null +++ b/demos/examples/duckdb/accuracy_analysis_from_labels_column.html @@ -0,0 +1,6079 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Evaluation from ground truth column - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Evaluation from ground truth column

+ +

Evaluation when you have fully labelled data

+

In this example, our data contains a fully-populated ground-truth column called cluster that enables us to perform accuracy analysis of the final model

+

+ Open In Colab +

+
from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+df.head(2)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idfirst_namesurnamedobcityemailcluster
00RobertAlan1971-06-24NaNrobert255@smith.net0
11RobertAllen1971-05-24NaNroberta25@smith.net0
+
+ +
from splink import SettingsCreator, Linker, block_on, DuckDBAPI
+
+import splink.comparison_library as cl
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+        block_on("dob"),
+        block_on("email"),
+    ],
+    comparisons=[
+        cl.ForenameSurnameComparison("first_name", "surname"),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+
db_api = DuckDBAPI()
+linker = Linker(df, settings, db_api=db_api)
+deterministic_rules = [
+    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
+    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
+    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
+    "l.email = r.email",
+]
+
+linker.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.7
+)
+
+
Probability two random records match is estimated to be  0.00333.
+This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name_surname (no m values are trained).
+    - dob (no m values are trained).
+    - city (no m values are trained).
+    - email (no m values are trained).
+
+
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("dob"), estimate_without_term_frequencies=True
+)
+session_email = linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("email"), estimate_without_term_frequencies=True
+)
+session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("first_name", "surname"), estimate_without_term_frequencies=True
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."dob" = r."dob"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name_surname
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+WARNING:
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was -0.751 in the m_probability of first_name_surname, level `(Exact match on first_name) AND (Exact match on surname)`
+Iteration 2: Largest change in params was 0.196 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.0536 in probability_two_random_records_match
+Iteration 4: Largest change in params was 0.0189 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.00731 in probability_two_random_records_match
+Iteration 6: Largest change in params was 0.0029 in probability_two_random_records_match
+Iteration 7: Largest change in params was 0.00116 in probability_two_random_records_match
+Iteration 8: Largest change in params was 0.000469 in probability_two_random_records_match
+Iteration 9: Largest change in params was 0.000189 in probability_two_random_records_match
+Iteration 10: Largest change in params was 7.62e-05 in probability_two_random_records_match
+
+EM converged after 10 iterations
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - dob (no m values are trained).
+    - email (some m values are not trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."email" = r."email"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name_surname
+    - dob
+    - city
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - email
+
+Iteration 1: Largest change in params was -0.438 in the m_probability of dob, level `Exact match on dob`
+Iteration 2: Largest change in params was 0.122 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.0286 in probability_two_random_records_match
+Iteration 4: Largest change in params was 0.01 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.00448 in probability_two_random_records_match
+Iteration 6: Largest change in params was 0.00237 in probability_two_random_records_match
+Iteration 7: Largest change in params was 0.0014 in probability_two_random_records_match
+Iteration 8: Largest change in params was 0.000893 in probability_two_random_records_match
+Iteration 9: Largest change in params was 0.000597 in probability_two_random_records_match
+Iteration 10: Largest change in params was 0.000413 in probability_two_random_records_match
+Iteration 11: Largest change in params was 0.000292 in probability_two_random_records_match
+Iteration 12: Largest change in params was 0.000211 in probability_two_random_records_match
+Iteration 13: Largest change in params was 0.000154 in probability_two_random_records_match
+Iteration 14: Largest change in params was 0.000113 in probability_two_random_records_match
+Iteration 15: Largest change in params was 8.4e-05 in probability_two_random_records_match
+
+EM converged after 15 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - email (some m values are not trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+(l."first_name" = r."first_name") AND (l."surname" = r."surname")
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name_surname
+
+WARNING:
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was 0.473 in probability_two_random_records_match
+Iteration 2: Largest change in params was 0.0452 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.00766 in probability_two_random_records_match
+Iteration 4: Largest change in params was 0.00135 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.00025 in probability_two_random_records_match
+Iteration 6: Largest change in params was 0.000468 in the m_probability of email, level `All other comparisons`
+Iteration 7: Largest change in params was 0.00776 in the m_probability of email, level `All other comparisons`
+Iteration 8: Largest change in params was 0.00992 in the m_probability of email, level `All other comparisons`
+Iteration 9: Largest change in params was 0.00277 in probability_two_random_records_match
+Iteration 10: Largest change in params was 0.000972 in probability_two_random_records_match
+Iteration 11: Largest change in params was 0.000337 in probability_two_random_records_match
+Iteration 12: Largest change in params was 0.000118 in probability_two_random_records_match
+Iteration 13: Largest change in params was 4.14e-05 in probability_two_random_records_match
+
+EM converged after 13 iterations
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - email (some m values are not trained).
+
+
linker.evaluation.accuracy_analysis_from_labels_column(
+    "cluster", output_type="table"
+).as_pandas_dataframe(limit=5)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
truth_thresholdmatch_probabilitytotal_clerical_labelspntptnfpfnP_rate...precisionrecallspecificitynpvaccuracyf1f2f0_5p4phi
0-17.80.000004499500.02031.0497469.01650.0495130.02339.0381.00.004066...0.4136380.8124080.9952980.9992310.9945550.5481730.6810860.4586650.7074660.577474
1-17.70.000005499500.02031.0497469.01650.0495225.02244.0381.00.004066...0.4237290.8124080.9954890.9992310.9947450.5569620.6864700.4685640.7147690.584558
2-17.10.000007499500.02031.0497469.01650.0495311.02158.0381.00.004066...0.4332980.8124080.9956620.9992310.9949170.5651650.6914180.4779010.7215120.591197
3-17.00.000008499500.02031.0497469.01650.0495354.02115.0381.00.004066...0.4382470.8124080.9957480.9992310.9950030.5693580.6939190.4827100.7249310.594601
4-16.90.000008499500.02031.0497469.01650.0495386.02083.0381.00.004066...0.4420040.8124080.9958130.9992310.9950670.5725190.6957920.4863530.7274970.597173
+

5 rows × 25 columns

+
+ +
linker.evaluation.accuracy_analysis_from_labels_column("cluster", output_type="roc")
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+ +
+ + +
linker.evaluation.accuracy_analysis_from_labels_column(
+    "cluster",
+    output_type="threshold_selection",
+    threshold_match_probability=0.5,
+    add_metrics=["f1"],
+)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+ +
+ + +
# Plot some false positives
+linker.evaluation.prediction_errors_from_labels_column(
+    "cluster", include_false_negatives=True, include_false_positives=True
+).as_pandas_dataframe(limit=5)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
clerical_match_scorefound_by_blocking_rulesmatch_weightmatch_probabilityunique_id_lunique_id_rsurname_lsurname_rfirst_name_lfirst_name_r...email_lemail_rgamma_emailtf_email_ltf_email_rbf_emailbf_tf_adj_emailcluster_lcluster_rmatch_key
01.0False-15.5689450.000021452454DavesReubenNoneDavies...rd@lewis.comidlewrs.cocm00.0038020.0012670.010991.01151154
11.0False-14.8840570.000033715717JoesJonesNoneMia...Nonemia.j63@martinez.biz-1NaN0.0050701.000001.01821824
21.0False-14.8840570.000033626628DavidsonNonegeeorGeGeeorge...Nonegdavidson@johnson-brown.com-1NaN0.0050701.000001.01581584
31.0False-13.7615890.000072983984MilllerMillerJessicaaessicJ...Nonejessica.miller@johnson.com-1NaN0.0076051.000001.02462464
41.0True-11.6375850.000314594595KikKiirkGraceGrace...gk@frey-robinson.orgrgk@frey-robinon.org00.0012670.0012670.010991.01461460
+

5 rows × 38 columns

+
+ +
records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster", include_false_negatives=True, include_false_positives=True
+).as_record_dict(limit=5)
+
+linker.visualisations.waterfall_chart(records)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/cookbook.html b/demos/examples/duckdb/cookbook.html new file mode 100644 index 0000000000..92c9fa0c18 --- /dev/null +++ b/demos/examples/duckdb/cookbook.html @@ -0,0 +1,6038 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Cookbook - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Cookbook

+

This notebook contains a miscellaneous collection of runnable examples illustrating various Splink techniques.

+

Array columns

+

Comparing array columns

+

This example shows how we can use use ArrayIntersectAtSizes to assess the similarity of columns containing arrays.

+
import pandas as pd
+
+import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on
+
+
+data = [
+    {"unique_id": 1, "first_name": "John", "postcode": ["A", "B"]},
+    {"unique_id": 2, "first_name": "John", "postcode": ["B"]},
+    {"unique_id": 3, "first_name": "John", "postcode": ["A"]},
+    {"unique_id": 4, "first_name": "John", "postcode": ["A", "B"]},
+    {"unique_id": 5, "first_name": "John", "postcode": ["C"]},
+]
+
+df = pd.DataFrame(data)
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+    ],
+    comparisons=[
+        cl.ArrayIntersectAtSizes("postcode", [2, 1]),
+        cl.ExactMatch("first_name"),
+    ]
+)
+
+
+linker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)
+
+linker.inference.predict().as_pandas_dataframe()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rpostcode_lpostcode_rgamma_postcodefirst_name_lfirst_name_rgamma_first_name
0-8.2875680.00319045[A, B][C]0JohnJohn1
1-0.2875680.45033334[A][A, B]1JohnJohn1
2-8.2875680.00319035[A][C]0JohnJohn1
3-8.2875680.00319023[B][A]0JohnJohn1
4-0.2875680.45033324[B][A, B]1JohnJohn1
5-8.2875680.00319025[B][C]0JohnJohn1
6-0.2875680.45033312[A, B][B]1JohnJohn1
7-0.2875680.45033313[A, B][A]1JohnJohn1
86.7124320.99055414[A, B][A, B]2JohnJohn1
9-8.2875680.00319015[A, B][C]0JohnJohn1
+
+ +

Blocking on array columns

+

This example shows how we can use block_on to block on the individual elements of an array column - that is, pairwise comaprisons are created for pairs or records where any of the elements in the array columns match.

+
import pandas as pd
+
+import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on
+
+
+data = [
+    {"unique_id": 1, "first_name": "John", "postcode": ["A", "B"]},
+    {"unique_id": 2, "first_name": "John", "postcode": ["B"]},
+    {"unique_id": 3, "first_name": "John", "postcode": ["C"]},
+
+]
+
+df = pd.DataFrame(data)
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("postcode", arrays_to_explode=["postcode"]),
+    ],
+    comparisons=[
+        cl.ArrayIntersectAtSizes("postcode", [2, 1]),
+        cl.ExactMatch("first_name"),
+    ]
+)
+
+
+linker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)
+
+linker.inference.predict().as_pandas_dataframe()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rpostcode_lpostcode_rgamma_postcodefirst_name_lfirst_name_rgamma_first_name
0-0.2875680.45033312[A, B][B]1JohnJohn1
+
+ +

Other

+

Using DuckDB without pandas

+

In this example, we read data directly using DuckDB and obtain results in native DuckDB DuckDBPyRelation format.

+
import duckdb
+import tempfile
+import os
+
+import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+# Create a parquet file on disk to demontrate native DuckDB parquet reading
+df = splink_datasets.fake_1000
+temp_file = tempfile.NamedTemporaryFile(delete=True, suffix=".parquet")
+temp_file_path = temp_file.name
+df.to_parquet(temp_file_path)
+
+# Example would start here if you already had a parquet file
+duckdb_df = duckdb.read_parquet(temp_file_path)
+
+db_api = DuckDBAPI(":default:")
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.JaroAtThresholds("surname"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "dob"),
+        block_on("surname"),
+    ],
+)
+
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+result = linker.inference.predict().as_duckdbpyrelation()
+
+# Since result is a DuckDBPyRelation, we can use all the usual DuckDB API
+# functions on it.
+
+# For example, we can use the `sort` function to sort the results,
+# or could use result.to_parquet() to write to a parquet file.
+result.sort("match_weight")
+
+
┌─────────────────────┬──────────────────────┬─────────────┬───┬───────────────┬────────────┬────────────┬───────────┐
+│    match_weight     │  match_probability   │ unique_id_l │ … │ gamma_surname │   dob_l    │   dob_r    │ match_key │
+│       double        │        double        │    int64    │   │     int32     │  varchar   │  varchar   │  varchar  │
+├─────────────────────┼──────────────────────┼─────────────┼───┼───────────────┼────────────┼────────────┼───────────┤
+│  -11.83278901894715 │ 0.000274066864295451 │         758 │ … │             0 │ 2002-09-15 │ 2002-09-15 │ 0         │
+│ -10.247826518225994 │  0.0008217501639050… │         670 │ … │             0 │ 2006-12-05 │ 2006-12-05 │ 0         │
+│  -9.662864017504837 │  0.0012321189988629… │         558 │ … │             0 │ 2020-02-11 │ 2020-02-11 │ 0         │
+│  -9.470218939562441 │  0.0014078881864458… │         259 │ … │             1 │ 1983-03-07 │ 1983-03-07 │ 0         │
+│  -8.470218939562441 │ 0.002811817648042493 │         644 │ … │            -1 │ 1992-02-06 │ 1992-02-06 │ 0         │
+│  -8.287568102831404 │  0.0031901106569634… │         393 │ … │             3 │ 1991-05-06 │ 1991-04-12 │ 1         │
+│  -8.287568102831404 │  0.0031901106569634… │         282 │ … │             3 │ 2004-12-02 │ 2002-02-25 │ 1         │
+│  -8.287568102831404 │  0.0031901106569634… │         282 │ … │             3 │ 2004-12-02 │ 1993-03-01 │ 1         │
+│  -8.287568102831404 │  0.0031901106569634… │         531 │ … │             3 │ 1987-09-11 │ 2000-09-03 │ 1         │
+│  -8.287568102831404 │  0.0031901106569634… │         531 │ … │             3 │ 1987-09-11 │ 1990-10-06 │ 1         │
+│           ·         │            ·         │          ·  │ · │             · │     ·      │     ·      │ ·         │
+│           ·         │            ·         │          ·  │ · │             · │     ·      │     ·      │ ·         │
+│           ·         │            ·         │          ·  │ · │             · │     ·      │     ·      │ ·         │
+│   5.337135982495163 │   0.9758593366351407 │         554 │ … │             3 │ 2020-02-11 │ 2030-02-08 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         774 │ … │             3 │ 2027-04-21 │ 2017-04-23 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         874 │ … │             3 │ 2020-06-23 │ 2019-05-23 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         409 │ … │             3 │ 2017-05-03 │ 2008-05-05 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         415 │ … │             3 │ 2002-02-25 │ 1993-03-01 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         740 │ … │             3 │ 2005-09-18 │ 2006-09-14 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         417 │ … │             3 │ 2002-02-24 │ 1992-02-28 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         534 │ … │             3 │ 1974-02-28 │ 1975-03-31 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         286 │ … │             3 │ 1985-01-05 │ 1986-02-04 │ 1         │
+│   5.337135982495163 │   0.9758593366351407 │         172 │ … │             3 │ 2012-07-06 │ 2012-07-09 │ 1         │
+├─────────────────────┴──────────────────────┴─────────────┴───┴───────────────┴────────────┴────────────┴───────────┤
+│ 1800 rows (20 shown)                                                                          13 columns (7 shown) │
+└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+
+

Fixing m or u probabilities during training

+
import splink.comparison_level_library as cll
+import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+
+db_api = DuckDBAPI()
+
+first_name_comparison = cl.CustomComparison(
+    comparison_levels=[
+        cll.NullLevel("first_name"),
+        cll.ExactMatchLevel("first_name").configure(
+            m_probability=0.9999,
+            fix_m_probability=True,
+            u_probability=0.7,
+            fix_u_probability=True,
+        ),
+        cll.ElseLevel(),
+    ]
+)
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        first_name_comparison,
+        cl.ExactMatch("surname"),
+        cl.ExactMatch("dob"),
+        cl.ExactMatch("city"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("dob"),
+    ],
+    additional_columns_to_retain=["cluster"],
+)
+
+df = splink_datasets.fake_1000
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
+
+linker.visualisations.m_u_parameters_chart()
+
+ +
+ + +

Manually altering m and u probabilities post-training

+

This is not officially supported, but can be useful for ad-hoc alterations to trained models.

+
import splink.comparison_level_library as cll
+import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+from splink.datasets import splink_dataset_labels
+
+labels = splink_dataset_labels.fake_1000_labels
+
+db_api = DuckDBAPI()
+
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.ExactMatch("first_name"),
+        cl.ExactMatch("surname"),
+        cl.ExactMatch("dob"),
+        cl.ExactMatch("city"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("dob"),
+    ],
+)
+df = splink_datasets.fake_1000
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
+
+
+surname_comparison = linker._settings_obj._get_comparison_by_output_column_name(
+    "surname"
+)
+else_comparison_level = (
+    surname_comparison._get_comparison_level_by_comparison_vector_value(0)
+)
+else_comparison_level._m_probability = 0.1
+
+
+linker.visualisations.m_u_parameters_chart()
+
+ +
+ + +

Generate the (beta) labelling tool

+
import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+db_api = DuckDBAPI()
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.ExactMatch("first_name"),
+        cl.ExactMatch("surname"),
+        cl.ExactMatch("dob"),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    max_iterations=2,
+)
+
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+linker.training.estimate_probability_two_random_records_match(
+    [block_on("first_name", "surname")], recall=0.7
+)
+
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
+
+pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)
+
+first_unique_id = df.iloc[0].unique_id
+linker.evaluation.labelling_tool_for_specific_record(unique_id=first_unique_id, overwrite=True)
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/dashboards/50k_cluster.html b/demos/examples/duckdb/dashboards/50k_cluster.html new file mode 100644 index 0000000000..b92590e386 --- /dev/null +++ b/demos/examples/duckdb/dashboards/50k_cluster.html @@ -0,0 +1,11080 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/demos/examples/duckdb/dashboards/50k_deterministic_cluster.html b/demos/examples/duckdb/dashboards/50k_deterministic_cluster.html new file mode 100644 index 0000000000..1617631134 --- /dev/null +++ b/demos/examples/duckdb/dashboards/50k_deterministic_cluster.html @@ -0,0 +1,9513 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/demos/examples/duckdb/dashboards/comparison_viewer_transactions.html b/demos/examples/duckdb/dashboards/comparison_viewer_transactions.html new file mode 100644 index 0000000000..7bb8650dee --- /dev/null +++ b/demos/examples/duckdb/dashboards/comparison_viewer_transactions.html @@ -0,0 +1,11024 @@ + + + + + +Splink comparison viewer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink comparison viewer

+ +
+ +
+
+
+
+ +
+
+ +
+
+
+
+
+ + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/deduplicate_50k_synthetic.html b/demos/examples/duckdb/deduplicate_50k_synthetic.html new file mode 100644 index 0000000000..a571cca09e --- /dev/null +++ b/demos/examples/duckdb/deduplicate_50k_synthetic.html @@ -0,0 +1,6434 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deduplicate 50k rows historical persons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Deduplicate 50k rows historical persons

+ +

Linking a dataset of real historical persons

+

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

+

+ Open In Colab +

+
from splink import splink_datasets
+
+df = splink_datasets.historical_50k
+
+
df.head()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idclusterfull_namefirst_and_surnamefirst_namesurnamedobbirth_placepostcode_fakegenderoccupation
0Q2296770-1Q2296770thomas clifford, 1st baron clifford of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
1Q2296770-2Q2296770thomas of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
2Q2296770-3Q2296770tom 1st baron clifford of chudleightom chudleightomchudleigh1630-08-01devontq13 8dfmalepolitician
3Q2296770-4Q2296770thomas 1st chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8huNonepolitician
4Q2296770-5Q2296770thomas clifford, 1st baron chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfNonepolitician
+
+ +
from splink import DuckDBAPI
+from splink.exploratory import profile_columns
+
+db_api = DuckDBAPI()
+profile_columns(df, db_api, column_expressions=["first_name", "substr(surname,1,2)"])
+
+ +
+ + +
from splink import DuckDBAPI, block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+blocking_rules = [
+    block_on("substr(first_name,1,3)", "substr(surname,1,4)"),
+    block_on("surname", "dob"),
+    block_on("first_name", "dob"),
+    block_on("postcode_fake", "first_name"),
+    block_on("postcode_fake", "surname"),
+    block_on("dob", "birth_place"),
+    block_on("substr(postcode_fake,1,3)", "dob"),
+    block_on("substr(postcode_fake,1,3)", "first_name"),
+    block_on("substr(postcode_fake,1,3)", "surname"),
+    block_on("substr(first_name,1,2)", "substr(surname,1,2)", "substr(dob,1,4)"),
+]
+
+db_api = DuckDBAPI()
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules,
+    db_api=db_api,
+    link_type="dedupe_only",
+)
+
+ +
+ + +
import splink.comparison_library as cl
+
+from splink import Linker, SettingsCreator
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=blocking_rules,
+    comparisons=[
+        cl.ForenameSurnameComparison(
+            "first_name",
+            "surname",
+            forename_surname_concat_col_name="first_name_surname_concat",
+        ),
+        cl.DateOfBirthComparison(
+            "dob", input_is_string=True
+        ),
+        cl.PostcodeComparison("postcode_fake"),
+        cl.ExactMatch("birth_place").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+# Needed to apply term frequencies to first+surname comparison
+df["first_name_surname_concat"] = df["first_name"] + " " + df["surname"]
+linker = Linker(df, settings, db_api=db_api)
+
+
linker.training.estimate_probability_two_random_records_match(
+    [
+        "l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob",
+        "substr(l.first_name,1,2) = substr(r.first_name,1,2) and l.surname = r.surname and substr(l.postcode_fake,1,2) = substr(r.postcode_fake,1,2)",
+        "l.dob = r.dob and l.postcode_fake = r.postcode_fake",
+    ],
+    recall=0.6,
+)
+
+
Probability two random records match is estimated to be  0.000136.
+This means that amongst all possible pairwise record comparisons, one in 7,362.31 are expected to match.  With 1,279,041,753 total possible comparisons, we expect a total of around 173,728.33 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=5e6)
+
+
----- Estimating u probabilities using random sampling -----
+
+
+
+FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+u probability not trained for first_name_surname - Match on reversed cols: first_name and surname (comparison vector value: 5). This usually means the comparison level was never observed in the training data.
+
+
+
+Estimated u probabilities using random sampling
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name_surname (some u values are not trained, no m values are trained).
+    - dob (no m values are trained).
+    - postcode_fake (no m values are trained).
+    - birth_place (no m values are trained).
+    - occupation (no m values are trained).
+
+
training_blocking_rule = block_on("first_name", "surname")
+training_session_names = (
+    linker.training.estimate_parameters_using_expectation_maximisation(
+        training_blocking_rule, estimate_without_term_frequencies=True
+    )
+)
+
+
----- Starting EM training session -----
+
+
+
+Estimating the m probabilities of the model by blocking on:
+(l."first_name" = r."first_name") AND (l."surname" = r."surname")
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name_surname
+
+
+
+
+
+Iteration 1: Largest change in params was 0.247 in probability_two_random_records_match
+
+
+Iteration 2: Largest change in params was -0.0938 in the m_probability of postcode_fake, level `Exact match on full postcode`
+
+
+Iteration 3: Largest change in params was -0.0236 in the m_probability of birth_place, level `Exact match on birth_place`
+
+
+Iteration 4: Largest change in params was 0.00967 in the m_probability of birth_place, level `All other comparisons`
+
+
+Iteration 5: Largest change in params was -0.00467 in the m_probability of birth_place, level `Exact match on birth_place`
+
+
+Iteration 6: Largest change in params was 0.00267 in the m_probability of birth_place, level `All other comparisons`
+
+
+Iteration 7: Largest change in params was 0.00186 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 8: Largest change in params was 0.00127 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 9: Largest change in params was 0.000847 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 10: Largest change in params was 0.000563 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 11: Largest change in params was 0.000373 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 12: Largest change in params was 0.000247 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 13: Largest change in params was 0.000163 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 14: Largest change in params was 0.000108 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+Iteration 15: Largest change in params was 7.14e-05 in the m_probability of dob, level `Abs date difference <= 10 year`
+
+
+
+EM converged after 15 iterations
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name_surname (some u values are not trained, no m values are trained).
+
+
training_blocking_rule = block_on("dob")
+training_session_dob = (
+    linker.training.estimate_parameters_using_expectation_maximisation(
+        training_blocking_rule, estimate_without_term_frequencies=True
+    )
+)
+
+
----- Starting EM training session -----
+
+
+
+Estimating the m probabilities of the model by blocking on:
+l."dob" = r."dob"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name_surname
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+
+
+
+
+Iteration 1: Largest change in params was -0.472 in the m_probability of first_name_surname, level `Exact match on first_name_surname_concat`
+
+
+Iteration 2: Largest change in params was 0.0524 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 3: Largest change in params was 0.0175 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 4: Largest change in params was 0.00537 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 5: Largest change in params was 0.00165 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 6: Largest change in params was 0.000518 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 7: Largest change in params was 0.000164 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+Iteration 8: Largest change in params was 5.2e-05 in the m_probability of first_name_surname, level `All other comparisons`
+
+
+
+EM converged after 8 iterations
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name_surname (some u values are not trained).
+
+

The final match weights can be viewed in the match weights chart:

+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
linker.evaluation.unlinkables_chart()
+
+ +
+ + +
df_predict = linker.inference.predict()
+df_e = df_predict.as_pandas_dataframe(limit=5)
+df_e
+
+
Blocking time: 0.65 seconds
+
+
+Predict time: 1.71 seconds
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name_surname':
+    u values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rsurname_lsurname_rfirst_name_surname_concat_lfirst_name_surname_concat_r...bf_birth_placebf_tf_adj_birth_placeoccupation_loccupation_rgamma_occupationtf_occupation_ltf_occupation_rbf_occupationbf_tf_adj_occupationmatch_key
05.9031330.983565Q6105786-11Q6105786-6joanj.garsongarsonjoan garsonj. garson...0.1641591.000000anthropologistanatomist00.0020560.0005930.1072481.04
12.3548190.836476Q6105786-11Q6105786-8joanj.garsongarsonjoan garsonj. garson...0.1641591.000000anthropologistanatomist00.0020560.0005930.1072481.04
22.3548190.836476Q6105786-11Q6105786-9joaniangarsongarsonjoan garsonian garson...0.1641591.000000anthropologistanatomist00.0020560.0005930.1072481.04
33.3192020.908935Q6105786-11Q6105786-13joanj.garsongarsonjoan garsonj. garson...0.1641591.000000anthropologistNone-10.002056NaN1.0000001.04
416.8816610.999992Q6241382-1Q6241382-11johnjoanjacksonjacksonjohn jacksonjoan jackson...147.48951117.689372authorNone-10.003401NaN1.0000001.04
+

5 rows × 42 columns

+
+ +

You can also view rows in this dataset as a waterfall chart as follows:

+
records_to_plot = df_e.to_dict(orient="records")
+linker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)
+
+ +
+ + +
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predict, threshold_match_probability=0.95
+)
+
+
Completed iteration 1, root rows count 858
+
+
+Completed iteration 2, root rows count 202
+
+
+Completed iteration 3, root rows count 68
+
+
+Completed iteration 4, root rows count 9
+
+
+Completed iteration 5, root rows count 1
+
+
+Completed iteration 6, root rows count 0
+
+
from IPython.display import IFrame
+
+linker.visualisations.cluster_studio_dashboard(
+    df_predict,
+    clusters,
+    "dashboards/50k_cluster.html",
+    sampling_method="by_cluster_size",
+    overwrite=True,
+)
+
+
+IFrame(src="./dashboards/50k_cluster.html", width="100%", height=1200)
+
+

+

+
linker.evaluation.accuracy_analysis_from_labels_column(
+    "cluster", output_type="accuracy", match_weight_round_to_nearest=0.02
+)
+
+
Blocking time: 1.10 seconds
+
+
+Predict time: 1.54 seconds
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name_surname':
+    u values not fully trained
+
+ +
+ + +
records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster",
+    threshold_match_probability=0.999,
+    include_false_negatives=False,
+    include_false_positives=True,
+).as_record_dict()
+linker.visualisations.waterfall_chart(records)
+
+
Blocking time: 0.86 seconds
+
+
+Predict time: 0.30 seconds
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name_surname':
+    u values not fully trained
+
+ +
+ + +
# Some of the false negatives will be because they weren't detected by the blocking rules
+records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster",
+    threshold_match_probability=0.5,
+    include_false_negatives=True,
+    include_false_positives=False,
+).as_record_dict(limit=50)
+
+linker.visualisations.waterfall_chart(records)
+
+
Blocking time: 0.92 seconds
+
+
+Predict time: 0.30 seconds
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name_surname':
+    u values not fully trained
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/deterministic_dedupe.html b/demos/examples/duckdb/deterministic_dedupe.html new file mode 100644 index 0000000000..438829ef5a --- /dev/null +++ b/demos/examples/duckdb/deterministic_dedupe.html @@ -0,0 +1,5765 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deterministic dedupe - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Deterministic dedupe

+ +

Linking a dataset of real historical persons with Deterrministic Rules

+

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

+

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

+

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons.

+

+ Open In Colab +

+
# Uncomment and run this cell if you're running in Google Colab.
+# !pip install splink
+
+
import pandas as pd
+
+from splink import splink_datasets
+
+pd.options.display.max_rows = 1000
+df = splink_datasets.historical_50k
+df.head()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idclusterfull_namefirst_and_surnamefirst_namesurnamedobbirth_placepostcode_fakegenderoccupation
0Q2296770-1Q2296770thomas clifford, 1st baron clifford of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
1Q2296770-2Q2296770thomas of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
2Q2296770-3Q2296770tom 1st baron clifford of chudleightom chudleightomchudleigh1630-08-01devontq13 8dfmalepolitician
3Q2296770-4Q2296770thomas 1st chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8huNonepolitician
4Q2296770-5Q2296770thomas clifford, 1st baron chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfNonepolitician
+
+ +

When defining the settings object, specity your deterministic rules in the blocking_rules_to_generate_predictions key.

+

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define comparisons nor any other parameters required for model training in a probabilistic model.

+

Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:

+
from splink import DuckDBAPI, block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+db_api = DuckDBAPI()
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=[
+        block_on("first_name", "surname", "dob"),
+        block_on("surname", "dob", "postcode_fake"),
+        block_on("first_name", "dob", "occupation"),
+    ],
+    db_api=db_api,
+    link_type="dedupe_only",
+)
+
+ +
+ + +
from splink import Linker, SettingsCreator
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname", "dob"),
+        block_on("surname", "dob", "postcode_fake"),
+        block_on("first_name", "dob", "occupation"),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+linker = Linker(df, settings, db_api=db_api)
+
+

The results of the linkage can be viewed with the deterministic_link function.

+
df_predict = linker.inference.deterministic_link()
+df_predict.as_pandas_dataframe().head()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_id_lunique_id_roccupation_loccupation_rfirst_name_lfirst_name_rdob_ldob_rsurname_lsurname_rpostcode_fake_lpostcode_fake_rmatch_key
0Q55455287-12Q55455287-2Nonewriterjaidojaido1836-01-011836-01-01moratamoratata4 2ugta4 2uu0
1Q55455287-12Q55455287-3Nonewriterjaidojaido1836-01-011836-01-01moratamoratata4 2ugta4 2uu0
2Q55455287-12Q55455287-4Nonewriterjaidojaido1836-01-011836-01-01moratamoratata4 2ugta4 2sz0
3Q55455287-12Q55455287-5NoneNonejaidojaido1836-01-011836-01-01moratamoratata4 2ugta4 2ug0
4Q55455287-12Q55455287-6Nonewriterjaidojaido1836-01-011836-01-01moratamoratata4 2ugNone0
+
+ +

Which can be used to generate clusters.

+

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1 in the cluster_pairwise_predictions_at_threshold function.

+
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predict, threshold_match_probability=1
+)
+
+
Completed iteration 1, root rows count 94
+
+
+Completed iteration 2, root rows count 10
+
+
+Completed iteration 3, root rows count 0
+
+
clusters.as_pandas_dataframe(limit=5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
cluster_idunique_idclusterfull_namefirst_and_surnamefirst_namesurnamedobbirth_placepostcode_fakegenderoccupation__splink_salt
0Q16025107-1Q5497940-9Q5497940frederick hallfrederick hallfrederickhall1855-01-01bristol, city ofbs11 9pnNoneNone0.002739
1Q1149445-1Q1149445-9Q1149445earl egertonearl egertonearlegerton1800-01-01westminsterw1d 2hfNoneNone0.991459
2Q20664532-1Q21466387-2Q21466387harry brookerharry brookerharrybrooker1848-01-01plymouthpl4 9hxmalepainter0.506127
3Q1124636-1Q1124636-12Q1124636tom stapletontom stapletontomstapleton1535-01-01Nonebn6 9namaletheologian0.612694
4Q18508292-1Q21466711-4Q21466711harry s0enceharry s0enceharrys0ence1860-01-01londonse1 7pbmalepainter0.488917
+
+ +

These results can then be passed into the Cluster Studio Dashboard.

+
linker.visualisations.cluster_studio_dashboard(
+    df_predict,
+    clusters,
+    "dashboards/50k_deterministic_cluster.html",
+    sampling_method="by_cluster_size",
+    overwrite=True,
+)
+
+from IPython.display import IFrame
+
+IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)
+
+

+

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/febrl3.html b/demos/examples/duckdb/febrl3.html new file mode 100644 index 0000000000..dd0c8bf529 --- /dev/null +++ b/demos/examples/duckdb/febrl3.html @@ -0,0 +1,6155 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Febrl3 Dedupe - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Febrl3 Dedupe

+ +

Deduplicating the febrl3 dataset

+

See A.2 here and here for the source of this data

+

+ Open In Colab +

+
from splink.datasets import splink_datasets
+
+df = splink_datasets.febrl3
+
+
df = df.rename(columns=lambda x: x.strip())
+
+df["cluster"] = df["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))
+
+df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
+df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
+
+df.head(2)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
rec_idgiven_namesurnamestreet_numberaddress_1address_2suburbpostcodestatedate_of_birthsoc_sec_idcluster
0rec-1496-orgmitchellgreen7wallaby placedelmarcleveland2119sa195604091804974rec-1496
1rec-552-dup-3harleymccarthy177pridhamstreetmiltonmarsden3165nsw190804196089216rec-552
+
+ +
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
+df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
+
+
df["date_of_birth"] = df["date_of_birth"].astype(str).str.strip()
+df["soc_sec_id"] = df["soc_sec_id"].astype(str).str.strip()
+
+
from splink import DuckDBAPI, Linker, SettingsCreator
+
+# TODO:  Allow missingness to be analysed without a linker
+settings = SettingsCreator(
+    unique_id_column_name="rec_id",
+    link_type="dedupe_only",
+)
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+
+

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:

+
from splink.exploratory import completeness_chart
+
+completeness_chart(df, db_api=DuckDBAPI())
+
+ +
+ + +
from splink.exploratory import profile_columns
+
+profile_columns(df, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])
+
+ +
+ + +
from splink import DuckDBAPI, block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+blocking_rules = [
+    block_on("soc_sec_id"),
+    block_on("given_name"),
+    block_on("surname"),
+    block_on("date_of_birth"),
+    block_on("postcode"),
+]
+
+db_api = DuckDBAPI()
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules,
+    db_api=db_api,
+    link_type="dedupe_only",
+    unique_id_column_name="rec_id",
+)
+
+ +
+ + +
import splink.comparison_library as cl
+
+from splink import Linker
+
+settings = SettingsCreator(
+    unique_id_column_name="rec_id",
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=blocking_rules,
+    comparisons=[
+        cl.NameComparison("given_name"),
+        cl.NameComparison("surname"),
+        cl.DateOfBirthComparison(
+            "date_of_birth",
+            input_is_string=True,
+            datetime_format="%Y%m%d",
+        ),
+        cl.DamerauLevenshteinAtThresholds("soc_sec_id", [2]),
+        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("postcode").configure(term_frequency_adjustments=True),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+
+
from splink import block_on
+
+deterministic_rules = [
+    block_on("soc_sec_id"),
+    block_on("given_name", "surname", "date_of_birth"),
+    "l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth",
+]
+
+linker.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.9
+)
+
+
Probability two random records match is estimated to be  0.000528.
+This means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+
+
+
+FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - given_name (no m values are trained).
+    - surname (no m values are trained).
+    - date_of_birth (some u values are not trained, no m values are trained).
+    - soc_sec_id (no m values are trained).
+    - street_number (no m values are trained).
+    - postcode (no m values are trained).
+
+
em_blocking_rule_1 = block_on("date_of_birth")
+session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    em_blocking_rule_1
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."date_of_birth" = r."date_of_birth"
+
+Parameter estimates will be made for the following comparison(s):
+    - given_name
+    - surname
+    - soc_sec_id
+    - street_number
+    - postcode
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - date_of_birth
+
+Iteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`
+Iteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`
+Iteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`
+Iteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`
+
+EM converged after 4 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - date_of_birth (some u values are not trained, no m values are trained).
+
+
em_blocking_rule_2 = block_on("postcode")
+session_postcode = linker.training.estimate_parameters_using_expectation_maximisation(
+    em_blocking_rule_2
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."postcode" = r."postcode"
+
+Parameter estimates will be made for the following comparison(s):
+    - given_name
+    - surname
+    - date_of_birth
+    - soc_sec_id
+    - street_number
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - postcode
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was 0.0681 in probability_two_random_records_match
+Iteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
+Iteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`
+
+EM converged after 3 iterations
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - date_of_birth (some u values are not trained, some m values are not trained).
+
+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
results = linker.inference.predict(threshold_match_probability=0.2)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+
linker.evaluation.accuracy_analysis_from_labels_column(
+    "cluster", match_weight_round_to_nearest=0.1, output_type="accuracy"
+)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+ +
+ + +
pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster"
+).as_pandas_dataframe()
+len(pred_errors_df)
+pred_errors_df.head()
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
clerical_match_scorefound_by_blocking_rulesmatch_weightmatch_probabilityrec_id_lrec_id_rgiven_name_lgiven_name_rgamma_given_nametf_given_name_l...postcode_lpostcode_rgamma_postcodetf_postcode_ltf_postcode_rbf_postcodebf_tf_adj_postcodecluster_lcluster_rmatch_key
01.0False-27.8057314.262268e-09rec-993-dup-1rec-993-dup-3westbrookjake00.0004...2704207400.00020.00140.2301731.0rec-993rec-9935
11.0False-27.8057314.262268e-09rec-829-dup-0rec-829-dup-2wildekyra00.0002...3859359500.00040.00060.2301731.0rec-829rec-8295
21.0False-19.7178771.159651e-06rec-829-dup-0rec-829-dup-1wildekyra00.0002...3859388900.00040.00020.2301731.0rec-829rec-8295
31.0True-15.4531902.229034e-05rec-721-dup-0rec-721-dup-1mikhailielly00.0008...4806486000.00080.00140.2301731.0rec-721rec-7212
41.0True-12.9317811.279648e-04rec-401-dup-1rec-401-dup-3whitbealexa-ose00.0002...3040304100.00200.00040.2301731.0rec-401rec-4010
+

5 rows × 45 columns

+
+ +

The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches

+
records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster"
+).as_record_dict(limit=10)
+linker.visualisations.waterfall_chart(records)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/febrl4.html b/demos/examples/duckdb/febrl4.html new file mode 100644 index 0000000000..6adce16588 --- /dev/null +++ b/demos/examples/duckdb/febrl4.html @@ -0,0 +1,7416 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Febrl4 link-only - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

Febrl4 link-only

+ +

Linking the febrl4 datasets

+

See A.2 here and here for the source of this data.

+

It consists of two datasets, A and B, of 5000 records each, with each record in dataset A having a corresponding record in dataset B. The aim will be to capture as many of those 5000 true links as possible, with minimal false linkages.

+

It is worth noting that we should not necessarily expect to capture all links. There are some links that although we know they do correspond to the same person, the data is so mismatched between them that we would not reasonably expect a model to link them, and indeed should a model do so may indicate that we have overengineered things using our knowledge of true links, which will not be a helpful reference in situations where we attempt to link unlabelled data, as will usually be the case.

+

+ Open In Colab +

+

Exploring data and defining model

+

Firstly let's read in the data and have a little look at it

+
from splink import splink_datasets
+
+df_a = splink_datasets.febrl4a
+df_b = splink_datasets.febrl4b
+
+
+def prepare_data(data):
+    data = data.rename(columns=lambda x: x.strip())
+    data["cluster"] = data["rec_id"].apply(lambda x: "-".join(x.split("-")[:2]))
+    data["date_of_birth"] = data["date_of_birth"].astype(str).str.strip()
+    data["soc_sec_id"] = data["soc_sec_id"].astype(str).str.strip()
+    data["postcode"] = data["postcode"].astype(str).str.strip()
+    return data
+
+
+dfs = [prepare_data(dataset) for dataset in [df_a, df_b]]
+
+display(dfs[0].head(2))
+display(dfs[1].head(2))
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
rec_idgiven_namesurnamestreet_numberaddress_1address_2suburbpostcodestatedate_of_birthsoc_sec_idcluster
0rec-1070-orgmichaelaneumann8stanley streetmiamiwinston hills4223nsw191511115304218rec-1070
1rec-1016-orgcourtneypainter12pinkerton circuitbega flatsrichlands4560vic191612144066625rec-1016
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
rec_idgiven_namesurnamestreet_numberaddress_1address_2suburbpostcodestatedate_of_birthsoc_sec_idcluster
0rec-561-dup-0elton3light setreetpinehillwindermere3212vic196510131551941rec-561
1rec-2642-dup-0mitchellmaxon47edkins streetlochaoairnorth ryde3355nsw193902128859999rec-2642
+
+ +

Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each

+
from splink import DuckDBAPI, Linker, SettingsCreator
+
+basic_settings = SettingsCreator(
+    unique_id_column_name="rec_id",
+    link_type="link_only",
+    # NB as we are linking one-one, we know the probability that a random pair will be a match
+    # hence we could set:
+    # "probability_two_random_records_match": 1/5000,
+    # however we will not specify this here, as we will use this as a check that
+    # our estimation procedure returns something sensible
+)
+
+linker = Linker(dfs, basic_settings, db_api=DuckDBAPI())
+
+

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing

+
from splink.exploratory import completeness_chart
+
+completeness_chart(dfs, db_api=DuckDBAPI())
+
+ +
+ + +
from splink.exploratory import profile_columns
+
+profile_columns(dfs, db_api=DuckDBAPI(), column_expressions=["given_name", "surname"])
+
+ +
+ + +

Next let's come up with some candidate blocking rules, which define which record comparisons are generated, and have a look at how many comparisons each will generate.

+

For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least one of the following conditions holding.

+
from splink import DuckDBAPI, block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+blocking_rules = [
+    block_on("given_name", "surname"),
+    # A blocking rule can also be an aribtrary SQL expression
+    "l.given_name = r.surname and l.surname = r.given_name",
+    block_on("date_of_birth"),
+    block_on("soc_sec_id"),
+    block_on("state", "address_1"),
+    block_on("street_number", "address_1"),
+    block_on("postcode"),
+]
+
+
+db_api = DuckDBAPI()
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=dfs,
+    blocking_rules=blocking_rules,
+    db_api=db_api,
+    link_type="link_only",
+    unique_id_column_name="rec_id",
+    source_dataset_column_name="source_dataset",
+)
+
+ +
+ + +

The broadest rule, having a matching postcode, unsurpisingly gives the largest number of comparisons. +For this small dataset we still have a very manageable number, but if it was larger we might have needed to include a further AND condition with it to break the number of comparisons further.

+

Now we get the full settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.

+

We will define two models, each with a separate linker with different settings, so that we can compare performance. One will be a very basic model, whilst the other will include a lot more detail.

+
import splink.comparison_level_library as cll
+import splink.comparison_library as cl
+
+
+# the simple model only considers a few columns, and only two comparison levels for each
+simple_model_settings = SettingsCreator(
+    unique_id_column_name="rec_id",
+    link_type="link_only",
+    blocking_rules_to_generate_predictions=blocking_rules,
+    comparisons=[
+        cl.ExactMatch("given_name").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("surname").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+# the detailed model considers more columns, using the information we saw in the exploratory phase
+# we also include further comparison levels to account for typos and other differences
+detailed_model_settings = SettingsCreator(
+    unique_id_column_name="rec_id",
+    link_type="link_only",
+    blocking_rules_to_generate_predictions=blocking_rules,
+    comparisons=[
+        cl.NameComparison("given_name").configure(term_frequency_adjustments=True),
+        cl.NameComparison("surname").configure(term_frequency_adjustments=True),
+        cl.DateOfBirthComparison(
+            "date_of_birth",
+            input_is_string=True,
+            datetime_format="%Y%m%d",
+            invalid_dates_as_null=True,
+        ),
+        cl.DamerauLevenshteinAtThresholds("soc_sec_id", [1, 2]),
+        cl.ExactMatch("street_number").configure(term_frequency_adjustments=True),
+        cl.DamerauLevenshteinAtThresholds("postcode", [1, 2]).configure(
+            term_frequency_adjustments=True
+        ),
+        # we don't consider further location columns as they will be strongly correlated with postcode
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+
+linker_simple = Linker(dfs, simple_model_settings, db_api=DuckDBAPI())
+linker_detailed = Linker(dfs, detailed_model_settings, db_api=DuckDBAPI())
+
+

Estimating model parameters

+

We need to furnish our models with parameter estimates so that we can generate results. We will focus on the detailed model, generating the values for the simple model at the end

+

We can instead estimate the probability two random records match, and compare with the known value of 1/5000 = 0.0002, to see how well our estimation procedure works.

+

To do this we come up with some deterministic rules - the aim here is that we generate very few false positives (i.e. we expect that the majority of records with at least one of these conditions holding are true matches), whilst also capturing the majority of matches - our guess here is that these two rules should capture 80% of all matches.

+
deterministic_rules = [
+    block_on("soc_sec_id"),
+    block_on("given_name", "surname", "date_of_birth"),
+]
+
+linker_detailed.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.8
+)
+
+
Probability two random records match is estimated to be  0.000239.
+This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
+
+

Even playing around with changing these deterministic rules, or the nominal recall leaves us with an answer which is pretty close to our known value

+

Next we estimate u and m values for each comparison, so that we can move to generating predictions

+
# We generally recommend setting max pairs higher (e.g. 1e7 or more)
+# But this will run faster for the purpose of this demo
+linker_detailed.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+
+
+
+FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - given_name (no m values are trained).
+    - surname (no m values are trained).
+    - date_of_birth (some u values are not trained, no m values are trained).
+    - soc_sec_id (no m values are trained).
+    - street_number (no m values are trained).
+    - postcode (no m values are trained).
+
+

When training the m values using expectation maximisation, we need somre more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.

+

We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. date_of_birth, then we cannot compute the m values for the date_of_birth comparison, as we have only looked at records where these match.

+
session_dob = (
+    linker_detailed.training.estimate_parameters_using_expectation_maximisation(
+        block_on("date_of_birth"), estimate_without_term_frequencies=True
+    )
+)
+session_pc = (
+    linker_detailed.training.estimate_parameters_using_expectation_maximisation(
+        block_on("postcode"), estimate_without_term_frequencies=True
+    )
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."date_of_birth" = r."date_of_birth"
+
+Parameter estimates will be made for the following comparison(s):
+    - given_name
+    - surname
+    - soc_sec_id
+    - street_number
+    - postcode
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - date_of_birth
+
+Iteration 1: Largest change in params was -0.331 in probability_two_random_records_match
+Iteration 2: Largest change in params was 0.00365 in the m_probability of given_name, level `All other comparisons`
+Iteration 3: Largest change in params was 9.22e-05 in the m_probability of soc_sec_id, level `All other comparisons`
+
+EM converged after 3 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - date_of_birth (some u values are not trained, no m values are trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."postcode" = r."postcode"
+
+Parameter estimates will be made for the following comparison(s):
+    - given_name
+    - surname
+    - date_of_birth
+    - soc_sec_id
+    - street_number
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - postcode
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`
+Iteration 2: Largest change in params was 0.000457 in the m_probability of date_of_birth, level `All other comparisons`
+Iteration 3: Largest change in params was 7.66e-06 in the m_probability of soc_sec_id, level `All other comparisons`
+
+EM converged after 3 iterations
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - date_of_birth (some u values are not trained, some m values are not trained).
+
+

If we wish we can have a look at how our parameter estimates changes over these training sessions

+
session_dob.m_u_values_interactive_history_chart()
+
+ +
+ + +

For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see for example street_number). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.

+
linker_detailed.visualisations.parameter_estimate_comparisons_chart()
+
+ +
+ + +

We repeat our parameter estimations for the simple model in much the same fashion

+
linker_simple.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.8
+)
+linker_simple.training.estimate_u_using_random_sampling(max_pairs=1e7)
+session_ssid = (
+    linker_simple.training.estimate_parameters_using_expectation_maximisation(
+        block_on("given_name"), estimate_without_term_frequencies=True
+    )
+)
+session_pc = linker_simple.training.estimate_parameters_using_expectation_maximisation(
+    block_on("street_number"), estimate_without_term_frequencies=True
+)
+linker_simple.visualisations.parameter_estimate_comparisons_chart()
+
+
Probability two random records match is estimated to be  0.000239.
+This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
+----- Estimating u probabilities using random sampling -----
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - given_name (no m values are trained).
+    - surname (no m values are trained).
+    - street_number (no m values are trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."given_name" = r."given_name"
+
+Parameter estimates will be made for the following comparison(s):
+    - surname
+    - street_number
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - given_name
+
+Iteration 1: Largest change in params was 0.0812 in the m_probability of surname, level `All other comparisons`
+Iteration 2: Largest change in params was -0.0261 in the m_probability of surname, level `Exact match on surname`
+Iteration 3: Largest change in params was -0.0247 in the m_probability of surname, level `Exact match on surname`
+Iteration 4: Largest change in params was 0.0227 in the m_probability of surname, level `All other comparisons`
+Iteration 5: Largest change in params was -0.0198 in the m_probability of surname, level `Exact match on surname`
+Iteration 6: Largest change in params was 0.0164 in the m_probability of surname, level `All other comparisons`
+Iteration 7: Largest change in params was -0.0131 in the m_probability of surname, level `Exact match on surname`
+Iteration 8: Largest change in params was 0.0101 in the m_probability of surname, level `All other comparisons`
+Iteration 9: Largest change in params was -0.00769 in the m_probability of surname, level `Exact match on surname`
+Iteration 10: Largest change in params was 0.00576 in the m_probability of surname, level `All other comparisons`
+Iteration 11: Largest change in params was -0.00428 in the m_probability of surname, level `Exact match on surname`
+Iteration 12: Largest change in params was 0.00316 in the m_probability of surname, level `All other comparisons`
+Iteration 13: Largest change in params was -0.00234 in the m_probability of surname, level `Exact match on surname`
+Iteration 14: Largest change in params was -0.00172 in the m_probability of surname, level `Exact match on surname`
+Iteration 15: Largest change in params was 0.00127 in the m_probability of surname, level `All other comparisons`
+Iteration 16: Largest change in params was -0.000939 in the m_probability of surname, level `Exact match on surname`
+Iteration 17: Largest change in params was -0.000694 in the m_probability of surname, level `Exact match on surname`
+Iteration 18: Largest change in params was -0.000514 in the m_probability of surname, level `Exact match on surname`
+Iteration 19: Largest change in params was -0.000381 in the m_probability of surname, level `Exact match on surname`
+Iteration 20: Largest change in params was -0.000282 in the m_probability of surname, level `Exact match on surname`
+Iteration 21: Largest change in params was 0.00021 in the m_probability of surname, level `All other comparisons`
+Iteration 22: Largest change in params was -0.000156 in the m_probability of surname, level `Exact match on surname`
+Iteration 23: Largest change in params was 0.000116 in the m_probability of surname, level `All other comparisons`
+Iteration 24: Largest change in params was 8.59e-05 in the m_probability of surname, level `All other comparisons`
+
+EM converged after 24 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - given_name (no m values are trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."street_number" = r."street_number"
+
+Parameter estimates will be made for the following comparison(s):
+    - given_name
+    - surname
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - street_number
+
+Iteration 1: Largest change in params was -0.0446 in the m_probability of surname, level `Exact match on surname`
+Iteration 2: Largest change in params was -0.0285 in the m_probability of surname, level `All other comparisons`
+Iteration 3: Largest change in params was -0.026 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 4: Largest change in params was 0.0252 in the m_probability of given_name, level `All other comparisons`
+Iteration 5: Largest change in params was -0.0231 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 6: Largest change in params was -0.02 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 7: Largest change in params was -0.0164 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 8: Largest change in params was -0.013 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 9: Largest change in params was 0.01 in the m_probability of given_name, level `All other comparisons`
+Iteration 10: Largest change in params was -0.00757 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 11: Largest change in params was 0.00564 in the m_probability of given_name, level `All other comparisons`
+Iteration 12: Largest change in params was -0.00419 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 13: Largest change in params was 0.0031 in the m_probability of given_name, level `All other comparisons`
+Iteration 14: Largest change in params was -0.00231 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 15: Largest change in params was -0.00173 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 16: Largest change in params was 0.0013 in the m_probability of given_name, level `All other comparisons`
+Iteration 17: Largest change in params was 0.000988 in the m_probability of given_name, level `All other comparisons`
+Iteration 18: Largest change in params was -0.000756 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 19: Largest change in params was -0.000584 in the m_probability of given_name, level `Exact match on given_name`
+Iteration 20: Largest change in params was -0.000465 in the m_probability of surname, level `Exact match on surname`
+Iteration 21: Largest change in params was -0.000388 in the m_probability of surname, level `Exact match on surname`
+Iteration 22: Largest change in params was -0.000322 in the m_probability of surname, level `Exact match on surname`
+Iteration 23: Largest change in params was 0.000266 in the m_probability of surname, level `All other comparisons`
+Iteration 24: Largest change in params was -0.000219 in the m_probability of surname, level `Exact match on surname`
+Iteration 25: Largest change in params was -0.00018 in the m_probability of surname, level `Exact match on surname`
+
+EM converged after 25 iterations
+
+Your model is fully trained. All comparisons have at least one estimate for their m and u values
+
+ +
+ + +
# import json
+# we can have a look at the full settings if we wish, including the values of our estimated parameters:
+# print(json.dumps(linker_detailed._settings_obj.as_dict(), indent=2))
+# we can also get a handy summary of of the model in an easily readable format if we wish:
+# print(linker_detailed._settings_obj.human_readable_description)
+# (we suppress output here for brevity)
+
+

We can now visualise some of the details of our models. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.

+

Comparing the two models will show the added benefit we get in the more detailed model --- what in the simple model is classed as 'all other comparisons' is instead broken down further, and we can see that the detail of how this is broken down in fact gives us quite a bit of useful information about the likelihood of a match.

+
linker_simple.visualisations.match_weights_chart()
+
+ +
+ + +
linker_detailed.visualisations.match_weights_chart()
+
+ +
+ + +

As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual u and m parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see this article). We might want to revise aspects of our model based on the information we ascertain here.

+

Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.

+
# linker_simple.m_u_parameters_chart()
+linker_detailed.visualisations.m_u_parameters_chart()
+
+ +
+ + +

It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out be seeing whether records are able to be matched with themselves.

+

This is of course relative to the information we have put into the model - we see that in our simple model, at a 99% match threshold nearly 10% of records are unlinkable, as we have not included enough information in the model for distinct records to be adequately distinguished; this is not an issue in our more detailed model.

+
linker_simple.evaluation.unlinkables_chart()
+
+ +
+ + +
linker_detailed.evaluation.unlinkables_chart()
+
+ +
+ + +

Our simple model doesn't do terribly, but suffers if we want to have a high match probability --- to be 99% (match weight ~7) certain of matches we have ~10% of records that we will be unable to link.

+

Our detailed model, however, has enough nuance that we can at least self-link records.

+

Predictions

+

Now that we have had a look into the details of the models, we will focus on only our more detailed model, which should be able to capture more of the genuine links in our data

+
predictions = linker_detailed.inference.predict(threshold_match_probability=0.2)
+df_predictions = predictions.as_pandas_dataframe()
+df_predictions.head(5)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilitysource_dataset_lsource_dataset_rrec_id_lrec_id_rgiven_name_lgiven_name_rgamma_given_nametf_given_name_l...gamma_postcodetf_postcode_ltf_postcode_rbf_postcodebf_tf_adj_postcodeaddress_1_laddress_1_rstate_lstate_rmatch_key
0-1.8300010.219521__splink__input_table_0__splink__input_table_1rec-760-orgrec-3951-dup-0lachlanlachlan40.0113...30.00070.0007759.4071551.583362bushby closetemplestoew avenuenswvic0
1-1.8017360.222896__splink__input_table_0__splink__input_table_1rec-4980-orgrec-4980-dup-0isabellactercteko00.0069...30.00040.0004759.4071552.770884sturt avenuesturta venuevicvic2
2-1.2717940.292859__splink__input_table_0__splink__input_table_1rec-585-orgrec-585-dup-0dannystephenson00.0001...20.00160.001211.2648251.000000o'shanassy streeto'shanassy streettastas1
3-1.2134410.301305__splink__input_table_0__splink__input_table_1rec-1250-orgrec-1250-dup-0lukegazzola00.0055...20.00150.000211.2648251.000000newman morris circuitnewman morr is circuitnswnsw1
4-0.3803360.434472__splink__input_table_0__splink__input_table_1rec-4763-orgrec-4763-dup-0maxalisha00.0021...10.00040.00160.0435651.000000duffy streetduffy s treetnswnsw2
+

5 rows × 47 columns

+
+ +

We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things

+
linker_detailed.evaluation.accuracy_analysis_from_labels_column(
+    "cluster", output_type="accuracy"
+)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+
+ +
+ + +

and we can easily see how many individuals we identify and link by looking at clusters generated at some threshold match probability of interest - in this example 99%

+
clusters = linker_detailed.clustering.cluster_pairwise_predictions_at_threshold(
+    predictions, threshold_match_probability=0.99
+)
+df_clusters = clusters.as_pandas_dataframe().sort_values("cluster_id")
+df_clusters.groupby("cluster_id").size().value_counts()
+
+
Completed iteration 1, root rows count 0
+
+
+
+
+
+2    4959
+1      82
+Name: count, dtype: int64
+
+

In this case, we happen to know what the true links are, so we can manually inspect the ones that are doing worst to see what our model is not capturing - i.e. where we have false negatives.

+

Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with false positives.

+

Ordinarily we would not have this luxury, and so would need to dig a bit deeper for clues as to how to improve our model, such as manually inspecting records across threshold probabilities,

+
df_predictions["cluster_l"] = df_predictions["rec_id_l"].apply(
+    lambda x: "-".join(x.split("-")[:2])
+)
+df_predictions["cluster_r"] = df_predictions["rec_id_r"].apply(
+    lambda x: "-".join(x.split("-")[:2])
+)
+df_true_links = df_predictions[
+    df_predictions["cluster_l"] == df_predictions["cluster_r"]
+].sort_values("match_probability")
+
+
records_to_view = 3
+linker_detailed.visualisations.waterfall_chart(
+    df_true_links.head(records_to_view).to_dict(orient="records")
+)
+
+ +
+ + +
df_non_links = df_predictions[
+    df_predictions["cluster_l"] != df_predictions["cluster_r"]
+].sort_values("match_probability", ascending=False)
+linker_detailed.visualisations.waterfall_chart(
+    df_non_links.head(records_to_view).to_dict(orient="records")
+)
+
+ +
+ + +

Further refinements

+

Looking at the non-links we have done well in having no false positives at any substantial match probability --- however looking at some of the true links we can see that there are a few that we are not capturing with sufficient match probability.

+

We can see that there are a few features that we are not capturing/weighting appropriately

+
    +
  • single-character transpostions, particularly in postcode (which is being lumped in with more 'severe typos'/probable non-matches)
  • +
  • given/sur-names being swapped with typos
  • +
  • given/sur-names being cross-matches on one only, with no match on the other cross
  • +
+

We will quickly see if we can incorporate these features into a new model. As we are now going into more detail with the inter-relationship between given name and surname, it is probably no longer sensible to model them as independent comparisons, and so we will need to switch to a combined comparison on full name.

+
# we need to append a full name column to our source data frames
+# so that we can use it for term frequency adjustments
+dfs[0]["full_name"] = dfs[0]["given_name"] + "_" + dfs[0]["surname"]
+dfs[1]["full_name"] = dfs[1]["given_name"] + "_" + dfs[1]["surname"]
+
+
+extended_model_settings = {
+    "unique_id_column_name": "rec_id",
+    "link_type": "link_only",
+    "blocking_rules_to_generate_predictions": blocking_rules,
+    "comparisons": [
+        {
+            "output_column_name": "Full name",
+            "comparison_levels": [
+                {
+                    "sql_condition": "(given_name_l IS NULL OR given_name_r IS NULL) and (surname_l IS NULL OR surname_r IS NULL)",
+                    "label_for_charts": "Null",
+                    "is_null_level": True,
+                },
+                # full name match
+                cll.ExactMatchLevel("full_name", term_frequency_adjustments=True),
+                # typos - keep levels across full name rather than scoring separately
+                cll.JaroWinklerLevel("full_name", 0.9),
+                cll.JaroWinklerLevel("full_name", 0.7),
+                # name switched
+                cll.ColumnsReversedLevel("given_name", "surname"),
+                # name switched + typo
+                {
+                    "sql_condition": "jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.8",
+                    "label_for_charts": "switched + jaro_winkler_similarity >= 1.8",
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.4",
+                    "label_for_charts": "switched + jaro_winkler_similarity >= 1.4",
+                },
+                # single name match
+                cll.ExactMatchLevel("given_name", term_frequency_adjustments=True),
+                cll.ExactMatchLevel("surname", term_frequency_adjustments=True),
+                # single name cross-match
+                {
+                    "sql_condition": "given_name_l = surname_r OR surname_l = given_name_r",
+                    "label_for_charts": "single name cross-matches",
+                },  # single name typos
+                cll.JaroWinklerLevel("given_name", 0.9),
+                cll.JaroWinklerLevel("surname", 0.9),
+                # the rest
+                cll.ElseLevel(),
+            ],
+        },
+        cl.DateOfBirthComparison(
+            "date_of_birth",
+            input_is_string=True,
+            datetime_format="%Y%m%d",
+            invalid_dates_as_null=True,
+        ),
+        {
+            "output_column_name": "Social security ID",
+            "comparison_levels": [
+                cll.NullLevel("soc_sec_id"),
+                cll.ExactMatchLevel("soc_sec_id", term_frequency_adjustments=True),
+                cll.DamerauLevenshteinLevel("soc_sec_id", 1),
+                cll.DamerauLevenshteinLevel("soc_sec_id", 2),
+                cll.ElseLevel(),
+            ],
+        },
+        {
+            "output_column_name": "Street number",
+            "comparison_levels": [
+                cll.NullLevel("street_number"),
+                cll.ExactMatchLevel("street_number", term_frequency_adjustments=True),
+                cll.DamerauLevenshteinLevel("street_number", 1),
+                cll.ElseLevel(),
+            ],
+        },
+        {
+            "output_column_name": "Postcode",
+            "comparison_levels": [
+                cll.NullLevel("postcode"),
+                cll.ExactMatchLevel("postcode", term_frequency_adjustments=True),
+                cll.DamerauLevenshteinLevel("postcode", 1),
+                cll.DamerauLevenshteinLevel("postcode", 2),
+                cll.ElseLevel(),
+            ],
+        },
+        # we don't consider further location columns as they will be strongly correlated with postcode
+    ],
+    "retain_intermediate_calculation_columns": True,
+}
+
+
# train
+linker_advanced = Linker(dfs, extended_model_settings, db_api=DuckDBAPI())
+linker_advanced.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.8
+)
+# We recommend increasing target rows to 1e8 improve accuracy for u
+# values in full name comparison, as we have subdivided the data more finely
+
+# Here, 1e7 for speed
+linker_advanced.training.estimate_u_using_random_sampling(max_pairs=1e7)
+
+
Probability two random records match is estimated to be  0.000239.
+This means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs
+----- Estimating u probabilities using random sampling -----
+
+
+
+FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+u probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - Full name (no m values are trained).
+    - date_of_birth (some u values are not trained, no m values are trained).
+    - Social security ID (no m values are trained).
+    - Street number (no m values are trained).
+    - Postcode (no m values are trained).
+
+
session_dob = (
+    linker_advanced.training.estimate_parameters_using_expectation_maximisation(
+        "l.date_of_birth = r.date_of_birth", estimate_without_term_frequencies=True
+    )
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.date_of_birth = r.date_of_birth
+
+Parameter estimates will be made for the following comparison(s):
+    - Full name
+    - Social security ID
+    - Street number
+    - Postcode
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - date_of_birth
+
+WARNING:
+Level single name cross-matches on comparison Full name not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was -0.465 in the m_probability of Full name, level `Exact match on full_name`
+Iteration 2: Largest change in params was 0.00252 in the m_probability of Social security ID, level `All other comparisons`
+Iteration 3: Largest change in params was 4.98e-05 in the m_probability of Social security ID, level `All other comparisons`
+
+EM converged after 3 iterations
+m probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - Full name (some m values are not trained).
+    - date_of_birth (some u values are not trained, no m values are trained).
+
+
session_pc = (
+    linker_advanced.training.estimate_parameters_using_expectation_maximisation(
+        "l.postcode = r.postcode", estimate_without_term_frequencies=True
+    )
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.postcode = r.postcode
+
+Parameter estimates will be made for the following comparison(s):
+    - Full name
+    - date_of_birth
+    - Social security ID
+    - Street number
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - Postcode
+
+WARNING:
+Level single name cross-matches on comparison Full name not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+WARNING:
+Level Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`
+Iteration 2: Largest change in params was 0.000656 in the m_probability of date_of_birth, level `All other comparisons`
+Iteration 3: Largest change in params was 1.75e-05 in the m_probability of Social security ID, level `All other comparisons`
+
+EM converged after 3 iterations
+m probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+m probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - Full name (some m values are not trained).
+    - date_of_birth (some u values are not trained, some m values are not trained).
+
+
linker_advanced.visualisations.parameter_estimate_comparisons_chart()
+
+ +
+ + +
linker_advanced.visualisations.match_weights_chart()
+
+ +
+ + +
predictions_adv = linker_advanced.inference.predict()
+df_predictions_adv = predictions_adv.as_pandas_dataframe()
+clusters_adv = linker_advanced.clustering.cluster_pairwise_predictions_at_threshold(
+    predictions_adv, threshold_match_probability=0.99
+)
+df_clusters_adv = clusters_adv.as_pandas_dataframe().sort_values("cluster_id")
+df_clusters_adv.groupby("cluster_id").size().value_counts()
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'Full name':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    m values not fully trained
+Comparison: 'date_of_birth':
+    u values not fully trained
+Completed iteration 1, root rows count 0
+
+
+
+
+
+2    4960
+1      80
+Name: count, dtype: int64
+
+

This is a pretty modest improvement on our previous model - however it is worth re-iterating that we should not necessarily expect to recover all matches, as in several cases it may be unreasonable for a model to have reasonable confidence that two records refer to the same entity.

+

If we wished to improve matters we could iterate on this process - investigating where our model is not performing as we would hope, and seeing how we can adjust these areas to address these shortcomings.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/link_only.html b/demos/examples/duckdb/link_only.html new file mode 100644 index 0000000000..835ac36f7f --- /dev/null +++ b/demos/examples/duckdb/link_only.html @@ -0,0 +1,5746 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Linking two tables of persons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Linking two tables of persons

+ +

Linking without deduplication

+

A simple record linkage model using the link_only link type.

+

With link_only, only between-dataset record comparisons are generated. No within-dataset record comparisons are created, meaning that the model does not attempt to find within-dataset duplicates.

+

+ Open In Colab +

+
from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+
+# Split a simple dataset into two, separate datasets which can be linked together.
+df_l = df.sample(frac=0.5)
+df_r = df.drop(df_l.index)
+
+df_l.head(2)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idfirst_namesurnamedobcityemailcluster
922922EvieJones2002-07-22NaNeviejones@brewer-sparks.org230
224224LognFeeruson2013-10-15Londonl.fergson46@shah.com58
+
+ +
import splink.comparison_library as cl
+
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on
+
+settings = SettingsCreator(
+    link_type="link_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        cl.NameComparison(
+            "first_name",
+        ),
+        cl.NameComparison("surname"),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+            invalid_dates_as_null=True,
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+)
+
+linker = Linker(
+    [df_l, df_r],
+    settings,
+    db_api=DuckDBAPI(),
+    input_table_aliases=["df_left", "df_right"],
+)
+
+
from splink.exploratory import completeness_chart
+
+completeness_chart(
+    [df_l, df_r],
+    cols=["first_name", "surname", "dob", "city", "email"],
+    db_api=DuckDBAPI(),
+    table_names_for_chart=["df_left", "df_right"],
+)
+
+ +
+ + +
deterministic_rules = [
+    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
+    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
+    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
+    block_on("email"),
+]
+
+
+linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
+
+
Probability two random records match is estimated to be  0.00338.
+This means that amongst all possible pairwise record comparisons, one in 295.61 are expected to match.  With 250,000 total possible comparisons, we expect a total of around 845.71 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+    - dob (no m values are trained).
+    - city (no m values are trained).
+    - email (no m values are trained).
+
+
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
+session_email = linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("email")
+)
+session_first_name = linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("first_name")
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."dob" = r."dob"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+WARNING:
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was -0.418 in the m_probability of surname, level `Exact match on surname`
+Iteration 2: Largest change in params was 0.104 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.0711 in the m_probability of first_name, level `All other comparisons`
+Iteration 4: Largest change in params was 0.0237 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.0093 in probability_two_random_records_match
+Iteration 6: Largest change in params was 0.00407 in probability_two_random_records_match
+Iteration 7: Largest change in params was 0.0019 in probability_two_random_records_match
+Iteration 8: Largest change in params was 0.000916 in probability_two_random_records_match
+Iteration 9: Largest change in params was 0.000449 in probability_two_random_records_match
+Iteration 10: Largest change in params was 0.000222 in probability_two_random_records_match
+Iteration 11: Largest change in params was 0.00011 in probability_two_random_records_match
+Iteration 12: Largest change in params was 5.46e-05 in probability_two_random_records_match
+
+EM converged after 12 iterations
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - dob (no m values are trained).
+    - email (some m values are not trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."email" = r."email"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - dob
+    - city
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - email
+
+Iteration 1: Largest change in params was -0.483 in the m_probability of dob, level `Exact match on dob`
+Iteration 2: Largest change in params was 0.0905 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.02 in probability_two_random_records_match
+Iteration 4: Largest change in params was 0.00718 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.0031 in probability_two_random_records_match
+Iteration 6: Largest change in params was 0.00148 in probability_two_random_records_match
+Iteration 7: Largest change in params was 0.000737 in probability_two_random_records_match
+Iteration 8: Largest change in params was 0.000377 in probability_two_random_records_match
+Iteration 9: Largest change in params was 0.000196 in probability_two_random_records_match
+Iteration 10: Largest change in params was 0.000102 in probability_two_random_records_match
+Iteration 11: Largest change in params was 5.37e-05 in probability_two_random_records_match
+
+EM converged after 11 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - email (some m values are not trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."first_name" = r."first_name"
+
+Parameter estimates will be made for the following comparison(s):
+    - surname
+    - dob
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name
+
+Iteration 1: Largest change in params was -0.169 in the m_probability of surname, level `All other comparisons`
+Iteration 2: Largest change in params was -0.0127 in the m_probability of surname, level `All other comparisons`
+Iteration 3: Largest change in params was -0.00388 in the m_probability of surname, level `All other comparisons`
+Iteration 4: Largest change in params was -0.00164 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+Iteration 5: Largest change in params was -0.00089 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+Iteration 6: Largest change in params was -0.000454 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+Iteration 7: Largest change in params was -0.000225 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+Iteration 8: Largest change in params was -0.00011 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+Iteration 9: Largest change in params was -5.31e-05 in the m_probability of email, level `Jaro-Winkler >0.88 on username`
+
+EM converged after 9 iterations
+
+Your model is fully trained. All comparisons have at least one estimate for their m and u values
+
+
results = linker.inference.predict(threshold_match_probability=0.9)
+
+
results.as_pandas_dataframe(limit=5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilitysource_dataset_lsource_dataset_runique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_namesurname_l...dob_ldob_rgamma_dobcity_lcity_rgamma_cityemail_lemail_rgamma_emailmatch_key
03.1807670.900674df_leftdf_right242240FreyaFreya4Shah...1970-12-171970-12-164LonnodnoLdon0NoneNone-10
13.1807670.900674df_leftdf_right241240FreyaFreya4None...1970-12-171970-12-164LondonnoLdon0f.s@flynn.comNone-10
23.2125230.902626df_leftdf_right679682ElizabethElizabeth4Shaw...2006-04-212016-04-181CardiffCardifrf0e.shaw@smith-hall.bize.shaw@smith-hall.lbiz30
33.2241260.903331df_leftdf_right576580JessicaJessica4None...1974-11-171974-12-174NoneWalsall-1jesscac.owen@elliott.orgNone-10
43.2241260.903331df_leftdf_right577580JessicaJessica4None...1974-11-171974-12-174NoneWalsall-1jessica.owen@elliott.orgNone-10
+

5 rows × 22 columns

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/pairwise_labels.html b/demos/examples/duckdb/pairwise_labels.html new file mode 100644 index 0000000000..f78c10b2f0 --- /dev/null +++ b/demos/examples/duckdb/pairwise_labels.html @@ -0,0 +1,5671 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Estimating m probabilities from labels - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Estimating m probabilities from labels

+ +

+ Open In Colab +

+

Estimating m from a sample of pairwise labels

+

In this example, we estimate the m probabilities of the model from a table containing pairwise record comparisons which we know are 'true' matches. For example, these may be the result of work by a clerical team who have manually labelled a sample of matches.

+

The table must be in the following format:

+ + + + + + + + + + + + + + + + + + + + + + + +
source_dataset_lunique_id_lsource_dataset_runique_id_r
df_11df_22
df_11df_23
+

It is assumed that every record in the table represents a certain match.

+

Note that the column names above are the defaults. They should correspond to the values you've set for unique_id_column_name and source_dataset_column_name, if you've chosen custom values.

+
from splink.datasets import splink_dataset_labels
+
+pairwise_labels = splink_dataset_labels.fake_1000_labels
+
+# Choose labels indicating a match
+pairwise_labels = pairwise_labels[pairwise_labels["clerical_match_score"] == 1]
+pairwise_labels
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_id_lsource_dataset_lunique_id_rsource_dataset_rclerical_match_score
00fake_10001fake_10001.0
10fake_10002fake_10001.0
20fake_10003fake_10001.0
491fake_10002fake_10001.0
501fake_10003fake_10001.0
..................
3171994fake_1000996fake_10001.0
3172995fake_1000996fake_10001.0
3173997fake_1000998fake_10001.0
3174997fake_1000999fake_10001.0
3175998fake_1000999fake_10001.0
+

2031 rows × 5 columns

+
+ +

We now proceed to estimate the Fellegi Sunter model:

+
from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+df.head(2)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idfirst_namesurnamedobcityemailcluster
00RobertAlan1971-06-24NaNrobert255@smith.net0
11RobertAllen1971-05-24NaNroberta25@smith.net0
+
+ +
import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+
linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)
+deterministic_rules = [
+    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
+    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
+    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
+    "l.email = r.email",
+]
+
+linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+
+
# Register the pairwise labels table with the database, and then use it to estimate the m values
+labels_df = linker.table_management.register_labels_table(pairwise_labels, overwrite=True)
+linker.training.estimate_m_from_pairwise_labels(labels_df)
+
+
+# If the labels table already existing in the dataset you could run
+# linker.training.estimate_m_from_pairwise_labels("labels_tablename_here")
+
+
training_blocking_rule = block_on("first_name")
+linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
+
+
<EMTrainingSession, blocking on l."first_name" = r."first_name", deactivating comparisons first_name>
+
+
linker.visualisations.parameter_estimate_comparisons_chart()
+
+ +
+ + +
linker.visualisations.match_weights_chart()
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/quick_and_dirty_persons.html b/demos/examples/duckdb/quick_and_dirty_persons.html new file mode 100644 index 0000000000..985f314bd8 --- /dev/null +++ b/demos/examples/duckdb/quick_and_dirty_persons.html @@ -0,0 +1,5623 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Quick and dirty persons model - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Quick and dirty persons model

+ +

Historical people: Quick and dirty

+

This example shows how to get some initial record linkage results as quickly as possible.

+

There are many ways to improve the accuracy of this model. But this may be a good place to start if you just want to give Splink a try and see what it's capable of.

+

+ Open In Colab +

+
from splink.datasets import splink_datasets
+
+df = splink_datasets.historical_50k
+df.head(5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idclusterfull_namefirst_and_surnamefirst_namesurnamedobbirth_placepostcode_fakegenderoccupation
0Q2296770-1Q2296770thomas clifford, 1st baron clifford of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
1Q2296770-2Q2296770thomas of chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfmalepolitician
2Q2296770-3Q2296770tom 1st baron clifford of chudleightom chudleightomchudleigh1630-08-01devontq13 8dfmalepolitician
3Q2296770-4Q2296770thomas 1st chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8huNonepolitician
4Q2296770-5Q2296770thomas clifford, 1st baron chudleighthomas chudleighthomaschudleigh1630-08-01devontq13 8dfNonepolitician
+
+ +
from splink import block_on, SettingsCreator
+import splink.comparison_library as cl
+
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("full_name"),
+        block_on("substr(full_name,1,6)", "dob", "birth_place"),
+        block_on("dob", "birth_place"),
+        block_on("postcode_fake"),
+    ],
+    comparisons=[
+        cl.ForenameSurnameComparison(
+            "first_name",
+            "surname",
+            forename_surname_concat_col_name="first_and_surname",
+        ),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+        ),
+        cl.LevenshteinAtThresholds("postcode_fake", 2),
+        cl.JaroWinklerAtThresholds("birth_place", 0.9).configure(
+            term_frequency_adjustments=True
+        ),
+        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
+    ],
+)
+
+
from splink import Linker, DuckDBAPI
+
+
+linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)
+deterministic_rules = [
+    "l.full_name = r.full_name",
+    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
+]
+
+linker.training.estimate_probability_two_random_records_match(
+    deterministic_rules, recall=0.6
+)
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
results = linker.inference.predict(threshold_match_probability=0.9)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name_surname':
+    m values not fully trained
+Comparison: 'first_name_surname':
+    u values not fully trained
+Comparison: 'dob':
+    m values not fully trained
+Comparison: 'postcode_fake':
+    m values not fully trained
+Comparison: 'birth_place':
+    m values not fully trained
+Comparison: 'occupation':
+    m values not fully trained
+
+
results.as_pandas_dataframe(limit=5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rsurname_lsurname_rfirst_and_surname_lfirst_and_surname_r...gamma_postcode_fakebirth_place_lbirth_place_rgamma_birth_placeoccupation_loccupation_rgamma_occupationfull_name_lfull_name_rmatch_key
03.1700050.900005Q7412607-1Q7412607-3samuelsamuelshelleyshelleysamuel shelleysamuel shelley...0whitechapelcity of london0illuminatorilluminator1samuel shelleysamuel shelley0
13.1706950.900048Q15997578-4Q15997578-7jobwildingwildingNonejob wildingwilding...-1wrexhamwrexham2association football playerassociation football player1job wildingwilding2
23.1706950.900048Q15997578-2Q15997578-7jobwildingwildingNonejob wildingwilding...-1wrexhamwrexham2association football playerassociation football player1job wildingwilding2
33.1706950.900048Q15997578-1Q15997578-7jobwildingwildingNonejob wildingwilding...-1wrexhamwrexham2association football playerassociation football player1job wildingwilding2
43.1725530.900164Q5726641-11Q5726641-8henryharrypagepaigehenry pageharry paige...2staffordshire moorlandsstaffordshire moorlands2cricketercricketer1henry pageharry paige3
+

5 rows × 26 columns

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/real_time_record_linkage.html b/demos/examples/duckdb/real_time_record_linkage.html new file mode 100644 index 0000000000..59b4a1662f --- /dev/null +++ b/demos/examples/duckdb/real_time_record_linkage.html @@ -0,0 +1,5968 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Real time record linkage - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Real time record linkage

+ +

Real time linkage

+

In this notebook, we demonstrate splink's incremental and real time linkage capabilities - specifically:

+
    +
  • the linker.compare_two_records function, that allows you to interactively explore the results of a linkage model; and
  • +
  • the linker.find_matches_to_new_records that allows you to incrementally find matches to a small number of new records
  • +
+

+ Open In Colab +

+

Step 1: Load a pre-trained linkage model

+
import urllib.request
+import json
+from pathlib import Path
+from splink import Linker, DuckDBAPI, block_on, SettingsCreator, splink_datasets
+
+df = splink_datasets.fake_1000
+
+url = "https://raw.githubusercontent.com/moj-analytical-services/splink_demos/master/demo_settings/real_time_settings.json"
+
+with urllib.request.urlopen(url) as u:
+    settings = json.loads(u.read().decode())
+
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+
+
linker.visualisations.waterfall_chart(
+    linker.inference.predict().as_record_dict(limit=2)
+)
+
+ +
+ + +

Step Comparing two records

+

It's now possible to compute a match weight for any two records using linker.compare_two_records()

+
record_1 = {
+    "unique_id": 1,
+    "first_name": "Lucas",
+    "surname": "Smith",
+    "dob": "1984-01-02",
+    "city": "London",
+    "email": "lucas.smith@hotmail.com",
+}
+
+record_2 = {
+    "unique_id": 2,
+    "first_name": "Lucas",
+    "surname": "Smith",
+    "dob": "1983-02-12",
+    "city": "Machester",
+    "email": "lucas.smith@hotmail.com",
+}
+
+linker._settings_obj._retain_intermediate_calculation_columns = True
+
+
+# To `compare_two_records` the linker needs to compute term frequency tables
+# If you have precomputed tables, you can linker.register_term_frequency_lookup()
+linker.table_management.compute_tf_table("first_name")
+linker.table_management.compute_tf_table("surname")
+linker.table_management.compute_tf_table("dob")
+linker.table_management.compute_tf_table("city")
+linker.table_management.compute_tf_table("email")
+
+
+df_two = linker.inference.compare_two_records(record_1, record_2)
+df_two.as_pandas_dataframe()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_name...bf_citybf_tf_adj_cityemail_lemail_rgamma_emailtf_email_ltf_email_rbf_emailbf_tf_adj_emailmatch_key
013.1616720.99989112LucasLucas20.0012030.00120387.571229...0.4464041.0lucas.smith@hotmail.comlucas.smith@hotmail.com1NaNNaN263.2291681.00
+

1 rows × 40 columns

+
+ +

Step 3: Interactive comparisons

+

One interesting applicatin of compare_two_records is to create a simple interface that allows the user to input two records interactively, and get real time feedback.

+

In the following cell we use ipywidets for this purpose. ✨✨ Change the values in the text boxes to see the waterfall chart update in real time. ✨✨

+
import ipywidgets as widgets
+from IPython.display import display
+
+
+fields = ["unique_id", "first_name", "surname", "dob", "email", "city"]
+
+left_text_boxes = []
+right_text_boxes = []
+
+inputs_to_interactive_output = {}
+
+for f in fields:
+    wl = widgets.Text(description=f, value=str(record_1[f]))
+    left_text_boxes.append(wl)
+    inputs_to_interactive_output[f"{f}_l"] = wl
+    wr = widgets.Text(description=f, value=str(record_2[f]))
+    right_text_boxes.append(wr)
+    inputs_to_interactive_output[f"{f}_r"] = wr
+
+b1 = widgets.VBox(left_text_boxes)
+b2 = widgets.VBox(right_text_boxes)
+ui = widgets.HBox([b1, b2])
+
+
+def myfn(**kwargs):
+    my_args = dict(kwargs)
+
+    record_left = {}
+    record_right = {}
+
+    for key, value in my_args.items():
+        if value == "":
+            value = None
+        if key.endswith("_l"):
+            record_left[key[:-2]] = value
+        elif key.endswith("_r"):
+            record_right[key[:-2]] = value
+
+    # Assuming 'linker' is defined earlier in your code
+    linker._settings_obj._retain_intermediate_calculation_columns = True
+
+    df_two = linker.inference.compare_two_records(record_left, record_right)
+
+    recs = df_two.as_pandas_dataframe().to_dict(orient="records")
+
+    display(linker.visualisations.waterfall_chart(recs, filter_nulls=False))
+
+
+out = widgets.interactive_output(myfn, inputs_to_interactive_output)
+
+display(ui, out)
+
+
HBox(children=(VBox(children=(Text(value='1', description='unique_id'), Text(value='Lucas', description='first…
+
+
+
+Output()
+
+

Finding matching records interactively

+

It is also possible to search the records in the input dataset rapidly using the linker.find_matches_to_new_records() function

+
record = {
+    "unique_id": 123987,
+    "first_name": "Robert",
+    "surname": "Alan",
+    "dob": "1971-05-24",
+    "city": "London",
+    "email": "robert255@smith.net",
+}
+
+
+df_inc = linker.inference.find_matches_to_new_records(
+    [record], blocking_rules=[]
+).as_pandas_dataframe()
+df_inc.sort_values("match_weight", ascending=False)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_name...tf_city_rbf_citybf_tf_adj_cityemail_lemail_rgamma_emailtf_email_ltf_email_rbf_emailbf_tf_adj_email
623.5317931.0000000123987RobertRobert20.0036100.0036187.571229...0.2127921.0000001.000000robert255@smith.netrobert255@smith.net10.0012670.001267263.2291681.730964
514.5503200.9999581123987RobertRobert20.0036100.0036187.571229...0.2127921.0000001.000000roberta25@smith.netrobert255@smith.net00.0025350.0012670.4234381.000000
410.3886230.9992553123987RobertRobert20.0036100.0036187.571229...0.2127920.4464041.000000Nonerobert255@smith.net-1NaN0.0012671.0000001.000000
32.4272560.8432282123987RobRobert00.0012030.003610.218767...0.21279210.4848590.259162roberta25@smith.netrobert255@smith.net00.0025350.0012670.4234381.000000
2-2.1230900.1866978123987NoneRobert-1NaN0.003611.000000...0.2127921.0000001.000000Nonerobert255@smith.net-1NaN0.0012671.0000001.000000
1-2.2058940.178139754123987NoneRobert-1NaN0.003611.000000...0.2127921.0000001.000000j.c@whige.wortrobert255@smith.net00.0012670.0012670.4234381.000000
0-2.8023090.125383750123987NoneRobert-1NaN0.003611.000000...0.21279210.4848590.259162j.c@white.orgrobert255@smith.net00.0025350.0012670.4234381.000000
+

7 rows × 39 columns

+
+ +

Interactive interface for finding records

+

Again, we can use ipywidgets to build an interactive interface for the linker.find_matches_to_new_records function

+
@widgets.interact(
+    first_name="Robert",
+    surname="Alan",
+    dob="1971-05-24",
+    city="London",
+    email="robert255@smith.net",
+)
+def interactive_link(first_name, surname, dob, city, email):
+    record = {
+        "unique_id": 123987,
+        "first_name": first_name,
+        "surname": surname,
+        "dob": dob,
+        "city": city,
+        "email": email,
+        "group": 0,
+    }
+
+    for key in record.keys():
+        if type(record[key]) == str:
+            if record[key].strip() == "":
+                record[key] = None
+
+    df_inc = linker.inference.find_matches_to_new_records(
+        [record], blocking_rules=[f"(true)"]
+    ).as_pandas_dataframe()
+    df_inc = df_inc.sort_values("match_weight", ascending=False)
+    recs = df_inc.to_dict(orient="records")
+
+    display(linker.visualisations.waterfall_chart(recs, filter_nulls=False))
+
+
interactive(children=(Text(value='Robert', description='first_name'), Text(value='Alan', description='surname'…
+
+
linker.visualisations.match_weights_chart()
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/duckdb/transactions.html b/demos/examples/duckdb/transactions.html new file mode 100644 index 0000000000..0c9379361e --- /dev/null +++ b/demos/examples/duckdb/transactions.html @@ -0,0 +1,5980 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Linking financial transactions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Linking financial transactions

+ +

Linking banking transactions

+

This example shows how to perform a one-to-one link on banking transactions.

+

The data is fake data, and was generated has the following features:

+
    +
  • Money shows up in the destination account with some time delay
  • +
  • The amount sent and the amount received are not always the same - there are hidden fees and foreign exchange effects
  • +
  • The memo is sometimes truncated and content is sometimes missing
  • +
+

Since each origin payment should end up in the destination account, the probability_two_random_records_match of the model is known.

+

+ Open In Colab +

+
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df_origin = splink_datasets.transactions_origin
+df_destination = splink_datasets.transactions_destination
+
+display(df_origin.head(2))
+display(df_destination.head(2))
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ground_truthmemotransaction_dateamountunique_id
00MATTHIAS C paym2022-03-2836.360
11M CORVINUS dona2022-02-14221.911
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ground_truthmemotransaction_dateamountunique_id
00MATTHIAS C payment BGC2022-03-2936.360
11M CORVINUS BGC2022-02-16221.911
+
+ +

In the following chart, we can see this is a challenging dataset to link:

+
    +
  • There are only 151 distinct transaction dates, with strong skew
  • +
  • Some 'memos' are used multiple times (up to 48 times)
  • +
  • There is strong skew in the 'amount' column, with 1,400 transactions of around 60.00
  • +
+
from splink.exploratory import profile_columns
+
+db_api = DuckDBAPI()
+profile_columns(
+    [df_origin, df_destination],
+    db_api=db_api,
+    column_expressions=[
+        "memo",
+        "transaction_date",
+        "amount",
+    ],
+)
+
+ +
+ + +
from splink import DuckDBAPI, block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+# Design blocking rules that allow for differences in transaction date and amounts
+blocking_rule_date_1 = """
+    strftime(l.transaction_date, '%Y%m') = strftime(r.transaction_date, '%Y%m')
+    and substr(l.memo, 1,3) = substr(r.memo,1,3)
+    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
+"""
+
+# Offset by half a month to ensure we capture case when the dates are e.g. 31st Jan and 1st Feb
+blocking_rule_date_2 = """
+    strftime(l.transaction_date+15, '%Y%m') = strftime(r.transaction_date, '%Y%m')
+    and substr(l.memo, 1,3) = substr(r.memo,1,3)
+    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3
+"""
+
+blocking_rule_memo = block_on("substr(memo,1,9)")
+
+blocking_rule_amount_1 = """
+round(l.amount/2,0)*2 = round(r.amount/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date)
+"""
+
+blocking_rule_amount_2 = """
+round(l.amount/2,0)*2 = round((r.amount+1)/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date + 4)
+"""
+
+blocking_rule_cheat = block_on("unique_id")
+
+
+brs = [
+    blocking_rule_date_1,
+    blocking_rule_date_2,
+    blocking_rule_memo,
+    blocking_rule_amount_1,
+    blocking_rule_amount_2,
+    blocking_rule_cheat,
+]
+
+
+db_api = DuckDBAPI()
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=[df_origin, df_destination],
+    blocking_rules=brs,
+    db_api=db_api,
+    link_type="link_only"
+)
+
+ +
+ + +
# Full settings for linking model
+import splink.comparison_level_library as cll
+import splink.comparison_library as cl
+
+comparison_amount = {
+    "output_column_name": "amount",
+    "comparison_levels": [
+        cll.NullLevel("amount"),
+        cll.ExactMatchLevel("amount"),
+        cll.PercentageDifferenceLevel("amount", 0.01),
+        cll.PercentageDifferenceLevel("amount", 0.03),
+        cll.PercentageDifferenceLevel("amount", 0.1),
+        cll.PercentageDifferenceLevel("amount", 0.3),
+        cll.ElseLevel(),
+    ],
+    "comparison_description": "Amount percentage difference",
+}
+
+# The date distance is one sided becaause transactions should only arrive after they've left
+# As a result, the comparison_template_library date difference functions are not appropriate
+within_n_days_template = "transaction_date_r - transaction_date_l <= {n} and transaction_date_r >= transaction_date_l"
+
+comparison_date = {
+    "output_column_name": "transaction_date",
+    "comparison_levels": [
+        cll.NullLevel("transaction_date"),
+        {
+            "sql_condition": within_n_days_template.format(n=1),
+            "label_for_charts": "1 day",
+        },
+        {
+            "sql_condition": within_n_days_template.format(n=4),
+            "label_for_charts": "<=4 days",
+        },
+        {
+            "sql_condition": within_n_days_template.format(n=10),
+            "label_for_charts": "<=10 days",
+        },
+        {
+            "sql_condition": within_n_days_template.format(n=30),
+            "label_for_charts": "<=30 days",
+        },
+        cll.ElseLevel(),
+    ],
+    "comparison_description": "Transaction date days apart",
+}
+
+
+settings = SettingsCreator(
+    link_type="link_only",
+    probability_two_random_records_match=1 / len(df_origin),
+    blocking_rules_to_generate_predictions=[
+        blocking_rule_date_1,
+        blocking_rule_date_2,
+        blocking_rule_memo,
+        blocking_rule_amount_1,
+        blocking_rule_amount_2,
+        blocking_rule_cheat,
+    ],
+    comparisons=[
+        comparison_amount,
+        cl.LevenshteinAtThresholds("memo", [2, 6, 10]),
+        comparison_date,
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+
linker = Linker(
+    [df_origin, df_destination],
+    settings,
+    input_table_aliases=["__ori", "_dest"],
+    db_api=db_api,
+)
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - amount (no m values are trained).
+    - memo (no m values are trained).
+    - transaction_date (no m values are trained).
+
+
linker.training.estimate_parameters_using_expectation_maximisation(block_on("memo"))
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."memo" = r."memo"
+
+Parameter estimates will be made for the following comparison(s):
+    - amount
+    - transaction_date
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - memo
+
+Iteration 1: Largest change in params was -0.588 in the m_probability of amount, level `Exact match on amount`
+Iteration 2: Largest change in params was -0.176 in the m_probability of transaction_date, level `1 day`
+Iteration 3: Largest change in params was 0.00996 in the m_probability of amount, level `Percentage difference of 'amount' within 10.00%`
+Iteration 4: Largest change in params was 0.0022 in the m_probability of transaction_date, level `<=30 days`
+Iteration 5: Largest change in params was 0.000385 in the m_probability of transaction_date, level `<=30 days`
+Iteration 6: Largest change in params was -0.000255 in the m_probability of amount, level `All other comparisons`
+Iteration 7: Largest change in params was -0.000229 in the m_probability of amount, level `All other comparisons`
+Iteration 8: Largest change in params was -0.000208 in the m_probability of amount, level `All other comparisons`
+Iteration 9: Largest change in params was -0.00019 in the m_probability of amount, level `All other comparisons`
+Iteration 10: Largest change in params was -0.000173 in the m_probability of amount, level `All other comparisons`
+Iteration 11: Largest change in params was -0.000159 in the m_probability of amount, level `All other comparisons`
+Iteration 12: Largest change in params was -0.000146 in the m_probability of amount, level `All other comparisons`
+Iteration 13: Largest change in params was -0.000135 in the m_probability of amount, level `All other comparisons`
+Iteration 14: Largest change in params was -0.000124 in the m_probability of amount, level `All other comparisons`
+Iteration 15: Largest change in params was -0.000115 in the m_probability of amount, level `All other comparisons`
+Iteration 16: Largest change in params was -0.000107 in the m_probability of amount, level `All other comparisons`
+Iteration 17: Largest change in params was -9.92e-05 in the m_probability of amount, level `All other comparisons`
+
+EM converged after 17 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - memo (no m values are trained).
+
+
+
+
+
+<EMTrainingSession, blocking on l."memo" = r."memo", deactivating comparisons memo>
+
+
session = linker.training.estimate_parameters_using_expectation_maximisation(block_on("amount"))
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l."amount" = r."amount"
+
+Parameter estimates will be made for the following comparison(s):
+    - memo
+    - transaction_date
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - amount
+
+Iteration 1: Largest change in params was -0.373 in the m_probability of memo, level `Exact match on memo`
+Iteration 2: Largest change in params was -0.108 in the m_probability of memo, level `Exact match on memo`
+Iteration 3: Largest change in params was 0.0202 in the m_probability of memo, level `Levenshtein distance of memo <= 10`
+Iteration 4: Largest change in params was -0.00538 in the m_probability of memo, level `Exact match on memo`
+Iteration 5: Largest change in params was 0.00482 in the m_probability of memo, level `All other comparisons`
+Iteration 6: Largest change in params was 0.00508 in the m_probability of memo, level `All other comparisons`
+Iteration 7: Largest change in params was 0.00502 in the m_probability of memo, level `All other comparisons`
+Iteration 8: Largest change in params was 0.00466 in the m_probability of memo, level `All other comparisons`
+Iteration 9: Largest change in params was 0.00409 in the m_probability of memo, level `All other comparisons`
+Iteration 10: Largest change in params was 0.00343 in the m_probability of memo, level `All other comparisons`
+Iteration 11: Largest change in params was 0.00276 in the m_probability of memo, level `All other comparisons`
+Iteration 12: Largest change in params was 0.00216 in the m_probability of memo, level `All other comparisons`
+Iteration 13: Largest change in params was 0.00165 in the m_probability of memo, level `All other comparisons`
+Iteration 14: Largest change in params was 0.00124 in the m_probability of memo, level `All other comparisons`
+Iteration 15: Largest change in params was 0.000915 in the m_probability of memo, level `All other comparisons`
+Iteration 16: Largest change in params was 0.000671 in the m_probability of memo, level `All other comparisons`
+Iteration 17: Largest change in params was 0.000488 in the m_probability of memo, level `All other comparisons`
+Iteration 18: Largest change in params was 0.000353 in the m_probability of memo, level `All other comparisons`
+Iteration 19: Largest change in params was 0.000255 in the m_probability of memo, level `All other comparisons`
+Iteration 20: Largest change in params was 0.000183 in the m_probability of memo, level `All other comparisons`
+Iteration 21: Largest change in params was 0.000132 in the m_probability of memo, level `All other comparisons`
+Iteration 22: Largest change in params was 9.45e-05 in the m_probability of memo, level `All other comparisons`
+
+EM converged after 22 iterations
+
+Your model is fully trained. All comparisons have at least one estimate for their m and u values
+
+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
df_predict = linker.inference.predict(threshold_match_probability=0.001)
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+
linker.visualisations.comparison_viewer_dashboard(
+    df_predict, "dashboards/comparison_viewer_transactions.html", overwrite=True
+)
+from IPython.display import IFrame
+
+IFrame(
+    src="./dashboards/comparison_viewer_transactions.html", width="100%", height=1200
+)
+
+

+

+
pred_errors = linker.evaluation.prediction_errors_from_labels_column(
+    "ground_truth", include_false_positives=True, include_false_negatives=False
+)
+linker.visualisations.waterfall_chart(pred_errors.as_record_dict(limit=5))
+
+
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
+
+ +
+ + +
pred_errors = linker.evaluation.prediction_errors_from_labels_column(
+    "ground_truth", include_false_positives=False, include_false_negatives=True
+)
+linker.visualisations.waterfall_chart(pred_errors.as_record_dict(limit=5))
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/examples_index.html b/demos/examples/examples_index.html new file mode 100644 index 0000000000..a552f36376 --- /dev/null +++ b/demos/examples/examples_index.html @@ -0,0 +1,5413 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + + + +
+
+ + + + + + + + + + + + + + + + + +

Example Notebooks

+

This section provides a series of examples to help you get started with Splink. You can find the underlying notebooks in the demos folder of the Splink repository.

+

DuckDB examples

+
Entity type: Persons
+

+ Open In Colab + Deduplicating 50,000 records of realistic data based on historical persons

+

+ Open In Colab + Using the link_only setting to link, but not dedupe, two datasets

+

+ Open In Colab + Real time record linkage

+

+ Open In Colab + Accuracy analysis and ROC charts using a ground truth (cluster) column

+

+ Open In Colab + Estimating m probabilities from pairwise labels

+

+ Open In Colab + Deduplicating 50,000 records with Deterministic Rules

+

+ Open In Colab Deduplicating the febrl3 dataset. Note this dataset comes from febrl, as referenced in A.2 here and replicated here. +

+

+ Open In Colab Linking the febrl4 datasets. As above, these datasets are from febrl, replicated here. +

+

+ Open In Colab + Cookbook of various Splink techniques

+
Entity type: Financial transactions
+

+ Open In Colab + Linking financial transactions

+

PySpark examples

+

+ Open In Colab + Deduplication of a small dataset using PySpark. Entity type is persons.

+

Athena examples

+

Deduplicating 50,000 records of realistic data based on historical persons

+

SQLite examples

+

+ Open In Colab + Deduplicating 50,000 records of realistic data based on historical persons

+ + + + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/spark/deduplicate_1k_synthetic.html b/demos/examples/spark/deduplicate_1k_synthetic.html new file mode 100644 index 0000000000..e26c1a8a7f --- /dev/null +++ b/demos/examples/spark/deduplicate_1k_synthetic.html @@ -0,0 +1,5452 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deduplication using Pyspark - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Deduplication using Pyspark

+ +

Linking in Spark

+

+ Open In Colab +

+
from pyspark import SparkConf, SparkContext
+from pyspark.sql import SparkSession
+
+from splink.backends.spark import similarity_jar_location
+
+conf = SparkConf()
+# This parallelism setting is only suitable for a small toy example
+conf.set("spark.driver.memory", "12g")
+conf.set("spark.default.parallelism", "8")
+conf.set("spark.sql.codegen.wholeStage", "false")
+
+
+# Add custom similarity functions, which are bundled with Splink
+# documented here: https://github.com/moj-analytical-services/splink_scalaudfs
+path = similarity_jar_location()
+conf.set("spark.jars", path)
+
+sc = SparkContext.getOrCreate(conf=conf)
+
+spark = SparkSession(sc)
+spark.sparkContext.setCheckpointDir("./tmp_checkpoints")
+
+
24/07/13 19:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+Setting default log level to "WARN".
+To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+
+
from splink import splink_datasets
+
+pandas_df = splink_datasets.fake_1000
+
+df = spark.createDataFrame(pandas_df)
+
+
import splink.comparison_library as cl
+from splink import Linker, SettingsCreator, SparkAPI, block_on
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.LevenshteinAtThresholds(
+            "dob"
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        "l.surname = r.surname",  # alternatively, you can write BRs in their SQL form
+    ],
+    retain_intermediate_calculation_columns=True,
+    em_convergence=0.01,
+)
+
+
linker = Linker(df, settings, db_api=SparkAPI(spark_session=spark))
+deterministic_rules = [
+    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
+    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
+    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
+    "l.email = r.email",
+]
+
+linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)
+
+
Probability two random records match is estimated to be  0.0806.                
+This means that amongst all possible pairwise record comparisons, one in 12.41 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 40,246.67 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=5e5)
+
+
----- Estimating u probabilities using random sampling -----
+
+
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+    - dob (no m values are trained).
+    - city (no m values are trained).
+    - email (no m values are trained).
+
+
training_blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
+training_session_fname_sname = (
+    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
+)
+
+training_blocking_rule = "l.dob = r.dob"
+training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    training_blocking_rule
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.first_name = r.first_name and l.surname = r.surname
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name
+    - surname
+
+Iteration 1: Largest change in params was -0.709 in probability_two_random_records_match
+Iteration 2: Largest change in params was 0.0573 in the m_probability of email, level `All other comparisons`
+Iteration 3: Largest change in params was 0.0215 in the m_probability of email, level `All other comparisons`
+Iteration 4: Largest change in params was 0.00888 in the m_probability of email, level `All other comparisons`
+
+EM converged after 4 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+
+----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.dob = r.dob
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+WARNING:                                                                        
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was -0.548 in the m_probability of surname, level `Exact match on surname`
+Iteration 2: Largest change in params was 0.129 in probability_two_random_records_match
+Iteration 3: Largest change in params was 0.0313 in probability_two_random_records_match
+Iteration 4: Largest change in params was 0.0128 in probability_two_random_records_match
+Iteration 5: Largest change in params was 0.00651 in probability_two_random_records_match
+
+EM converged after 5 iterations
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+Your model is fully trained. All comparisons have at least one estimate for their m and u values
+
+
results = linker.inference.predict(threshold_match_probability=0.9)
+
+
Blocking time: 4.65 seconds                                                     
+Predict time: 82.92 seconds
+
+
spark_df = results.as_spark_dataframe().show()
+
+
+------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+
+|      match_weight| match_probability|unique_id_l|unique_id_r|first_name_l|first_name_r|gamma_first_name|tf_first_name_l|tf_first_name_r|     bf_first_name|bf_tf_adj_first_name|surname_l|surname_r|gamma_surname|tf_surname_l|tf_surname_r|         bf_surname| bf_tf_adj_surname|     dob_l|     dob_r|gamma_dob|            bf_dob|    city_l|    city_r|gamma_city|tf_city_l|tf_city_r|           bf_city|    bf_tf_adj_city|             email_l|             email_r|gamma_email|tf_email_l|tf_email_r|           bf_email|    bf_tf_adj_email|match_key|
++------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+
+|15.131885475840011|0.9999721492762709|         51|         56|      Jayden|      Jayden|               4|          0.008|          0.008|11.371009132404957|  4.0525525525525525|  Bennett|  Bennett|            4|       0.006|       0.006|  9.113630950205666| 5.981981981981981|2017-01-11|2017-02-10|        1|14.373012181955707|   Swansea|   Swansea|         1|    0.013|    0.013|5.8704874944935215| 5.481481481481482|                 NaN|       jb88@king.com|          0|     0.211|     0.004|0.35260600559686806|                1.0|        0|
+|  7.86514930254232|0.9957293356289956|        575|        577|     Jessica|     Jessica|               4|          0.011|          0.011|11.371009132404957|  2.9473109473109473|     Owen|      NaN|            0|       0.006|       0.181|0.45554364195240765|               1.0|1974-11-17|1974-11-17|        3|220.92747883214062|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|                 NaN|jessica.owen@elli...|          0|     0.211|     0.002|0.35260600559686806|                1.0|        0|
+| 5.951711022429932|0.9841000517299358|        171|        174|         NaN|        Leah|               0|          0.169|          0.002|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2012-07-09|        0|0.2607755750325071|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|
+|21.650093935297473|0.9999996961409438|        518|        519|      Amelia|     Amlelia|               2|          0.009|          0.001| 47.10808446952784|                 1.0|   Morgan|   Morgan|            4|       0.012|       0.012|  9.113630950205666|2.9909909909909906|2011-05-26|2011-05-26|        3|220.92747883214062|   Swindno|   Swindon|         0|    0.001|     0.01|0.6263033203299755|               1.0|amelia.morgan92@d...|amelia.morgan92@d...|          3|     0.004|     0.001| 211.35554441198767|                1.0|        1|
+|11.456207518049865|0.9996442185022277|        752|        754|        Jaes|         NaN|               0|          0.001|          0.169|0.4452000905514796|                 1.0|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|1972-07-20|1971-07-20|        2| 84.28155355946456|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|       j.c@white.org|      j.c@whige.wort|          3|     0.002|     0.001| 211.35554441198767|                1.0|        1|
+|24.387299048327478|0.9999999544286963|        760|        761|       Henry|       Henry|               4|          0.009|          0.009|11.371009132404957|   3.602268935602269|      Day|      Day|            4|       0.004|       0.004|  9.113630950205666| 8.972972972972972|2002-09-15|2002-08-18|        1|14.373012181955707|     Leeds|     Leeds|         1|    0.017|    0.017|5.8704874944935215| 4.191721132897603|hday48@thomas-car...|hday48@thomas-car...|          3|     0.003|     0.001| 211.35554441198767|                1.0|        0|
+|12.076660303346712|0.9997685471829967|        920|        922|         Evi|        Evie|               3|          0.001|          0.007| 61.79623639995749|                 1.0|    Jones|    Jones|            4|       0.023|       0.023|  9.113630950205666|1.5605170387779081|2012-06-19|2002-07-22|        0|0.2607755750325071|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|eviejones@brewer-...|eviejones@brewer-...|          4|     0.004|     0.004|  8.411105418567649|  11.42992992992993|        1|
+| 4.002786788974079|0.9412833223288347|        171|        175|         NaN|       Lheah|               0|          0.169|          0.001|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2011-07-10|        0|0.2607755750325071|    London|   Londoon|         0|    0.173|    0.002|0.6263033203299755|               1.0|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|
+|19.936162812706836|0.9999990031804153|        851|        853|    Mhichael|     Michael|               2|          0.001|          0.006| 47.10808446952784|                 1.0|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|2000-04-03|2000-04-03|        3|220.92747883214062|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|      m.w@cannon.com|      m@w.cannon.com|          2|     0.002|     0.001| 251.69908796212906|                1.0|        1|
+| 21.33290823458872|0.9999996214227064|        400|        402|       James|       James|               4|          0.013|          0.013|11.371009132404957|  2.4938784938784937|    Dixon|    Dixon|            4|       0.009|       0.009|  9.113630950205666| 3.987987987987988|1991-04-12|1991-04-12|        3|220.92747883214062|       NaN|   Loodnon|         0|    0.187|    0.001|0.6263033203299755|               1.0|james.d@merritot-...|james.d@merritt-s...|          3|     0.001|     0.005| 211.35554441198767|                1.0|        0|
+|22.169132705637786|0.9999997879560012|         81|         84|        Ryan|        Ryan|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|     Cole|     Cole|            4|       0.005|       0.005|  9.113630950205666| 7.178378378378378|1987-05-27|1988-05-27|        2| 84.28155355946456|       NaN|   Bristol|         0|    0.187|    0.016|0.6263033203299755|               1.0|r.cole1@ramirez-a...|r.cole1@ramtrez-a...|          3|     0.005|     0.001| 211.35554441198767|                1.0|        0|
+|6.1486678498977065|0.9861008615160808|        652|        654|         NaN|         NaN|               4|          0.169|          0.169|11.371009132404957| 0.19183680722142257|  Roberts|      NaN|            0|       0.006|       0.181|0.45554364195240765|               1.0|1990-10-26|1990-10-26|        3|220.92747883214062|Birmingham|Birmingham|         1|     0.04|     0.04|5.8704874944935215|1.7814814814814814|                 NaN|droberts73@taylor...|          0|     0.211|     0.003|0.35260600559686806|                1.0|        0|
+|17.935398542824068|0.9999960106207738|        582|        584|      ilivOa|      Olivia|               1|          0.001|          0.014| 3.944098136204933|                 1.0|  Edwards|  Edwards|            4|       0.008|       0.008|  9.113630950205666| 4.486486486486486|1988-12-27|1988-12-27|        3|220.92747883214062|    Dudley|   Duudley|         0|    0.006|    0.001|0.6263033203299755|               1.0|      oe56@lopez.net|      oe56@lopez.net|          4|     0.003|     0.003|  8.411105418567649| 15.239906573239907|        1|
+|21.036204363210302|0.9999995349803662|        978|        981|     Jessica|     Jessica|               4|          0.011|          0.011|11.371009132404957|  2.9473109473109473|   Miller|  Miiller|            3|       0.004|       0.001|  82.56312210691897|               1.0|2001-05-23|2001-05-23|        3|220.92747883214062|       NaN|  Coventry|         0|    0.187|    0.021|0.6263033203299755|               1.0|jessica.miller@jo...|jessica.miller@jo...|          4|     0.006|     0.006|  8.411105418567649|  7.619953286619953|        0|
+|13.095432674729635|0.9998857562788657|        684|        686|       Rosie|       Rosie|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|  Johnstn| Johnston|            3|       0.001|       0.002|  82.56312210691897|               1.0|1979-12-23|1978-11-23|        1|14.373012181955707|       NaN| Sheffield|         0|    0.187|    0.007|0.6263033203299755|               1.0|                 NaN|                 NaN|          4|     0.211|     0.211|  8.411105418567649|0.21668113611241574|        0|
+|25.252698357543103|0.9999999749861632|        279|        280|        Lola|        Lola|               4|          0.008|          0.008|11.371009132404957|  4.0525525525525525|   Taylor|   Taylor|            4|       0.014|       0.014|  9.113630950205666|2.5637065637065635|2017-11-20|2016-11-20|        2| 84.28155355946456|  Aberdeen|  Aberdeen|         1|    0.016|    0.016|5.8704874944935215| 4.453703703703703|lolat86@bishop-gi...|lolat86@bishop-gi...|          4|     0.002|     0.002|  8.411105418567649|  22.85985985985986|        0|
+| 9.711807138722323|0.9988089303569408|         42|         43|    Theodore|    Theodore|               4|           0.01|           0.01|11.371009132404957|   3.242042042042042|   Morris|   Morris|            4|       0.004|       0.004|  9.113630950205666| 8.972972972972972|1978-09-18|1978-08-19|        1|14.373012181955707|Birgmhniam|Birmingham|         0|    0.001|     0.04|0.6263033203299755|               1.0|                 NaN|t.m39@brooks-sawy...|          0|     0.211|     0.005|0.35260600559686806|                1.0|        0|
+| 5.951711022429932|0.9841000517299358|        173|        174|         NaN|        Leah|               0|          0.169|          0.002|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2012-07-09|        0|0.2607755750325071|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|
+| 23.43211696288854|0.9999999116452517|         88|         89|        Lexi|        Lexi|               4|          0.003|          0.003|11.371009132404957|  10.806806806806806|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|1994-09-02|1994-09-02|        3|220.92747883214062|Birmingham|Birmingham|         1|     0.04|     0.04|5.8704874944935215|1.7814814814814814|l.gordon34cfren@h...|l.gordon34@french...|          2|     0.001|     0.002| 251.69908796212906|                1.0|        0|
+|7.1659948250873144|0.9930847652376709|        391|        393|       Isaac|       Isaac|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|      NaN|    James|            0|       0.181|       0.007|0.45554364195240765|               1.0|1991-05-06|1991-05-06|        3|220.92747883214062|     Lodon|    London|         0|    0.008|    0.173|0.6263033203299755|               1.0|isaac.james@smich...|                 NaN|          0|     0.001|     0.211|0.35260600559686806|                1.0|        0|
++------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+
+only showing top 20 rows
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/examples/sqlite/dashboards/50k_cluster.html b/demos/examples/sqlite/dashboards/50k_cluster.html new file mode 100644 index 0000000000..4ac584e47f --- /dev/null +++ b/demos/examples/sqlite/dashboards/50k_cluster.html @@ -0,0 +1,11080 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/demos/examples/sqlite/deduplicate_50k_synthetic.html b/demos/examples/sqlite/deduplicate_50k_synthetic.html new file mode 100644 index 0000000000..d37fae4718 --- /dev/null +++ b/demos/examples/sqlite/deduplicate_50k_synthetic.html @@ -0,0 +1,6240 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Deduplicate 50k rows historical persons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Deduplicate 50k rows historical persons

+ +

Linking a dataset of real historical persons

+

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

+

Note, as explained in the backends topic guide, SQLite does not natively support string fuzzy matching functions such as damareau-levenshtein and jaro-winkler (as used in this example). Instead, these have been imported as python User Defined Functions (UDFs). One drawback of python UDFs is that they are considerably slower than native-SQL comparisons. As such, if you are hitting issues with large run times, consider switching to DuckDB (or some other backend).

+

+ Open In Colab +

+
# Uncomment and run this cell if you're running in Google Colab.
+# !pip install splink
+# !pip install rapidfuzz
+
+
import pandas as pd
+
+from splink import splink_datasets
+
+pd.options.display.max_rows = 1000
+# reduce size of dataset to make things run faster
+df = splink_datasets.historical_50k.sample(5000)
+
+
from splink.backends.sqlite import SQLiteAPI
+from splink.exploratory import profile_columns
+
+db_api = SQLiteAPI()
+profile_columns(
+    df, db_api, column_expressions=["first_name", "postcode_fake", "substr(dob, 1,4)"]
+)
+
+ +
+ + +
from splink import block_on
+from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+blocking_rules =  [block_on("first_name", "surname"),
+        block_on("surname", "dob"),
+        block_on("first_name", "dob"),
+        block_on("postcode_fake", "first_name")]
+
+db_api = SQLiteAPI()
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules,
+    db_api=db_api,
+    link_type="dedupe_only"
+)
+
+ +
+ + +
import splink.comparison_library as cl
+from splink import Linker
+
+settings = {
+    "link_type": "dedupe_only",
+    "blocking_rules_to_generate_predictions": [
+        block_on("first_name", "surname"),
+        block_on("surname", "dob"),
+        block_on("first_name", "dob"),
+        block_on("postcode_fake", "first_name"),
+
+    ],
+    "comparisons": [
+        cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.DamerauLevenshteinAtThresholds("dob", [1, 2]).configure(
+            term_frequency_adjustments=True
+        ),
+        cl.DamerauLevenshteinAtThresholds("postcode_fake", [1, 2]),
+        cl.ExactMatch("birth_place").configure(term_frequency_adjustments=True),
+        cl.ExactMatch(
+            "occupation",
+        ).configure(term_frequency_adjustments=True),
+    ],
+    "retain_matching_columns": True,
+    "retain_intermediate_calculation_columns": True,
+    "max_iterations": 10,
+    "em_convergence": 0.01,
+}
+
+linker = Linker(df, settings, db_api=db_api)
+
+
linker.training.estimate_probability_two_random_records_match(
+    [
+        "l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob",
+        "substr(l.first_name,1,2) = substr(r.first_name,1,2) and l.surname = r.surname and substr(l.postcode_fake,1,2) = substr(r.postcode_fake,1,2)",
+        "l.dob = r.dob and l.postcode_fake = r.postcode_fake",
+    ],
+    recall=0.6,
+)
+
+
Probability two random records match is estimated to be  0.000125.
+This means that amongst all possible pairwise record comparisons, one in 7,985.62 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 1,565.00 matching pairs
+
+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+----- Estimating u probabilities using random sampling -----
+u probability not trained for first_name - Jaro-Winkler distance of first_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+u probability not trained for surname - Jaro-Winkler distance of surname >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+
+Estimated u probabilities using random sampling
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (some u values are not trained, no m values are trained).
+    - surname (some u values are not trained, no m values are trained).
+    - dob (no m values are trained).
+    - postcode_fake (no m values are trained).
+    - birth_place (no m values are trained).
+    - occupation (no m values are trained).
+
+
training_blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
+training_session_names = linker.training.estimate_parameters_using_expectation_maximisation(
+    training_blocking_rule, estimate_without_term_frequencies=True
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.first_name = r.first_name and l.surname = r.surname
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name
+    - surname
+
+Iteration 1: Largest change in params was -0.438 in probability_two_random_records_match
+Iteration 2: Largest change in params was -0.0347 in probability_two_random_records_match
+Iteration 3: Largest change in params was -0.0126 in the m_probability of birth_place, level `All other comparisons`
+Iteration 4: Largest change in params was 0.00644 in the m_probability of birth_place, level `Exact match on birth_place`
+
+EM converged after 4 iterations
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (some u values are not trained, no m values are trained).
+    - surname (some u values are not trained, no m values are trained).
+
+
training_blocking_rule = "l.dob = r.dob"
+training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    training_blocking_rule, estimate_without_term_frequencies=True
+)
+
+
----- Starting EM training session -----
+
+Estimating the m probabilities of the model by blocking on:
+l.dob = r.dob
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - postcode_fake
+    - birth_place
+    - occupation
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+WARNING:
+Level Jaro-Winkler distance of first_name >= 0.88 on comparison first_name not observed in dataset, unable to train m value
+
+WARNING:
+Level Jaro-Winkler distance of surname >= 0.88 on comparison surname not observed in dataset, unable to train m value
+
+Iteration 1: Largest change in params was 0.327 in the m_probability of first_name, level `All other comparisons`
+Iteration 2: Largest change in params was -0.0566 in the m_probability of surname, level `Exact match on surname`
+Iteration 3: Largest change in params was -0.0184 in the m_probability of surname, level `Exact match on surname`
+Iteration 4: Largest change in params was -0.006 in the m_probability of surname, level `Exact match on surname`
+
+EM converged after 4 iterations
+m probability not trained for first_name - Jaro-Winkler distance of first_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+m probability not trained for surname - Jaro-Winkler distance of surname >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (some u values are not trained, some m values are not trained).
+    - surname (some u values are not trained, some m values are not trained).
+
+

The final match weights can be viewed in the match weights chart:

+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
linker.evaluation.unlinkables_chart()
+
+ +
+ + +
df_predict = linker.inference.predict()
+df_e = df_predict.as_pandas_dataframe(limit=5)
+df_e
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name':
+    m values not fully trained
+Comparison: 'first_name':
+    u values not fully trained
+Comparison: 'surname':
+    m values not fully trained
+Comparison: 'surname':
+    u values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_name...bf_birth_placebf_tf_adj_birth_placeoccupation_loccupation_rgamma_occupationtf_occupation_ltf_occupation_rbf_occupationbf_tf_adj_occupationmatch_key
026.9320831.000000Q446382-1Q446382-3mariannemarianne40.0008010.00080151.871289...0.1623661.000000NoneNone-1NaNNaN1.0000001.0000000
130.7888001.000000Q2835078-1Q2835078-2alfredalfred40.0136220.01362251.871289...197.4525260.607559NoneNone-1NaNNaN1.0000001.0000000
223.8823401.000000Q2835078-1Q2835078-5alfredalfred40.0136220.01362251.871289...1.0000001.000000NoneNone-1NaNNaN1.0000001.0000000
339.9321871.000000Q80158702-1Q80158702-4johnjohn40.0530850.05308551.871289...197.4525262.025198sculptorsculptor10.0027690.00276923.83678113.8680190
417.0423390.999993Q18810722-3Q18810722-6frederickfrederick40.0122200.01222051.871289...197.4525260.607559printerprinter10.0007910.00079123.83678148.5380670
+

5 rows × 44 columns

+
+ +

You can also view rows in this dataset as a waterfall chart as follows:

+
records_to_plot = df_e.to_dict(orient="records")
+linker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)
+
+ +
+ + +
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predict, threshold_match_probability=0.95
+)
+
+
Completed iteration 1, root rows count 5
+Completed iteration 2, root rows count 0
+
+
linker.visualisations.cluster_studio_dashboard(
+    df_predict,
+    clusters,
+    "dashboards/50k_cluster.html",
+    sampling_method="by_cluster_size",
+    overwrite=True,
+)
+
+from IPython.display import IFrame
+
+IFrame(src="./dashboards/50k_cluster.html", width="100%", height=1200)
+
+

+

+
linker.evaluation.accuracy_analysis_from_labels_column(
+    "cluster", output_type="roc", match_weight_round_to_nearest=0.02
+)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name':
+    m values not fully trained
+Comparison: 'first_name':
+    u values not fully trained
+Comparison: 'surname':
+    m values not fully trained
+Comparison: 'surname':
+    u values not fully trained
+
+ +
+ + +
records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster",
+    threshold_match_probability=0.999,
+    include_false_negatives=False,
+    include_false_positives=True,
+).as_record_dict()
+linker.visualisations.waterfall_chart(records)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name':
+    m values not fully trained
+Comparison: 'first_name':
+    u values not fully trained
+Comparison: 'surname':
+    m values not fully trained
+Comparison: 'surname':
+    u values not fully trained
+
+ +
+ + +
# Some of the false negatives will be because they weren't detected by the blocking rules
+records = linker.evaluation.prediction_errors_from_labels_column(
+    "cluster",
+    threshold_match_probability=0.5,
+    include_false_negatives=True,
+    include_false_positives=False,
+).as_record_dict(limit=50)
+
+linker.visualisations.waterfall_chart(records)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'first_name':
+    m values not fully trained
+Comparison: 'first_name':
+    u values not fully trained
+Comparison: 'surname':
+    m values not fully trained
+Comparison: 'surname':
+    u values not fully trained
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/00_Tutorial_Introduction.html b/demos/tutorials/00_Tutorial_Introduction.html new file mode 100644 index 0000000000..42603fc8e2 --- /dev/null +++ b/demos/tutorials/00_Tutorial_Introduction.html @@ -0,0 +1,5337 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Introductory tutorial

+

This is the introduction to a seven part tutorial which demonstrates how to de-duplicate a small dataset using simple settings.

+

The aim of the tutorial is to demonstrate core Splink functionality succinctly, rather that comprehensively document all configuration options.

+

The seven parts are:

+ +

Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.

+

You can find these tutorial notebooks in the docs/demos/tutorials/ folder of the splink repo, or click the Colab links to run in your browser.

+

End-to-end demos

+

After following the steps of the tutorial, it might prove useful to have a look at some of the example notebooks that show various use-case scenarios of Splink from start to finish.

+

Interactive Introduction to Record Linkage Theory

+

If you'd like to learn more about record linkage theory, an interactive introduction is available here.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/01_Prerequisites.html b/demos/tutorials/01_Prerequisites.html new file mode 100644 index 0000000000..4603e2d080 --- /dev/null +++ b/demos/tutorials/01_Prerequisites.html @@ -0,0 +1,5371 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1. Data prep prerequisites - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Data Prerequisites

+

Splink requires that you clean your data and assign unique IDs to rows before linking.

+

This section outlines the additional data cleaning steps needed before loading data into Splink.

+

Unique IDs

+
    +
  • Each input dataset must have a unique ID column, which is unique within the dataset. By default, Splink assumes this column will be called unique_id, but this can be changed with the unique_id_column_name key in your Splink settings. The unique id is essential because it enables Splink to keep track each row correctly.
  • +
+

Conformant input datasets

+
    +
  • Input datasets must be conformant, meaning they share the same column names and data formats. For instance, if one dataset has a "date of birth" column and another has a "dob" column, rename them to match. Ensure data type and number formatting are consistent across both columns. The order of columns in input dataframes is not important.
  • +
+

Cleaning

+
    +
  • Ensure data consistency by cleaning your data. This process includes standardizing date formats, matching text case, and handling invalid data. For example, if one dataset uses "yyyy-mm-dd" date format and another uses "mm/dd/yyyy," convert them to the same format before using Splink. Try also to identify and rectify any obvious data entry errors, such as removing values such as 'Mr' or 'Mrs' from a 'first name' column.
  • +
+

Ensure nulls are consistently and correctly represented

+
    +
  • Ensure null values (or other 'not known' indicators) are represented as true nulls, not empty strings. Splink treats null values differently from empty strings, so using true nulls guarantees proper matching across datasets.
  • +
+

Further details on data cleaning and standardisation

+

Splink performs optimally with cleaned and standardized data. Here is a non-exhaustive list of suggestions for data cleaning rules to enhance matching accuracy:

+
    +
  • Trim leading and trailing whitespace from string values (e.g., " john smith " becomes "john smith").
  • +
  • Remove special characters from string values (e.g., "O'Hara" becomes "Ohara").
  • +
  • Standardise date formats as strings in "yyyy-mm-dd" format.
  • +
  • Replace abbreviations with full words (e.g., standardize "St." and "Street" to "Street").
  • +
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/02_Exploratory_analysis.html b/demos/tutorials/02_Exploratory_analysis.html new file mode 100644 index 0000000000..349fefc552 --- /dev/null +++ b/demos/tutorials/02_Exploratory_analysis.html @@ -0,0 +1,5593 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 2. Exploratory analysis - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Exploratory analysis

+

+ Open In Colab +

+

Exploratory analysis helps you understand features of your data which are relevant linking or deduplicating your data.

+

Splink includes a variety of charts to help with this, which are demonstrated in this notebook.

+

Read in the data

+

For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.

+

The first five rows of this dataset are printed below.

+

Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)

+
from splink import  splink_datasets
+
+df = splink_datasets.fake_1000
+df = df.drop(columns=["cluster"])
+df.head(5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idfirst_namesurnamedobcityemail
00RobertAlan1971-06-24NaNrobert255@smith.net
11RobertAllen1971-05-24NaNroberta25@smith.net
22RobAllen1971-06-24Londonroberta25@smith.net
33RobertAlen1971-06-24LononNaN
44GraceNaN1997-04-26Hullgrace.kelly52@jones.com
+
+ +

Analyse missingness

+

It's important to understand the level of missingness in your data, because columns with higher levels of missingness are less useful for data linking.

+
from splink.exploratory import completeness_chart
+from splink import DuckDBAPI
+db_api = DuckDBAPI()
+completeness_chart(df, db_api=db_api)
+
+ +
+ + +

The above summary chart shows that in this dataset, the email, city, surname and forename columns contain nulls, but the level of missingness is relatively low (less than 22%).

+

Analyse the distribution of values in your data

+

The distribution of values in your data is important for two main reasons:

+
    +
  1. +

    Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.

    +
  2. +
  3. +

    The skew of values is important. If you have a city column that has 1,000 distinct values, but 75% of them are London, this is much less useful for linkage than if the 1,000 values were equally distributed

    +
  4. +
+

The linker.profile_columns() method creates summary charts to help you understand these aspects of your data.

+

To profile all columns, leave the column_expressions argument empty.

+
from splink.exploratory import profile_columns
+
+profile_columns(df, db_api=DuckDBAPI(), top_n=10, bottom_n=5)
+
+ +
+ + +

This chart is very information-dense, but here are some key takehomes relevant to our linkage:

+
    +
  • +

    There is strong skew in the city field with around 20% of the values being London. We therefore will probably want to use term_frequency_adjustments in our linkage model, so that it can weight a match on London differently to a match on, say, Norwich.

    +
  • +
  • +

    Looking at the "Bottom 5 values by value count", we can see typos in the data in most fields. This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.

    +
  • +
  • +

    Email is a much more uniquely-identifying field than any others, with a maximum value count of 6. It's likely to be a strong linking variable.

    +
  • +
+
+

Further Reading

+

For more on exploratory analysis tools in Splink, please refer to the Exploratory Analysis API documentation.

+

📊 For more on the charts used in this tutorial, please refer to the Charts Gallery.

+
+

Next steps

+

At this point, we have begun to develop a strong understanding of our data. It's time to move on to estimating a linkage model

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/03_Blocking.html b/demos/tutorials/03_Blocking.html new file mode 100644 index 0000000000..122e136a56 --- /dev/null +++ b/demos/tutorials/03_Blocking.html @@ -0,0 +1,5819 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 3. Blocking - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Choosing blocking rules to optimise runtime

+

+ Open In Colab +

+

To link records, we need to compare pairs of records and decide which pairs are matches.

+

For example consider the following two records:

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
first_namesurnamedobcityemail
RobertAllen1971-05-24nanroberta25@smith.net
RobAllen1971-06-24Londonroberta25@smith.net
+

These can be represented as a pairwise comparison as follows:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
first_name_lfirst_name_rsurname_lsurname_rdob_ldob_rcity_lcity_remail_lemail_r
RobertRobAllenAllen1971-05-241971-06-24nanLondonroberta25@smith.netroberta25@smith.net
+

For most large datasets, it is computationally intractable to compare every row with every other row, since the number of comparisons rises quadratically with the number of records.

+

Instead we rely on blocking rules, which specify which pairwise comparisons to generate. For example, we could generate the subset of pairwise comparisons where either first name or surname matches.

+

This is part of a two step process to link data:

+
    +
  1. +

    Use blocking rules to generate candidate pairwise record comparisons

    +
  2. +
  3. +

    Use a probabilistic linkage model to score these candidate pairs, to determine which ones should be linked

    +
  4. +
+

Blocking rules are the most important determinant of the performance of your linkage job.

+

When deciding on your blocking rules, you're trading off accuracy for performance:

+
    +
  • If your rules are too loose, your linkage job may fail.
  • +
  • If they're too tight, you may miss some valid links.
  • +
+

This tutorial clarifies what blocking rules are, and how to choose good rules.

+ +

In Splink, blocking rules are specified as SQL expressions.

+

For example, to generate the subset of record comparisons where the first name and surname matches, we can specify the following blocking rule:

+
from splink import block_on
+block_on("first_name", "surname")
+
+

When executed, this blocking rule will be converted to a SQL statement with the following form:

+
SELECT ...
+FROM input_tables as l
+INNER JOIN input_tables as r
+ON l.first_name = r.first_name AND l.surname = r.surname
+
+

Since blocking rules are SQL expressions, they can be arbitrarily complex. For example, you could create record comparisons where the initial of the first name and the surname match with the following rule:

+
from splink import block_on
+block_on("substr(first_name, 1, 2)", "surname")
+
+

Devising effective blocking rules for prediction

+

The aims of your blocking rules are twofold:

+
    +
  1. Eliminate enough non-matching comparison pairs so your record linkage job is small enough to compute
  2. +
  3. Eliminate as few truly matching pairs as possible (ideally none)
  4. +
+

It is usually impossible to find a single blocking rule which achieves both aims, so we recommend using multiple blocking rules.

+

When we specify multiple blocking rules, Splink will generate all comparison pairs that meet any one of the rules.

+

For example, consider the following blocking rule:

+

block_on("first_name", "dob")

+

This rule is likely to be effective in reducing the number of comparison pairs. It will retain all truly matching pairs, except those with errors or nulls in either the first_name or dob fields.

+

Now consider a second blocking rule:

+

block_on("email").

+

This will retain all truly matching pairs, except those with errors or nulls in the email column.

+

Individually, these blocking rules are problematic because they exclude true matches where the records contain typos of certain types. But between them, they might do quite a good job.

+

For a true match to be eliminated by the use of these two blocking rules, it would have to have an error in both email AND (first_name or dob).

+

This is not completely implausible, but it is significantly less likely than if we'd used a single rule.

+

More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking criteria. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible.

+

The question then becomes how to choose what to put in this list.

+ +

Splink contains a number of tools to help you choose effective blocking rules. Let's try them out, using our small test dataset:

+
from splink import DuckDBAPI, block_on, splink_datasets
+
+df = splink_datasets.fake_1000
+
+

Counting the number of comparisons created by a single blocking rule

+

On large datasets, some blocking rules imply the creation of trillions of record comparisons, which would cause a linkage job to fail.

+

Before using a blocking rule in a linkage job, it's therefore a good idea to count the number of records it generates to ensure it is not too loose:

+
from splink.blocking_analysis import count_comparisons_from_blocking_rule
+
+db_api = DuckDBAPI()
+
+br = block_on("substr(first_name, 1,1)", "surname")
+
+counts = count_comparisons_from_blocking_rule(
+    table_or_tables=df,
+    blocking_rule=br,
+    link_type="dedupe_only",
+    db_api=db_api,
+)
+
+counts
+
+
{'number_of_comparisons_generated_pre_filter_conditions': 1632,
+ 'number_of_comparisons_to_be_scored_post_filter_conditions': 473,
+ 'filter_conditions_identified': '',
+ 'equi_join_conditions_identified': 'SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1) AND l."surname" = r."surname"',
+ 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}
+
+
br = "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2"
+
+counts = count_comparisons_from_blocking_rule(
+    table_or_tables=df,
+    blocking_rule= br,
+    link_type="dedupe_only",
+    db_api=db_api,
+)
+counts
+
+
{'number_of_comparisons_generated_pre_filter_conditions': 4827,
+ 'number_of_comparisons_to_be_scored_post_filter_conditions': 372,
+ 'filter_conditions_identified': 'LEVENSHTEIN(l.surname, r.surname) < 2',
+ 'equi_join_conditions_identified': 'l.first_name = r.first_name',
+ 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}
+
+

The maximum number of comparisons that you can compute will be affected by your choice of SQL backend, and how powerful your computer is.

+

For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons. For Spark and Athena, try starting with fewer than 100 million comparisons, before scaling up.

+

Finding 'worst offending' values for your blocking rule

+

Blocking rules can be affected by skew: some values of a field may be much more common than others, and this can lead to a disproportionate number of comparisons being generated.

+

It can be useful to identify whether your data is afflicted by this problem.

+
from splink.blocking_analysis import n_largest_blocks
+
+result = n_largest_blocks(    table_or_tables=df,
+    blocking_rule= block_on("city", "first_name"),
+    link_type="dedupe_only",
+    db_api=db_api,
+    n_largest=3
+    )
+
+result.as_pandas_dataframe()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
key_0key_1count_lcount_rblock_count
0BirminghamTheodore7749
1LondonOliver7749
2LondonJames6636
+
+ +

In this case, we can see that Olivers in London will result in 49 comparisons being generated. This is acceptable on this small dataset, but on a larger dataset, Olivers in London could be responsible for many million comparisons.

+

Counting the number of comparisons created by a list of blocking rules

+

As noted above, it's usually a good idea to use multiple blocking rules. It's therefore useful to know how many record comparisons will be generated when these rules are applied.

+

Since the same record comparison may be created by several blocking rules, and Splink automatically deduplicates these comparisons, we cannot simply total the number of comparisons generated by each rule individually.

+

Splink provides a chart that shows the marginal (additional) comparisons generated by each blocking rule, after deduplication:

+
from splink.blocking_analysis import (
+    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
+)
+
+blocking_rules_for_analysis = [
+    block_on("substr(first_name, 1,1)", "surname"),
+    block_on("surname"),
+    block_on("email"),
+    block_on("city", "first_name"),
+    "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2",
+]
+
+
+cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
+    table_or_tables=df,
+    blocking_rules=blocking_rules_for_analysis,
+    db_api=db_api,
+    link_type="dedupe_only",
+)
+
+ +
+ + +

Digging deeper: Understanding why certain blocking rules create large numbers of comparisons

+

Finally, we can use the profile_columns function we saw in the previous tutorial to understand a specific blocking rule in more depth.

+

Suppose we're interested in blocking on city and first initial.

+

Within each distinct value of (city, first initial), all possible pairwise comparisons will be generated.

+

So for instance, if there are 15 distinct records with London,J then these records will result in n(n-1)/2 = 105 pairwise comparisons being generated.

+

In a larger dataset, we might observe 10,000 London,J records, which would then be responsible for 49,995,000 comparisons.

+

These high-frequency values therefore have a disproportionate influence on the overall number of pairwise comparisons, and so it can be useful to analyse skew, as follows:

+
from splink.exploratory import profile_columns
+
+profile_columns(df, column_expressions=["city || left(first_name,1)"], db_api=db_api)
+
+ +
+ + +
+

Further Reading

+

For a deeper dive on blocking, please refer to the Blocking Topic Guides.

+

For more on the blocking tools in Splink, please refer to the Blocking API documentation.

+

📊 For more on the charts used in this tutorial, please refer to the Charts Gallery.

+
+

Next steps

+

Now we have chosen which records to compare, we can use those records to train a linkage model.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/04_Estimating_model_parameters.html b/demos/tutorials/04_Estimating_model_parameters.html new file mode 100644 index 0000000000..0a5f75bb5e --- /dev/null +++ b/demos/tutorials/04_Estimating_model_parameters.html @@ -0,0 +1,6234 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 4. Estimating model parameters - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Specifying and estimating a linkage model

+

+ Open In Colab +

+

In the last tutorial we looked at how we can use blocking rules to generate pairwise record comparisons.

+

Now it's time to estimate a probabilistic linkage model to score each of these comparisons. The resultant match score is a prediction of whether the two records represent the same entity (e.g. are the same person).

+

The purpose of estimating the model is to learn the relative importance of different parts of your data for the purpose of data linking.

+

For example, a match on date of birth is a much stronger indicator that two records refer to the same entity than a match on gender. A mismatch on gender may be a stronger indicate against two records referring than a mismatch on name, since names are more likely to be entered differently.

+

The relative importance of different information is captured in the (partial) 'match weights', which can be learned from your data. These match weights are then added up to compute the overall match score.

+

The match weights are are derived from the m and u parameters of the underlying Fellegi Sunter model. Splink uses various statistical routines to estimate these parameters. Further details of the underlying theory can be found here, which will help you understand this part of the tutorial.

+

Specifying a linkage model

+

To build a linkage model, the user defines the partial match weights that splink needs to estimate. This is done by defining how the information in the input records should be compared.

+

To be concrete, here is an example comparison:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
first_name_lfirst_name_rsurname_lsurname_rdob_ldob_rcity_lcity_remail_lemail_r
RobertRobAllenAllen1971-05-241971-06-24nanLondonroberta25@smith.netroberta25@smith.net
+

What functions should we use to assess the similarity of Rob vs. Robert in the the first_name field?

+

Should similarity in the dob field be computed in the same way, or a different way?

+

Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have.

+

Splink can then estimate how much weight to place on a fuzzy match of Rob vs. Robert, relative to an exact match on Robert, or a non-match.

+

Defining these scenarios is done using Comparisons.

+

Comparisons

+

The concept of a Comparison has a specific definition within Splink: it defines how data from one or more input columns is compared.

+

For example, one Comparison may represent how similarity is assessed for a person's date of birth.

+

Another Comparison may represent the comparison of a person's name or location.

+

A model is composed of many Comparisons, which between them assess the similarity of all of the columns being used for data linking.

+

Each Comparison contains two or more ComparisonLevels which define n discrete gradations of similarity between the input columns within the Comparison.

+

As such ComparisonLevelsare nested within Comparisons as follows:

+
Data Linking Model
+├─-- Comparison: Date of birth
+│    ├─-- ComparisonLevel: Exact match
+│    ├─-- ComparisonLevel: One character difference
+│    ├─-- ComparisonLevel: All other
+├─-- Comparison: Surname
+│    ├─-- ComparisonLevel: Exact match on surname
+│    ├─-- ComparisonLevel: All other
+│    etc.
+
+

Our example data would therefore result in the following comparisons, for dob and surname:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
dob_ldob_rcomparison_levelinterpretation
1971-05-241971-05-24Exact matchgreat match
1971-05-241971-06-24One character differencefuzzy match
1971-05-242000-01-02All otherbad match
+


+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
surname_lsurname_rcomparison_levelinterpretation
RobRobExact matchgreat match
RobJaneAll otherbad match
RobRobertAll otherbad match, this comparison has no notion of nicknames
+

More information about specifying comparisons can be found here and here.

+

We will now use these concepts to build a data linking model.

+
# Begin by reading in the tutorial data again
+from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+
+

Specifying the model using comparisons

+

Splink includes a library of comparison functions at splink.comparison_library to make it simple to get started. These are split into two categories:

+
    +
  1. Generic Comparison functions which apply a particular fuzzy matching function. For example, levenshtein distance.
  2. +
+
import splink.comparison_library as cl
+
+city_comparison = cl.LevenshteinAtThresholds("city", 2)
+print(city_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'LevenshteinAtThresholds' of "city".
+Similarity is assessed using the following ComparisonLevels:
+    - 'city is NULL' with SQL rule: "city_l" IS NULL OR "city_r" IS NULL
+    - 'Exact match on city' with SQL rule: "city_l" = "city_r"
+    - 'Levenshtein distance of city <= 2' with SQL rule: levenshtein("city_l", "city_r") <= 2
+    - 'All other comparisons' with SQL rule: ELSE
+
+
    +
  1. Comparison functions tailored for specific data types. For example, email.
  2. +
+
email_comparison = cl.EmailComparison("email")
+print(email_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'EmailComparison' of "email".
+Similarity is assessed using the following ComparisonLevels:
+    - 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
+    - 'Exact match on email' with SQL rule: "email_l" = "email_r"
+    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
+    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
+    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
+    - 'All other comparisons' with SQL rule: ELSE
+
+

Specifying the full settings dictionary

+

Comparisons are specified as part of the Splink settings, a Python dictionary which controls all of the configuration of a Splink model:

+
from splink import Linker, SettingsCreator, block_on, DuckDBAPI
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.LevenshteinAtThresholds("dob", 1),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "city"),
+        block_on("surname"),
+
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+
+

In words, this setting dictionary says:

+
    +
  • We are performing a dedupe_only (the other options are link_only, or link_and_dedupe, which may be used if there are multiple input datasets).
  • +
  • When comparing records, we will use information from the first_name, surname, dob, city and email columns to compute a match score.
  • +
  • The blocking_rules_to_generate_predictions states that we will only check for duplicates amongst records where either the first_name AND city or surname is identical.
  • +
  • We have enabled term frequency adjustments for the 'city' column, because some values (e.g. London) appear much more frequently than others.
  • +
  • We have set retain_intermediate_calculation_columns and additional_columns_to_retain to True so that Splink outputs additional information that helps the user understand the calculations. If they were False, the computations would run faster.
  • +
+

Estimate the parameters of the model

+

Now that we have specified our linkage model, we need to estimate the probability_two_random_records_match, u, and m parameters.

+
    +
  • +

    The probability_two_random_records_match parameter is the probability that two records taken at random from your input data represent a match (typically a very small number).

    +
  • +
  • +

    The u values are the proportion of records falling into each ComparisonLevel amongst truly non-matching records.

    +
  • +
  • +

    The m values are the proportion of records falling into each ComparisonLevel amongst truly matching records

    +
  • +
+

You can read more about the theory of what these mean.

+

We can estimate these parameters using unlabeled data. If we have labels, then we can estimate them even more accurately.

+

Estimation of probability_two_random_records_match

+

In some cases, the probability_two_random_records_match will be known. For example, if you are linking two tables of 10,000 records and expect a one-to-one match, then you should set this value to 1/10_000 in your settings instead of estimating it.

+

More generally, this parameter is unknown and needs to be estimated.

+

It can be estimated accurately enough for most purposes by combining a series of deterministic matching rules and a guess of the recall corresponding to those rules. For further details of the rationale behind this appraoch see here.

+

In this example, I guess that the following deterministic matching rules have a recall of about 70%. That means, between them, the rules recover 70% of all true matches.

+
deterministic_rules = [
+    block_on("first_name", "dob"),
+    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
+    block_on("email")
+]
+
+linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
+
+
Probability two random records match is estimated to be  0.00298.
+This means that amongst all possible pairwise record comparisons, one in 335.56 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,488.57 matching pairs
+
+

Estimation of u probabilities

+

Once we have the probability_two_random_records_match parameter, we can estimate the u probabilities.

+

We estimate u using the estimate_u_using_random_sampling method, which doesn't require any labels.

+

It works by sampling random pairs of records, since most of these pairs are going to be non-matches. Over these non-matches we compute the distribution of ComparisonLevels for each Comparison.

+

For instance, for gender, we would find that the the gender matches 50% of the time, and mismatches 50% of the time.

+

For dob on the other hand, we would find that the dob matches 1% of the time, has a "one character difference" 3% of the time, and everything else happens 96% of the time.

+

The larger the random sample, the more accurate the predictions. You control this using the max_pairs parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets.

+
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
+
+
+----- Estimating u probabilities using random sampling -----
+
+
+
+Estimated u probabilities using random sampling
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+    - dob (no m values are trained).
+    - city (no m values are trained).
+    - email (no m values are trained).
+
+

Estimation of m probabilities

+

m is the trickiest of the parameters to estimate, because we have to have some idea of what the true matches are.

+

If we have labels, we can directly estimate it.

+

If we do not have labelled data, the m parameters can be estimated using an iterative maximum likelihood approach called Expectation Maximisation.

+

Estimating directly

+

If we have labels, we can estimate m directly using the estimate_m_from_label_column method of the linker.

+

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

+

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

+

For example (in this tutorial we don't have labels, so we're not actually going to use this):

+
linker.estimate_m_from_label_column("social_security_number")
+
+

Estimating with Expectation Maximisation

+

This algorithm estimates the m values by generating pairwise record comparisons, and using them to maximise a likelihood function.

+

Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.

+

In our first estimation pass, we block on first_name and surname, meaning we will generate all record comparisons that have first_name and surname exactly equal.

+

Recall we are trying to estimate the m values of the model, i.e. proportion of records falling into each ComparisonLevel amongst truly matching records.

+

This means that, in this training session, we cannot estimate parameter estimates for the first_name or surname columns, since they will be equal for all the comparisons we do.

+

We can, however, estimate parameter estimates for all of the other columns. The output messages produced by Splink confirm this.

+
training_blocking_rule = block_on("first_name", "surname")
+training_session_fname_sname = (
+    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
+)
+
+
----- Starting EM training session -----
+
+
+
+Estimating the m probabilities of the model by blocking on:
+(l."first_name" = r."first_name") AND (l."surname" = r."surname")
+
+Parameter estimates will be made for the following comparison(s):
+    - dob
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - first_name
+    - surname
+
+
+
+
+
+WARNING:
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+
+
+Iteration 1: Largest change in params was -0.521 in the m_probability of dob, level `Exact match on dob`
+
+
+Iteration 2: Largest change in params was 0.0516 in probability_two_random_records_match
+
+
+Iteration 3: Largest change in params was 0.0183 in probability_two_random_records_match
+
+
+Iteration 4: Largest change in params was 0.00744 in probability_two_random_records_match
+
+
+Iteration 5: Largest change in params was 0.00349 in probability_two_random_records_match
+
+
+Iteration 6: Largest change in params was 0.00183 in probability_two_random_records_match
+
+
+Iteration 7: Largest change in params was 0.00103 in probability_two_random_records_match
+
+
+Iteration 8: Largest change in params was 0.000607 in probability_two_random_records_match
+
+
+Iteration 9: Largest change in params was 0.000367 in probability_two_random_records_match
+
+
+Iteration 10: Largest change in params was 0.000226 in probability_two_random_records_match
+
+
+Iteration 11: Largest change in params was 0.00014 in probability_two_random_records_match
+
+
+Iteration 12: Largest change in params was 8.73e-05 in probability_two_random_records_match
+
+
+
+EM converged after 12 iterations
+
+
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - first_name (no m values are trained).
+    - surname (no m values are trained).
+    - email (some m values are not trained).
+
+

In a second estimation pass, we block on dob. This allows us to estimate parameters for the first_name and surname comparisons.

+

Between the two estimation passes, we now have parameter estimates for all comparisons.

+
training_blocking_rule = block_on("dob")
+training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
+    training_blocking_rule
+)
+
+
----- Starting EM training session -----
+
+
+
+Estimating the m probabilities of the model by blocking on:
+l."dob" = r."dob"
+
+Parameter estimates will be made for the following comparison(s):
+    - first_name
+    - surname
+    - city
+    - email
+
+Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
+    - dob
+
+
+
+
+
+WARNING:
+Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value
+
+
+
+Iteration 1: Largest change in params was -0.407 in the m_probability of surname, level `Exact match on surname`
+
+
+Iteration 2: Largest change in params was 0.0929 in probability_two_random_records_match
+
+
+Iteration 3: Largest change in params was 0.0548 in the m_probability of first_name, level `All other comparisons`
+
+
+Iteration 4: Largest change in params was 0.0186 in probability_two_random_records_match
+
+
+Iteration 5: Largest change in params was 0.00758 in probability_two_random_records_match
+
+
+Iteration 6: Largest change in params was 0.00339 in probability_two_random_records_match
+
+
+Iteration 7: Largest change in params was 0.0016 in probability_two_random_records_match
+
+
+Iteration 8: Largest change in params was 0.000773 in probability_two_random_records_match
+
+
+Iteration 9: Largest change in params was 0.000379 in probability_two_random_records_match
+
+
+Iteration 10: Largest change in params was 0.000189 in probability_two_random_records_match
+
+
+Iteration 11: Largest change in params was 9.68e-05 in probability_two_random_records_match
+
+
+
+EM converged after 11 iterations
+
+
+m probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
+
+
+
+Your model is not yet fully trained. Missing estimates for:
+    - email (some m values are not trained).
+
+

Note that Splink includes other algorithms for estimating m and u values, which are documented here.

+

Visualising model parameters

+

Splink can generate a number of charts to help you understand your model. For an introduction to these charts and how to interpret them, please see this video.

+

The final estimated match weights can be viewed in the match weights chart:

+
linker.visualisations.match_weights_chart()
+
+ +
+ + +
linker.visualisations.m_u_parameters_chart()
+
+ +
+ + +

We can also compare the estimates that were produced by the different EM training sessions

+
linker.visualisations.parameter_estimate_comparisons_chart()
+
+ +
+ + +

Saving the model

+

We can save the model, including our estimated parameters, to a .json file, so we can use it in the next tutorial.

+
settings = linker.misc.save_model_to_json(
+    "../demo_settings/saved_model_from_demo.json", overwrite=True
+)
+
+

Detecting unlinkable records

+

An interesting application of our trained model that is useful to explore before making any predictions is to detect 'unlinkable' records.

+

Unlinkable records are those which do not contain enough information to be linked. A simple example would be a record containing only 'John Smith', and null in all other fields. This record may link to other records, but we'll never know because there's not enough information to disambiguate any potential links. Unlinkable records can be found by linking records to themselves - if, even when matched to themselves, they don't meet the match threshold score, we can be sure they will never link to anything.

+
linker.evaluation.unlinkables_chart()
+
+ +
+ + +

In the above chart, we can see that about 1.3% of records in the input dataset are unlinkable at a threshold match weight of 6.11 (correponding to a match probability of around 98.6%)

+
+

Further Reading

+

For more on the model estimation tools in Splink, please refer to the Model Training API documentation.

+

For a deeper dive on:

+ +

📊 For more on the charts used in this tutorial, please refer to the Charts Gallery.

+
+

Next steps

+

Now we have trained a model, we can move on to using it predict matching records.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/05_Predicting_results.html b/demos/tutorials/05_Predicting_results.html new file mode 100644 index 0000000000..613d6214f6 --- /dev/null +++ b/demos/tutorials/05_Predicting_results.html @@ -0,0 +1,5964 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 5. Predicting results - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Predicting which records match

+

+ Open In Colab +

+

In the previous tutorial, we built and estimated a linkage model.

+

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

+
from splink import Linker, DuckDBAPI, splink_datasets
+
+import pandas as pd
+
+pd.options.display.max_columns = 1000
+
+db_api = DuckDBAPI()
+df = splink_datasets.fake_1000
+
+

Load estimated model from previous tutorial

+
import json
+import urllib
+
+url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"
+
+with urllib.request.urlopen(url) as u:
+    settings = json.loads(u.read().decode())
+
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+
+

Predicting match weights using the trained model

+

We use linker.predict() to run the model.

+

Under the hood this will:

+
    +
  • +

    Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions

    +
  • +
  • +

    Use the rules specified in the Comparisons to evaluate the similarity of the input data

    +
  • +
  • +

    Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

    +
  • +
+

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

+
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
+df_predictions.as_pandas_dataframe(limit=5)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_namebf_tf_adj_first_namesurname_lsurname_rgamma_surnametf_surname_ltf_surname_rbf_surnamebf_tf_adj_surnamedob_ldob_rgamma_dobbf_dobcity_lcity_rgamma_citytf_city_ltf_city_rbf_citybf_tf_adj_cityemail_lemail_rgamma_emailtf_email_ltf_email_rbf_emailbf_tf_adj_emailmatch_key
0-1.7496640.229211324326KaiKai40.0060170.00601784.8217650.962892NoneTurner-1NaN0.0073261.0000001.0000002018-12-312009-11-0300.460743LondonLondon10.2127920.21279210.201260.259162k.t50eherand@z.ncomNone-10.001267NaN1.01.00
1-1.6260760.2446952527GabrielNone-10.001203NaN1.0000001.000000ThomasThomas40.0048840.00488488.8705071.0012221977-09-131977-10-1700.460743LondonLondon10.2127920.21279210.201260.259162gabriel.t54@nichols.infoNone-10.002535NaN1.01.01
2-1.5512650.254405626629geeorGeGeorge10.0012030.0144404.1767271.000000DavidsonDavidson40.0073260.00732688.8705070.6674821999-05-072000-05-0600.460743SouthamptnNone-10.001230NaN1.000001.000000Nonegdavidson@johnson-brown.com-1NaN0.005071.01.01
3-1.4277350.270985600602TobyToby40.0048130.00481384.8217651.203614NoneNone-1NaNNaN1.0000001.0000002003-04-232013-03-2100.460743LondonLondon10.2127920.21279210.201260.259162toby.d@menhez.comNone-10.001267NaN1.01.00
4-1.4277350.270985599602TobyToby40.0048130.00481384.8217651.203614HaallNone-10.001221NaN1.0000001.0000002003-04-232013-03-2100.460743LondonLondon10.2127920.21279210.201260.259162NoneNone-1NaNNaN1.01.00
+
+ +

Clustering

+

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

+
A -> B with score 0.9
+B -> C with score 0.95
+C -> D with score 0.1
+D -> E with score 0.99
+
+

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

+

With a score threshold of 0.5, the above data could be represented conceptually as:

+
ID, Cluster ID
+A,  1
+B,  1
+C,  1
+D,  2
+E,  2
+
+

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

+
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predictions, threshold_match_probability=0.5
+)
+clusters.as_pandas_dataframe(limit=10)
+
+
Completed iteration 1, root rows count 2
+
+
+Completed iteration 2, root rows count 0
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
cluster_idunique_idfirst_namesurnamedobcityemailcluster__splink_salttf_surnametf_emailtf_citytf_first_name
000RobertAlan1971-06-24Nonerobert255@smith.net00.0129240.0012210.001267NaN0.003610
111RobertAllen1971-05-24Noneroberta25@smith.net00.4787560.0024420.002535NaN0.003610
212RobAllen1971-06-24Londonroberta25@smith.net00.4096620.0024420.0025350.2127920.001203
333RobertAlen1971-06-24LononNone00.3110290.001221NaN0.0073800.003610
444GraceNone1997-04-26Hullgrace.kelly52@jones.com10.486141NaN0.0025350.0012300.006017
555GraceKelly1991-04-26Nonegrace.kelly52@jones.com10.4345660.0024420.002535NaN0.006017
666LoganpMurphy1973-08-01NoneNone20.4237600.001221NaNNaN0.012034
777NoneNone2015-03-03Portsmouthevied56@harris-bailey.net30.683689NaN0.0025350.017220NaN
888NoneDean2015-03-03NoneNone30.5530860.003663NaNNaNNaN
989EvieDean2015-03-03Pootsmruthevihd56@earris-bailey.net30.7530700.0036630.0012670.0012300.008424
+
+ +
sql = f"""
+select *
+from {df_predictions.physical_name}
+limit 2
+"""
+linker.misc.query_sql(sql)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
match_weightmatch_probabilityunique_id_lunique_id_rfirst_name_lfirst_name_rgamma_first_nametf_first_name_ltf_first_name_rbf_first_namebf_tf_adj_first_namesurname_lsurname_rgamma_surnametf_surname_ltf_surname_rbf_surnamebf_tf_adj_surnamedob_ldob_rgamma_dobbf_dobcity_lcity_rgamma_citytf_city_ltf_city_rbf_citybf_tf_adj_cityemail_lemail_rgamma_emailtf_email_ltf_email_rbf_emailbf_tf_adj_emailmatch_key
0-1.7496640.229211324326KaiKai40.0060170.00601784.8217650.962892NoneTurner-1NaN0.0073261.0000001.0000002018-12-312009-11-0300.460743LondonLondon10.2127920.21279210.201260.259162k.t50eherand@z.ncomNone-10.001267NaN1.01.00
1-1.6260760.2446952527GabrielNone-10.001203NaN1.0000001.000000ThomasThomas40.0048840.00488488.8705071.0012221977-09-131977-10-1700.460743LondonLondon10.2127920.21279210.201260.259162gabriel.t54@nichols.infoNone-10.002535NaN1.01.01
+
+ +
+

Further Reading

+
+

For more on the prediction tools in Splink, please refer to the Prediction API documentation.

+

Next steps

+

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/06_Visualising_predictions.html b/demos/tutorials/06_Visualising_predictions.html new file mode 100644 index 0000000000..443b9a1ed5 --- /dev/null +++ b/demos/tutorials/06_Visualising_predictions.html @@ -0,0 +1,5478 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 6. Visualising predictions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Visualising predictions

+

+ Open In Colab +

+

Splink contains a variety of tools to help you visualise your predictions.

+

The idea is that, by developing an understanding of how your model works, you can gain confidence that the predictions it makes are sensible, or alternatively find examples of where your model isn't working, which may help you improve the model specification and fix these problems.

+
# Rerun our predictions to we're ready to view the charts
+from splink import Linker, DuckDBAPI, splink_datasets
+
+import pandas as pd
+
+pd.options.display.max_columns = 1000
+
+db_api = DuckDBAPI()
+df = splink_datasets.fake_1000
+
+
import json
+import urllib
+
+url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"
+
+with urllib.request.urlopen(url) as u:
+    settings = json.loads(u.read().decode())
+
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+df_predictions = linker.inference.predict(threshold_match_probability=0.2)
+
+
 -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+

Waterfall chart

+

The waterfall chart provides a means of visualising individual predictions to understand how Splink computed the final matchweight for a particular pairwise record comparison.

+

To plot a waterfall chart, the user chooses one or more records from the results of linker.inference.predict(), and provides these records to the linker.visualisations.waterfall_chart() function.

+

For an introduction to waterfall charts and how to interpret them, please see this video.

+
records_to_view = df_predictions.as_record_dict(limit=5)
+linker.visualisations.waterfall_chart(records_to_view, filter_nulls=False)
+
+ +
+ + +

Comparison viewer dashboard

+

The comparison viewer dashboard takes this one step further by producing an interactive dashboard that contains example predictions from across the spectrum of match scores.

+

An in-depth video describing how to interpret the dashboard can be found here.

+
linker.visualisations.comparison_viewer_dashboard(df_predictions, "scv.html", overwrite=True)
+
+# You can view the scv.html file in your browser, or inline in a notbook as follows
+from IPython.display import IFrame
+
+IFrame(src="./scv.html", width="100%", height=1200)
+
+

+

+

Cluster studio dashboard

+

Cluster studio is an interactive dashboards that visualises the results of clustering your predictions.

+

It provides examples of clusters of different sizes. The shape and size of clusters can be indicative of problems with record linkage, so it provides a tool to help you find potential false positive and negative links.

+
df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    df_predictions, threshold_match_probability=0.5
+)
+
+linker.visualisations.cluster_studio_dashboard(
+    df_predictions,
+    df_clusters,
+    "cluster_studio.html",
+    sampling_method="by_cluster_size",
+    overwrite=True,
+)
+
+# You can view the scv.html file in your browser, or inline in a notbook as follows
+from IPython.display import IFrame
+
+IFrame(src="./cluster_studio.html", width="100%", height=1000)
+
+
Completed iteration 1, root rows count 2
+
+
+Completed iteration 2, root rows count 0
+
+

+

+
+

Further Reading

+

For more on the visualisation tools in Splink, please refer to the Visualisation API documentation.

+

📊 For more on the charts used in this tutorial, please refer to the Charts Gallery

+
+

Next steps

+

Now we have visualised the results of a model, we can move on to some more formal Quality Assurance procedures using labelled data.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/07_Evaluation.html b/demos/tutorials/07_Evaluation.html new file mode 100644 index 0000000000..4b81d66ab0 --- /dev/null +++ b/demos/tutorials/07_Evaluation.html @@ -0,0 +1,6127 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + 7. Evaluation - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

7. Evaluation

+ +

Evaluation of prediction results

+

+ Open In Colab +

+

In the previous tutorial, we looked at various ways to visualise the results of our model. +These are useful for evaluating a linkage pipeline because they allow us to understand how our model works and verify that it is doing something sensible. They can also be useful to identify examples where the model is not performing as expected.

+

In addition to these spot checks, Splink also has functions to perform more formal accuracy analysis. These functions allow you to understand the likely prevalence of false positives and false negatives in your linkage models.

+

They rely on the existence of a sample of labelled (ground truth) matches, which may have been produced (for example) by human beings. For the accuracy analysis to be unbiased, the sample should be representative of the overall dataset.

+
# Rerun our predictions to we're ready to view the charts
+import pandas as pd
+
+from splink import DuckDBAPI, Linker, splink_datasets
+
+pd.options.display.max_columns = 1000
+
+db_api = DuckDBAPI()
+df = splink_datasets.fake_1000
+
+
import json
+import urllib
+
+from splink import block_on
+
+url = "https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json"
+
+with urllib.request.urlopen(url) as u:
+    settings = json.loads(u.read().decode())
+
+# The data quality is very poor in this dataset, so we need looser blocking rules
+# to achieve decent recall
+settings["blocking_rules_to_generate_predictions"] = [
+    block_on("first_name"),
+    block_on("city"),
+    block_on("email"),
+    block_on("dob"),
+]
+
+linker = Linker(df, settings, db_api=DuckDBAPI())
+df_predictions = linker.inference.predict(threshold_match_probability=0.01)
+
+
Blocking time: 0.02 seconds
+
+
+Predict time: 0.80 seconds
+
+
+
+ -- WARNING --
+You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
+Comparison: 'email':
+    m values not fully trained
+
+

Load in labels

+

The labels file contains a list of pairwise comparisons which represent matches and non-matches.

+

The required format of the labels file is described here.

+
from splink.datasets import splink_dataset_labels
+
+df_labels = splink_dataset_labels.fake_1000_labels
+labels_table = linker.table_management.register_labels_table(df_labels)
+df_labels.head(5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_id_lsource_dataset_lunique_id_rsource_dataset_rclerical_match_score
00fake_10001fake_10001.0
10fake_10002fake_10001.0
20fake_10003fake_10001.0
30fake_10004fake_10000.0
40fake_10005fake_10000.0
+
+ +

View examples of false positives and false negatives

+
splink_df = linker.evaluation.prediction_errors_from_labels_table(
+    labels_table, include_false_negatives=True, include_false_positives=False
+)
+false_negatives = splink_df.as_record_dict(limit=5)
+linker.visualisations.waterfall_chart(false_negatives)
+
+ +
+ + +

False positives

+
# Note I've picked a threshold match probability of 0.01 here because otherwise
+# in this simple example there are no false positives
+splink_df = linker.evaluation.prediction_errors_from_labels_table(
+    labels_table, include_false_negatives=False, include_false_positives=True, threshold_match_probability=0.01
+)
+false_postives = splink_df.as_record_dict(limit=5)
+linker.visualisations.waterfall_chart(false_postives)
+
+ +
+ + +

Threshold Selection chart

+

Splink includes an interactive dashboard that shows key accuracy statistics:

+
linker.evaluation.accuracy_analysis_from_labels_table(
+    labels_table, output_type="threshold_selection", add_metrics=["f1"]
+)
+
+ +
+ + +

Receiver operating characteristic curve

+

A ROC chart shows how the number of false positives and false negatives varies depending on the match threshold chosen. The match threshold is the match weight chosen as a cutoff for which pairwise comparisons to accept as matches.

+
linker.evaluation.accuracy_analysis_from_labels_table(labels_table, output_type="roc")
+
+ +
+ + +

Truth table

+

Finally, Splink can also report the underlying table used to construct the ROC and precision recall curves.

+
roc_table = linker.evaluation.accuracy_analysis_from_labels_table(
+    labels_table, output_type="table"
+)
+roc_table.as_pandas_dataframe(limit=5)
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
truth_thresholdmatch_probabilitytotal_clerical_labelspntptnfpfnP_rateN_ratetp_ratetn_ratefp_ratefn_rateprecisionrecallspecificitynpvaccuracyf1f2f0_5p4phi
0-18.90.0000023176.02031.01145.01709.01103.042.0322.00.6394840.3605160.8414570.9633190.0366810.1585430.9760140.8414570.9633190.7740350.8853900.9037550.8653160.9457660.8804760.776931
1-16.70.0000093176.02031.01145.01709.01119.026.0322.00.6394840.3605160.8414570.9772930.0227070.1585430.9850140.8414570.9772930.7765440.8904280.9075940.8667210.9525140.8860100.789637
2-12.80.0001403176.02031.01145.01709.01125.020.0322.00.6394840.3605160.8414570.9825330.0174670.1585430.9884330.8414570.9825330.7774710.8923170.9090430.8672490.9550690.8880760.794416
3-12.50.0001733176.02031.01145.01708.01125.020.0323.00.6394840.3605160.8409650.9825330.0174670.1590350.9884260.8409650.9825330.7769340.8920030.9087520.8668290.9549370.8877630.793897
4-12.40.0001853176.02031.01145.01705.01132.013.0326.00.6394840.3605160.8394880.9886460.0113540.1605120.9924330.8394880.9886460.7764060.8932620.9095760.8661860.9575420.8892250.797936
+
+ +

Unlinkables chart

+

Finally, it can be interesting to analyse whether your dataset contains any 'unlinkable' records.

+

'Unlinkable records' are records with such poor data quality they don't even link to themselves at a high enough probability to be accepted as matches

+

For example, in a typical linkage problem, a 'John Smith' record with nulls for their address and postcode may be unlinkable. By 'unlinkable' we don't mean there are no matches; rather, we mean it is not possible to determine whether there are matches.UnicodeTranslateError

+

A high proportion of unlinkable records is an indication of poor quality in the input dataset

+
linker.evaluation.unlinkables_chart()
+
+ +
+ + +

For this dataset and this trained model, we can see that most records are (theoretically) linkable: At a match weight 6, around around 99% of records could be linked to themselves.

+
+

Further Reading

+

For more on the quality assurance tools in Splink, please refer to the Evaluation API documentation.

+

📊 For more on the charts used in this tutorial, please refer to the Charts Gallery.

+

For more on the Evaluation Metrics used in this tutorial, please refer to the Edge Metrics guide.

+
+

That's it!

+

That wraps up the Splink tutorial! Don't worry, there are still plenty of resources to help on the next steps of your Splink journey:

+

For some end-to-end notebooks of Splink pipelines, check out our Examples

+

For more deepdives into the different aspects of Splink, and record linkage more generally, check out our Topic Guides

+

For a reference on all the functionality avalable in Splink, see our Documentation

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/demos/tutorials/cluster_studio.html b/demos/tutorials/cluster_studio.html new file mode 100644 index 0000000000..ea226fe9e6 --- /dev/null +++ b/demos/tutorials/cluster_studio.html @@ -0,0 +1,11080 @@ + + + + + +Splink cluster studio + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink cluster studio

+ +
+
+ +
+
+
+ +
+
+ + +
+
+
+
+ +
+
+
+ + +
+
+
+ +
+ + + + + +
+
+
+ + + + + +
+ + + + + + diff --git a/demos/tutorials/scv.html b/demos/tutorials/scv.html new file mode 100644 index 0000000000..6574a30a0f --- /dev/null +++ b/demos/tutorials/scv.html @@ -0,0 +1,11024 @@ + + + + + +Splink comparison viewer + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +

Splink comparison viewer

+ +
+ +
+
+
+
+ +
+
+ +
+
+
+
+
+ + + + + + \ No newline at end of file diff --git a/dev_guides/CONTRIBUTING.html b/dev_guides/CONTRIBUTING.html new file mode 100644 index 0000000000..f201d9918a --- /dev/null +++ b/dev_guides/CONTRIBUTING.html @@ -0,0 +1,5430 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Contributor Guide - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Contributing to Splink

+

Contributing to an open source project takes many forms. Below are some of the ways you can contribute to Splink!

+

Asking questions

+

If you have a question about Splink, we recommended asking on our GitHub discussion board. This means that other users can benefit from the answers too! On that note, it is always worth checking if a similar question has been asked (and answered) before.

+

Reporting issues

+

Is something broken? Or not acting how you would expect? Are we missing a feature that would make your life easier? We want to know about it!

+

When reporting issues please include as much detail as possible about your operating system, Splink version, python version and which SQL backend you are using. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem. It is particularly helpful if you can look through the existing issues and provide links to any related issues.

+

Contributing to documentation

+

Contributions to Splink are not limited to the code. Feedback and input on our documentation from a user's perspective is extremely valuable - even something as small as fixing a typo. More generally, if you are interested in starting to work on Splink, documentation is a great way to get those first commits!

+

The easiest way to contribute to the documentation is by clicking the pencil icon at the top right of the docs page you want to edit. +This will automatically create a fork of the Splink repository on GitHub and make it easy to open a pull request with your changes, +which one of the Splink dev team will review.

+

If you need to make a larger change to the docs, this workflow might not be the best, since you won't get to see the effects +of your changes before submitting them. +To do this, you will need to create a fork of the Splink repo, +then clone your fork to your computer. +Then, you can edit the documentation in the docs folder +(and API documentation, which can be found as docstrings in the code itself) locally. +To see what the docs will look like with your changes, you can +build the docs site locally. +When you are happy with your changes, commit and push them to your fork, then +create a Pull Request.

+

We are trying to make our documentation as accessible to as many people as possible. If you find any problems with accessibility then please let us know by raising an issue, or feel free to put in a Pull Request with your suggested fixes.

+

Contributing code

+

Thanks for your interest in contributing code to Splink!

+

There are a number of ways to get involved:

+
    +
  • Start work on an existing issue, there should be some with a good first issue flag which are a good place to start.
  • +
  • Tackle a problem you have identified. If you have identified a feature or bug, the first step is to create a new issue to explain what you have identified and what you plan to implement, then you are free to fork the repository and get coding!
  • +
+

In either case, we ask that you assign yourself to the relevant issue and open up a draft pull request (PR) while you are working on your feature/bug-fix. This helps the Splink dev team keep track of developments and means we can start supporting you sooner!

+

You can always add further PRs to build extra functionality. Starting out with a minimum viable product and iterating makes for better software (in our opinion). It also helps get features out into the wild sooner.

+

To get set up for development locally, see the development quickstart.

+

Best practices

+

When making code changes, we recommend:

+
    +
  • Adding tests to ensure your code works as expected. These will be run through GitHub Actions when a PR is opened.
  • +
  • Linting to ensure that code is styled consistently.
  • +
+

Branching Strategy

+

All pull requests (PRs) should target the master branch.

+

We believe that small Pull Requests make better code. They:

+
    +
  • are more focused
  • +
  • increase understanding and clarity
  • +
  • are easier (and quicker) to review
  • +
  • get feedback quicker
  • +
+

If you have a larger feature, please consider creating a simple minimum-viable feature and submit for review. Once this has been reviewed by the Splink dev team there are two options to consider:

+
    +
  1. Merge minimal feature, then create a new branch with additional features.
  2. +
  3. Do not merge the initial feature branch, create additional feature branches from the reviewed branch.
  4. +
+

The best solution often depends on the specific feature being created and any other development work happening in that area of the codebase. If you are unsure, please ask the dev team for advice on how to best structure your changes in your initial PR and we can come to a decision together.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/caching.html b/dev_guides/caching.html new file mode 100644 index 0000000000..b8d6f5d8b6 --- /dev/null +++ b/dev_guides/caching.html @@ -0,0 +1,5403 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Caching and pipelining - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

Caching and pipelining

+ +

Caching and pipelining

+

Splink is able to run against multiple SQL backends because all of the core data linking calculations are implemented in SQL. This SQL can therefore be submitted to a chosen SQL backend for execution.

+

Computations in Splink often take the form of a number of select statements run in sequence.

+

For example, the predict() step:

+
    +
  • Inputs __splink__df_concat_with_tf and outputs __splink__df_blocked
  • +
  • Inputs __splink__df_blocked and outputs __splink__df_comparison_vectors
  • +
  • Inputs __splink__df_comparison_vectors and outputs __splink__df_match_weight_parts
  • +
  • Inputs __splink__df_match_weight_parts and outputs __splink__df_predict
  • +
+

To make this run faster, two key optimisations are implemented:

+
    +
  • Pipelining - combining multiple select statements into a single statement using WITH(CTE) queries
  • +
  • Caching: saving the results of calculations so they don't need recalculating. This is especially useful because some intermediate calculations are reused multiple times during a typical Splink session
  • +
+

This article discusses the general implementation of caching and pipelining. The implementation needs some alterations for certain backends like Spark, which lazily evaluate SQL by default.

+

Implementation: Pipelining

+

A SQLPipeline class manages SQL pipelining.

+

A SQLPipeline is composed of a number of SQLTask objects, each of which represents a select statement.

+

The code is fairly straightforward: Given a sequence of select statements, [a,b,c] they are combined into a single query as follows:

+
with
+a as (a_sql),
+b as (b_sql),
+c_sql
+
+

To make this work, each statement (a,b,c) in the pipeline must refer to the previous step by name. For example, b_sql probably selects from the a_sql table, which has been aliased a. So b_sql must use the table name a to refer to the result of a_sql.

+

To make this tractable, each SQLTask has an output_table_name. For example, the output_table_name for a_sql in the above example is a.

+

For instance, in the predict() pipeline above, the first output_table_name is __splink__df_blocked. By giving each task a meaningful output_table_name, subsequent tasks can reference previous outputs in a way which is semantically clear.

+

Implementation: Caching

+

When a SQL pipeline is executed, it has two output names:

+
    +
  • A physical_name, which is the name of the materialised table in the output database e.g. __splink__df_predict_cbc9833
  • +
  • A templated_name, which is a descriptive name of what the table represents e.g. __splink__df_predict
  • +
+

Each time Splink runs a SQL pipeline, the SQL string is hashed. This creates a unique identifier for that particular SQL string, which serves to identify the output.

+

When Splink is asked to execute a SQL string, before execution, it checks whether the resultant table already exists. If it does, it returns the table rather than recomputing it.

+

For example, when we run linker.predict(), Splink:

+
    +
  • Generates the SQL tasks
  • +
  • Pipelines them into a single SQL statement
  • +
  • Hashes the statement to create a physical name for the outputs __splink__df_predict_cbc9833
  • +
  • Checks whether a table with physical name __splink__df_predict_cbc9833 already exists in the database
  • +
  • If not, executes the SQL statement, creating table __splink__df_predict_cbc9833 in the database.
  • +
+

In terms of implementation, the following happens:

+
    +
  • SQL statements are generated an put in the queue - see here
  • +
  • Once all the tasks have been added to the queue, we call _execute_sql_pipeline() see here
  • +
  • The SQL is combined into a single pipelined statement here
  • +
  • We call _sql_to_splink_dataframe() which returns the table (from the cache if it already exists, or it executes the SQL)
  • +
  • The table is returned as a SplinkDataframe, an abstraction over a table in a database. See here.
  • +
+

Some cached tables do not need a hash

+

A hash is required to uniquely identify some outputs. For example, blocking is used in several places in Splink, with different results. For example, the __splink__df_blocked needed to estimate parameters is different to the __splink__df_blocked needed in the predict() step.

+

As a result, we cannot materialise a single table called __splink__df_blocked in the database and reuse it multiple times. This is why we append the hash of the SQL, so that we can uniquely identify the different versions of __splink__df_blocked which are needed in different contexts.

+

There are, however, some tables which are globally unique. They only take a single form, and if they exist in the cache they never need recomputing.

+

An example of this is __splink__df_concat_with_tf, which represents the concatenation of the input dataframes.

+

To create this table, we can execute _sql_to_splink_dataframe with materialise_as_hash set to False. The resultant materialised table will not have a hash appended, and will simply be called __splink__df_concat_with_tf. This is useful, because when performing calculations Splink can now check the cache for __splink__df_concat_with_tf each time it is needed.

+

In fact, many Splink pipelines begin with the assumption that this table exists in the database, because the first SQLTask in the pipeline refers to a table named __splink__df_concat_with_tf. To ensure this is the case, a function is used to create this table if it doesn't exist.

+ +

At what point should a pipeline of SQLTasks be executed (materialised into a physical table)?

+

For any individual output, it will usually be fastest to pipeline the full linage of tasks, right from raw data through to the end result.

+

However, there are many intermediate outputs which are used by many different Splink operations.

+

Performance can therefore be improved by computing and saving these intermediate outputs to a cache, to ensure they don't need to be computed repeatedly.

+

This is achieved by enqueueing SQL to a pipeline and strategically calling execute_sql_pipeline to materialise results that need to cached.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/blog_posts.html b/dev_guides/changing_splink/blog_posts.html new file mode 100644 index 0000000000..d3a92199b6 --- /dev/null +++ b/dev_guides/changing_splink/blog_posts.html @@ -0,0 +1,5325 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Contributing to the Splink Blog - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Contributing to the Splink Blog

+

Thanks for considering making a contribution to the Splink Blog! We are keen to use this blog as a forum all things data linking and Splink!

+

This blog, and the docs as a whole, are built using the fantastic MkDocs material, to understand more about how the blog works under the hood checkout out the MkDocs material blog documentation.

+

For more general guidance for contributing to Splink, check out our Contributor Guide.

+

Adding a blog post

+

The easiest way to get started with a blog post is to make a copy of one of the pre-existing blog posts and make edits from there. There is a metadata in the section at the top of each post which should be updated with the post date, authors and the category of the post (this is a tag system to make posts easier to find).

+

Blog posts are ordered by date, so change the name of your post markdown file to be a recent date (YYYY-MM-DD format) to make sure it appears at the top of the blog.

+
+

Note

+

In this blog we want to make content as easily digestible as possible. We encourage breaking up and big blocks of text into sections and using visuals/emojis/gifs to bring your post to life!

+
+

Adding a new author to the blogs

+

If you are a new author, you will need to add yourself to the .authors.yml file.

+

Testing your changes

+

Once you have made a first draft, check out how the deployed blog will look by building the docs locally.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/building_env_locally.html b/dev_guides/changing_splink/building_env_locally.html new file mode 100644 index 0000000000..d39a55b56a --- /dev/null +++ b/dev_guides/changing_splink/building_env_locally.html @@ -0,0 +1,5368 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Building your local environment - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + +

Building your local environment

+ +
+

Managing Dependencies with Poetry

+

Splink utilises poetry for managing its core dependencies, offering a clean and effective solution for tracking and resolving any ensuing package and version conflicts.

+

You can find a list of Splink's core dependencies within the pyproject.toml file.

+

Fundamental Commands in Poetry

+

Below are some useful commands to help in the maintenance and upkeep of the pyproject.toml file.

+

Adding Packages +- To incorporate a new package into Splink: +

poetry add <package-name>
+
+- To specify a version when adding a new package: +
poetry add <package-name>==<version>
+# Add quotes if you want to use other equality calls
+poetry add "<package-name> >= <version>"
+
+

Modifying Packages +- To remove a package from the project: +

poetry remove <package-name>
+
+- Updating an existing package to a specific version: +
poetry add <package-name>==<version>
+poetry add "<package-name> >= <version>"
+
+- To update an existing package to the latest version: +
poetry add <package-name>==<version>
+poetry update <package-name>
+
+ Note: Direct updates can also be performed within the pyproject.toml file. +

Locking the Project +- To update the existing poetry.lock file, thereby locking the project to ensure consistent dependency installation across different environments: +

poetry lock
+
+ Note: This should be used sparingly due to our loose dependency requirements and the resulting time to solve the dependency graph. If you only need to update a single dependency, update it using poetry add <pkg>==<version> instead. +

Installing Dependencies +- To install project dependencies as per the lock file: +

poetry install
+
+- For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support: +
poetry install -E spark
+
+

A comprehensive list of Poetry commands is available in the Poetry documentation.

+

Automating Virtual Environment Creation

+

To streamline the creation of a virtual environment via venv, you may use the create_venv.sh script.

+

This script facilitates the automatic setup of a virtual environment, with the default environment name being venv.

+

Default Environment Creation: +

source scripts/create_venv.sh
+
+

Specifying a Custom Environment Name: +

source scripts/create_venv.sh <name_of_venv>
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/contributing_to_docs.html b/dev_guides/changing_splink/contributing_to_docs.html new file mode 100644 index 0000000000..fc848a4a54 --- /dev/null +++ b/dev_guides/changing_splink/contributing_to_docs.html @@ -0,0 +1,5348 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Contributing to Documentation - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Contributing to Documentation

+ +

Building docs locally

+

Before building the docs locally, you will need to follow the development quickstart to set up the necessary environment. You cannot skip this step, because some Splink docs Markdown is auto-generated using the Splink development environment.

+

Once you've done that, +to rapidly build the documentation and immediately see changes you've made you can use this script +outside your Poetry virtual environment:

+
source scripts/make_docs_locally.sh
+
+

This is much faster than waiting for GitHub actions to run if you're trying to make fiddly changes to formatting etc.

+

Once you've finished updating Splink documentation we ask that you run our spellchecker. Instructions on how to do this are given below.

+

Quick builds for rapidly authoring new content

+

When you mkdocs serve -v --dirtyreload or mkdocs build the documentation, the mkdocs command will rebuild the entire site. This can be slow if you're just making small changes to a single page.

+

To speed up the process, you can temporarily tell mkdocs to ignore content by modifying mkdocs.yml, for example by adding:

+
exclude_docs: |
+  dev_guides/**
+  charts/**
+  topic_guides/**
+  demos/**
+  blog/**
+
+

Spellchecking docs

+

When updating Splink documentation, we ask that you run our spellchecker before submitting a pull request. This is to help ensure quality and consistency across the documentation. If for whatever reason you can't run the spellchecker on your system, please don't let this prevent you from contributing to the documentation. Please note, the spellchecker only works on markdown files.

+

If you are a Mac user with the Homebrew package manager installed, the script below will automatically install +the required system dependency, aspell. +If you've created your development environment using conda, aspell will have been installed as part of that +process. +Instructions for installing aspell through other means may be added here in the future.

+

To run the spellchecker on either a single markdown file or folder of markdown files, you can run the following bash script:

+
./scripts/pyspelling/spellchecker.sh <path_to_file_or_folder>
+
+

Omitting the file/folder path will run the spellchecker on all markdown files contained in the docs folder. We recommend running the spellchecker only on files that you have created or edited.

+

The spellchecker uses the Python package PySpelling and its underlying spellchecking tool, Aspell. Running the above script will automatically install these packages along with any other necessary dependencies.

+

The spellchecker compares words to a standard British English dictionary and a custom dictionary (scripts/pyspelling/custom_dictionary.txt) of words. If no spelling mistakes are found, you will see the following terminal printout:

+
Spelling check passed :)
+
+

otherwise, PySpelling will printout the spelling mistakes found in each file.

+

Correct spellings of words not found in a standard dictionary (e.g. "Splink") can be recorded as such by adding them to scripts/pyspelling/custom_dictionary.txt.

+

Please correct any mistakes found or update the custom dictionary to ensure the spellchecker passes before putting in a pull request containing updates to the documentation.

+
+

Note

+

The spellchecker is configured (via pyspelling.yml) to ignore text between certain delimiters to minimise picking up Splink/programming-specific terms. If there are additional patterns that you think should be excepted then please let us know in your pull request.

+

The custom dictionary deliberately contains a small number of misspelled words (e.g. “Siohban”). These are sometimes necessary where we are explaining how Splink handles typos in data records.

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/development_quickstart.html b/dev_guides/changing_splink/development_quickstart.html new file mode 100644 index 0000000000..86ad86aeea --- /dev/null +++ b/dev_guides/changing_splink/development_quickstart.html @@ -0,0 +1,5705 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Development Quickstart - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Development Quickstart

+ +

Splink is a complex project with many dependencies. +This page provides step-by-step instructions for getting set up to develop Splink. +Once you have followed these instructions, you should be all set to start making changes.

+

Step 0: Unix-like operating system

+

We highly recommend developing Splink on a Unix-like operating system, such as MacOS or Linux. +While it is possible to develop on another operating system such as Windows, we do not provide +instructions for how to do so.

+

Luckily, Windows users can easily fulfil this requirement by installing the Windows Subsystem for Linux (WSL):

+
    +
  • Open PowerShell as Administrator: Right-click the Start button, select “Windows Terminal (Admin)”, and ensure PowerShell is the selected shell.
  • +
  • Run the command wsl --install.
  • +
  • You can find more guidance on setting up WSL on the Microsoft website + but you don't need to do anything additional.
  • +
  • Open the Windows Terminal again (does not need to be Admin) and select the Ubuntu shell. + Follow the rest of these instructions in that shell.
  • +
+ +

If you haven't already, create a fork of the Splink repository. +You can find the Splink repository here, +or click here to go directly to making a fork. +Clone your fork to whatever directory you want to work in with git clone https://github.com/<YOUR_USERNAME>/splink.git.

+

Step 2: Choose how to install system dependencies

+

Developing Splink requires Python, as well as Poetry (the package manager we use to install Python package dependencies). +Running Spark or PostgreSQL on your computer to test those backends requires additional dependencies. +Athena only runs in the AWS cloud, so to locally run the tests for that backend you will need to create an AWS account and +configure Splink to use it.

+

There are two ways to install these system dependencies: globally on your computer, or in an isolated conda environment.

+

The decision of which approach to take is subjective.

+

If you already have Python and Poetry installed (plus Java and PostgreSQL if you want to run the +Spark and PostgreSQL backends locally), there is probably little advantage to using conda.

+

On the other hand, conda is particularly suitable if:

+
    +
  • You're already a conda user, and/or
  • +
  • You're working in an environment where security policies prevent the installation of system level packages like Java
  • +
  • You don't want to do global installs of some of the requirements like Java
  • +
+

Step 3, Manual install option: Install system dependencies

+

Python

+

Check if Python is already installed by running python3 --version. +If that outputs a version like 3.10.12, you've already got it! +Otherwise, follow the instructions for installation on your platform +from the Python website.

+

Poetry

+

Run these commands to install Poetry globally. +Note that we currently use an older version of Poetry, so the version +must be specified.

+
pip install --upgrade pip
+pip install poetry==1.4.2
+
+

Java

+

The instructions to install Java globally depend on your operating system. +Generally, some version of Java will be available from your operating system's +package manager. +Note that you must install a version of Java earlier than Java 18 because +Splink currently uses an older version of Spark.

+

As an example, you could run this on Ubuntu:

+
sudo apt install openjdk-11-jre-headless
+
+

PostgreSQL (optional)

+

Follow the instructions on the PostgreSQL website +to install it on your computer.

+

Then, we will need to set up a database for Splink. +You can achieve that with the following commands:

+
initdb splink_db
+pg_ctl -D splink_db start --wait -l ./splink_db_log
+createdb splink_db # The inner database
+psql -d splink_db <<SQL
+  CREATE USER splinkognito CREATEDB CREATEROLE password 'splink123!' ;
+SQL
+
+

Most of these commands are one-time setup, but the pg_ctl -D splink_db start --wait -l ./splink_db_log +command will need to be run each time you want to start PostgreSQL (after rebooting, for example).

+

Alternatively, you can run PostgreSQL using Docker. +First, install Docker Desktop.

+

Then run the setup script (a thin wrapper around docker-compose) each time you want to start your PostgreSQL server:

+
./scripts/postgres_docker/setup.sh
+
+

and the teardown script each time you want to stop it:

+
./scripts/postgres_docker/teardown.sh
+
+

Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port. +The default username is a@b.com with password b.

+

Step 3, Conda install option: Install system dependencies

+

These instructions are the same no matter what operating system you are using. +As an added benefit, these installations will be specific to the conda environment +you create for Splink, so they will not interfere with other projects.

+

For convenience, we have created an automatic installation script that will install all dependencies for you. +It will create an isolated conda environment called splink.

+

From the directory where you have cloned the Splink repository, simply run:

+
./scripts/conda/development_setup_with_conda.sh
+
+

If you use a shell besides bash, add the mamba CLI to your PATH by running ~/miniforge3/bin/mamba init <your_shell> +-- e.g. ~/miniforge3/bin/mamba init zsh for zsh.

+

If you've run this successfully, restart your terminal and skip to the "Step 5: Activating your environment(s)" section.

+

If you would prefer to manually go through the steps to have a better understanding of what you are installing, continue +to the next section.

+

Install Conda itself

+

First, we need to install a conda CLI. +Any will do, but we recommend Miniforge, which can be installed like so:

+
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
+bash Miniforge3-$(uname)-$(uname -m).sh -b
+
+

Miniforge is great because it defaults to the community-curated conda-forge channel, and it +installs the mamba CLI by default, which is generally faster than the conda CLI.

+

Before you'll be able to run the mamba command, you need to run ~/miniforge3/bin/mamba init +for your shell -- e.g. ~/miniforge3/bin/mamba init for Bash or ~/miniforge3/bin/mamba init zsh for zsh.

+

Install Conda packages

+

The rest is easy, because all the other dependencies can be installed as conda packages. +Simply run:

+
mamba env create -n splink --file ./scripts/conda/development_environment.yaml
+
+

Now run mamba activate splink to enter your newly created conda environment +-- you will need to do this again each time you open a new terminal. +Run the rest of the steps in this guide inside this environment. +mamba deactivate leaves the environment.

+

Step 4: Python package dependencies

+

Splink manages the other Python packages it depends on using Poetry. +Simply run poetry install in the Splink directory to install them. +You can find more options for this command (such as how to install +optional dependencies) on the managing dependencies with Poetry page.

+

To enter the virtual environment created by poetry, run poetry shell. +You will need to do this again each time you open a new terminal. +Use exit to leave the Poetry shell.

+

Step 5: Activating your environment(s)

+

Depending on the options you chose in this document, you now have either:

+
    +
  • Only a Poetry virtual environment.
  • +
  • Both a conda environment and a Poetry virtual environment.
  • +
+

If you did not use conda, then each time you open a terminal to develop +Splink, after navigating to the repository directory, run poetry shell.

+

If you did use conda, then each time you open a terminal to develop +Splink, after navigating to the repository directory, run mamba activate splink +and then poetry shell.

+

Step 6: Checking that it worked

+

If you have installed all the dependencies, including PostgreSQL, +you should be able to run the following command without error (will take about 10 minutes):

+
pytest tests/
+
+

This runs all the Splink tests across the default DuckDB and Spark backends, +and runs some integration tests across the rest of the backends except for Athena, +which can't run locally.

+

If you haven't installed PostgreSQL, try this:

+
pytest tests/ --ignore tests/test_full_example_postgres.py
+
+

Step 7: Visual Studio Code (optional)

+

You're now all set to develop Splink. +If you have a text editor/IDE you are comfortable with for working on Python packages, +you can use that. +If you don't, we recommend Visual Studio Code. +Here are some tips on how to get started:

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/lint_and_format.html b/dev_guides/changing_splink/lint_and_format.html new file mode 100644 index 0000000000..e0f5758942 --- /dev/null +++ b/dev_guides/changing_splink/lint_and_format.html @@ -0,0 +1,5298 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Linting and Formatting - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Linting and Formatting

+ +

Linting your code

+

We use ruff for linting and formatting.

+

To quickly run both the linter and formatter, you can source the linting bash script (shown below). The -f flag can be called to run automatic fixes with ruff. +If you simply wish for ruff to print the errors it finds to the console, remove this flag.

+
poetry run ruff format
+poetry run ruff check .
+
+

Additional Rules

+

ruff contains an extensive arsenal of linting rules and techniques that can be applied.

+

If you wish to add an addition rule, do so in the pyproject.toml file in the root of the project.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/managing_dependencies_with_poetry.html b/dev_guides/changing_splink/managing_dependencies_with_poetry.html new file mode 100644 index 0000000000..05acd38b41 --- /dev/null +++ b/dev_guides/changing_splink/managing_dependencies_with_poetry.html @@ -0,0 +1,5399 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Managing Dependencies with Poetry - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Managing Dependencies with Poetry

+ +

Splink utilises poetry for managing its core dependencies, offering a clean and effective solution for tracking and resolving any ensuing package and version conflicts.

+

You can find a list of Splink's core dependencies within the pyproject.toml file.

+

A comprehensive list of Poetry commands is available in the Poetry documentation.

+

Fundamental Commands in Poetry

+

Below are some useful commands to help in the maintenance and upkeep of the pyproject.toml file.

+

Adding Packages

+

To incorporate a new package into Splink: +

poetry add <package-name>
+
+

To specify a version when adding a new package: +

poetry add <package-name>==<version>
+# Add quotes if you want to use other equality calls
+poetry add "<package-name> >= <version>"
+
+

Modifying Packages

+

To remove a package from the project:

+
poetry remove <package-name>
+
+

Updating an existing package to a specific version:

+
poetry add <package-name>==<version>
+poetry add "<package-name> >= <version>"
+
+

To update an existing package to the latest version:

+
poetry add <package-name>==<version>
+poetry update <package-name>
+
+

Note: Direct updates can also be performed within the pyproject.toml file.

+

Locking the Project

+

To update the existing poetry.lock file, thereby locking the project to ensure consistent dependency installation across different environments:

+
poetry lock
+
+

Note: This updates all dependencies and may take some time. If you only need to update a single dependency, update it using poetry add <pkg>==<version> instead.

+

Installing Dependencies

+

To install project dependencies as per the lock file:

+
poetry install
+
+

For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:

+
poetry install -E spark
+
+

For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:

+
poetry install -E spark
+
+

To install everything:

+
poetry install --with dev --with linting --with testing --with benchmarking --with typechecking --with demos --all-extras
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/releases.html b/dev_guides/changing_splink/releases.html new file mode 100644 index 0000000000..8676952cf3 --- /dev/null +++ b/dev_guides/changing_splink/releases.html @@ -0,0 +1,5245 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Releasing a Package Version - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Releasing a new version of Splink

+

Splink is regularly updated with releases to add new features or bug fixes to the package.

+

Below are the steps for releasing a new version of Splink:

+
    +
  1. On a new branch, update pyproject.toml and init.py with the latest version.
  2. +
  3. Update CHANGELOG.md. This consists of adding a heading for the new release below the 'Unreleased' heading, with the new version and date. Additionally the links at the bottom of the file for 'unreleased' and the new version should be updated.
  4. +
  5. Open a pull request to merge the new branch with the master branch (the base branch).
  6. +
  7. +

    Once the pull request has been approved, merge the changes and generate a new release in the releases section of the repo, including:

    +
  8. +
  9. +

    Choosing a new release tag (which matches your updates to pyproject.toml and init.py). Ensure that your release tag follows semantic versioning. The target branch should be set to master.

    +
  10. +
+

)

+
    +
  • Generating release notes. This can be done automatically by pressing the + button.
  • +
+

This will give you release notes based off the Pull Requests which have been merged since the last release.

+

For example +

+
    +
  • Publish as the latest release
  • +
+

+

Now your release should be published to PyPI.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/changing_splink/testing.html b/dev_guides/changing_splink/testing.html new file mode 100644 index 0000000000..2ea86a61bf --- /dev/null +++ b/dev_guides/changing_splink/testing.html @@ -0,0 +1,5703 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Testing - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Testing in Splink

+

Tests in Splink make use of the pytest framework. You can find the tests themselves in the tests folder.

+

Splink tests can be broadly categorised into three sets:

+
    +
  • 'Core' tests - these are tests which test some specific bit of functionality which does not depend on any specific SQL dialect. They are usually unit tests - examples are testing InputColumn and testing the latitude-longitude distance calculation.
  • +
  • Backend-agnostic tests - these are tests which run against some SQL backend, but which are written in such a way that they can run against many backends by making use of the backend-agnostic testing framework. The majority of tests are of this type.
  • +
  • Backend-specific tests - these are tests which run against a specific SQL backend, and test some feature particular to this backend. There are not many of these, as Splink is designed to run very similarly independent of the backend used.
  • +
+

Running tests

+

Running tests locally

+

To run tests locally against duckdb only (the default) run: +

poetry run pytest tests/
+
+

To run a single test file, append the filename to the tests/ folder call, for example:

+
poetry run pytest tests/test_u_train.py
+
+

or for a single test, additionally append the test name after a pair of colons, as:

+
poetry run pytest tests/test_u_train.py::test_u_train_multilink
+
+
+Further useful pytest options +

There may be many warnings emitted, for instance by library dependencies, cluttering your output in which case you can use --disable-pytest-warnings or -W ignore so that these will not be displayed. Some additional command-line options that may be useful:

+
    +
  • -s to disable output capture, so that test output is displayed in the terminal in all cases
  • +
  • -v for verbose mode, where each test instance will be displayed on a separate line with status
  • +
  • -q for quiet mode, where output is extremely minimal
  • +
  • -x to fail on first error/failure rather than continuing to run all selected tests + *
  • +
  • -m some_mark run only those tests marked with some_mark - see below for useful options here
  • +
+

For instance usage might be: +

# ignore warnings, display output
+pytest -W ignore -s tests/
+
+

or +

# ignore warnings, verbose output, fail on first error/failure
+pytest -W ignore -v -x tests/
+
+

You can find a host of other available options using pytest's in-built help: +

pytest -h
+
+
+

Running tests for specific backends or backend groups

+

You may wish to run tests relating to to specific backends, tests which are backend-independent, or any combinations of these. Splink allows for various combinations by making use of pytest's mark feature.

+

If when you invoke pytest you pass no marks explicitly, there will be an implicit mark of default, as per the pyproject.toml pytest.ini configuration, and see also the decorator.py file.

+

The available options are:

+
Run core tests
+

Option for running only the backend-independent 'core' tests:

+
    +
  • poetry run pytest tests/ -m core - run only the 'core' tests, meaning those without dialect-dependence. In practice this means any test that hasn't been decorated using mark_with_dialects_excluding or mark_with_dialects_including.
  • +
+
Run tests on a specific backend
+

Options for running tests on one backend only - this includes tests written specifically for that backend, as well as backend-agnostic tests supported for that backend.

+
    +
  • poetry run pytest tests/ -m duckdb - run all duckdb tests, and all core tests
      +
    • & similarly for other dialects
    • +
    +
  • +
  • poetry run pytest tests/ -m duckdb_only - run all duckdb tests only, and not the core tests
      +
    • & similarly for other dialects
    • +
    +
  • +
+
Run tests across multiple backends
+

Options for running tests on multiple backends (including all backends) - this includes tests written specifically for those backends, as well as backend-agnostic tests supported for those backends.

+
    +
  • pytest tests/ -m default or equivalently pytest tests/ - run all tests in the default group. The default group consists of the core tests, and those dialects in the default group - currently spark and duckdb.
      +
    • Other groups of dialects can be added and will similarly run with pytest tests/ -m new_dialect_group. Dialects within the current scope of testing and the groups they belong to are defined in the dialect_groups dictionary in tests/decorator.py
    • +
    +
  • +
  • pytest tests/ -m all run all tests for all available dialects
  • +
+

These all work alongside all the other pytest options, so for instance to run the tests for training probability_two_random_records_match for only duckdb, ignoring warnings, with quiet output, and exiting on the first failure/error: +

pytest -W ignore -q -x -m duckdb tests/test_estimate_prob_two_rr_match.py
+
+
+Running tests against a specific version of Python +

Testing Splink against a specific version of Python, especially newer versions not included in our GitHub Actions, is vital for identifying compatibility issues +early and reviewing errors reported by users.

+

If you're a conda user, you can create a isolated environment according to the +instructions in the development quickstart.

+

Another method is to utilise docker 🐳.

+

A pre-built Dockerfile for running tests against python version 3.9.10 can be located within scripts/run_tests.Dockerfile.

+

To run, simply use the following docker command from within a terminal and the root folder of a Splink clone: +

docker build -t run_tests:testing -f scripts/run_tests.Dockerfile . && docker run --rm --name splink-test run_tests:testing
+
+

This will both build and run the tests library.

+

Feel free to replace run_tests:testing with an image name and tag you're happy with.

+

Reusing the same image and tag will overwrite your existing image.

+

You can also overwrite the default CMD if you want a different set of pytest command-line options, for example +

docker run --rm --name splink-test run_tests:testing pytest -W ignore -m spark tests/test_u_train.py
+
+
+

Running with a pre-existing Postgres database

+

If you have a pre-existing Postgres server you wish to use to run the tests against, you will need to specify environment variables for the credentials where they differ from default (in parentheses):

+
    +
  • SPLINKTEST_PG_USER (splinkognito)
  • +
  • SPLINKTEST_PG_PASSWORD (splink123!)
  • +
  • SPLINKTEST_PG_HOST (localhost)
  • +
  • SPLINKTEST_PG_PORT (5432)
  • +
  • SPLINKTEST_PG_DB (splink_db) - tests will not actually run against this, but it is from a connection to this that the temporary test database + user will be created
  • +
+

While care has been taken to ensure that tests are run using minimal permissions, and are cleaned up after, it is probably wise to run tests connected to a non-important database, in case anything goes wrong. +In addition to the standard privileges for Splink usage, in order to run the tests you will need:

+
    +
  • CREATE DATABASE to create a temporary testing database
  • +
  • CREATEROLE to create a temporary user role with limited privileges, which will be actually used for all the SQL execution in the tests
  • +
+

Tests in CI

+

Splink utilises GitHub actions to run tests for each pull request. This consists of a few independent checks:

+
    +
  • The full test suite is run separately against several different python versions
  • +
  • The example notebooks are checked to ensure they run without error
  • +
  • The tutorial notebooks are checked to ensure they run without error
  • +
+

Writing tests

+

Core tests

+

Core tests are treated the same way as ordinary pytest tests. Any test is marked as core by default, and will only be excluded from being a core test if it is decorated using either:

+ +

from the test decorator file.

+

Backend-agnostic testing

+

The majority of tests should be written using the backend-agnostic testing framework. This just provides some small tools which allow tests to be written in a backend-independent way. This means the tests can then by run against all available SQL backends (or a subset, if some lack necessary features for the test).

+

As an example, let's consider a test that will run on all dialects, and then break down the various parts to see what each is doing.

+
 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
from tests.decorator import mark_with_dialects_excluding
+
+@mark_with_dialects_excluding()
+def test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):
+    helper = test_helpers[dialect]
+
+    df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
+    settings = SettingsCreator(
+        link_type="dedupe_only",
+        comparisons=[
+            cl.ExactMatch("first_name"),
+            cl.ExactMatch("surname"),
+        ],
+        blocking_rules_to_generate_predictions=[
+            block_on("first_name"),
+        ],
+    )
+    linker = helper.Linker(
+        df,
+        settings,
+        **helper.extra_linker_args(),
+    )
+
+
+    # and then some actual testing logic
+
+

Firstly you should import the decorator-factory mark_with_dialects_excluding, which will decorate each test function:

+
1
from tests.decorator import mark_with_dialects_excluding
+
+

Then we define the function, and pass parameters:

+
3
+4
@mark_with_dialects_excluding()
+def test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):
+
+

The decorator @mark_with_dialects_excluding() will do two things:

+
    +
  • marks the test it decorates with the appropriate custom pytest marks. This ensures that it will be run with tests for each dialect, excluding any that are passed as arguments; in this case it will be run for all dialects, as we have passed no arguments.
  • +
  • parameterises the test with a string parameter dialect, which will be used to configure the test for that dialect. The test will run for each value of dialect possible, excluding any passed to the decorator (none in this case).
  • +
+

You should aim to exclude as few dialects as possible - consider if you really need to exclude any. Dialects should only be excluded if the test doesn't make sense for them due to features they lack. The default choice should be the decorator with no arguments @mark_with_dialects_excluding(), meaning the test runs for all dialects.

+
3
+4
@mark_with_dialects_excluding()
+def test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):
+
+

As well as the parameter dialect (which is provided by the decorator), we must also pass the helper-factory fixture test_helpers. We can additionally pass further fixtures if needed - in this case some_other_test_fixture. +We could similarly provide an explicit parameterisation to the test, in which case we would also pass these parameters - see the pytest docs on parameterisation for more information.

+
5
    helper = test_helpers[dialect]
+
+

The fixture test_helpers is simply a dictionary of the specific-dialect test helpers - here we pick the appropriate one for our test.

+

Each helper has the same set of methods and properties, which encapsulate all of the dialect-dependencies. You can find the full set of properties and methods by examining the source for the base class TestHelper.

+
7
    df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
+
+

Here we are now actually using a method of the test helper - in this case we are loading a table from a csv to the database and returning it in a form suitable for passing to a Splink linker.

+

Finally we instantiate the linker, passing any default set of extra arguments provided by the helper, which some dialects require. +

18
    linker = helper.Linker(df, settings_dict, **helper.extra_linker_args())
+
+

From this point onwards we will be working with the instantiated linker, and so will not need to refer to helper any more - the rest of the test can be written as usual.

+

Excluding some backends

+

Now let's consider an example in which we wanted to test a ComparisonLevel that included the split_part function which does not exist in the sqlite dialect. We assume that this particular comparison level is crucial for the test to make sense, otherwise we would rewrite this line to make it run universally. When you come to run the tests, this test will not run on the sqlite backend.

+
{
+    "sql_condition": "split_part(email_l, '@', 1) = split_part(email_r, '@', 1)",
+    "label_for_charts": "email local-part matches",
+}
+
+
+

Warning

+

Tests should be made available to the widest range of backends possible. Only exclude backends if features not shared by all backends are crucial to the test-logic - otherwise consider rewriting things so that all backends are covered.

+
+

We therefore want to exclude sqlite backend, as the test relies on features not directly available for that backend, which we can do as follows:

+
1
+2
+3
+4
+5
+6
+7
+8
+9
from tests.decorator import mark_with_dialects_excluding
+
+@mark_with_dialects_excluding("sqlite")
+def test_feature_that_doesnt_work_with_sqlite(test_helpers, dialect, some_other_test_fixture):
+    helper = test_helpers[dialect]
+
+    df = helper.load_frame_from_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
+
+    # and then some actual testing logic
+
+

The key difference is the argument we pass to the decorator: +

3
+4
@mark_with_dialects_excluding("sqlite")
+def test_feature_that_doesnt_work_with_sqlite(test_helpers, dialect, some_other_test_fixture):
+
+As above this marks the test it decorates with the appropriate custom pytest marks, but in this case it ensures that it will be run with tests for each dialect excluding sqlite. Again dialect is passed as a parameter, and the test will run in turn for each value of dialect except for sqlite. +

If you need to exclude multiple dialects this is also possible - just pass each as an argument. For example, to decorate a test that is not supported on spark or sqlite, use the decorator @mark_with_dialects_excluding("sqlite", "spark").

+

Backend-specific tests

+

If you intend to write a test for a specific backend, first consider whether it is definitely specific to that backend - if not then a backend-agnostic test would be preferable, as then your test will be run against many backends. +If you really do need to test features peculiar to one backend, then you can write it simply as you would an ordinary pytest test. The only difference is that you should decorate it with @mark_with_dialects_including (from tests/decorator.py) - for example:

+
+
+
+
@mark_with_dialects_including("duckdb")
+def test_some_specific_duckdb_feature():
+    ...
+
+
+
+
@mark_with_dialects_including("spark")
+def test_some_specific_spark_feature():
+    ...
+
+
+
+
@mark_with_dialects_including("sqlite")
+def test_some_specific_sqlite_feature():
+    ...
+
+
+
+
+

This ensures that the test gets marked appropriately for running when the Spark tests should be run, and excludes it from the set of core tests.

+

Note that unlike the exclusive mark_with_dialects_excluding, this decorator will not parameterise the test with the dialect argument. This is because usage of the inclusive form is largely designed for single-dialect tests. If you wish to override this behaviour and parameterise the test you can use the argument pass_dialect, for example @mark_with_dialects_including("spark", "sqlite", pass_dialect=True), in which case you would need to write the test in a backend-independent manner.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/charts/building_charts.html b/dev_guides/charts/building_charts.html new file mode 100644 index 0000000000..311d66241d --- /dev/null +++ b/dev_guides/charts/building_charts.html @@ -0,0 +1,6414 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Building new charts - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Building a new chart in Splink

+

As mentioned in the Understanding Splink Charts topic guide, splink charts are made up of three distinct parts:

+
    +
  1. A function to create the dataset for the chart
  2. +
  3. A template chart definition (in a json file)
  4. +
  5. A function to read the chart definition, add the data to it, and return the chart itself
  6. +
+

Worked Example

+

Below is a worked example of how to create a new chart that shows all comparisons levels ordered by match weight:

+
import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+      cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.DateOfBirthComparison("dob", input_is_string=True),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.LevenshteinAtThresholds("email", 2),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "dob"),
+        block_on("surname"),
+    ]
+)
+
+linker = Linker(df, settings,DuckDBAPI())
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+for rule in [block_on("first_name"), block_on("dob")]:
+    linker.training.estimate_parameters_using_expectation_maximisation(rule)
+
+

Generate data for chart

+
# Take linker object and extract complete settings dict
+records = linker._settings_obj._parameters_as_detailed_records
+
+cols_to_keep = [
+    "comparison_name",
+    "sql_condition",
+    "label_for_charts",
+    "m_probability",
+    "u_probability",
+    "bayes_factor",
+    "log2_bayes_factor",
+    "comparison_vector_value"
+]
+
+# Keep useful information for a match weights chart
+records = [{k: r[k] for k in cols_to_keep}
+           for r in records
+           if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]
+
+records[:3]
+
+
[{'comparison_name': 'first_name',
+  'sql_condition': '"first_name_l" = "first_name_r"',
+  'label_for_charts': 'Exact match on first_name',
+  'm_probability': 0.5009783629340309,
+  'u_probability': 0.0057935713975033705,
+  'bayes_factor': 86.4714229896119,
+  'log2_bayes_factor': 6.434151525637829,
+  'comparison_vector_value': 4},
+ {'comparison_name': 'first_name',
+  'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92',
+  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.92',
+  'm_probability': 0.15450921411813767,
+  'u_probability': 0.0023429457903817435,
+  'bayes_factor': 65.9465595629351,
+  'log2_bayes_factor': 6.043225490816602,
+  'comparison_vector_value': 3},
+ {'comparison_name': 'first_name',
+  'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88',
+  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.88',
+  'm_probability': 0.07548037415770431,
+  'u_probability': 0.0015484319951285285,
+  'bayes_factor': 48.7463281533646,
+  'log2_bayes_factor': 5.607221645966225,
+  'comparison_vector_value': 2}]
+
+

Create a chart template

+

Build prototype chart in Altair

+
import pandas as pd
+import altair as alt
+
+df = pd.DataFrame(records)
+
+# Need a unique name for each comparison level - easier to create in pandas than altair
+df["cl_id"] = df["comparison_name"] + "_" + \
+    df["comparison_vector_value"].astype("str")
+
+# Simple start - bar chart with x, y and color encodings
+alt.Chart(df).mark_bar().encode(
+    y="cl_id",
+    x="log2_bayes_factor",
+    color="comparison_name"
+)
+
+ +
+ + +

Sort bars, edit axes/titles

+
alt.Chart(df).mark_bar().encode(
+    y=alt.Y("cl_id",
+        sort="-x",
+        title="Comparison level"
+    ),
+    x=alt.X("log2_bayes_factor",
+        title="Comparison level match weight = log2(m/u)",
+        scale=alt.Scale(domain=[-10,10])
+    ),
+    color="comparison_name"
+).properties(
+    title="New Chart - WOO!"
+).configure_view(
+    step=15
+)
+
+ +
+ + +

Add tooltip

+
alt.Chart(df).mark_bar().encode(
+    y=alt.Y("cl_id",
+            sort="-x",
+            title="Comparison level"
+            ),
+    x=alt.X("log2_bayes_factor",
+            title="Comparison level match weight = log2(m/u)",
+            scale=alt.Scale(domain=[-10, 10])
+            ),
+    color="comparison_name",
+    tooltip=[
+        "comparison_name",
+        "label_for_charts",
+        "sql_condition",
+        "m_probability",
+        "u_probability",
+        "bayes_factor",
+        "log2_bayes_factor"
+        ]
+).properties(
+    title="New Chart - WOO!"
+).configure_view(
+    step=15
+)
+
+ +
+ + +

Add text layer

+
# Create base chart with shared data and encodings (mark type not specified)
+base = alt.Chart(df).encode(
+    y=alt.Y("cl_id",
+            sort="-x",
+            title="Comparison level"
+            ),
+    x=alt.X("log2_bayes_factor",
+            title="Comparison level match weight = log2(m/u)",
+            scale=alt.Scale(domain=[-10, 10])
+            ),
+    tooltip=[
+        "comparison_name",
+        "label_for_charts",
+        "sql_condition",
+        "m_probability",
+        "u_probability",
+        "bayes_factor",
+        "log2_bayes_factor"
+    ]
+)
+
+# Build bar chart from base (color legend made redundant by text labels)
+bar = base.mark_bar().encode(
+    color=alt.Color("comparison_name", legend=None)
+)
+
+# Build text layer from base
+text = base.mark_text(dx=0, align="right").encode(
+    text="comparison_name"
+)
+
+# Final layered chart
+chart = bar + text
+
+# Add global config
+chart.resolve_axis(
+    y="shared",
+    x="shared"
+).properties(
+    title="New Chart - WOO!"
+).configure_view(
+    step=15
+)
+
+ +
+ + +

Sometimes things go wrong in Altair and it's not clear why or how to fix it. If the docs and Stack Overflow don't have a solution, the answer is usually that Altair is making decisions under the hood about the Vega-Lite schema that are out of your control.

+

In this example, the sorting of the y-axis is broken when layering charts. If we show bar and text side-by-side, you can see they work as expected, but the sorting is broken in the layering process.

+
bar | text
+
+ +
+ + +

Once we get to this stage (or whenever you're comfortable), we can switch to Vega-Lite by exporting the JSON from our chart object, or opening the chart in the Vega-Lite editor.

+
chart.to_json()
+
+
+Chart JSON +
  {
+  "$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json",
+  "config": {
+    "view": {
+      "continuousHeight": 300,
+      "continuousWidth": 300
+    }
+  },
+  "data": {
+    "name": "data-3901c03d78701611834aa82ab7374cce"
+  },
+  "datasets": {
+    "data-3901c03d78701611834aa82ab7374cce": [
+      {
+        "bayes_factor": 86.62949969575988,
+        "cl_id": "first_name_4",
+        "comparison_name": "first_name",
+        "comparison_vector_value": 4,
+        "label_for_charts": "Exact match first_name",
+        "log2_bayes_factor": 6.436786480320881,
+        "m_probability": 0.5018941916173814,
+        "sql_condition": "\"first_name_l\" = \"first_name_r\"",
+        "u_probability": 0.0057935713975033705
+      },
+      {
+        "bayes_factor": 82.81743551783742,
+        "cl_id": "first_name_3",
+        "comparison_name": "first_name",
+        "comparison_vector_value": 3,
+        "label_for_charts": "Damerau_levenshtein <= 1",
+        "log2_bayes_factor": 6.371862624533329,
+        "m_probability": 0.19595791797531015,
+        "sql_condition": "damerau_levenshtein(\"first_name_l\", \"first_name_r\") <= 1",
+        "u_probability": 0.00236614327345483
+      },
+      {
+        "bayes_factor": 35.47812468678278,
+        "cl_id": "first_name_2",
+        "comparison_name": "first_name",
+        "comparison_vector_value": 2,
+        "label_for_charts": "Jaro_winkler_similarity >= 0.9",
+        "log2_bayes_factor": 5.148857848140163,
+        "m_probability": 0.045985303626033085,
+        "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
+        "u_probability": 0.001296159366708712
+      },
+      {
+        "bayes_factor": 11.266641370022352,
+        "cl_id": "first_name_1",
+        "comparison_name": "first_name",
+        "comparison_vector_value": 1,
+        "label_for_charts": "Jaro_winkler_similarity >= 0.8",
+        "log2_bayes_factor": 3.493985601438375,
+        "m_probability": 0.06396730257493154,
+        "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
+        "u_probability": 0.005677583982137938
+      },
+      {
+        "bayes_factor": 0.19514855669673956,
+        "cl_id": "first_name_0",
+        "comparison_name": "first_name",
+        "comparison_vector_value": 0,
+        "label_for_charts": "All other comparisons",
+        "log2_bayes_factor": -2.357355302129234,
+        "m_probability": 0.19219528420634394,
+        "sql_condition": "ELSE",
+        "u_probability": 0.9848665419801952
+      },
+      {
+        "bayes_factor": 113.02818119005431,
+        "cl_id": "surname_4",
+        "comparison_name": "surname",
+        "comparison_vector_value": 4,
+        "label_for_charts": "Exact match surname",
+        "log2_bayes_factor": 6.820538712806792,
+        "m_probability": 0.5527050424941531,
+        "sql_condition": "\"surname_l\" = \"surname_r\"",
+        "u_probability": 0.004889975550122249
+      },
+      {
+        "bayes_factor": 80.61351958508214,
+        "cl_id": "surname_3",
+        "comparison_name": "surname",
+        "comparison_vector_value": 3,
+        "label_for_charts": "Damerau_levenshtein <= 1",
+        "log2_bayes_factor": 6.332949906378981,
+        "m_probability": 0.22212752320956386,
+        "sql_condition": "damerau_levenshtein(\"surname_l\", \"surname_r\") <= 1",
+        "u_probability": 0.0027554624131641246
+      },
+      {
+        "bayes_factor": 48.57568460485815,
+        "cl_id": "surname_2",
+        "comparison_name": "surname",
+        "comparison_vector_value": 2,
+        "label_for_charts": "Jaro_winkler_similarity >= 0.9",
+        "log2_bayes_factor": 5.602162423566203,
+        "m_probability": 0.0490149338194711,
+        "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
+        "u_probability": 0.0010090425738347498
+      },
+      {
+        "bayes_factor": 13.478820689774516,
+        "cl_id": "surname_1",
+        "comparison_name": "surname",
+        "comparison_vector_value": 1,
+        "label_for_charts": "Jaro_winkler_similarity >= 0.8",
+        "log2_bayes_factor": 3.752622370380284,
+        "m_probability": 0.05001678986356945,
+        "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
+        "u_probability": 0.003710768991942586
+      },
+      {
+        "bayes_factor": 0.1277149376863226,
+        "cl_id": "surname_0",
+        "comparison_name": "surname",
+        "comparison_vector_value": 0,
+        "label_for_charts": "All other comparisons",
+        "log2_bayes_factor": -2.969000820703079,
+        "m_probability": 0.1261357106132424,
+        "sql_condition": "ELSE",
+        "u_probability": 0.9876347504709363
+      },
+      {
+        "bayes_factor": 236.78351486807742,
+        "cl_id": "dob_5",
+        "comparison_name": "dob",
+        "comparison_vector_value": 5,
+        "label_for_charts": "Exact match",
+        "log2_bayes_factor": 7.887424832202931,
+        "m_probability": 0.41383785481447766,
+        "sql_condition": "\"dob_l\" = \"dob_r\"",
+        "u_probability": 0.0017477477477477479
+      },
+      {
+        "bayes_factor": 65.74625268345359,
+        "cl_id": "dob_4",
+        "comparison_name": "dob",
+        "comparison_vector_value": 4,
+        "label_for_charts": "Damerau_levenshtein <= 1",
+        "log2_bayes_factor": 6.038836762842662,
+        "m_probability": 0.10806341031654734,
+        "sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
+        "u_probability": 0.0016436436436436436
+      },
+      {
+        "bayes_factor": 29.476860590690453,
+        "cl_id": "dob_3",
+        "comparison_name": "dob",
+        "comparison_vector_value": 3,
+        "label_for_charts": "Within 1 month",
+        "log2_bayes_factor": 4.881510974428093,
+        "m_probability": 0.11300938544779224,
+        "sql_condition": "\n            abs(date_diff('month',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 1\n        ",
+        "u_probability": 0.003833833833833834
+      },
+      {
+        "bayes_factor": 3.397551460259144,
+        "cl_id": "dob_2",
+        "comparison_name": "dob",
+        "comparison_vector_value": 2,
+        "label_for_charts": "Within 1 year",
+        "log2_bayes_factor": 1.7644954026183992,
+        "m_probability": 0.17200656922328977,
+        "sql_condition": "\n            abs(date_diff('year',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 1\n        ",
+        "u_probability": 0.05062662662662663
+      },
+      {
+        "bayes_factor": 0.6267794172297388,
+        "cl_id": "dob_1",
+        "comparison_name": "dob",
+        "comparison_vector_value": 1,
+        "label_for_charts": "Within 10 years",
+        "log2_bayes_factor": -0.6739702908716182,
+        "m_probability": 0.19035523041792068,
+        "sql_condition": "\n            abs(date_diff('year',\n                strptime(\"dob_l\", '%Y-%m-%d'),\n                strptime(\"dob_r\", '%Y-%m-%d'))\n                ) <= 10\n        ",
+        "u_probability": 0.3037037037037037
+      },
+      {
+        "bayes_factor": 0.004272180302776005,
+        "cl_id": "dob_0",
+        "comparison_name": "dob",
+        "comparison_vector_value": 0,
+        "label_for_charts": "All other comparisons",
+        "log2_bayes_factor": -7.870811748958801,
+        "m_probability": 0.002727549779972325,
+        "sql_condition": "ELSE",
+        "u_probability": 0.6384444444444445
+      },
+      {
+        "bayes_factor": 10.904938885948333,
+        "cl_id": "city_1",
+        "comparison_name": "city",
+        "comparison_vector_value": 1,
+        "label_for_charts": "Exact match",
+        "log2_bayes_factor": 3.4469097796586596,
+        "m_probability": 0.6013808934279701,
+        "sql_condition": "\"city_l\" = \"city_r\"",
+        "u_probability": 0.0551475711801453
+      },
+      {
+        "bayes_factor": 0.42188504195296994,
+        "cl_id": "city_0",
+        "comparison_name": "city",
+        "comparison_vector_value": 0,
+        "label_for_charts": "All other comparisons",
+        "log2_bayes_factor": -1.2450781575619725,
+        "m_probability": 0.3986191065720299,
+        "sql_condition": "ELSE",
+        "u_probability": 0.9448524288198547
+      },
+      {
+        "bayes_factor": 269.6074384240141,
+        "cl_id": "email_2",
+        "comparison_name": "email",
+        "comparison_vector_value": 2,
+        "label_for_charts": "Exact match",
+        "log2_bayes_factor": 8.07471649055784,
+        "m_probability": 0.5914840252879943,
+        "sql_condition": "\"email_l\" = \"email_r\"",
+        "u_probability": 0.0021938713143283602
+      },
+      {
+        "bayes_factor": 222.9721189153553,
+        "cl_id": "email_1",
+        "comparison_name": "email",
+        "comparison_vector_value": 1,
+        "label_for_charts": "Levenshtein <= 2",
+        "log2_bayes_factor": 7.800719512398763,
+        "m_probability": 0.3019669634613132,
+        "sql_condition": "levenshtein(\"email_l\", \"email_r\") <= 2",
+        "u_probability": 0.0013542812658830492
+      },
+      {
+        "bayes_factor": 0.10692840956298139,
+        "cl_id": "email_0",
+        "comparison_name": "email",
+        "comparison_vector_value": 0,
+        "label_for_charts": "All other comparisons",
+        "log2_bayes_factor": -3.225282884575804,
+        "m_probability": 0.10654901125069259,
+        "sql_condition": "ELSE",
+        "u_probability": 0.9964518474197885
+      }
+    ]
+  },
+  "layer": [
+    {
+      "encoding": {
+        "color": {
+          "field": "comparison_name",
+          "legend": null,
+          "type": "nominal"
+        },
+        "tooltip": [
+          {
+            "field": "comparison_name",
+            "type": "nominal"
+          },
+          {
+            "field": "label_for_charts",
+            "type": "nominal"
+          },
+          {
+            "field": "sql_condition",
+            "type": "nominal"
+          },
+          {
+            "field": "m_probability",
+            "type": "quantitative"
+          },
+          {
+            "field": "u_probability",
+            "type": "quantitative"
+          },
+          {
+            "field": "bayes_factor",
+            "type": "quantitative"
+          },
+          {
+            "field": "log2_bayes_factor",
+            "type": "quantitative"
+          }
+        ],
+        "x": {
+          "field": "log2_bayes_factor",
+          "scale": {
+            "domain": [
+              -10,
+              10
+            ]
+          },
+          "title": "Comparison level match weight = log2(m/u)",
+          "type": "quantitative"
+        },
+        "y": {
+          "field": "cl_id",
+          "sort": "-x",
+          "title": "Comparison level",
+          "type": "nominal"
+        }
+      },
+      "mark": {
+        "type": "bar"
+      }
+    },
+    {
+      "encoding": {
+        "text": {
+          "field": "comparison_name",
+          "type": "nominal"
+        },
+        "tooltip": [
+          {
+            "field": "comparison_name",
+            "type": "nominal"
+          },
+          {
+            "field": "label_for_charts",
+            "type": "nominal"
+          },
+          {
+            "field": "sql_condition",
+            "type": "nominal"
+          },
+          {
+            "field": "m_probability",
+            "type": "quantitative"
+          },
+          {
+            "field": "u_probability",
+            "type": "quantitative"
+          },
+          {
+            "field": "bayes_factor",
+            "type": "quantitative"
+          },
+          {
+            "field": "log2_bayes_factor",
+            "type": "quantitative"
+          }
+        ],
+        "x": {
+          "field": "log2_bayes_factor",
+          "scale": {
+            "domain": [
+              -10,
+              10
+            ]
+          },
+          "title": "Comparison level match weight = log2(m/u)",
+          "type": "quantitative"
+        },
+        "y": {
+          "field": "cl_id",
+          "sort": "-x",
+          "title": "Comparison level",
+          "type": "nominal"
+        }
+      },
+      "mark": {
+        "align": "right",
+        "dx": 0,
+        "type": "text"
+      }
+    }
+  ]
+  }
+
+
+

Edit in Vega-Lite

+

Opening the JSON from the above chart in Vega-Lite editor, it is now behaving as intended, with both bar and text layers sorted by match weight.

+

If the chart is working as intended, there is only one step required before saving the JSON file - removing data from the template schema.

+

The data appears as follows with a dictionary of all included datasets by name, and then each chart referencing the data it uses by name:

+
"data": {"name": "data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185"},
+"datasets": {
+  "data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185": [
+
+    ...
+
+  ]
+},
+
+

Where only one dataset is required, this is equivalent to: +

"data": {"values": [...]}
+
+

After removing the data references, the template can be saved in Splink as splink/files/chart_defs/my_new_chart.json

+

Combine the chart dataset and template

+

Putting all of the above together, Splink needs definitions for the methods that generate the chart and the data behind it (these can be separate or performed by the same function if relatively simple).

+

Chart definition

+

In splink/charts.py we can add a new function to populate the chart definition with the provided data:

+
def my_new_chart(records, as_dict=False):
+    chart_path = "my_new_chart.json"
+    chart = load_chart_definition(chart_path)
+
+    chart["data"]["values"] = records
+    return altair_or_json(chart, as_dict=as_dict)
+
+
+

Note - only the data is being added to a fixed chart definition here. Other elements of the chart spec can be changed by editing the chart dictionary in the same way.

+

For example, if you wanted to add a color_scheme argument to replace the default scheme ("tableau10"), this function could include the line: chart["layer"][0]["encoding"]["color"]["scale"]["scheme"] = color_scheme

+
+

Chart method

+

Then we can add a method to the linker in splink/linker.py so the chart can be generated by linker.my_new_chart():

+
from .charts import my_new_chart
+
+...
+
+class Linker:
+
+    ...
+
+    def my_new_chart(self):
+
+        # Take linker object and extract complete settings dict
+        records = self._settings_obj._parameters_as_detailed_records
+
+        cols_to_keep = [
+            "comparison_name",
+            "sql_condition",
+            "label_for_charts",
+            "m_probability",
+            "u_probability",
+            "bayes_factor",
+            "log2_bayes_factor",
+            "comparison_vector_value"
+        ]
+
+        # Keep useful information for a match weights chart
+        records = [{k: r[k] for k in cols_to_keep}
+                   for r in records 
+                   if r["comparison_vector_value"] != -1 and r["comparison_sort_order"] != -1]
+
+        return my_new_chart(records)
+
+

Previous new chart PRs

+

Real-life Splink chart additions, for reference:

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/charts/understanding_and_editing_charts.html b/dev_guides/charts/understanding_and_editing_charts.html new file mode 100644 index 0000000000..61a87b2a1a --- /dev/null +++ b/dev_guides/charts/understanding_and_editing_charts.html @@ -0,0 +1,5511 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Understanding and editing charts - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Charts in Splink

+

Interactive charts are a key tool when linking data with Splink. To see all of the charts available, check out the Splink Charts Gallery.

+ +

Charts in Splink are built with Altair.

+

For a given chart, there is usually:

+ +
+The Vega-Lite Editor +

By far the best feature of Vega-Lite is the online editor where the JSON schema and the chart are shown side-by-side, showing changes in real time as the editor helps you to navigate the API.

+

Vega-Lite editor

+
+

Editing existing charts

+

If you take any Altair chart in HTML format, you should be able to make changes pretty easily with the Vega-Lite Editor.

+

For example, consider the comparator_score_chart from the similarity analysis library:

+ + + + + + + + + + + + + +
BeforeAfter
Alt textAlt text
+

Desired changes

+
    +
  • Titles (shared title)
  • +
  • Axis titles
  • +
  • Shared y-axis
  • +
  • Colour scales!! 🤮 (see the Vega colour schemes docs)
  • +
  • red-green is an accessibility no-no
  • +
  • shared colour scheme for different metrics
  • +
  • unpleasant and unclear to look at
  • +
  • legends not necessary (especially when using text labels)
  • +
  • Text size encoding (larger text for similar strings)
  • +
  • Remove "_similarity" and "_distance" from column labels
  • +
  • Fixed column width (rather than chart width)
  • +
  • Row highlighting (on click/hover)
  • +
+

The old spec can be pasted into the Vega Lite editorand edited as shown in the video below:

+

+

Check out the final, improved version chart specification.

+
+Before-After diff +
@@ -1,9 +1,8 @@
+{
+-  "config": {
+-    "view": {
+-      "continuousWidth": 400,
+-      "continuousHeight": 300
+-    }
++  "title": {
++    "text": "Heatmaps of string comparison metrics",
++    "anchor": "middle",
++    "fontSize": 16
+  },
+  "hconcat": [
+    {
+@@ -18,25 +17,32 @@
+                  0,
+                  1
+                ],
+-                "range": [
+-                  "red",
+-                  "green"
+-                ]
++                "scheme": "greenblue"
+              },
+-              "type": "quantitative"
++              "type": "quantitative",
++              "legend": null
+            },
+            "x": {
+              "field": "comparator",
+-              "type": "ordinal"
++              "type": "ordinal",
++              "title": null
+            },
+            "y": {
+              "field": "strings_to_compare",
+-              "type": "ordinal"
++              "type": "ordinal",
++              "title": "String comparison",
++              "axis": {
++                "titleFontSize": 14
++              }
+            }
+          },
+-          "height": 300,
+-          "title": "Heatmap of Similarity Scores",
+-          "width": 300
++          "title": "Similarity",
++          "width": {
++            "step": 40
++          },
++          "height": {
++            "step": 30
++          }
+        },
+        {
+          "mark": {
+@@ -44,6 +50,16 @@
+            "baseline": "middle"
+          },
+          "encoding": {
++            "size": {
++              "field": "score",
++              "scale": {
++                "range": [
++                  8,
++                  14
++                ]
++              },
++              "legend": null
++            },
+            "text": {
+              "field": "score",
+              "format": ".2f",
+@@ -51,7 +67,10 @@
+            },
+            "x": {
+              "field": "comparator",
+-              "type": "ordinal"
++              "type": "ordinal",
++              "axis": {
++                "labelFontSize": 12
++              }
+            },
+            "y": {
+              "field": "strings_to_compare",
+@@ -72,29 +91,33 @@
+            "color": {
+              "field": "score",
+              "scale": {
+-                "domain": [
+-                  0,
+-                  5
+-                ],
+-                "range": [
+-                  "green",
+-                  "red"
+-                ]
++                "scheme": "yelloworangered",
++                "reverse": true
+              },
+-              "type": "quantitative"
++              "type": "quantitative",
++              "legend": null
+            },
+            "x": {
+              "field": "comparator",
+-              "type": "ordinal"
++              "type": "ordinal",
++              "title": null,
++              "axis": {
++                "labelFontSize": 12
++              }
+            },
+            "y": {
+              "field": "strings_to_compare",
+-              "type": "ordinal"
++              "type": "ordinal",
++              "axis": null
+            }
+          },
+-          "height": 300,
+-          "title": "Heatmap of Distance Scores",
+-          "width": 200
++          "title": "Distance",
++          "width": {
++            "step": 40
++          },
++          "height": {
++            "step": 30
++          }
+        },
+        {
+          "mark": {
+@@ -102,6 +125,17 @@
+            "baseline": "middle"
+          },
+          "encoding": {
++            "size": {
++              "field": "score",
++              "scale": {
++                "range": [
++                  8,
++                  14
++                ],
++                "reverse": true
++              },
++              "legend": null
++            },
+            "text": {
+              "field": "score",
+              "type": "quantitative"
+@@ -124,7 +158,9 @@
+  ],
+  "resolve": {
+    "scale": {
+-      "color": "independent"
++      "color": "independent",
++      "y": "shared",
++      "size": "independent"
+    }
+  },
+  "$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json",
+
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/debug_modes.html b/dev_guides/debug_modes.html new file mode 100644 index 0000000000..6927f7873d --- /dev/null +++ b/dev_guides/debug_modes.html @@ -0,0 +1,5399 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Understanding and debugging Splink - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Understanding and debugging Splink

+ +
+

Splink contains tooling to help developers understand the underlying computations, how caching and pipelining is working, and debug problems.

+

There are two main mechanisms: _debug_mode, and setting different logging levels

+

Debug mode

+

You can turn on debug mode by setting linker._debug_mode = True.

+

This has the following effects:

+
    +
  • Each step of Splink's calculations are executed in turn. That is, pipelining is switched off.
  • +
  • The SQL statements being executed by Splink are displayed
  • +
  • The results of the SQL statements are displayed in tabular format
  • +
+

This is probably the best way to understand each step of the calculations being performed by Splink - because a lot of the implementation gets 'hidden' within pipelines for performance reasons.

+

Note that enabling debug mode will dramatically reduce Splink's performance!

+

Logging

+

Splink has a range of logging modes that output information about what Splink is doing at different levels of verbosity.

+

Unlike debug mode, logging doesn't affect the performance of Splink.

+

Logging levels

+

You can set the logging level with code like logging.getLogger("splink").setLevel(desired_level) although see notes below about gotchas.

+

The logging levels in Splink are:

+
    +
  • logging.INFO (20): This outputs user facing messages about the training status of Splink models
  • +
  • 15: Outputs additional information about time taken and parameter estimation
  • +
  • logging.DEBUG (10): Outputs information about the names of the SQL statements executed
  • +
  • logging.DEBUG (7): Outputs information about the names of the components of the SQL pipelines
  • +
  • logging.DEBUG (5): Outputs the SQL statements themselves
  • +
+

How to control logging

+

Note that by default Splink sets the logging level to INFO on initialisation

+

With basic logging

+
import logging
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+# This must come AFTER the linker is intialised, because the logging level
+# will be set to INFO
+logging.getLogger("splink").setLevel(logging.DEBUG)
+
+

Without basic logging

+
# This code can be anywhere since set_up_basic_logging is False
+import logging
+logging.basicConfig(format="%(message)s")
+splink_logger = logging.getLogger("splink")
+splink_logger.setLevel(logging.INFO)
+
+linker = Linker(df, settings, db_api, set_up_basic_logging=False)
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/dependency_compatibility_policy.html b/dev_guides/dependency_compatibility_policy.html new file mode 100644 index 0000000000..a73beaa9f7 --- /dev/null +++ b/dev_guides/dependency_compatibility_policy.html @@ -0,0 +1,5479 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Dependency Compatibility Policy - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Dependency Compatibility Policy

+ +

This page highlights the importance of package versioning and proposes that we use a "sunsetting" strategy for updating our support python and dependency versions as they reach end-of-life.

+

Additionally, it lays out some rough guidelines for us to follow when addresses future package conflicts and issues arises from antiquated dependency versions.

+
+ +

Package Versioning Policy

+

Monitoring package versioning within Splink is important. It ensures that the project can be used by as wide a group of individuals as possible, without wreaking havoc on our issues log.

+

Below is a rough summary of versioning and some complimentary guidelines detailing how we should look to deal with dependency management going forward.

+

Benefits to Effective Versioning

+

Effective versioning is crucial for ensuring Splink's compatibility across diverse technical ecosystems and seamless integration with various Python versions and cloud tools. Key advantages include:

+
    +
  • Faster dependency resolution with poetry lock.
  • +
  • Reduces dependency conflicts across systems.
  • +
+

Versioning Guidance

+

Establish Minimum Supported Versions

+
    +
  • Align with Python Versions: Select the minimum required versions for dependencies based on the earliest version of Python we plan to support. This approach is aligned with our policy on Sunsetting End-of-Life Python Versions, ensuring Splink remains compatible across a broad spectrum of environments.
  • +
  • Document Reasons: Where appropriate, clearly document why specific versions are chosen as minimums, including any critical features or bug fixes that dictate these choices. We should look to do this in pull requests implementing the change and as comments in pyproject.toml. Doing so allows us to easily track versioning decisions.
  • +
+

Prefer Open Version Constraints

+
    +
  • Use Open Upper Bounds: Wherever feasible, avoid setting an upper version limit for a dependency. This reduces compatibility conflicts with external packages and allows the user to decide their versioning strategy at the application level.
  • +
  • Monitor Compatibility: Actively monitor the development of our core dependencies to anticipate significant updates (such as new major versions) that might necessitate code changes. Within Splink, this is particularly relevant for both SQLGlot and DuckDB, that (semi)frequently release new, breaking changes.
  • +
+

Compatibility Checks

+
    +
  • Automated Testing: Use Continuous Integration (CI) to help test the latest python and package versions. This helps identify compatibility issues early.
  • +
  • Matrix Testing: Test against a matrix of dependencies or python versions to ensure broad compatibility. pytest_run_tests_with_cache.yml is currently our broad compatibility check for supported versions of python.
  • +
+

Handling Breaking Changes

+
    +
  • Temporary Version Pinning for Major Changes: In cases where a dependency introduces breaking changes that we cannot immediately accommodate, we should look to temporarily pin to a specific version or version range until we have an opportunity to update Splink.
  • +
  • Adaptive Code Changes: When feasible, adapt code to be compatible with new major versions of dependencies. This may include conditional logic to handle differences across versions. An example of this can be found within input_column.py, where we adjust how column identifiers are extracted from SQLGlot based on its version.
  • +
+

Documentation and Communication

+
    +
  • Clear Documentation: Clearly log installation instructions within the Getting Started section of our documentation. This should cover not only standard installation procedures but also specialised instructions, for instance, installing a -less version of Splink, for locked down environments.
  • +
  • Log Dependency Changes in the CHANGELOG: Where dependencies are adjusted, ensure that changes are logged within CHANGELOG.md. This can help simplify debugging and creates a guide that can be easily referenced.
  • +
+

User Support and Feedback

+
    +
  • Issue Tracking: Actively track and address issues related to dependency compatibility. Where users are having issues, have them report their package versions through either pip freeze or pip-chill, so we can more easily identify what may have caused the problem.
  • +
  • Feedback Loops: Encourage feedback from users regarding compatibility and dependency issues. Streamline the reporting process in our issues log.
  • +
+
+ +

Sunsetting End-of-Life Python Versions

+

In alignment with the Python community's practices, we are phasing out support for Python versions that have hit end-of-life and are no longer maintained by the core Python development team. This decision ensures that Splink remains secure, efficient, and up-to-date with the latest Python features and improvements.

+

Our approach mirrors that of key package maintainers, such as the developers behind NumPy. The NumPy developers have kindly pulled together NEP 29, their guidelines for python version support. This outlines a recommended framework for the deprecation of outdated Python versions.

+

Benefits of Discontinuing Support for Older Python Versions:

+
    +
  • Enhanced Tooling: Embracing newer versions enables the use of advanced Python features. For python 3.8, these include protocols, walrus operators, and improved type annotations, amongst others.
  • +
  • Fewer Dependabot Alerts: Transitioning away from older Python versions reduces the volume of alerts associated with legacy package dependencies.
  • +
  • Minimised Package Conflicts: Updating python decreases the necessity for makeshift solutions to resolve dependency issues with our core dependencies, fostering a smoother integration with tools like Poetry.
  • +
+

For a comprehensive rationale behind upgrading, the article "It's time to stop using python 3.7" offers an insightful summary.

+

Implementation Timeline:

+

The cessation of support for major Python versions post-end-of-life will not be immediate but will instead be phased in gradually over the months following their official end-of-life designation.

+

Proposed Workflow for Sunsetting Major Python Versions:

+
    +
  1. Initial Grace Period: We propose a waiting period of approximately six months post-end-of-life before initiating the upgrade process. This interval:
      +
    • Mitigates potential complications arising from system-wide Python updates across major cloud distributors and network administrators.
    • +
    • Provides a window to inform users about the impending deprecation of older versions.
    • +
    +
  2. +
  3. Following the Grace Period:
      +
    • Ensure the upgrade process is seamless and devoid of critical issues, leveraging the backward compatibility strengths of newer Python versions.
    • +
    • Address any bugs discovered during the upgrade process.
    • +
    • Update pyproject.toml accordingly. Pull requests updating our supported versions should be clearly marked with the [DEPENDENCIES] tag and python_version_update label for straightforward tracking.
    • +
    +
  4. +
+

Python's Development Cycle:

+

A comprehensive summary of Python's development cycle is available on the Python Developer's Guide. This includes a chart outlining the full release cycle up to 2029:

+

+

As it stands, support for Python 3.8 will officially end in October of 2024. Following an initial grace period of around six months, we will then look to phase out support.

+

We will look to regularly review this page and update Splink's dependencies accordingly.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/index.html b/dev_guides/index.html new file mode 100644 index 0000000000..b4450b6f61 --- /dev/null +++ b/dev_guides/index.html @@ -0,0 +1,5256 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Contributing to Splink - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + + + +
+
+ + + + + + + + + + + + +

Contributing to Splink

+

Thank you for your interest in contributing to Splink! If this is your first time working with Splink, check our Contributors Guide.

+

When making changes to Splink, there are a number of common operations that developers need to perform. The guides below lay out some of these common operations, and provides scripts to automate these processes. These include:

+ + +

Splink is quite a large, complex codebase. The guides in this section lay out some of the key structures and key areas within the Splink codebase. These include:

+
    +
  • Understanding and Debugging Splink - demonstrates several ways of understanding how Splink code is running under the hood. This includes Splink's debug mode and logging.
  • +
  • Transpilation using SQLGlot - demonstrates how Splink translates SQL in order to be compatible with multiple SQL engines using the SQLGlot package.
  • +
  • Performance and caching - demonstrates how pipelining and caching is used to make Splink run more efficiently.
  • +
  • Charts - demonstrates how charts are built in Splink, including how to add new charts and edit existing charts.
  • +
  • User-Defined Functions - demonstrates how User Defined Functions (UDFs) are used to provide functionality within Splink that is not native to a given SQL backend.
  • +
  • Settings Validation - summarises how to use and expand the existing settings schema and validation functions.
  • +
  • Managing Splink's Dependencies - this section provides guidelines for managing our core dependencies and our strategy for phasing out Python versions that have reached their end-of-life.
  • +
+ + + + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/settings_validation/extending_settings_validator.html b/dev_guides/settings_validation/extending_settings_validator.html new file mode 100644 index 0000000000..13dc8111ff --- /dev/null +++ b/dev_guides/settings_validation/extending_settings_validator.html @@ -0,0 +1,5571 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Extending the Settings Validator - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Enhancing the Settings Validator

+

Overview of Current Validation Checks

+

Below is a summary of the key validation checks currently implemented by our settings validator. For detailed information, please refer to the source code:

+
    +
  • Blocking Rules and Comparison Levels Validation: Ensures that the user’s blocking rules and comparison levels are correctly imported from the designated library, and that they contain the necessary details for effective use within the Splink.
  • +
  • Column Existence Verification: Verifies the presence of columns specified in the user’s settings across all input dataframes, preventing errors due to missing data fields.
  • +
  • Miscellaneous Checks: Conducts a range of additional checks aimed at providing clear and informative error messages, facilitating smoother user experiences when deviations from typical Splink usage are detected.
  • +
+

Extending Validation Logic

+

If you are introducing new validation checks that deviate from the existing ones, please incorporate them as functions within a new script located in the splink/settings_validation directory. This ensures that all validation logic is centrally managed and easily maintainable.

+
+ +

Error handling and logging

+

Error handling and logging in the settings validator takes the following forms:

+
    +
  • Raising INFO level logs - These are raised when the settings validator detects an issue with the user's settings dictionary. These logs are intended to provide the user with information on how to rectify the issue, but should not halt the program.
  • +
  • Raising single exceptions - Raise a built-in Python or Splink exception in response to finding an error.
  • +
  • Concurrently raising multiple exceptions - In some instances, it makes sense to raise multiple errors simultaneously, so as not to disrupt the program. This is achieved using the ErrorLogger class.
  • +
+

The first two use standard Python logging and exception handling. The third is a custom class, covered in more detail below.

+

You should look to use whichever makes the most sense given your requirements.

+

Raising multiple exceptions concurrently

+

Raising multiple exceptions simultaneously provides users with faster and more manageable feedback, avoiding the tedious back-and-forth that typically occurs when errors are reported and addressed one at a time.

+

To enable the logging of multiple errors in a single check, the ErrorLogger class can be utilised. This is designed to operate similarly to a list, allowing the storing of errors using the append method.

+

Once all errors have been logged, you can raise them with the raise_and_log_all_errors method. This will raise an exception of your choice and report all stored errors to the user.

+
+ErrorLogger in practice +
from splink.exceptions import ErrorLogger
+
+# Create an error logger instance
+e = ErrorLogger()
+
+# Log your errors
+e.append(SyntaxError("The syntax is wrong"))
+e.append(NameError("Invalid name entered"))
+
+# Raise your errors
+e.raise_and_log_all_errors()
+
+

+
+
+ +

Expanding miscellaneous checks

+

Miscellaneous checks should be added as standalone functions within an appropriate check inside splink/settings_validation. These functions can then be integrated into the linker's startup process for validation.

+

An example of a miscellaneous check is the validate_dialect function. This assesses whether the settings dialect aligns with the linker's dialect.

+

This is then injected into the _validate_settings method within our linker, as seen here.

+
+ +

Additional comparison and blocking rule checks

+

Comparison and Blocking Rule checks can be found within the valid_types.py script.

+

These checks currently interface with the ErrorLogger class which is used to store and raise multiple errors simultaneously (see above).

+

If you wish to expand the current set of tests, it is advised that you incorporate any new checks into either log_comparison_errors or _validate_settings (mentioned above).

+
+ +

Checking for the existence of user specified columns

+

Column and SQL validation is performed within log_invalid_columns.py.

+

The aim of this script is to check that the columns specified by the user exist within the input dataframe(s). If any invalid columns are found, the script will log this with the user.

+

Should you need to include extra checks to assess the validity of columns supplied by a user, your primary focus should be on the column_lookups.py script.

+

There are two main classes within this script that can be used or extended to perform additional column checks:

+
+InvalidCols +

InvalidCols is a NamedTuple, used to construct the bulk of our log strings. This accepts a list of columns and the type of error, producing a complete log string when requested.

+

For simplicity, there are three partial implementations to cover the most common cases: +- MissingColumnsLogGenerator - missing column identified. +- InvalidTableNamesLogGenerator - table name entered by the user is missing or invalid. +- InvalidColumnSuffixesLogGenerator - _l and _r suffixes are missing or invalid.

+

In practice, this can be used as follows: +

# Store our invalid columns
+my_invalid_cols = MissingColumnsLogGenerator(["first_col", "second_col"])
+# Construct the corresponding log string
+my_invalid_cols.construct_log_string()
+
+
+
+InvalidColumnsLogger +

InvalidColumnsLogger takes in a series of cleansed columns from your settings object (see SettingsColumnCleaner) and runs a series of validation checks to assess whether the column(s) are present within the underlying dataframes.

+

Any invalid columns are stored in an InvalidCols instance (see above), which is then used to construct a log string.

+

Logs are output to the user at the INFO level.

+
+

To extend the column checks, you simply need to add an additional validation method to the InvalidColumnsLogger class. Checks must be added as a new method and then called within construct_output_logs.

+

Single column, multi-column and SQL checks

+

Single and multi-column

+

Single and multi-column checks are relatively straightforward. Assuming you have a clean set of columns, you can leverage the check_for_missing_settings_column function.

+

This expects the following arguments: +* settings_id: the name of the settings ID. This is only used for logging and does not necessarily need to match the true ID. +* settings_column_to_check: the column(s) you wish to validate. +* valid_input_dataframe_columns: the cleaned columns from your all input dataframes.

+

Checking columns in SQL statements

+

Checking SQL statements is a little more complex, given the need to parse SQL in order to extract your column names.

+

To do this, you can leverage the check_for_missing_or_invalid_columns_in_sql_strings function.

+

This expects the following arguments: +* sql_dialect: The SQL dialect used by the linker. +* sql_strings: A list of SQL strings. +* valid_input_dataframe_columns: The list of columns identified in your input dataframe(s). +* additional_validation_checks: Functions used to check for other issues with the parsed SQL string, namely, table name and column suffix validation.

+

NB: for nested SQL statements, you'll need to add an additional loop. See check_comparison_for_missing_or_invalid_sql_strings for more details.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/settings_validation/settings_validation_overview.html b/dev_guides/settings_validation/settings_validation_overview.html new file mode 100644 index 0000000000..b2854250b7 --- /dev/null +++ b/dev_guides/settings_validation/settings_validation_overview.html @@ -0,0 +1,5351 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Settings Validation Overview - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Settings Validation Overview

+ +

Settings Validation

+

A common problem within Splink comes from users providing invalid settings dictionaries. To prevent this, we've built a settings validator to scan through a given settings dictionary and provide user-friendly feedback on what needs to be fixed.

+

At a high level, this includes:

+
    +
  1. Assessing the structure of the settings dictionary. See the Settings Schema Validation section.
  2. +
  3. The contents of the settings dictionary. See the Settings Validator section.
  4. +
+
+ +

Settings Schema Validation

+

Our custom settings schema can be found within settings_jsonschema.json.

+

This is a json file, outlining the required data type, key and value(s) to be specified by the user while constructing their settings. Where values deviate from this specified schema, an error will be thrown.

+

Schema validation is currently performed inside the settings.py script.

+

You can modify the schema by manually editing the json schema.

+

Modifications can be used to (amongst other uses):

+
    +
  • Set or remove default values for schema keys.
  • +
  • Set the required data type for a given key.
  • +
  • Expand or refine previous titles and descriptions to help with clarity.
  • +
+

Any updates you wish to make to the schema should be discussed with the wider team, to ensure it won't break backwards compatibility and makes sense as a design decision.

+

Detailed information on the arguments that can be supplied to the json schema can be found within the json schema documentation.

+
+ +

Settings Validator

+

As long as an input is of the correct data type, it will pass our initial schema checks. This can then mean that user inputs that would generate invalid SQL can slip through and are then often caught by the database engine, commonly resulting in uninformative errors. This can result in uninformative and confusing errors that the user is unsure of how to resolve.

+

The settings validation code (found within the settings validation directory of Splink) is another layer of validation, executing a series of checks to determine whether values in the user's settings dictionary will generate invalid SQL.

+

Frequently encountered problems include:

+
    +
  • Invalid column names. For example, specifying a unique_id_column_name that doesn't exist in the underlying dataframe(s). Such names satisfy the schema requirements as long as they are strings.
  • +
  • Using the settings dictionary's default values
  • +
  • Importing comparisons and blocking rules for the wrong dialect.
  • +
  • Using an inappropriate custom data types - (comparison level vs. comparison within our comparisons).
  • +
  • Using Splink for an invalid form of linkage - See the following discussion.
  • +
+

All code relating to settings validation can be found within one of the following scripts:

+
    +
  • valid_types.py - This script includes various miscellaneous checks for comparison levels, blocking rules, and linker objects. These checks are primarily performed within settings.py.
  • +
  • settings_column_cleaner.py - Includes a set of functions for cleaning and extracting data, designed to sanitise user inputs in the settings dictionary and retrieve necessary SQL or column identifiers.
  • +
  • log_invalid_columns.py - Pulls the information extracted in settings_column_cleaner.py and generates any log strings outlining invalid columns or SQL identified within the settings dictionary. Any generated error logs are reported to the user when initialising a linker object at the INFO level.
  • +
  • settings_validation_log_strings.py - a home for any error messages or logs generated by the settings validator.
  • +
+

For information on expanding the range of checks available to the validator, see Extending the Settings Validator.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/spark_pipelining_and_caching.html b/dev_guides/spark_pipelining_and_caching.html new file mode 100644 index 0000000000..b5029e332f --- /dev/null +++ b/dev_guides/spark_pipelining_and_caching.html @@ -0,0 +1,5323 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Spark caching - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Spark caching

+ +

Caching and pipelining in Spark

+

This article assumes you've read the general guide to caching and pipelining.

+

In Spark, some additions have to be made to this general pattern because all transformation in Spark are lazy.

+

That is, when we call df = spark.sql(sql), the df is not immediately computed.

+

Furthermore, even when an action is called, the results aren't automatically persisted by Spark to disk. This differs from other backends, which execute SQL as a create table statement, meaning that the result is automatically saved.

+

This interferes with caching, because Splink assumes that when the the function _execute_sql_against_backend() is called, this will be evaluated greedily (immediately evaluated) AND the results will be saved to the 'database'.

+

Another quirk of Spark is that it chunks work up into tasks. This is relevant for two reasons:

+
    +
  • Tasks can suffer from skew, meaning some take longer than others, which can be bad from a performance point of view.
  • +
  • The number of tasks and how data is partitioned controls how many files are output when results are saved. Some Splink operations results in a very large number of small files which can take a long time to read and write, relative to the same data stored in fewer files.
  • +
+

Repartitioning can be used to rebalance workloads (reduce skew) and to avoid the 'many small files' problem.

+

Spark-specific modifications

+

The logic for Spark is captured in the implementation of _execute_sql_against_backend() in the spark_linker.py.

+

This has three roles:

+
    +
  • It determines how to save result - using either persist, checkpoint or saving to .parquet, with .parquet being the default.
  • +
  • It determines which results to save. Some small results such __splink__m_u_counts are immediately converted using toPandas() rather than being saved. This is because saving to disk and reloading is expensive and unnecessary.
  • +
  • It chooses which Spark dataframes to repartition to reduce the number of files which are written/read
  • +
+

Note that repartitioning and saving is independent. Some dataframes are saved without repartitioning. Some dataframes are repartitioned without being saved.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/transpilation.html b/dev_guides/transpilation.html new file mode 100644 index 0000000000..d7111ee6a5 --- /dev/null +++ b/dev_guides/transpilation.html @@ -0,0 +1,5331 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Transpilation using sqlglot - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

SQL Transpilation in Splink, and how we support multiple SQL backends

+

In Splink, all the core data linking algorithms are implemented in SQL. This allows computation to be offloaded to a SQL backend of the users choice.

+

One difficulty with this paradigm is that SQL implementations differ - the functions available in (say) the Spark dialect of SQL differ from those available in DuckDB SQL. And to make matters worse, functions with the same name may behave differently (e.g. different arguments, arguments in different orders, etc.).

+

Splink therefore needs a mechanism of writing SQL statements that are able to run against all the target SQL backends (engines).

+

Details are as follows:

+ +

Core data linking algorithms are implemented in 'backend agnostic' SQL. So they're written using basic SQL functions that are common across the available in all the target backends, and don't need any translation.

+

It has been possible to write all of the core Splink logic in SQL that is consistent between dialects.

+

However, this is not the case with Comparisons, which tend to use backend specific SQL functions like jaro_winker, whose function names and signatures differ between backends.

+

2. User-provided SQL is interpolated into these dialect-agnostic SQL statements

+

The user provides custom SQL is two places in Splink:

+
    +
  1. Blocking rules
  2. +
  3. The sql_condition (see here) provided as part of a Comparison
  4. +
+

The user is free to write this SQL however they want.

+

It's up to the user to ensure the SQL they provide will execute successfully in their chosen backend. So the sql_condition must use functions that exist in the target execution engine

+

3. Backends can implement transpilation and or dialect steps to further transform the SQL if needed

+

Occasionally some modifications are needed to the SQL to ensure it executes against the target backend.

+

sqlglot is used for this purpose. For instance, a custom dialect is implemented in the Spark linker.

+

A transformer is implemented in the Athena linker.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/dev_guides/udfs.html b/dev_guides/udfs.html new file mode 100644 index 0000000000..0ca37bdb49 --- /dev/null +++ b/dev_guides/udfs.html @@ -0,0 +1,5358 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + User-Defined Functions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

User Defined Functions

+

User Defined Functions (UDFs) are functions that can be created to add functionality to a given SQL backend that does not already exist. These are particularly useful within Splink as it supports multiple SQL engines each with different inherent functionality. UDFs are an important tool for creating consistent functionality across backends.

+

For example, DuckDB has an in-built string comparison function for Jaccard similarity whereas Spark SQL doesn't have an equivalent function. Therefore, a UDF is required to use functions like JaccardAtThresholds() and JaccardLevel() with a Spark backend.

+

Spark

+

Spark supports UDFs written in Scala and Java.

+

Splink currently uses UDFs written in Scala and are implemented as follows:

+ +

Now the Spark UDFs have been successfully registered, they can be used in Spark SQL. For example,

+
jaccard("name_column_1", "name_column_2") >= 0.9
+
+

which provides the basis for functions such as JaccardAtThresholds() and JaccardLevel().

+

DuckDB

+

Python UDFs can be registered to a DuckDB connection from version 0.8.0 onwards.

+

The documentation is here, an examples are here. Note that these functions should be registered against the DuckDB connection provided to the linker using connection.create_function.

+

Note that performance will generally be substantially slower than using native DuckDB functions. Consider using vectorised UDFs were possible - see here.

+

Athena

+

Athena supports UDFs written in Java, however these have not yet been implemented in Splink.

+

SQLite

+

Python UDFs can be registered to a SQLite connection using the create_function function. An example is as follows:

+
from rapidfuzz.distance.Levenshtein import distance
+conn = sqlite3.connect(":memory:")
+conn.create_function("levenshtein", 2, distance)
+
+

The function levenshtein is now available to use as a Python function

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/feed_json_created.json b/feed_json_created.json new file mode 100644 index 0000000000..b71d68c06f --- /dev/null +++ b/feed_json_created.json @@ -0,0 +1 @@ +{"version": "https://jsonfeed.org/version/1", "title": "Splink", "home_page_url": "https://moj-analytical-services.github.io/splink/", "feed_url": "https://moj-analytical-services.github.io/splink/feed_json_created.json", "description": null, "icon": null, "authors": [], "language": "en", "items": [{"id": "https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html", "title": "Bias in Data Linking", "content_html": "

Bias in Data Linking

\n

In March 2024, the Splink team launched a 6-month 'Bias in Data Linking' internship with the Alan Turing Institute. This installment of the Splink Blog is going to introduce the internship, its goals, and provide an update on what's happened so far.

", "image": null, "date_published": "2024-08-19T00:00:00+00:00", "authors": [{"name": "erica-k"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html", "title": "Splink 4.0.0 released", "content_html": "

Splink 4.0.0 released

\n

We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3.

\n

For the uninitiated, Splink is a free and open source library for record linkage and deduplication at scale, capable of deduplicating 100 million records+, that is widely used and has been downloaded over 8 million times.

\n

Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines.

", "image": null, "date_published": "2024-07-24T00:00:00+00:00", "authors": [{"name": "robin-l"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html", "title": "Splink 3 updates, and Splink 4 development announcement - April 2024", "content_html": "

Splink 3 updates, and Splink 4 development announcement - April 2024

\n

This post describes significant updates to Splink since our previous post and details of development work taking place on the forthcoming release of Splink 4.

", "image": null, "date_published": "2024-04-02T00:00:00+00:00", "authors": [{"name": "robin-l"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html", "title": "Ethics in Data Linking", "content_html": "

Ethics in Data Linking

\n

Welcome to the next installment of the Splink Blog where we\u2019re talking about Data Ethics!

\n

:question: Why should we care about ethics?

\n

Splink was developed in-house at the UK Government\u2019s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.

", "image": null, "date_published": "2024-01-23T00:00:00+00:00", "authors": [{"name": "zoe-s"}, {"name": "alice-o"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html", "url": "https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html", "title": "Splink Updates - December 2023", "content_html": "

Splink Updates - December 2023

\n

Welcome to the second installment of the Splink Blog!

\n

Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!

", "image": null, "date_published": "2023-12-06T00:00:00+00:00", "authors": [{"name": "ross-k"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html", "url": "https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html", "title": "Splink Updates - July 2023", "content_html": "

Splink Updates - July 2023

\n

:new: Welcome to the Splink Blog! :new:

\n

Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.

\n

So, without further ado, here are some of the highlights from the first half of 2023!

", "image": null, "date_published": "2023-07-27T00:00:00+00:00", "authors": [{"name": "ross-k"}, {"name": "robin-l"}], "tags": null}]} \ No newline at end of file diff --git a/feed_json_updated.json b/feed_json_updated.json new file mode 100644 index 0000000000..f0c5c7192d --- /dev/null +++ b/feed_json_updated.json @@ -0,0 +1 @@ +{"version": "https://jsonfeed.org/version/1", "title": "Splink", "home_page_url": "https://moj-analytical-services.github.io/splink/", "feed_url": "https://moj-analytical-services.github.io/splink/feed_json_updated.json", "description": null, "icon": null, "authors": [], "language": "en", "items": [{"id": "https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html", "url": "https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html", "title": "Splink Updates - July 2023", "content_html": "

Splink Updates - July 2023

\n

:new: Welcome to the Splink Blog! :new:

\n

Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.

\n

So, without further ado, here are some of the highlights from the first half of 2023!

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "ross-k"}, {"name": "robin-l"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html", "url": "https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html", "title": "Splink Updates - December 2023", "content_html": "

Splink Updates - December 2023

\n

Welcome to the second installment of the Splink Blog!

\n

Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "ross-k"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html", "title": "Ethics in Data Linking", "content_html": "

Ethics in Data Linking

\n

Welcome to the next installment of the Splink Blog where we\u2019re talking about Data Ethics!

\n

:question: Why should we care about ethics?

\n

Splink was developed in-house at the UK Government\u2019s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "zoe-s"}, {"name": "alice-o"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html", "title": "Splink 3 updates, and Splink 4 development announcement - April 2024", "content_html": "

Splink 3 updates, and Splink 4 development announcement - April 2024

\n

This post describes significant updates to Splink since our previous post and details of development work taking place on the forthcoming release of Splink 4.

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "robin-l"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html", "title": "Splink 4.0.0 released", "content_html": "

Splink 4.0.0 released

\n

We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3.

\n

For the uninitiated, Splink is a free and open source library for record linkage and deduplication at scale, capable of deduplicating 100 million records+, that is widely used and has been downloaded over 8 million times.

\n

Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines.

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "robin-l"}], "tags": null}, {"id": "https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html", "url": "https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html", "title": "Bias in Data Linking", "content_html": "

Bias in Data Linking

\n

In March 2024, the Splink team launched a 6-month 'Bias in Data Linking' internship with the Alan Turing Institute. This installment of the Splink Blog is going to introduce the internship, its goals, and provide an update on what's happened so far.

", "image": null, "date_modified": "2024-09-15T08:09:22+00:00", "authors": [{"name": "erica-k"}], "tags": null}]} \ No newline at end of file diff --git a/feed_rss_created.xml b/feed_rss_created.xml new file mode 100644 index 0000000000..253fae6cc9 --- /dev/null +++ b/feed_rss_created.xml @@ -0,0 +1 @@ + Splinkhttps://moj-analytical-services.github.io/splink/https://github.com/moj-analytical-services/splinken Sun, 15 Sep 2024 08:09:54 -0000 Sun, 15 Sep 2024 08:09:54 -0000 1440 MkDocs RSS plugin - v1.15.0 Bias in Data Linking erica-k <h1>Bias in Data Linking</h1><p>In March 2024, the Splink team launched a 6-month <em>'Bias in Data Linking'</em> internship with the <a href="https://www.turing.ac.uk">Alan Turing Institute</a>. This installment of the Splink Blog is going to introduce the internship, its goals, and provide an update on what's happened so far.</p>https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html Mon, 19 Aug 2024 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html Splink 4.0.0 released robin-l <h1>Splink 4.0.0 released</h1><p>We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3.</p><p>For the uninitiated, <a href="../../index.md">Splink</a> is a free and open source library for record linkage and deduplication at scale, capable of deduplicating 100 million records+, that is <a href="../../index.md#use-cases">widely used</a> and has been downloaded over 8 million times.</p><p>Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines.</p>https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html Wed, 24 Jul 2024 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html Splink 3 updates, and Splink 4 development announcement - April 2024 robin-l <h1>Splink 3 updates, and Splink 4 development announcement - April 2024</h1><p>This post describes significant updates to Splink since our previous <a href="https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html">post</a> and details of development work taking place on the forthcoming release of Splink 4.</p>https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html Tue, 02 Apr 2024 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html Ethics in Data Linking zoe-s alice-o <h1>Ethics in Data Linking</h1><p>Welcome to the next installment of the Splink Blog where we’re talking about Data Ethics!</p><h2>:question: Why should we care about ethics?</h2><p>Splink was developed in-house at the UK Government’s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.</p>https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html Tue, 23 Jan 2024 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html Splink Updates - December 2023 ross-k <h1>Splink Updates - December 2023</h1><p>Welcome to the second installment of the Splink Blog!</p><p>Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!</p>https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html Wed, 06 Dec 2023 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html Splink Updates - July 2023 ross-k robin-l <h1>Splink Updates - July 2023</h1><h2>:new: Welcome to the Splink Blog! :new:</h2><p>Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.</p><p>So, without further ado, here are some of the highlights from the first half of 2023!</p>https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html Thu, 27 Jul 2023 00:00:00 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html \ No newline at end of file diff --git a/feed_rss_updated.xml b/feed_rss_updated.xml new file mode 100644 index 0000000000..5048b12745 --- /dev/null +++ b/feed_rss_updated.xml @@ -0,0 +1 @@ + Splinkhttps://moj-analytical-services.github.io/splink/https://github.com/moj-analytical-services/splinken Sun, 15 Sep 2024 08:09:54 -0000 Sun, 15 Sep 2024 08:09:54 -0000 1440 MkDocs RSS plugin - v1.15.0 Splink Updates - July 2023 ross-k robin-l <h1>Splink Updates - July 2023</h1><h2>:new: Welcome to the Splink Blog! :new:</h2><p>Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.</p><p>So, without further ado, here are some of the highlights from the first half of 2023!</p>https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html Splink Updates - December 2023 ross-k <h1>Splink Updates - December 2023</h1><p>Welcome to the second installment of the Splink Blog!</p><p>Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!</p>https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html Ethics in Data Linking zoe-s alice-o <h1>Ethics in Data Linking</h1><p>Welcome to the next installment of the Splink Blog where we’re talking about Data Ethics!</p><h2>:question: Why should we care about ethics?</h2><p>Splink was developed in-house at the UK Government’s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.</p>https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html Splink 3 updates, and Splink 4 development announcement - April 2024 robin-l <h1>Splink 3 updates, and Splink 4 development announcement - April 2024</h1><p>This post describes significant updates to Splink since our previous <a href="https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html">post</a> and details of development work taking place on the forthcoming release of Splink 4.</p>https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html Splink 4.0.0 released robin-l <h1>Splink 4.0.0 released</h1><p>We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3.</p><p>For the uninitiated, <a href="../../index.md">Splink</a> is a free and open source library for record linkage and deduplication at scale, capable of deduplicating 100 million records+, that is <a href="../../index.md#use-cases">widely used</a> and has been downloaded over 8 million times.</p><p>Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines.</p>https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html Bias in Data Linking erica-k <h1>Bias in Data Linking</h1><p>In March 2024, the Splink team launched a 6-month <em>'Bias in Data Linking'</em> internship with the <a href="https://www.turing.ac.uk">Alan Turing Institute</a>. This installment of the Splink Blog is going to introduce the internship, its goals, and provide an update on what's happened so far.</p>https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html Sun, 15 Sep 2024 08:09:22 +0000Splinkhttps://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html \ No newline at end of file diff --git a/getting_started.html b/getting_started.html new file mode 100644 index 0000000000..255336c327 --- /dev/null +++ b/getting_started.html @@ -0,0 +1,5461 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + Getting Started - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Getting Started

+

Install

+

Splink supports python 3.8+.

+

To obtain the latest released version of Splink you can install from PyPI using pip: +

pip install splink
+
+

or if you prefer, you can instead install Splink using conda: +

conda install -c conda-forge splink
+
+
+Backend Specific Installs +

Backend Specific Installs

+

From Splink v3.9.7, packages required by specific Splink backends can be optionally installed by adding the [<backend>] suffix to the end of your pip install.

+

Note that SQLite and DuckDB come packaged with Splink and do not need to be optionally installed.

+

The following backends are supported:

+
+
+
+
pip install 'splink[spark]'
+
+
+
+
pip install 'splink[athena]'
+
+
+
+
pip install 'splink[postgres]'
+
+
+
+
+
+

🚀 Quickstart

+

To get a basic Splink model up and running, use the following code. It demonstrates how to:

+
    +
  1. Estimate the parameters of a deduplication model
  2. +
  3. Use the parameter estimates to identify duplicate records
  4. +
  5. Use clustering to generate an estimated unique person ID.
  6. +
+
+Simple Splink Model Example +
import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+db_api = DuckDBAPI()
+
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.JaroAtThresholds("surname"),
+        cl.DateOfBirthComparison(
+            "dob",
+            input_is_string=True,
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "dob"),
+        block_on("surname"),
+    ]
+)
+
+linker = Linker(df, settings, db_api)
+
+linker.training.estimate_probability_two_random_records_match(
+    [block_on("first_name", "surname")],
+    recall=0.7,
+)
+
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("first_name", "surname")
+)
+
+linker.training.estimate_parameters_using_expectation_maximisation(block_on("email"))
+
+pairwise_predictions = linker.inference.predict(threshold_match_weight=-5)
+
+clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    pairwise_predictions, 0.95
+)
+
+df_clusters = clusters.as_pandas_dataframe(limit=5)
+
+
+

Tutorials

+

You can learn more about Splink in the step-by-step tutorial. Each has a corresponding Google Colab link to run the notebook in your browser.

+

Example Notebooks

+

You can see end-to-end example of several use cases in the example notebooks. Each has a corresponding Google Colab link to run the notebook in your browser.

+

Getting help

+

If after reading the documentatation you still have questions, please feel free to post on our discussion forum.

+ + + + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/hooks/__init__.py b/hooks/__init__.py new file mode 100644 index 0000000000..61d93a53ef --- /dev/null +++ b/hooks/__init__.py @@ -0,0 +1,120 @@ +from __future__ import annotations + +import re +from pathlib import Path + +from mkdocs.config.defaults import MkDocsConfig +from mkdocs.plugins import event_priority +from mkdocs.structure.files import Files +from mkdocs.structure.pages import Page +from nbconvert import MarkdownExporter +from nbconvert.preprocessors import TagRemovePreprocessor + +INCLUDE_MARKDOWN_REGEX = ( + # opening tag and any whitespace + r"{%\s*" + # include-markdown literal and more whitespace + r"include-markdown\s*" + # the path in double-quotes (unvalidated) + r"\"(.*)\"" + # more whitespace and closing tag + r"\s*%}" +) + + +def include_markdown(markdown: str) -> str | None: + """ + Takes markdown string content and replaces blocks such as: + + {% include-markdown "./includes/some_file.md" %} + + with the _contents_ of "docs/includes/some_file.md", or fail + with an error if the file is not located + + If there is no such block, returns None. + """ + if not re.search(INCLUDE_MARKDOWN_REGEX, markdown): + return + # if we have an include tag, replace text with file contents + for match in re.finditer(INCLUDE_MARKDOWN_REGEX, markdown): + text_to_replace = match.group(0) + include_path = match.group(1) + try: + with open(Path("docs") / include_path) as f_inc: + include_text = f_inc.read() + new_text = re.sub(text_to_replace, include_text, markdown) + # update text, in case we are iterating + markdown = new_text + # if we can't find include file then warn but carry on + except FileNotFoundError as e: + raise FileNotFoundError( + f"Couldn't find specified include file: {e}" + ) from None + return markdown + + +def re_route_links(markdown: str, page_title: str) -> str | None: + """ + If any links are to files 'docs/foo/bar.md' (which work directly in-repo) + reroute instead to 'foo/bar.md' (which work in structure of docs) + + To avoid false positives this is opt-in - i.e. it only works for files + with titles as specified in tuple + + If not one of these files, or no such links, we return None + """ + # the 'proper' way to do this would be to check if the file lives outside + # the docs/ folder, and only adjust if so, rather relying on title + # (which could be changed), and must be opted-into + relevant_file_titles = ("Contributor Guide",) + if page_title not in relevant_file_titles: + return + + docs_folder_regex = "docs/" + if not re.search(docs_folder_regex, markdown): + return + return re.sub(docs_folder_regex, "", markdown) + + +# hooks for use by mkdocs + + +# priority last - run this after any other such hooks +# this ensures we are overwriting mknotebooks config, +# not the other way round +@event_priority(-100) +def on_config(config: MkDocsConfig) -> MkDocsConfig: + # convert ipynb to md rather than html directly + # this ensures we render symbols such as '<' correctly + # in codeblocks, instead of '%lt;' + + t = TagRemovePreprocessor() + mknotebooks_config = config.get("plugins", {}).get("mknotebooks", {}) + tag_remove_configs = mknotebooks_config.config.get("tag_remove_configs", {}) + for option, setting in tag_remove_configs.items(): + setattr(t, option, set(setting)) + + md_exporter = MarkdownExporter(config=config) + md_exporter.register_preprocessor(t, enabled=True) + + # md_exporter.config["TagRemovePreprocessor"]["remove_input_tags"] = ("hideme",) + # overwrite mknotebooks config option + config["notebook_exporter"] = md_exporter + return config + + +def on_page_markdown( + markdown: str, page: Page, config: MkDocsConfig, files: Files +) -> str | None: + """ + mkdocs hook to transform the raw markdown before it is sent to the renderer. + + See https://www.mkdocs.org/dev-guide/plugins/#on_page_markdown for details. + """ + if (replaced_markdown := include_markdown(markdown)) is not None: + return replaced_markdown + # this only works if we don't have files that need links rewritten + includes + # this is currently the case, so no need to worry + if (replaced_markdown := re_route_links(markdown, page.title)) is not None: + return replaced_markdown + return diff --git a/hooks/__pycache__/__init__.cpython-39.pyc b/hooks/__pycache__/__init__.cpython-39.pyc new file mode 100644 index 0000000000..dd303888ff Binary files /dev/null and b/hooks/__pycache__/__init__.cpython-39.pyc differ diff --git a/img/README/what_does_splink_do_1.drawio.png b/img/README/what_does_splink_do_1.drawio.png new file mode 100644 index 0000000000..1cedd60ae5 Binary files /dev/null and b/img/README/what_does_splink_do_1.drawio.png differ diff --git a/img/README/what_does_splink_do_2.drawio.png b/img/README/what_does_splink_do_2.drawio.png new file mode 100644 index 0000000000..78ae849ddf Binary files /dev/null and b/img/README/what_does_splink_do_2.drawio.png differ diff --git a/img/README/what_does_splink_do_3.drawio.png b/img/README/what_does_splink_do_3.drawio.png new file mode 100644 index 0000000000..01f7312628 Binary files /dev/null and b/img/README/what_does_splink_do_3.drawio.png differ diff --git a/img/blocking/cumulative_comparisons.png b/img/blocking/cumulative_comparisons.png new file mode 100644 index 0000000000..a9438c864f Binary files /dev/null and b/img/blocking/cumulative_comparisons.png differ diff --git a/img/blocking/pairwise_comparisons.png b/img/blocking/pairwise_comparisons.png new file mode 100644 index 0000000000..fab8922566 Binary files /dev/null and b/img/blocking/pairwise_comparisons.png differ diff --git a/img/charts/AltairUserGuide.png b/img/charts/AltairUserGuide.png new file mode 100644 index 0000000000..0be999a55b Binary files /dev/null and b/img/charts/AltairUserGuide.png differ diff --git a/img/charts/Vega-Lite-editor.png b/img/charts/Vega-Lite-editor.png new file mode 100644 index 0000000000..2fe7836847 Binary files /dev/null and b/img/charts/Vega-Lite-editor.png differ diff --git a/img/charts/chart.json b/img/charts/chart.json new file mode 100644 index 0000000000..695019b76a --- /dev/null +++ b/img/charts/chart.json @@ -0,0 +1 @@ +{"config": {"view": {"continuousWidth": 300, "continuousHeight": 300}}, "layer": [{"mark": {"type": "line", "color": "darkred", "point": true, "tooltip": true}, "encoding": {"x": {"field": "Date", "type": "temporal"}, "y": {"field": "Downloads", "title": "Cumulative splink downloads", "type": "quantitative"}}}, {"mark": {"type": "text", "dx": 10, "dy": -10}, "encoding": {"text": {"field": "Downloads", "type": "quantitative"}, "x": {"field": "Date", "type": "temporal"}, "y": {"field": "Downloads", "title": "Cumulative splink downloads", "type": "quantitative"}}}], "data": {"name": "data-22c31c9b0350c6c75db8d1f2d37de72f"}, "$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json", "datasets": {"data-22c31c9b0350c6c75db8d1f2d37de72f": [{"Date": "2023-04-01", "Downloads": 5506813}, {"Date": "2023-01-01", "Downloads": 4519181}, {"Date": "2022-10-01", "Downloads": 3699127}, {"Date": "2022-07-01", "Downloads": 2932701}, {"Date": "2022-04-01", "Downloads": 2145950}, {"Date": "2022-01-01", "Downloads": 1529909}, {"Date": "2021-10-01", "Downloads": 895990}, {"Date": "2021-07-01", "Downloads": 605775}, {"Date": "2021-04-01", "Downloads": 401654}, {"Date": "2021-01-01", "Downloads": 213616}, {"Date": "2020-10-01", "Downloads": 91175}, {"Date": "2020-07-01", "Downloads": 10567}, {"Date": "2020-04-01", "Downloads": 1823}, {"Date": "2020-03-13", "Downloads": 0}]}} \ No newline at end of file diff --git a/img/charts/charts.mp4 b/img/charts/charts.mp4 new file mode 100644 index 0000000000..2425f1be98 Binary files /dev/null and b/img/charts/charts.mp4 differ diff --git a/img/charts/new_chart.png b/img/charts/new_chart.png new file mode 100644 index 0000000000..3c5f6ff61a Binary files /dev/null and b/img/charts/new_chart.png differ diff --git a/img/charts/new_chart_def.json b/img/charts/new_chart_def.json new file mode 100644 index 0000000000..eacbe4fd0c --- /dev/null +++ b/img/charts/new_chart_def.json @@ -0,0 +1,171 @@ +{ + "title": { + "text": "Heatmaps of string comparison metrics", + "anchor": "middle", + "fontSize": 16 + }, + "hconcat": [ + { + "layer": [ + { + "mark": "rect", + "encoding": { + "color": { + "field": "score", + "scale": { + "domain": [ + 0, + 1 + ], + "scheme": "greenblue" + }, + "type": "quantitative", + "legend": null + }, + "x": { + "field": "comparator", + "type": "ordinal", + "title": null + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal", + "title": "String comparison", + "axis": { + "titleFontSize": 14 + } + } + }, + "title": "Similarity", + "width": { + "step": 40 + }, + "height": { + "step": 30 + } + }, + { + "mark": { + "type": "text", + "baseline": "middle" + }, + "encoding": { + "size": { + "field": "score", + "scale": { + "range": [ + 8, + 14 + ] + }, + "legend": null + }, + "text": { + "field": "score", + "format": ".2f", + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal", + "axis": { + "labelFontSize": 12 + } + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + } + } + ], + "data": { + "name": "data-similarity" + } + }, + { + "layer": [ + { + "mark": "rect", + "encoding": { + "color": { + "field": "score", + "scale": { + "scheme": "yelloworangered", + "reverse": true + }, + "type": "quantitative", + "legend": null + }, + "x": { + "field": "comparator", + "type": "ordinal", + "title": null, + "axis": { + "labelFontSize": 12 + } + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal", + "axis": null + } + }, + "title": "Distance", + "width": { + "step": 40 + }, + "height": { + "step": 30 + } + }, + { + "mark": { + "type": "text", + "baseline": "middle" + }, + "encoding": { + "size": { + "field": "score", + "scale": { + "range": [ + 8, + 14 + ], + "reverse": true + }, + "legend": null + }, + "text": { + "field": "score", + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal" + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + } + } + ], + "data": { + "name": "data-distance" + } + } + ], + "resolve": { + "scale": { + "color": "independent", + "y": "shared", + "size": "independent" + } + }, + "$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json", + "datasets": { + "data-similarity": "[{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaro\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaro\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaro\",\"score\":0.86},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaro\",\"score\":0.73},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaro\",\"score\":0.68},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaro\",\"score\":0.6},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaro\",\"score\":0.73},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaro\",\"score\":0.71},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaro\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaro_winkler\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaro_winkler\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaro_winkler\",\"score\":0.97},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaro_winkler\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaro_winkler\",\"score\":0.97},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaro_winkler\",\"score\":0.91},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaro_winkler\",\"score\":0.81},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaro_winkler\",\"score\":0.68},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaro_winkler\",\"score\":0.6},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaro_winkler\",\"score\":0.81},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaro_winkler\",\"score\":0.74},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaro_winkler\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaccard\",\"score\":0.08},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaccard\",\"score\":0.86},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaccard\",\"score\":0.57},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaccard\",\"score\":0.38},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaccard\",\"score\":0.33},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaccard\",\"score\":0.22},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaccard\",\"score\":0.38},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaccard\",\"score\":0.44},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaccard\",\"score\":0.08}]", + "data-distance": "[{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"levenshtein\",\"score\":0.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"levenshtein\",\"score\":6.0},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"levenshtein\",\"score\":2.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"levenshtein\",\"score\":2.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"levenshtein\",\"score\":5.0},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"levenshtein\",\"score\":7.0},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"damerau_levenshtein\",\"score\":0.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"damerau_levenshtein\",\"score\":6.0},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"damerau_levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"damerau_levenshtein\",\"score\":5.0},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"damerau_levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"damerau_levenshtein\",\"score\":7.0}]" + } +} \ No newline at end of file diff --git a/img/charts/old_chart.png b/img/charts/old_chart.png new file mode 100644 index 0000000000..3316f0bf5b Binary files /dev/null and b/img/charts/old_chart.png differ diff --git a/img/charts/old_chart_def.json b/img/charts/old_chart_def.json new file mode 100644 index 0000000000..fcbeafa188 --- /dev/null +++ b/img/charts/old_chart_def.json @@ -0,0 +1,135 @@ +{ + "config": { + "view": { + "continuousWidth": 400, + "continuousHeight": 300 + } + }, + "hconcat": [ + { + "layer": [ + { + "mark": "rect", + "encoding": { + "color": { + "field": "score", + "scale": { + "domain": [ + 0, + 1 + ], + "range": [ + "red", + "green" + ] + }, + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal" + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + }, + "height": 300, + "title": "Heatmap of Similarity Scores", + "width": 300 + }, + { + "mark": { + "type": "text", + "baseline": "middle" + }, + "encoding": { + "text": { + "field": "score", + "format": ".2f", + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal" + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + } + } + ], + "data": { + "name": "data-similarity" + } + }, + { + "layer": [ + { + "mark": "rect", + "encoding": { + "color": { + "field": "score", + "scale": { + "domain": [ + 0, + 5 + ], + "range": [ + "green", + "red" + ] + }, + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal" + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + }, + "height": 300, + "title": "Heatmap of Distance Scores", + "width": 200 + }, + { + "mark": { + "type": "text", + "baseline": "middle" + }, + "encoding": { + "text": { + "field": "score", + "type": "quantitative" + }, + "x": { + "field": "comparator", + "type": "ordinal" + }, + "y": { + "field": "strings_to_compare", + "type": "ordinal" + } + } + } + ], + "data": { + "name": "data-distance" + } + } + ], + "resolve": { + "scale": { + "color": "independent" + } + }, + "$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json", + "datasets": { + "data-similarity": "[{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaro\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaro\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaro\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaro\",\"score\":0.86},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaro\",\"score\":0.73},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaro\",\"score\":0.68},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaro\",\"score\":0.6},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaro\",\"score\":0.73},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaro\",\"score\":0.71},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaro\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaro_winkler\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaro_winkler\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaro_winkler\",\"score\":0.97},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaro_winkler\",\"score\":0.95},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaro_winkler\",\"score\":0.97},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaro_winkler\",\"score\":0.91},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaro_winkler\",\"score\":0.81},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaro_winkler\",\"score\":0.68},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaro_winkler\",\"score\":0.6},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaro_winkler\",\"score\":0.81},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaro_winkler\",\"score\":0.74},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaro_winkler\",\"score\":0.43},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"jaccard\",\"score\":0.08},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"jaccard\",\"score\":0.86},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"jaccard\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"jaccard\",\"score\":0.57},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"jaccard\",\"score\":0.38},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"jaccard\",\"score\":0.33},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"jaccard\",\"score\":0.22},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"jaccard\",\"score\":0.38},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"jaccard\",\"score\":0.44},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"jaccard\",\"score\":0.08}]", + "data-distance": "[{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"levenshtein\",\"score\":0.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"levenshtein\",\"score\":6.0},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"levenshtein\",\"score\":2.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"levenshtein\",\"score\":2.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"levenshtein\",\"score\":5.0},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"levenshtein\",\"score\":7.0},{\"strings_to_compare\":\"Richard, Richard\",\"comparator\":\"damerau_levenshtein\",\"score\":0.0},{\"strings_to_compare\":\"Richard, RICHARD\",\"comparator\":\"damerau_levenshtein\",\"score\":6.0},{\"strings_to_compare\":\"Richard, Richar\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, iRchard\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Richadr\",\"comparator\":\"damerau_levenshtein\",\"score\":1.0},{\"strings_to_compare\":\"Richard, Rich\",\"comparator\":\"damerau_levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Rick\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Ricky\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Dick\",\"comparator\":\"damerau_levenshtein\",\"score\":5.0},{\"strings_to_compare\":\"Richard, Rico\",\"comparator\":\"damerau_levenshtein\",\"score\":4.0},{\"strings_to_compare\":\"Richard, Rachael\",\"comparator\":\"damerau_levenshtein\",\"score\":3.0},{\"strings_to_compare\":\"Richard, Stephen\",\"comparator\":\"damerau_levenshtein\",\"score\":7.0}]" + } +} \ No newline at end of file diff --git a/img/clusters/basic_graph.drawio.png b/img/clusters/basic_graph.drawio.png new file mode 100644 index 0000000000..43c6415b06 Binary files /dev/null and b/img/clusters/basic_graph.drawio.png differ diff --git a/img/clusters/basic_graph_cluster.drawio.png b/img/clusters/basic_graph_cluster.drawio.png new file mode 100644 index 0000000000..54f8067c90 Binary files /dev/null and b/img/clusters/basic_graph_cluster.drawio.png differ diff --git a/img/clusters/basic_graph_cluster_person.drawio.png b/img/clusters/basic_graph_cluster_person.drawio.png new file mode 100644 index 0000000000..cca8376e61 Binary files /dev/null and b/img/clusters/basic_graph_cluster_person.drawio.png differ diff --git a/img/clusters/basic_graph_records.drawio.png b/img/clusters/basic_graph_records.drawio.png new file mode 100644 index 0000000000..239eb5d24e Binary files /dev/null and b/img/clusters/basic_graph_records.drawio.png differ diff --git a/img/clusters/cluster_density.drawio.png b/img/clusters/cluster_density.drawio.png new file mode 100644 index 0000000000..73b1705281 Binary files /dev/null and b/img/clusters/cluster_density.drawio.png differ diff --git a/img/clusters/cluster_size.drawio.png b/img/clusters/cluster_size.drawio.png new file mode 100644 index 0000000000..d204682d78 Binary files /dev/null and b/img/clusters/cluster_size.drawio.png differ diff --git a/img/clusters/graph.png b/img/clusters/graph.png new file mode 100644 index 0000000000..5ff4786f92 Binary files /dev/null and b/img/clusters/graph.png differ diff --git a/img/clusters/is_bridge.drawio.png b/img/clusters/is_bridge.drawio.png new file mode 100644 index 0000000000..03743a6759 Binary files /dev/null and b/img/clusters/is_bridge.drawio.png differ diff --git a/img/clusters/threshold_cluster.drawio.png b/img/clusters/threshold_cluster.drawio.png new file mode 100644 index 0000000000..4c04f476ef Binary files /dev/null and b/img/clusters/threshold_cluster.drawio.png differ diff --git a/img/clusters/threshold_cluster_high.drawio.png b/img/clusters/threshold_cluster_high.drawio.png new file mode 100644 index 0000000000..4df4314717 Binary files /dev/null and b/img/clusters/threshold_cluster_high.drawio.png differ diff --git a/img/clusters/threshold_cluster_low.drawio.png b/img/clusters/threshold_cluster_low.drawio.png new file mode 100644 index 0000000000..733f777709 Binary files /dev/null and b/img/clusters/threshold_cluster_low.drawio.png differ diff --git a/img/clusters/threshold_cluster_medium.drawio.png b/img/clusters/threshold_cluster_medium.drawio.png new file mode 100644 index 0000000000..122609ed15 Binary files /dev/null and b/img/clusters/threshold_cluster_medium.drawio.png differ diff --git a/img/dependency_management/python_release_cycle.png b/img/dependency_management/python_release_cycle.png new file mode 100644 index 0000000000..5d75e7b4b6 Binary files /dev/null and b/img/dependency_management/python_release_cycle.png differ diff --git a/img/favicon.ico b/img/favicon.ico new file mode 100644 index 0000000000..f4a8876490 Binary files /dev/null and b/img/favicon.ico differ diff --git a/img/fellegi_sunter/prob_v_weight.png b/img/fellegi_sunter/prob_v_weight.png new file mode 100644 index 0000000000..193a963a97 Binary files /dev/null and b/img/fellegi_sunter/prob_v_weight.png differ diff --git a/img/fellegi_sunter/waterfall.png b/img/fellegi_sunter/waterfall.png new file mode 100644 index 0000000000..bc4d44a020 Binary files /dev/null and b/img/fellegi_sunter/waterfall.png differ diff --git a/img/postcode_components.png b/img/postcode_components.png new file mode 100644 index 0000000000..a2dfbe0ca3 Binary files /dev/null and b/img/postcode_components.png differ diff --git a/img/probabilistic_vs_deterministic/probabilistic_example.png b/img/probabilistic_vs_deterministic/probabilistic_example.png new file mode 100644 index 0000000000..f6ed224718 Binary files /dev/null and b/img/probabilistic_vs_deterministic/probabilistic_example.png differ diff --git a/img/probabilistic_vs_deterministic/simplified_waterfall.png b/img/probabilistic_vs_deterministic/simplified_waterfall.png new file mode 100644 index 0000000000..c5b5aa6f64 Binary files /dev/null and b/img/probabilistic_vs_deterministic/simplified_waterfall.png differ diff --git a/img/releases/notes.png b/img/releases/notes.png new file mode 100644 index 0000000000..8bfbb48947 Binary files /dev/null and b/img/releases/notes.png differ diff --git a/img/releases/notes_button.png b/img/releases/notes_button.png new file mode 100644 index 0000000000..c09b31d951 Binary files /dev/null and b/img/releases/notes_button.png differ diff --git a/img/releases/publish.png b/img/releases/publish.png new file mode 100644 index 0000000000..7630b3ad3a Binary files /dev/null and b/img/releases/publish.png differ diff --git a/img/releases/tag.png b/img/releases/tag.png new file mode 100644 index 0000000000..cdd8b1735b Binary files /dev/null and b/img/releases/tag.png differ diff --git a/img/settings_validation/error_logger.png b/img/settings_validation/error_logger.png new file mode 100644 index 0000000000..a5cf2a785f Binary files /dev/null and b/img/settings_validation/error_logger.png differ diff --git a/img/term_frequency/calc.png b/img/term_frequency/calc.png new file mode 100644 index 0000000000..c6fa0857b6 Binary files /dev/null and b/img/term_frequency/calc.png differ diff --git a/img/term_frequency/example.png b/img/term_frequency/example.png new file mode 100644 index 0000000000..3614fe6050 Binary files /dev/null and b/img/term_frequency/example.png differ diff --git a/img/term_frequency/gender-distribution.png b/img/term_frequency/gender-distribution.png new file mode 100644 index 0000000000..2472ec2268 Binary files /dev/null and b/img/term_frequency/gender-distribution.png differ diff --git a/img/term_frequency/surname-distribution.png b/img/term_frequency/surname-distribution.png new file mode 100644 index 0000000000..0848ef1572 Binary files /dev/null and b/img/term_frequency/surname-distribution.png differ diff --git a/img/term_frequency/tf-intro.drawio.png b/img/term_frequency/tf-intro.drawio.png new file mode 100644 index 0000000000..16cd5447cc Binary files /dev/null and b/img/term_frequency/tf-intro.drawio.png differ diff --git a/img/term_frequency/tf-match-weight.png b/img/term_frequency/tf-match-weight.png new file mode 100644 index 0000000000..b14e7f012f Binary files /dev/null and b/img/term_frequency/tf-match-weight.png differ diff --git a/img/term_frequency/waterfall.png b/img/term_frequency/waterfall.png new file mode 100644 index 0000000000..2d2d69aa5a Binary files /dev/null and b/img/term_frequency/waterfall.png differ diff --git a/img/vega_spec_for_readme.vg.json b/img/vega_spec_for_readme.vg.json new file mode 100644 index 0000000000..f3ad42ea1b --- /dev/null +++ b/img/vega_spec_for_readme.vg.json @@ -0,0 +1,381 @@ +{ + "$schema": "https://vega.github.io/schema/vega/v5.json", + "description": "Links and nodes", + "padding": 0, + "autosize": "none", + "signals": [ + { + "name": "node_click", + "on": [ + { + "events": "@nodes:click", + "update": "datum" + } + ] + }, + { + "name": "nodeRadius", + "value": 1 + }, + { + "name": "nodeCollideStrength", + "value": 1 + }, + { + "name": "nodeCollideRadius", + "value": 1.4 + }, + { + "name": "linkStrength", + "value": 0.5 + }, + { + "name": "edge_click", + "on": [ + { + "events": "@edges:click", + "update": "datum" + } + ] + }, + { + "name": "cx", + "update": "width / 2" + }, + { + "name": "cy", + "update": "height / 2" + }, + { + "name": "nodeCharge", + "value": 30 + }, + { + "name": "linkDistance", + "value": 0.5 + }, + { + "name": "vis_height", + "value": 320 + }, + { + "name": "vis_width", + "value": 780 + }, + { + "name": "static", + "value": true + }, + { + "description": "State variable for active node fix status.", + "name": "fix", + "value": false, + "on": [ + { + "events": "symbol:mouseout[!event.buttons], window:mouseup", + "update": "false" + }, + { + "events": "symbol:mouseover", + "update": "fix || true" + }, + { + "events": "[symbol:mousedown, window:mouseup] > window:mousemove!", + "update": "xy()", + "force": true + } + ] + }, + { + "description": "Graph node most recently interacted with.", + "name": "node", + "value": null, + "on": [ + { + "events": "symbol:mouseover", + "update": "fix === true ? item() : node" + } + ] + }, + { + "description": "Flag to restart Force simulation upon data changes.", + "name": "restart", + "value": false, + "on": [ + { + "events": { + "signal": "fix" + }, + "update": "fix && fix.length" + } + ] + } + ], + "width": { + "signal": "vis_width" + }, + "height": { + "signal": "vis_height" + }, + "data": [ + { + "name": "node-data", + "values": [ + { + "name": "1. lucas smith", + "eigen_centrality": 0.4352045493, + "__node_id": "1", + "cluster_id": 1, + "tooltip": {} + }, + { + "name": "2. lucas smyth", + "eigen_centrality": 0.4352045493, + "__node_id": "2", + "cluster_id": 1, + "tooltip": {} + }, + { + "name": "3. lucas smyth", + "eigen_centrality": 0.4352045493, + "__node_id": "3", + "cluster_id": 1, + "tooltip": {} + }, + { + "name": "4. david jones", + "eigen_centrality": 0.4352045493, + "__node_id": "4", + "cluster_id": 2, + "tooltip": {} + }, + { + "name": "5. david jones", + "eigen_centrality": 0.4352045493, + "__node_id": "5", + "cluster_id": 2, + "tooltip": {} + } + ] + }, + { + "name": "link-data", + "values": [ + { + "source": "1", + "target": "2", + "tooltip": {}, + "match_probability": 1 + }, + { + "source": "1", + "target": "3", + "tooltip": {}, + "match_probability": 1 + }, + { + "source": "2", + "target": "3", + "tooltip": {}, + "match_probability": 1 + }, + { + "source": "4", + "target": "5", + "tooltip": {}, + "match_probability": 1 + } + ] + } + ], + "scales": [ + { + "name": "color", + "type": "ordinal", + "domain": { + "data": "node-data", + "field": "cluster_id" + }, + "range": [ + "#A9C5FC", + "#FFB300", + "#deebf7", + "#fee0d2", + "#ef8a62", + "#b2182b" + ] + }, + { + "name": "link_colour", + "type": "linear", + "domain": [0, 1], + "range": { + "scheme": "redyellowgreen" + }, + "reverse": false + } + ], + "legends": [], + "marks": [ + { + "name": "nodes", + "type": "symbol", + "zindex": 1, + "from": { + "data": "node-data" + }, + "on": [ + { + "trigger": "fix", + "modify": "node", + "values": "fix === true ? {fx: node.x, fy: node.y} : {fx: fix[0], fy: fix[1]}" + } + ], + "encode": { + "enter": { + "stroke": { + "value": "black" + }, + "tooltip": { + "signal": "datum.tooltip" + } + }, + "update": { + "size": { + "value": 1000, + "mult": { + "signal": "nodeRadius" + } + }, + "cursor": { + "value": "pointer" + }, + "fill": { + "scale": "color", + "field": "cluster_id" + } + } + }, + "transform": [ + { + "type": "force", + "iterations": 400, + "restart": { + "signal": "restart" + }, + "static": { + "signal": "static" + }, + "signal": "force", + "forces": [ + { + "force": "center", + "x": { + "signal": "cx" + }, + "y": { + "signal": "cy" + } + }, + { + "force": "collide", + "radius": { + "expr": "pow(1000*nodeRadius,0.5)*nodeCollideStrength*nodeCollideRadius" + }, + "strength": { + "signal": "nodeCollideStrength" + } + }, + { + "force": "nbody", + "strength": { + "signal": "nodeCharge" + } + }, + { + "description": "Uses link-data to find links between nodes constraining x and y of nodes. Tranforms link-data so source and target are objects that include e.g. x and y coords", + "force": "link", + "links": "link-data", + "distance": { + "expr": "50*linkDistance" + }, + "id": "datum.__node_id", + "strength": { + "signal": "linkStrength" + } + } + ] + } + ] + }, + { + "description": "The force link transform will replace source and target with objects containing x and y coords. We need to extract x and y to plot a path between them", + "type": "path", + "name": "edges", + "from": { + "data": "link-data" + }, + "interactive": true, + "encode": { + "update": { + "stroke": { + "scale": "link_colour", + "field": "match_probability" + }, + "tooltip": { + "signal": "datum.tooltip" + }, + "strokeWidth": { + "value": 2 + } + } + }, + "transform": [ + { + "type": "linkpath", + "require": { + "signal": "force" + }, + "shape": "line", + "sourceX": "datum.source.x", + "sourceY": "datum.source.y", + "targetX": "datum.target.x", + "targetY": "datum.target.y" + } + ] + }, + { + "type": "text", + "from": { + "data": "nodes" + }, + "interactive": false, + "zindex": 2, + "encode": { + "enter": { + "align": { + "value": "center" + }, + "baseline": { + "value": "middle" + }, + "fontSize": { + "value": 12 + }, + + "text": { + "field": "datum.name" + } + }, + "update": { + "x": { + "field": "x" + }, + "y": { + "field": "y" + } + } + } + } + ] +} diff --git a/includes/generated_files/dataset_labels_table.html b/includes/generated_files/dataset_labels_table.html new file mode 100644 index 0000000000..8807a96445 --- /dev/null +++ b/includes/generated_files/dataset_labels_table.html @@ -0,0 +1,5175 @@ + + + + + + + + + + + + + + + + + + + + + + + + Dataset labels table - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Dataset labels table

+ + + + + + + + + + + + + + + + + + + + +
dataset namedescriptionrowsunique entitieslink to source
fake_1000_labelsClerical labels for fake_10003,176NAsource
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/includes/generated_files/datasets_table.html b/includes/generated_files/datasets_table.html new file mode 100644 index 0000000000..2fb28e721c --- /dev/null +++ b/includes/generated_files/datasets_table.html @@ -0,0 +1,5217 @@ + + + + + + + + + + + + + + + + + + + + + + + + Datasets table - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Datasets table

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
dataset namedescriptionrowsunique entitieslink to source
fake_1000Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled.1,000250source
historical_50kThe data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors.50,0005,156source
febrl3The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record.5,0002,000source
febrl4aThe Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records.5,0005,000source
febrl4bThe Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a.5,0005,000source
transactions_originThis data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing.45,32645,326source
transactions_destinationThis data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing.45,32645,326source
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/includes/tags.html b/includes/tags.html new file mode 100644 index 0000000000..89672951a3 --- /dev/null +++ b/includes/tags.html @@ -0,0 +1,5163 @@ + + + + + + + + + + + + + + + + + + + + + + + + Tags - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Tags

+

Following is a list of relevant tags:

+

[TAGS]

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 0000000000..f761bdbc9a --- /dev/null +++ b/index.html @@ -0,0 +1,5345 @@ + + + + + + + + + + + + + + + + + + + + + + + + Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + + + + + + + + +

+Splink: data linkage at scale. (Splink logo). +

+ +
+

Info

+

🎉 Splink 4 has been released! Examples of new syntax are here and a release announcement is here.

+
+

Fast, accurate and scalable probabilistic data linkage

+

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers.

+

Get Started with Splink

+
+ +

Key Features

+

Speed: Capable of linking a million records on a laptop in approximately one minute.
+🎯 Accuracy: Full support for term frequency adjustments and user-defined fuzzy matching logic.
+🌐 Scalability: Execute linkage jobs in Python (using DuckDB) or big-data backends like AWS Athena or Spark for 100+ million records.
+🎓 Unsupervised Learning: No training data is required, as models can be trained using an unsupervised approach.
+📊 Interactive Outputs: Provides a wide range of interactive outputs to help users understand their model and diagnose linkage problems.

+

Splink's core linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customizations to improve accuracy.

+ +

Consider the following records that lack a unique person identifier:

+

tables showing what Splink does

+

Splink predicts which rows link together:

+

tables showing what Splink does

+

and clusters these links to produce an estimated person ID:

+

tables showing what Splink does

+ +

Before using Splink, input data should be standardised, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).

+

Splink performs best with input data containing multiple columns that are not highly correlated. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.

+

High correlation occurs when the value of a column is highly constrained (predictable) from the value of another column. For example, a 'city' field is almost perfectly correlated with 'postcode'. Gender is highly correlated with 'first name'. Correlation is particularly problematic if all of your input columns are highly correlated.

+

Splink is not designed for linking a single column containing a 'bag of words'. For example, a table with a single 'company name' column, and no other details.

+

Support

+

If after reading the documentatation you still have questions, please feel free to post on our discussion forum.

+

Use Cases

+

Here is a list of some of our known users and their use cases:

+
+
+ +
+ +
+ +
+
    +
  • Marie Curie have used Splink to build a single customer view on fundraising data which has been a "huge success [...] the tooling is just so much better. [...] The power of being able to select, plug in, configure and train a tool versus writing code. It's just mind boggling actually." Amongst other benefits, the system is expected to "dramatically reduce manual reporting efforts previously required". See also the blog post here.
  • +
  • Club Brugge uses Splink to link football players from different data providers to their own database, simplifying and reducing the need for manual linkage labor.
  • +
+
+
+
+

Sadly, we don't hear about the majority of our users or what they are working on. If you have a use case and it is not shown here please add it to the list!

+

Awards

+

🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - Runner up

+

🥇 Analysis in Government Awards 2022: People's Choice Award - Winner

+

🥈 Analysis in Government Awards 2022: Innovative Methods - Runner up

+

🥇 Analysis in Government Awards 2020: Innovative Methods - Winner

+

🥇 Ministry of Justice Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner

+

Citation

+

If you use Splink in your research, we'd be grateful for a citation as follows:

+
@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,
+    title        = {Splink: Free software for probabilistic record linkage at scale.},
+    author       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},
+    year         = 2022,
+    month        = {Aug.},
+    journal      = {International Journal of Population Data Science},
+    volume       = 7,
+    number       = 3,
+    doi          = {10.23889/ijpds.v7i3.1794},
+    url          = {https://ijpds.org/article/view/1794},
+}
+
+

Acknowledgements

+

We are very grateful to ADR UK (Administrative Data Research UK) for providing the initial funding for this work as part of the Data First project.

+

We are extremely grateful to professors Katie Harron, James Doidge and Peter Christen for their expert advice and guidance in the development of Splink. We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work. Any errors remain our own.

+ + + + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/javascripts/mathjax.js b/javascripts/mathjax.js new file mode 100644 index 0000000000..0f4b6e694e --- /dev/null +++ b/javascripts/mathjax.js @@ -0,0 +1,16 @@ +window.MathJax = { + tex: { + inlineMath: [["\\(", "\\)"]], + displayMath: [["\\[", "\\]"]], + processEscapes: true, + processEnvironments: true + }, + options: { + ignoreHtmlClass: ".*|", + processHtmlClass: "arithmatex" + } + }; + + document$.subscribe(() => { + MathJax.typesetPromise() + }) \ No newline at end of file diff --git a/js/mkdocs-charts-plugin.js b/js/mkdocs-charts-plugin.js new file mode 100644 index 0000000000..056a4d4292 --- /dev/null +++ b/js/mkdocs-charts-plugin.js @@ -0,0 +1,246 @@ +// Adapted from https://github.com/koaning/justcharts/blob/main/justcharts.js +async function fetchSchema(url){ + var resp = await fetch(url); + var schema = await resp.json(); + return schema +} + +function checkNested(obj /*, level1, level2, ... levelN*/) { + var args = Array.prototype.slice.call(arguments, 1); + + for (var i = 0; i < args.length; i++) { + if (!obj || !obj.hasOwnProperty(args[i])) { + return false; + } + obj = obj[args[i]]; + } + return true; + } + + + +function classnameInParents(el, classname) { + // check if class name in any parents + while (el.parentNode) { + el = el.parentNode; + if (el.classList === undefined) { + continue; + } + if (el.classList.contains(classname) ){ + return true; + } + } + return false; +} + +function findElementInParents(el, classname) { + while (el.parentNode) { + el = el.parentNode; + if (el.classList === undefined) { + continue; + } + if (el.classList.contains(classname) ){ + return el; + } + } + return null; +} + +function findProperChartWidth(el) { + + // mkdocs-material theme uses 'md-content' + var parent = findElementInParents(el, "md-content") + + // mkdocs theme uses 'col-md-9' + if (parent === undefined || parent == null) { + var parent = findElementInParents(el, "col-md-9") + } + if (parent === undefined || parent == null) { + // we can't find a suitable content parent + // 800 width is a good default + return '800' + } else { + // Use full width of parent + // Should bparent.offsetWidth - parseFloat(computedStyle.paddingLeft) - parseFloat(computedStyle.paddingRight) e equilavent to width: 100% + computedStyle = getComputedStyle(parent) + return parent.offsetWidth - parseFloat(computedStyle.paddingLeft) - parseFloat(computedStyle.paddingRight) + } +} + +function updateURL(url) { + // detect if absolute UR: + // credits https://stackoverflow.com/a/19709846 + var r = new RegExp('^(?:[a-z]+:)?//', 'i'); + if (r.test(url)) { + return url; + } + + // If 'use_data_path' is set to true + // schema and data urls are relative to + // 'data_path', not the to current page + // We need to update the specified URL + // to point to the actual location relative to current page + // Example: + // Actual location data file: docs/assets/data.csv + // Page: docs/folder/page.md + // data url in page's schema: assets/data.csv + // data_path in plugin settings: "" + // use_data_path in plugin settings: True + // path_to_homepage: ".." (this was detected in plugin on_post_page() event) + // output url: "../assets/data.csv" + if (mkdocs_chart_plugin['use_data_path'] == "True") { + new_url = window.location.href + new_url = new_url.endsWith('/') ? new_url.slice(0, -1) : new_url; + + if (mkdocs_chart_plugin['path_to_homepage'] != "") { + new_url += "/" + mkdocs_chart_plugin['path_to_homepage'] + } + + new_url = new_url.endsWith('/') ? new_url.slice(0, -1) : new_url; + new_url += "/" + url + new_url = new_url.endsWith('/') ? new_url.slice(0, -1) : new_url; + + if (mkdocs_chart_plugin['data_path'] != "") { + new_url += "/" + mkdocs_chart_plugin['data_path'] + } + + return new_url + } + return url; +} + +var vegalite_charts = []; + +function embedChart(block, schema) { + + // Make sure the schema is specified + let baseSchema = { + "$schema": "https://vega.github.io/schema/vega-lite/v5.json", + } + schema = Object.assign({}, baseSchema, schema); + + // If width is not set at all, + // default is set to 'container' + // Note we inserted .. + // So 'container' will use 100% width + if (!('width' in schema)) { + schema.width = mkdocs_chart_plugin['vega_width'] + } + + // Set default height if not specified + // if (!('height' in schema)) { + // schema.height = mkdocs_chart_plugin['default_height'] + // } + + // charts widths are screwed in content tabs (thinks its zero width) + // https://squidfunk.github.io/mkdocs-material/reference/content-tabs/?h= + // we need to set an explicit, absolute width in those cases + // detect if chart is in tabbed-content: + if (classnameInParents(block, "tabbed-content")) { + var chart_width = schema.width || 'notset'; + if (isNaN(chart_width)) { + schema.width = findProperChartWidth(block); + } + } + + // Update URL if 'use_data_path' is configured + if (schema?.data?.url !== undefined) { + schema.data.url = updateURL(schema.data.url) + } + if (schema?.spec?.data?.url !== undefined) { + schema.spec.data.url = updateURL(schema.spec.data.url) + } + // see docs/assets/data/geo_choropleth.json for example + if (schema.transform) { + for (const t of schema.transform) { + if (t?.from?.data?.url !== undefined) { + t.from.data.url = updateURL(t.from.data.url) + } + } + } + + + + + // Save the block and schema + // This way we can re-render the block + // in a different theme + vegalite_charts.push({'block' : block, 'schema': schema}); + + // mkdocs-material has a dark mode + // detect which one is being used + var theme = (document.querySelector('body').getAttribute('data-md-color-scheme') == 'slate') ? mkdocs_chart_plugin['vega_theme_dark'] : mkdocs_chart_plugin['vega_theme']; + + // Render the chart + vegaEmbed(block, schema, { + actions: false, + "theme": theme, + "renderer": mkdocs_chart_plugin['vega_renderer'] + }); +} + +// Adapted from +// https://facelessuser.github.io/pymdown-extensions/extensions/superfences/#uml-diagram-example +// https://github.com/koaning/justcharts/blob/main/justcharts.js +const chartplugin = className => { + + // Find all of our vegalite sources and render them. + const blocks = document.querySelectorAll('vegachart'); + + for (let i = 0; i < blocks.length; i++) { + + const block = blocks[i] + const block_json = JSON.parse(block.textContent); + + // get the vegalite JSON + if ('schema-url' in block_json) { + + var url = updateURL(block_json['schema-url']) + fetchSchema(url).then( + schema => embedChart(block, schema) + ); + } else { + embedChart(block, block_json); + } + + } + } + + +// mkdocs-material has a dark mode including a toggle +// We should watch when dark mode changes and update charts accordingly + +var bodyelement = document.querySelector('body'); +var observer = new MutationObserver(function(mutations) { + mutations.forEach(function(mutation) { + if (mutation.type === "attributes") { + + if (mutation.attributeName == "data-md-color-scheme") { + + var theme = (bodyelement.getAttribute('data-md-color-scheme') == 'slate') ? mkdocs_chart_plugin['vega_theme_dark'] : mkdocs_chart_plugin['vega_theme']; + for (let i = 0; i < vegalite_charts.length; i++) { + vegaEmbed(vegalite_charts[i].block, vegalite_charts[i].schema, { + actions: false, + "theme": theme, + "renderer": mkdocs_chart_plugin['vega_renderer'] + }); + } + } + + } + }); + }); +observer.observe(bodyelement, { +attributes: true //configure it to listen to attribute changes +}); + + +// Load when DOM ready +if (typeof document$ !== "undefined") { + // compatibility with mkdocs-material's instant loading feature + document$.subscribe(function() { + chartplugin("vegalite") + }) +} else { + document.addEventListener("DOMContentLoaded", () => {chartplugin("vegalite")}) +} diff --git a/objects.inv b/objects.inv new file mode 100644 index 0000000000..f26a68be62 Binary files /dev/null and b/objects.inv differ diff --git a/overrides/main.html b/overrides/main.html new file mode 100644 index 0000000000..cf702f5ac0 --- /dev/null +++ b/overrides/main.html @@ -0,0 +1,9 @@ +{% extends "base.html" %} + +{% block announce %} + +
🆕 Check out our latest blog post exploring Bias in Data Linking! 🆕
+ +
Still using Splink 3 and looking for the old docs? You can find them here
+ +{% endblock %} \ No newline at end of file diff --git a/profile_columns_tooltip_1.png b/profile_columns_tooltip_1.png new file mode 100644 index 0000000000..fd4570908c Binary files /dev/null and b/profile_columns_tooltip_1.png differ diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000000..5eb9382916 --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Home","text":"

Info

\ud83c\udf89 Splink 4 has been released! Examples of new syntax are here and a release announcement is here.

"},{"location":"index.html#fast-accurate-and-scalable-probabilistic-data-linkage","title":"Fast, accurate and scalable probabilistic data linkage","text":"

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers.

Get Started with Splink

"},{"location":"index.html#key-features","title":"Key Features","text":"

\u26a1 Speed: Capable of linking a million records on a laptop in approximately one minute. \ud83c\udfaf Accuracy: Full support for term frequency adjustments and user-defined fuzzy matching logic. \ud83c\udf10 Scalability: Execute linkage jobs in Python (using DuckDB) or big-data backends like AWS Athena or Spark for 100+ million records. \ud83c\udf93 Unsupervised Learning: No training data is required, as models can be trained using an unsupervised approach. \ud83d\udcca Interactive Outputs: Provides a wide range of interactive outputs to help users understand their model and diagnose linkage problems.

Splink's core linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customizations to improve accuracy.

"},{"location":"index.html#what-does-splink-do","title":"What does Splink do?","text":"

Consider the following records that lack a unique person identifier:

Splink predicts which rows link together:

and clusters these links to produce an estimated person ID:

"},{"location":"index.html#what-data-does-splink-work-best-with","title":"What data does Splink work best with?","text":"

Before using Splink, input data should be standardised, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).

Splink performs best with input data containing multiple columns that are not highly correlated. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.

High correlation occurs when the value of a column is highly constrained (predictable) from the value of another column. For example, a 'city' field is almost perfectly correlated with 'postcode'. Gender is highly correlated with 'first name'. Correlation is particularly problematic if all of your input columns are highly correlated.

Splink is not designed for linking a single column containing a 'bag of words'. For example, a table with a single 'company name' column, and no other details.

"},{"location":"index.html#support","title":"Support","text":"

If after reading the documentatation you still have questions, please feel free to post on our discussion forum.

"},{"location":"index.html#use-cases","title":"Use Cases","text":"

Here is a list of some of our known users and their use cases:

Public Sector (UK)Public Sector (International)AcademiaOther
  • Ministry of Justice created linked datasets (combining courts, prisons and probation data) for use by researchers as part of the Data First programme
  • Office for National Statistics's Business Index (formerly the Inter Departmental Business Register), Demographic Index and the 2021 Census
  • Lewisham Council (London) identified and auto-enrolled over 500 additional eligible families to receive Free School Meals
  • London Office of Technology and Innovation created a dashboard to help better measure and reduce rough sleeping across London
  • Competition and Markets Authority identified 'Persons with Significant Control' and estimated ownership groups across companies
  • Office for Health Improvement and Disparities linked Health and Justice data to assess the pathways between probation and specialist alcohol and drug treatment services as part of the Better Outcomes through Linked Data programme
  • Ministry of Defence recently launched their Veteran's Card system which uses Splink to verify applicants against historic records. This project was shortlisted for the Civil Service Awards
  • Gateshead Council, in partnership with the National Innovation Centre for Data are creating a single view of debt
  • The German Federal Statistical Office (Destatis) uses Splink to conduct projects in linking register-based census data.
  • Chilean Ministry of Health and University College London have assessed the access to immunisation programs among the migrant population
  • Florida Cancer Registry, published a feasibility study which showed Splink was faster and more accurate than alternatives
  • Catalyst Cooperative's Public Utility Data Liberation Project links public financial and operational data from electric utilities for use by US climate advocates, policymakers, and researchers seeking to accelerate the transition away from fossil fuels.
  • Stanford University investigated the impact of receiving government assistance has on political attitudes
  • Bern University researched how Active Learning can be applied to Biomedical Record Linkage
  • Marie Curie have used Splink to build a single customer view on fundraising data which has been a \"huge success [...] the tooling is just so much better. [...] The power of being able to select, plug in, configure and train a tool versus writing code. It's just mind boggling actually.\" Amongst other benefits, the system is expected to \"dramatically reduce manual reporting efforts previously required\". See also the blog post here.
  • Club Brugge uses Splink to link football players from different data providers to their own database, simplifying and reducing the need for manual linkage labor.

Sadly, we don't hear about the majority of our users or what they are working on. If you have a use case and it is not shown here please add it to the list!

"},{"location":"index.html#awards","title":"Awards","text":"

\ud83e\udd48 Civil Service Awards 2023: Best Use of Data, Science, and Technology - Runner up

\ud83e\udd47 Analysis in Government Awards 2022: People's Choice Award - Winner

\ud83e\udd48 Analysis in Government Awards 2022: Innovative Methods - Runner up

\ud83e\udd47 Analysis in Government Awards 2020: Innovative Methods - Winner

\ud83e\udd47 Ministry of Justice Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner

"},{"location":"index.html#citation","title":"Citation","text":"

If you use Splink in your research, we'd be grateful for a citation as follows:

@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,\n    title        = {Splink: Free software for probabilistic record linkage at scale.},\n    author       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},\n    year         = 2022,\n    month        = {Aug.},\n    journal      = {International Journal of Population Data Science},\n    volume       = 7,\n    number       = 3,\n    doi          = {10.23889/ijpds.v7i3.1794},\n    url          = {https://ijpds.org/article/view/1794},\n}\n
"},{"location":"index.html#acknowledgements","title":"Acknowledgements","text":"

We are very grateful to ADR UK (Administrative Data Research UK) for providing the initial funding for this work as part of the Data First project.

We are extremely grateful to professors Katie Harron, James Doidge and Peter Christen for their expert advice and guidance in the development of Splink. We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work. Any errors remain our own.

"},{"location":"getting_started.html","title":"Getting Started","text":""},{"location":"getting_started.html#getting-started","title":"Getting Started","text":""},{"location":"getting_started.html#install","title":"Install","text":"

Splink supports python 3.8+.

To obtain the latest released version of Splink you can install from PyPI using pip:

pip install splink\n

or if you prefer, you can instead install Splink using conda:

conda install -c conda-forge splink\n
Backend Specific Installs"},{"location":"getting_started.html#backend-specific-installs","title":"Backend Specific Installs","text":"

From Splink v3.9.7, packages required by specific Splink backends can be optionally installed by adding the [<backend>] suffix to the end of your pip install.

Note that SQLite and DuckDB come packaged with Splink and do not need to be optionally installed.

The following backends are supported:

Spark Athena PostgreSQL
pip install 'splink[spark]'\n
pip install 'splink[athena]'\n
pip install 'splink[postgres]'\n
"},{"location":"getting_started.html#quickstart","title":"Quickstart","text":"

To get a basic Splink model up and running, use the following code. It demonstrates how to:

  1. Estimate the parameters of a deduplication model
  2. Use the parameter estimates to identify duplicate records
  3. Use clustering to generate an estimated unique person ID.
Simple Splink Model Example
import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.JaroAtThresholds(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ]\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")],\n    recall=0.7,\n)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"email\"))\n\npairwise_predictions = linker.inference.predict(threshold_match_weight=-5)\n\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    pairwise_predictions, 0.95\n)\n\ndf_clusters = clusters.as_pandas_dataframe(limit=5)\n
"},{"location":"getting_started.html#tutorials","title":"Tutorials","text":"

You can learn more about Splink in the step-by-step tutorial. Each has a corresponding Google Colab link to run the notebook in your browser.

"},{"location":"getting_started.html#example-notebooks","title":"Example Notebooks","text":"

You can see end-to-end example of several use cases in the example notebooks. Each has a corresponding Google Colab link to run the notebook in your browser.

"},{"location":"getting_started.html#getting-help","title":"Getting help","text":"

If after reading the documentatation you still have questions, please feel free to post on our discussion forum.

"},{"location":"api_docs/api_docs_index.html","title":"Introduction","text":""},{"location":"api_docs/api_docs_index.html#api-documentation","title":"API Documentation","text":"

This section contains reference material for the modules and functions within Splink.

"},{"location":"api_docs/api_docs_index.html#api","title":"API","text":""},{"location":"api_docs/api_docs_index.html#linker","title":"Linker","text":"
  • Training
  • Visualisations
  • Inference
  • Clustering
  • Evaluation
  • Table Management
  • Miscellaneous functions
"},{"location":"api_docs/api_docs_index.html#comparisons","title":"Comparisons","text":"
  • Comparison Library
  • Comparison Level Library
"},{"location":"api_docs/api_docs_index.html#other","title":"Other","text":"
  • Exploratory
  • Blocking Analysis
  • Blocking
  • SplinkDataFrame
  • EM Training Session API
  • Column Expression API
"},{"location":"api_docs/api_docs_index.html#in-built-datasets","title":"In-built datasets","text":"

Information on pre-made data tables available within Splink suitable for linking, to get up-and-running or to try out ideas.

  • In-built datasets - information on included datasets, as well as how to use them, and methods for managing them.
"},{"location":"api_docs/api_docs_index.html#splink-settings","title":"Splink Settings","text":"

Reference materials for the Splink Settings dictionary:

  • Settings Dictionary Reference - for reference material on the parameters available within a Splink Settings dictionary.
"},{"location":"api_docs/blocking.html","title":"Blocking rule creator","text":"","tags":["API","blocking"]},{"location":"api_docs/blocking.html#documentation-forblock_on","title":"Documentation forblock_on","text":"

Generates blocking rules of equality conditions based on the columns or SQL expressions specified.

When multiple columns or SQL snippets are provided, the function generates a compound blocking rule, connecting individual match conditions with \"AND\" clauses.

Further information on equi-join conditions can be found here

Parameters:

Name Type Description Default col_names_or_exprs Union[str, ColumnExpression]

A list of input columns or SQL conditions you wish to create blocks on.

() salting_partitions (optional, int)

Whether to add salting to the blocking rule. More information on salting can be found within the docs.

None arrays_to_explode (optional, List[str])

List of arrays to explode before applying the blocking rule.

None

Examples:

from splink import block_on\nbr_1 = block_on(\"first_name\")\nbr_2 = block_on(\"substr(surname,1,2)\", \"surname\")\n
","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html","title":"Blocking analysis","text":"","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html#documentation-forsplinkblocking_analysis","title":"Documentation forsplink.blocking_analysis","text":"","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html#splink.blocking_analysis.count_comparisons_from_blocking_rule","title":"count_comparisons_from_blocking_rule(*, table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name='unique_id', source_dataset_column_name=None, compute_post_filter_count=True, max_rows_limit=int(1000000000.0))","text":"

Analyse a blocking rule to understand the number of comparisons it will generate.

Read more about the definition of pre and post filter conditions here

Parameters:

Name Type Description Default table_or_tables (dataframe, str)

Input data

required blocking_rule Union[BlockingRuleCreator, str, Dict[str, Any]]

The blocking rule to analyse

required link_type user_input_link_type_options

The link type - \"link_only\", \"dedupe_only\" or \"link_and_dedupe\"

required db_api DatabaseAPISubClass

Database API

required unique_id_column_name str

Defaults to \"unique_id\".

'unique_id' source_dataset_column_name Optional[str]

Defaults to None.

None compute_post_filter_count bool

Whether to use a slower methodology to calculate how many comparisons will be generated post filter conditions. Defaults to True.

True max_rows_limit int

Calculation of post filter counts will only proceed if the fast method returns a value below this limit. Defaults to int(1e9).

int(1000000000.0)

Returns:

Type Description dict[str, Union[int, str]]

dict[str, Union[int, str]]: A dictionary containing the results

","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html#splink.blocking_analysis.cumulative_comparisons_to_be_scored_from_blocking_rules_chart","title":"cumulative_comparisons_to_be_scored_from_blocking_rules_chart(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)","text":"","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html#splink.blocking_analysis.cumulative_comparisons_to_be_scored_from_blocking_rules_data","title":"cumulative_comparisons_to_be_scored_from_blocking_rules_data(*, table_or_tables, blocking_rules, link_type, db_api, unique_id_column_name='unique_id', max_rows_limit=int(1000000000.0), source_dataset_column_name=None)","text":"","tags":["API","blocking"]},{"location":"api_docs/blocking_analysis.html#splink.blocking_analysis.n_largest_blocks","title":"n_largest_blocks(*, table_or_tables, blocking_rule, link_type, db_api, n_largest=5)","text":"

Find the values responsible for creating the largest blocks of records.

For example, when blocking on first name and surname, the 'John Smith' block might be the largest block of records. In cases where values are highly skewed a few values may be resonsible for generating a large proportion of all comparisons. This function helps you find the culprit values.

The analysis is performed pre filter conditions, read more about what this means here

Parameters:

Name Type Description Default table_or_tables (dataframe, str)

Input data

required blocking_rule Union[BlockingRuleCreator, str, Dict[str, Any]]

The blocking rule to analyse

required link_type user_input_link_type_options

The link type - \"link_only\", \"dedupe_only\" or \"link_and_dedupe\"

required db_api DatabaseAPISubClass

Database API

required n_largest int

How many rows to return. Defaults to 5.

5

Returns:

Name Type Description SplinkDataFrame 'SplinkDataFrame'

A dataframe containing the n_largest blocks

","tags":["API","blocking"]},{"location":"api_docs/clustering.html","title":"Clustering","text":"","tags":["API","Clustering"]},{"location":"api_docs/clustering.html#methods-in-linkerclustering","title":"Methods in Linker.clustering","text":"

Cluster the results of the linkage model and analyse clusters, accessed via linker.clustering.

","tags":["API","Clustering"]},{"location":"api_docs/clustering.html#splink.internals.linker_components.clustering.LinkerClustering.cluster_pairwise_predictions_at_threshold","title":"cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=None)","text":"

Clusters the pairwise match predictions that result from linker.inference.predict() into groups of connected record using the connected components graph clustering algorithm

Records with an estimated match_probability at or above threshold_match_probability are considered to be a match (i.e. they represent the same entity).

Parameters:

Name Type Description Default df_predict SplinkDataFrame

The results of linker.predict()

required threshold_match_probability float

Pairwise comparisons with a match_probability at or above this threshold are matched

None

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

A SplinkDataFrame containing a list of all IDs, clustered into groups based on the desired match threshold.

","tags":["API","Clustering"]},{"location":"api_docs/clustering.html#splink.internals.linker_components.clustering.LinkerClustering.compute_graph_metrics","title":"compute_graph_metrics(df_predict, df_clustered, *, threshold_match_probability=None)","text":"

Generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes

Parameters:

Name Type Description Default df_predict SplinkDataFrame

The results of linker.inference.predict()

required df_clustered SplinkDataFrame

The outputs of linker.clustering.cluster_pairwise_predictions_at_threshold()

required threshold_match_probability float

Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on df_clustered. If no such metadata is available, this value must be provided.

None

Returns:

Name Type Description GraphMetricsResult GraphMetricsResults

A data class containing SplinkDataFrames

GraphMetricsResults

of cluster IDs and selected node, edge or cluster metrics. attribute \"nodes\" for nodes metrics table attribute \"edges\" for edge metrics table attribute \"clusters\" for cluster metrics table

","tags":["API","Clustering"]},{"location":"api_docs/column_expression.html","title":"Column Expressions","text":"","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#column-expressions","title":"Column Expressions","text":"

In comparisons, you may wish to consider expressions which are not simply columns of your input table. For instance you may have a forename column in your data, but when comparing records you may wish to also use the values in this column transformed all to lowercase, or just the first three letters of the name, or perhaps both of these transformations taken together.

If it is feasible to do so, then it may be best to derive a new column containing the transformed data. Particularly if it is an expensive calculation, or you wish to refer to it many times, deriving the column once on your input data may well be preferable, as it is cheaper than doing so directly in comparisons where each input record may need to be processed many times. However, there may be situations where you don't wish to derive a new column, perhaps for large data where you have many such transformations, or when you are experimenting with different models.

This is where a ColumnExpression may be used. It represents some SQL expression, which may be a column, or some more complicated construct, to which you can also apply zero or more transformations. These are lazily evaluated, and in particular will not be tied to a specific SQL dialect until they are put (via settings into a linker).

Term frequency adjustments

One caveat to using a ColumnExpression is that it cannot be combined with term frequency adjustments. Term frequency adjustments can only be computed on the raw values in a column prior to any function transforms.

If you wish to use term frequencies with transformations of an input column, you must pre-compute a new column in your input data with the transforms applied, instead of a ColumnExpression.

from splink import ColumnExpression\n\nemail_lowercase = ColumnExpression(\"email\").lower()\ndob_as_string = ColumnExpression(\"dob\").cast_to_string()\nsurname_initial_lowercase = ColumnExpression(\"surname\").substr(1, 1).lower()\nentry_date = ColumnExpression(\"entry_date_str\").try_parse_date(date_format=\"YYYY-MM-DD\")\nfull_name_lowercase = ColumnExpression(\"first_name || ' ' || surname\").lower()\n

You can use a ColumnExpression in most places where you might also use a simple column name, such as in a library comparison, a library comparison level, or in a blocking rule:

from splink import block_on\nimport splink.comparison_library as cl\nimport splink.comparison_level_library as cll\n\nfull_name_lower_br = block_on([full_name_lowercase])\n\nemail_comparison = cl.DamerauLevenshteinAtThresholds(email_lowercase, distance_threshold_or_thresholds=[1, 3])\nentry_date_comparison = cl.AbsoluteTimeDifferenceAtThresholds(\n    entry_date,\n    input_is_string=False,\n    metrics=[\"day\", \"day\"],\n    thresholds=[1, 10],\n)\nname_comparison = cl.CustomComparison(\n    comparison_levels=[\n        cll.NullLevel(full_name_lowercase),\n        cll.ExactMatch(full_name_lowercase),\n        cll.ExactMatch(\"surname\")\n        cll.ExactMatch(\"first_name\"),\n        cll.ExactMatch(surname_initial_lowercase),\n        cll.ElseLevel()\n    ],\n    output_column_name=\"name\",\n)\n
","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#columnexpression","title":"ColumnExpression","text":"

Enables transforms to be applied to a column before it's passed into a comparison level.

Dialect agnostic. Execution is delayed until the dialect is known.

For example
from splink.column_expression import ColumnExpression\ncol = (\n    ColumnExpression(\"first_name\")\n    .lower()\n    .regex_extract(\"^[A-Z]{1,4}\")\n)\n\nExactMatchLevel(col)\n

Note that this will typically be created without a dialect, and the dialect will later be populated when the ColumnExpression is passed via a comparison level creator into a Linker.

","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.lower","title":"lower()","text":"

Applies a lowercase transform to the input expression.

","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.substr","title":"substr(start, length)","text":"

Applies a substring transform to the input expression of a given length starting from a specified index.

Parameters:

Name Type Description Default start int

The starting index of the substring.

required length int

The length of the substring.

required","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.cast_to_string","title":"cast_to_string()","text":"

Applies a cast to string transform to the input expression.

","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.regex_extract","title":"regex_extract(pattern, capture_group=0)","text":"

Applies a regex extract transform to the input expression.

Parameters:

Name Type Description Default pattern str

The regex pattern to match.

required capture_group int

The capture group to extract from the matched pattern. Defaults to 0, meaning the full pattern is extracted

0","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.try_parse_date","title":"try_parse_date(date_format=None)","text":"

Applies a 'try parse date' transform to the input expression.

Parameters:

Name Type Description Default date_format str

The date format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used.

None","tags":["API","comparisons","blocking"]},{"location":"api_docs/column_expression.html#splink.internals.column_expression.ColumnExpression.try_parse_timestamp","title":"try_parse_timestamp(timestamp_format=None)","text":"

Applies a 'try parse timestamp' transform to the input expression.

Parameters:

Name Type Description Default timestamp_format str

The timestamp format to attempt to parse. Defaults to None, meaning the dialect-specific default format is used.

None","tags":["API","comparisons","blocking"]},{"location":"api_docs/comparison_level_library.html","title":"Comparison Level Library","text":"","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#documentation-for-the-comparison_level_library","title":"Documentation for the comparison_level_library","text":"","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.AbsoluteDifferenceLevel","title":"AbsoluteDifferenceLevel(col_name, difference_threshold)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where the absolute difference between two numerical values is within a specified threshold.

Parameters:

Name Type Description Default col_name str | ColumnExpression

Input column name or ColumnExpression.

required difference_threshold int | float

The maximum allowed absolute difference between the two values.

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.AbsoluteTimeDifferenceLevel","title":"AbsoluteTimeDifferenceLevel(col_name, *, input_is_string, threshold, metric, datetime_format=None)","text":"

Bases: ComparisonLevelCreator

Computes the absolute elapsed time between two dates (total duration).

This function computes the amount of time that has passed between two dates, in contrast to functions like date_diff found in some SQL backends, which count the number of full calendar intervals (e.g., months, years) crossed.

For instance, the difference between January 29th and March 2nd would be less than two months in terms of elapsed time, unlike a date_diff calculation that would give an answer of 2 calendar intervals crossed.

That the thresold is inclusive e.g. a level with a 10 day threshold will include difference in date of 10 days.

Parameters:

Name Type Description Default col_name str

The name of the input column containing the dates to compare

required input_is_string bool

Indicates if the input date/times are in string format, requiring parsing according to datetime_format.

required threshold int

The maximum allowed difference between the two dates, in units specified by date_metric.

required metric str

The unit of time to use when comparing the dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required datetime_format str

The format string for parsing dates. ISO 8601 format used if not provided.

None","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.And","title":"And(*comparison_levels)","text":"

Bases: _Merge

Represents a comparison level that is an 'AND' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical \"AND\".

Parameters:

Name Type Description Default *comparison_levels ComparisonLevelCreator | dict

These represent the comparison levels you wish to combine via 'AND'

()","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.ArrayIntersectLevel","title":"ArrayIntersectLevel(col_name, min_intersection)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level based around the size of an intersection of arrays

Parameters:

Name Type Description Default col_name str

Input column name

required min_intersection int

The minimum cardinality of the intersection of arrays for this comparison level. Defaults to 1

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.ColumnsReversedLevel","title":"ColumnsReversedLevel(col_name_1, col_name_2, symmetrical=False)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where the columns are reversed. For example, if surname is in the forename field and vice versa

By default, col_l = col_r. If the symmetrical argument is True, then col_l = col_r AND col_r = col_l.

Parameters:

Name Type Description Default col_name_1 str

First column, e.g. forename

required col_name_2 str

Second column, e.g. surname

required symmetrical bool

If True, equality is required in in both directions. Default is False.

False","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.CustomLevel","title":"CustomLevel(sql_condition, label_for_charts=None, base_dialect_str=None)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level with a custom sql expression

Must be in a form suitable for use in a SQL CASE WHEN expression e.g. \"substr(name_l, 1, 1) = substr(name_r, 1, 1)\"

Parameters:

Name Type Description Default sql_condition str

SQL condition to assess similarity

required label_for_charts str

A label for this level to be used in charts. Default None, so that sql_condition is used

None base_dialect_str str

If specified, the SQL dialect that this expression will parsed as when attempting to translate to other backends

None","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.DamerauLevenshteinLevel","title":"DamerauLevenshteinLevel(col_name, distance_threshold)","text":"

Bases: ComparisonLevelCreator

A comparison level using a Damerau-Levenshtein distance function

e.g. damerau_levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name Type Description Default col_name str

Input column name

required distance_threshold int

The threshold to use to assess similarity

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.DistanceFunctionLevel","title":"DistanceFunctionLevel(col_name, distance_function_name, distance_threshold, higher_is_more_similar=True)","text":"

Bases: ComparisonLevelCreator

A comparison level using an arbitrary distance function

e.g. custom_distance(val_l, val_r) >= (<=) distance_threshold

The function given by distance_function_name must exist in the SQL backend you use, and must take two parameters of the type in `col_name, returning a numeric type

Parameters:

Name Type Description Default col_name str | ColumnExpression

Input column name

required distance_function_name str

the name of the SQL distance function

required distance_threshold Union[int, float]

The threshold to use to assess similarity

required higher_is_more_similar bool

Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True

True","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.DistanceInKMLevel","title":"DistanceInKMLevel(lat_col, long_col, km_threshold, not_null=False)","text":"

Bases: ComparisonLevelCreator

Use the haversine formula to transform comparisons of lat,lngs into distances measured in kilometers

Parameters:

Name Type Description Default lat_col str

The name of a latitude column or the respective array or struct column column containing the information For example: long_lat['lat'] or long_lat[0]

required long_col str

The name of a longitudinal column or the respective array or struct column column containing the information, plus an index. For example: long_lat['long'] or long_lat[1]

required km_threshold int

The total distance in kilometers to evaluate your comparisons against

required not_null bool

If true, ensure no attempt is made to compute this if any inputs are null. This is only necessary if you are not capturing nulls elsewhere in your comparison level.

False","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.ElseLevel","title":"ElseLevel","text":"

Bases: ComparisonLevelCreator

This level is used to capture all comparisons that do not match any other specified levels. It corresponds to the ELSE clause in a SQL CASE statement.

","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.ExactMatchLevel","title":"ExactMatchLevel(col_name, term_frequency_adjustments=False)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where there is an exact match

e.g. val_l = val_r

Parameters:

Name Type Description Default col_name str

Input column name

required term_frequency_adjustments bool

If True, apply term frequency adjustments to the exact match level. Defaults to False.

False","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.JaccardLevel","title":"JaccardLevel(col_name, distance_threshold)","text":"

Bases: ComparisonLevelCreator

A comparison level using a Jaccard distance function

e.g. jaccard(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default col_name str

Input column name

required distance_threshold Union[int, float]

The threshold to use to assess similarity

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.JaroLevel","title":"JaroLevel(col_name, distance_threshold)","text":"

Bases: ComparisonLevelCreator

A comparison level using a Jaro distance function

e.g. jaro(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default col_name str

Input column name

required distance_threshold Union[int, float]

The threshold to use to assess similarity

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.JaroWinklerLevel","title":"JaroWinklerLevel(col_name, distance_threshold)","text":"

Bases: ComparisonLevelCreator

A comparison level using a Jaro-Winkler distance function

e.g. jaro_winkler(val_l, val_r) >= distance_threshold

Parameters:

Name Type Description Default col_name str

Input column name

required distance_threshold Union[int, float]

The threshold to use to assess similarity

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.LevenshteinLevel","title":"LevenshteinLevel(col_name, distance_threshold)","text":"

Bases: ComparisonLevelCreator

A comparison level using a sqlglot_dialect_name distance function

e.g. levenshtein(val_l, val_r) <= distance_threshold

Parameters:

Name Type Description Default col_name str

Input column name

required distance_threshold int

The threshold to use to assess similarity

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.LiteralMatchLevel","title":"LiteralMatchLevel(col_name, literal_value, literal_datatype, side_of_comparison='both')","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where a column matches a literal value

e.g. val_l = 'literal' AND/OR val_r = 'literal'

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

Input column name or ColumnExpression

required literal_value str

The literal value to compare against e.g. 'male'

required literal_datatype str

The datatype of the literal value. Must be one of: \"string\", \"int\", \"float\", \"date\"

required side_of_comparison str

Which side(s) of the comparison to apply. Must be one of: \"left\", \"right\", \"both\". Defaults to \"both\".

'both'","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.Not","title":"Not(comparison_level)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level that is the negation of another comparison level

Resulting ComparisonLevelCreator is equivalent to the passed ComparisonLevelCreator but with SQL conditions negated with logical \"NOY\".

Parameters:

Name Type Description Default *comparison_level ComparisonLevelCreator | dict

This represents the comparison level you wish to negate with 'NOT'

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.NullLevel","title":"NullLevel(col_name, valid_string_pattern=None)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where either or both values are NULL

e.g. val_l IS NULL OR val_r IS NULL

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

Input column name or ColumnExpression

required valid_string_pattern str

If provided, a regex pattern to extract a valid substring from the column before checking for NULL. Default is None.

None Note

If a valid_string_pattern is provided, the NULL check will be performed on the extracted substring rather than the original column value.

","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.Or","title":"Or(*comparison_levels)","text":"

Bases: _Merge

Represents a comparison level that is an 'OR' of other comparison levels

Merge multiple ComparisonLevelCreators into a single ComparisonLevelCreator by merging their SQL conditions using a logical \"OR\".

Parameters:

Name Type Description Default *comparison_levels ComparisonLevelCreator | dict

These represent the comparison levels you wish to combine via 'OR'

()","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#splink.comparison_level_library.PercentageDifferenceLevel","title":"PercentageDifferenceLevel(col_name, percentage_threshold)","text":"

Bases: ComparisonLevelCreator

Represents a comparison level where the difference between two numerical values is within a specified percentage threshold.

The percentage difference is calculated as the absolute difference between the two values divided by the greater of the two values.

Parameters:

Name Type Description Default col_name str

Input column name.

required percentage_threshold float

The threshold percentage to use to assess similarity e.g. 0.1 for 10%.

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#absolutedatedifferenceatthresholds","title":"AbsoluteDateDifferenceAtThresholds","text":"

An alias of AbsoluteTimeDifferenceAtThresholds.

","tags":["API","comparisons"]},{"location":"api_docs/comparison_level_library.html#configuring-comparisons","title":"Configuring comparisons","text":"

Note that all comparison levels have a .configure() method as follows:

Configure the comparison level with options which are common to all comparison levels. The options align to the keys in the json specification of a comparison level. These options are usually not needed, but are available for advanced users.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name Type Description Default m_probability float

The m probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.

unsupplied_option u_probability float

The u probability for this comparison level. Default is equivalent to None, in which case a default initial value will be provided for this level.

unsupplied_option tf_adjustment_column str

Make term frequency adjustments for this comparison level using this input column. Default is equivalent to None, meaning that term-frequency adjustments will not be applied for this level.

unsupplied_option tf_adjustment_weight float

Make term frequency adjustments for this comparison level using this weight. Default is equivalent to None, meaning term-frequency adjustments are fully-weighted if turned on.

unsupplied_option tf_minimum_u_value float

When term frequency adjustments are turned on, where the term frequency adjustment implies a u value below this value, use this minimum value instead. Defaults is equivalent to None, meaning no minimum value.

unsupplied_option is_null_level bool

If true, m and u values will not be estimated and instead the match weight will be zero for this column. Default is equivalent to False.

unsupplied_option label_for_charts str

If provided, a custom label that will be used for this level in any charts. Default is equivalent to None, in which case a default label will be provided for this level.

unsupplied_option disable_tf_exact_match_detection bool

If true, if term frequency adjustments are set, the corresponding adjustment will be made using the u-value for this level, rather than the usual case where it is the u-value of the exact match level in the same comparison. Default is equivalent to False.

unsupplied_option fix_m_probability bool

If true, the m probability for this level will be fixed and not estimated during training. Default is equivalent to False.

unsupplied_option fix_u_probability bool

If true, the u probability for this level will be fixed and not estimated during training. Default is equivalent to False.

unsupplied_option","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html","title":"Comparison Library","text":"","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#documentation-for-the-comparison_library","title":"Documentation for the comparison_library","text":"","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.AbsoluteTimeDifferenceAtThresholds","title":"AbsoluteTimeDifferenceAtThresholds(col_name, *, input_is_string, metrics, thresholds, datetime_format=None, term_frequency_adjustments=False, invalid_dates_as_null=True)","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on absolute time differences:

  • Exact match in col_name
  • Absolute time difference levels at specified thresholds
  • ...
  • Anything else

For example, with metrics = ['day', 'month'] and thresholds = [1, 3] the levels are:

  • Exact match in col_name
  • Absolute time difference in col_name <= 1 day
  • Absolute time difference in col_name <= 3 months
  • Anything else

This comparison uses the AbsoluteTimeDifferenceLevel, which computes the total elapsed time between two dates, rather than counting calendar intervals.

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required input_is_string bool

If True, the input dates are treated as strings and parsed according to datetime_format.

required metrics Union[DateMetricType, List[DateMetricType]]

The unit(s) of time to use when comparing dates. Can be 'second', 'minute', 'hour', 'day', 'month', or 'year'.

required thresholds Union[int, float, List[Union[int, float]]]

The threshold(s) to use for the time difference level(s).

required datetime_format str

The format string for parsing dates if input_is_string is True. ISO 8601 format used if not provided.

None term_frequency_adjustments bool

Whether to apply term frequency adjustments. Defaults to False.

False invalid_dates_as_null bool

If True and input_is_string is True, treat invalid dates as null. Defaults to True.

True","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.ArrayIntersectAtSizes","title":"ArrayIntersectAtSizes(col_name, size_threshold_or_thresholds=[1])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with multiple levels based on the intersection sizes of array elements:

  • Intersection at specified size thresholds
  • ...
  • Anything else

For example, with size_threshold_or_thresholds = [3, 1], the levels are:

  • Intersection of arrays in col_name has at least 3 elements
  • Intersection of arrays in col_name has at least 1 element
  • Anything else (e.g., empty intersection)

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required size_threshold_or_thresholds Union[int, list[int]]

The size threshold(s) for the intersection levels. Defaults to [1].

[1]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.CustomComparison","title":"CustomComparison(comparison_levels, output_column_name=None, comparison_description=None)","text":"

Bases: ComparisonCreator

Represents a comparison of the data with custom supplied levels.

Parameters:

Name Type Description Default output_column_name str

The column name to use to refer to this comparison

None comparison_levels list

A list of some combination of ComparisonLevelCreator objects, or dicts. These represent the similarity levels assessed by the comparison, in order of decreasing specificity

required comparison_description str

An optional description of the comparison

None","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.DamerauLevenshteinAtThresholds","title":"DamerauLevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Damerau-Levenshtein levels at specified distance thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

  • Exact match in col_name
  • Damerau-Levenshtein distance in col_name <= 1
  • Damerau-Levenshtein distance in col_name <= 3
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the Damerau-Levenshtein similarity level(s). Defaults to [1, 2].

[1, 2]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.DateOfBirthComparison","title":"DateOfBirthComparison(col_name, *, input_is_string, datetime_thresholds=[1, 1, 10], datetime_metrics=['month', 'year', 'year'], datetime_format=None, invalid_dates_as_null=True)","text":"

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a date of birth column in the col_name provided.

Note that input_is_string is a required argument: you must denote whether the col_name contains if of type date/dattime or string.

The default arguments will give a comparison with comparison levels:

  • Exact match (all other dates)
  • Damerau-Levenshtein distance <= 1
  • Date difference <= 1 month
  • Date difference <= 1 year
  • Date difference <= 10 years
  • Anything else

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

The column name

required input_is_string bool

If True, the provided col_name must be of type string. If False, it must be a date or datetime.

required datetime_thresholds Union[int, float, List[Union[int, float]]]

Numeric thresholds for date differences. Defaults to [1, 1, 10].

[1, 1, 10] datetime_metrics Union[DateMetricType, List[DateMetricType]]

Metrics for date differences. Defaults to [\"month\", \"year\", \"year\"].

['month', 'year', 'year'] datetime_format str

The datetime format used to cast strings to dates. Only used if input is a string.

None invalid_dates_as_null bool

If True, treat invalid dates as null as opposed to allowing e.g. an exact or levenshtein match where one side or both are an invalid date. Only used if input is a string. Defaults to True.

True","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.DistanceFunctionAtThresholds","title":"DistanceFunctionAtThresholds(col_name, distance_function_name, distance_threshold_or_thresholds, higher_is_more_similar=True)","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Custom distance function levels at specified thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] and distance_function 'hamming', with higher_is_more_similar False the levels are:

  • Exact match in col_name
  • Hamming distance of col_name <= 1
  • Hamming distance of col_name <= 3
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required distance_function_name str

the name of the SQL distance function

required distance_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the distance function level(s).

required higher_is_more_similar bool

Are higher values of the distance function more similar? (e.g. True for Jaro-Winkler, False for Levenshtein) Default is True

True","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.DistanceInKMAtThresholds","title":"DistanceInKMAtThresholds(lat_col, long_col, km_thresholds)","text":"

Bases: ComparisonCreator

A comparison of the latitude, longitude coordinates defined in 'lat_col' and 'long col' giving the great circle distance between them in km.

An example of the output with km_thresholds = [1, 10] would be:

  • The two coordinates are within 1 km of one another
  • The two coordinates are within 10 km of one another
  • Anything else (i.e. the distance between coordinates are > 10km apart)

Parameters:

Name Type Description Default lat_col str

The name of the latitude column to compare.

required long_col str

The name of the longitude column to compare.

required km_thresholds iterable[float] | float

The km threshold(s) for the distance levels.

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.EmailComparison","title":"EmailComparison(col_name)","text":"

Bases: ComparisonCreator

Generate an 'out of the box' comparison for an email address column with the in the col_name provided.

The default comparison levels are:

  • Null comparison: e.g., one email is missing or invalid.
  • Exact match on full email: e.g., john@smith.com vs. john@smith.com.
  • Exact match on username part of email: e.g., john@company.com vs. john@other.com.
  • Jaro-Winkler similarity > 0.88 on full email: e.g., john.smith@company.com vs. john.smyth@company.com.
  • Jaro-Winkler similarity > 0.88 on username part of email: e.g., john.smith@company.com vs. john.smyth@other.com.
  • Anything else: e.g., john@company.com vs. rebecca@other.com.

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

The column name or expression for the email addresses to be compared.

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.ExactMatch","title":"ExactMatch(col_name)","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with two levels:

  • Exact match in col_name
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare

required","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.ForenameSurnameComparison","title":"ForenameSurnameComparison(forename_col_name, surname_col_name, *, jaro_winkler_thresholds=[0.92, 0.88], forename_surname_concat_col_name=None)","text":"

Bases: ComparisonCreator

Generate an 'out of the box' comparison for forename and surname columns in the forename_col_name and surname_col_name provided.

It's recommended to derive an additional column containing a concatenated forename and surname column so that term frequencies can be applied to the full name. If you have derived a column, provide it at forename_surname_concat_col_name.

The default comparison levels are:

  • Null comparison on both forename and surname
  • Exact match on both forename and surname
  • Columns reversed comparison (forename and surname swapped)
  • Jaro-Winkler similarity > 0.92 on both forename and surname
  • Jaro-Winkler similarity > 0.88 on both forename and surname
  • Exact match on surname
  • Exact match on forename
  • Anything else

Parameters:

Name Type Description Default forename_col_name Union[str, ColumnExpression]

The column name or expression for the forenames to be compared.

required surname_col_name Union[str, ColumnExpression]

The column name or expression for the surnames to be compared.

required jaro_winkler_thresholds Union[float, list[float]]

Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88].

[0.92, 0.88] forename_surname_concat_col_name str

The column name for concatenated forename and surname values. If provided, term frequencies are applied on the exact match using this column

None","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.JaccardAtThresholds","title":"JaccardAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaccard score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaccard score in col_name >= 0.9
  • Jaccard score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaccard similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.JaroAtThresholds","title":"JaroAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaro score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaro score in col_name >= 0.9
  • Jaro score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaro similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.JaroWinklerAtThresholds","title":"JaroWinklerAtThresholds(col_name, score_threshold_or_thresholds=[0.9, 0.7])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Jaro-Winkler score levels at specified thresholds
  • ...
  • Anything else

For example, with score_threshold_or_thresholds = [0.9, 0.7] the levels are:

  • Exact match in col_name
  • Jaro-Winkler score in col_name >= 0.9
  • Jaro-Winkler score in col_name >= 0.7
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare.

required score_threshold_or_thresholds Union[float, list]

The threshold(s) to use for the Jaro-Winkler similarity level(s). Defaults to [0.9, 0.7].

[0.9, 0.7]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.LevenshteinAtThresholds","title":"LevenshteinAtThresholds(col_name, distance_threshold_or_thresholds=[1, 2])","text":"

Bases: ComparisonCreator

Represents a comparison of the data in col_name with three or more levels:

  • Exact match in col_name
  • Levenshtein levels at specified distance thresholds
  • ...
  • Anything else

For example, with distance_threshold_or_thresholds = [1, 3] the levels are

  • Exact match in col_name
  • Levenshtein distance in col_name <= 1
  • Levenshtein distance in col_name <= 3
  • Anything else

Parameters:

Name Type Description Default col_name str

The name of the column to compare

required distance_threshold_or_thresholds Union[int, list]

The threshold(s) to use for the levenshtein similarity level(s). Defaults to [1, 2].

[1, 2]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.NameComparison","title":"NameComparison(col_name, *, jaro_winkler_thresholds=[0.92, 0.88, 0.7], dmeta_col_name=None)","text":"

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a name column in the col_name provided.

It's also possible to include a level for a dmetaphone match, but this requires you to derive a dmetaphone column prior to importing it into Splink. Note this is expected to be a column containing arrays of dmetaphone values, which are of length 1 or 2.

The default comparison levels are:

  • Null comparison
  • Exact match
  • Jaro-Winkler similarity > 0.92
  • Jaro-Winkler similarity > 0.88
  • Jaro-Winkler similarity > 0.70
  • Anything else

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

The column name or expression for the names to be compared.

required jaro_winkler_thresholds Union[float, list[float]]

Thresholds for Jaro-Winkler similarity. Defaults to [0.92, 0.88, 0.7].

[0.92, 0.88, 0.7] dmeta_col_name str

The column name for dmetaphone values. If provided, array intersection level is included. This column must contain arrays of dmetaphone values, which are of length 1 or 2.

None","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#splink.comparison_library.PostcodeComparison","title":"PostcodeComparison(col_name, *, invalid_postcodes_as_null=False, lat_col=None, long_col=None, km_thresholds=[1, 10, 100])","text":"

Bases: ComparisonCreator

Generate an 'out of the box' comparison for a postcode column with the in the col_name provided.

The default comparison levels are:

  • Null comparison
  • Exact match on full postcode
  • Exact match on sector
  • Exact match on district
  • Exact match on area
  • Distance in km (if lat_col and long_col are provided)

It's also possible to include levels for distance in km, but this requires you to have geocoded your postcodes prior to importing them into Splink. Use the lat_col and long_col arguments to tell Splink where to find the latitude and longitude columns.

See https://ideal-postcodes.co.uk/guides/uk-postcode-format for definitions

Parameters:

Name Type Description Default col_name Union[str, ColumnExpression]

The column name or expression for the postcodes to be compared.

required invalid_postcodes_as_null bool

If True, treat invalid postcodes as null. Defaults to False.

False lat_col Union[str, ColumnExpression]

The column name or expression for latitude. Required if km_thresholds is provided.

None long_col Union[str, ColumnExpression]

The column name or expression for longitude. Required if km_thresholds is provided.

None km_thresholds Union[float, List[float]]

Thresholds for distance in kilometers. If provided, lat_col and long_col must also be provided.

[1, 10, 100]","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#absolutedatedifferenceatthresholds","title":"AbsoluteDateDifferenceAtThresholds","text":"

An alias of AbsoluteTimeDifferenceAtThresholds.

","tags":["API","comparisons"]},{"location":"api_docs/comparison_library.html#configuring-comparisons","title":"Configuring comparisons","text":"

Note that all comparisons have a .configure() method as follows:

Configure the comparison creator with options that are common to all comparisons.

For m and u probabilities, the first element in the list corresponds to the first comparison level, usually an exact match level. Subsequent elements correspond comparison to levels in sequential order, through to the last element which is usually the 'ELSE' level.

All options have default options set initially. Any call to .configure() will set any options that are supplied. Any subsequent calls to .configure() will not override these values with defaults; to override values you must explicitly provide a value corresponding to the default.

Generally speaking only a single call (at most) to .configure() should be required.

Parameters:

Name Type Description Default term_frequency_adjustments bool

Whether term frequency adjustments are switched on for this comparison. Only applied to exact match levels. Default corresponds to False.

unsupplied_option m_probabilities list

List of m probabilities Default corresponds to None.

unsupplied_option u_probabilities list

List of u probabilities Default corresponds to None.

unsupplied_option Example
cc = LevenshteinAtThresholds(\"name\", 2)\ncc.configure(\n    m_probabilities=[0.9, 0.08, 0.02],\n    u_probabilities=[0.01, 0.05, 0.94]\n    # probabilities for exact match level, levenshtein <= 2, and else\n    # in that order\n)\n
","tags":["API","comparisons"]},{"location":"api_docs/datasets.html","title":"SplinkDatasets","text":"","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#in-built-datasets","title":"In-built datasets","text":"

Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features. To use, simply import splink_datasets:

from splink import splink_datasets\n\ndf = splink_datasets.fake_1000\n
which you can then use to set up a linker:
from splink splink_datasets, Linker, DuckDBAPI, SettingsCreator\n\ndf = splink_datasets.fake_1000\nlinker = Linker(\n    df,\n    SettingsCreator(\n        link_type=\"dedupe_only\",\n        comparisons=[\n            cl.exact_match(\"first_name\"),\n            cl.exact_match(\"surname\"),\n        ],\n    ),\n    db_api=DuckDBAPI()\n)\n
Troubleshooting

If you get a SSLCertVerificationError when trying to use the inbuilt datasets, this can be fixed with the ssl package by running:

ssl._create_default_https_context = ssl._create_unverified_context.

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink_datasets","title":"splink_datasets","text":"

Each attribute of splink_datasets is a dataset available for use, which exists as a pandas DataFrame. These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested. Once requested they are cached for future use. The cache can be cleared using splink_dataset_utils (see below), which also contains information on available datasets, and which have already been cached.

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#available-datasets","title":"Available datasets","text":"

The datasets available are listed below:

dataset name description rows unique entities link to source fake_1000 Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled. 1,000 250 source historical_50k The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors. 50,000 5,156 source febrl3 The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record. 5,000 2,000 source febrl4a The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records. 5,000 5,000 source febrl4b The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a. 5,000 5,000 source transactions_origin This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing. 45,326 45,326 source transactions_destination This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing. 45,326 45,326 source","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink_dataset_labels","title":"splink_dataset_labels","text":"

Some of the splink_datasets have corresponding clerical labels to help assess model performance. These are requested through the splink_dataset_labels module.

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#available-datasets_1","title":"Available datasets","text":"

The datasets available are listed below:

dataset name description rows unique entities link to source fake_1000_labels Clerical labels for fake_1000 3,176 NA source","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink_dataset_utils-api","title":"splink_dataset_utils API","text":"

In addition to splink_datasets, you can also import splink_dataset_utils, which has a few functions to help managing splink_datasets. This can be useful if you have limited internet connection and want to see what is already cached, or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue).

For example:

from splink.datasets import splink_dataset_utils\n\nsplink_dataset_utils.show_downloaded_data()\nsplink_dataset_utils.clear_cache(['fake_1000'])\n
","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink.internals.datasets.utils.SplinkDataUtils.list_downloaded_datasets","title":"list_downloaded_datasets()","text":"

Return a list of datasets that have already been pre-downloaded

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink.internals.datasets.utils.SplinkDataUtils.list_all_datasets","title":"list_all_datasets()","text":"

Return a list of all available datasets, regardless of whether or not they have already been pre-downloaded

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink.internals.datasets.utils.SplinkDataUtils.show_downloaded_data","title":"show_downloaded_data()","text":"

Print a list of datasets that have already been pre-downloaded

","tags":["API","Datasets","Examples"]},{"location":"api_docs/datasets.html#splink.internals.datasets.utils.SplinkDataUtils.clear_downloaded_data","title":"clear_downloaded_data(datasets=None)","text":"

Delete any pre-downloaded data stored locally.

Parameters:

Name Type Description Default datasets list

A list of dataset names (without any file suffix) to delete. If None, all datasets will be deleted. Default None

None","tags":["API","Datasets","Examples"]},{"location":"api_docs/em_training_session.html","title":"EM Training Session API","text":"","tags":["API","training"]},{"location":"api_docs/em_training_session.html#documentation-foremtrainingsession","title":"Documentation forEMTrainingSession","text":"

linker.training.estimate_parameters_using_expectation_maximisation returns an object of type EMTrainingSession which has the following methods:

Manages training models using the Expectation Maximisation algorithm, and holds statistics on the evolution of parameter estimates. Plots diagnostic charts

","tags":["API","training"]},{"location":"api_docs/em_training_session.html#splink.internals.em_training_session.EMTrainingSession.probability_two_random_records_match_iteration_chart","title":"probability_two_random_records_match_iteration_chart()","text":"

Display a chart showing the iteration history of the probability that two random records match.

Returns:

Type Description ChartReturnType

An interactive Altair chart.

","tags":["API","training"]},{"location":"api_docs/em_training_session.html#splink.internals.em_training_session.EMTrainingSession.match_weights_interactive_history_chart","title":"match_weights_interactive_history_chart()","text":"

Display an interactive chart of the match weights history.

Returns:

Type Description ChartReturnType

An interactive Altair chart.

","tags":["API","training"]},{"location":"api_docs/em_training_session.html#splink.internals.em_training_session.EMTrainingSession.m_u_values_interactive_history_chart","title":"m_u_values_interactive_history_chart()","text":"

Display an interactive chart of the m and u values.

Returns:

Type Description ChartReturnType

An interactive Altair chart.

","tags":["API","training"]},{"location":"api_docs/evaluation.html","title":"Evaluation","text":"","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#methods-in-linkerevaluation","title":"Methods in Linker.evaluation","text":"

Evaluate the performance of a Splink model. Accessed via linker.evaluation

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.prediction_errors_from_labels_table","title":"prediction_errors_from_labels_table(labels_splinkdataframe_or_table_name, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5)","text":"

Find false positives and false negatives based on the comparison between the clerical_match_score in the labels table compared with the splink predicted match probability

The table of labels should be in the following format, and should be registered as a table with your database using

labels_table = linker.table_management.register_labels_table(my_df)

source_dataset_l unique_id_l source_dataset_r unique_id_r clerical_match_score df_1 1 df_2 2 0.99 df_1 1 df_2 3 0.2

Parameters:

Name Type Description Default labels_splinkdataframe_or_table_name str | SplinkDataFrame

Name of table containing labels in the database

required include_false_positives bool

Defaults to True.

True include_false_negatives bool

Defaults to True.

True threshold_match_probability float

Threshold probability above which a prediction considered to be a match. Defaults to 0.5.

0.5

Examples:

labels_table = linker.table_management.register_labels_table(df_labels)\n\nlinker.evaluation.prediction_errors_from_labels_table(\n   labels_table, include_false_negatives=True, include_false_positives=False\n).as_pandas_dataframe()\n

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

Table containing false positives and negatives

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.accuracy_analysis_from_labels_column","title":"accuracy_analysis_from_labels_column(labels_column_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[], positives_not_captured_by_blocking_rules_scored_as_zero=True)","text":"

Generate an accuracy chart or table from ground truth data, where the ground truth is in a column in the input dataset called labels_column_name

Parameters:

Name Type Description Default labels_column_name str

Column name containing labels in the input table

required threshold_match_probability float

Where the clerical_match_score provided by the user is a probability rather than binary, this value is used as the threshold to classify clerical_match_scores as binary matches or non matches. Defaults to 0.5.

0.5 match_weight_round_to_nearest float

When provided, thresholds are rounded. When large numbers of labels are provided, this is sometimes necessary to reduce the size of the ROC table, and therefore the number of points plotted on the chart. Defaults to None.

0.1 add_metrics list(str)

Precision and recall metrics are always included. Where provided, add_metrics specifies additional metrics to show, with the following options:

  • \"specificity\": specificity, selectivity, true negative rate (TNR)
  • \"npv\": negative predictive value (NPV)
  • \"accuracy\": overall accuracy (TP+TN)/(P+N)
  • \"f1\"/\"f2\"/\"f0_5\": F-scores for \u03b2=1 (balanced), \u03b2=2 (emphasis on recall) and \u03b2=0.5 (emphasis on precision)
  • \"p4\" - an extended F1 score with specificity and NPV included
  • \"phi\" - \u03c6 coefficient or Matthews correlation coefficient (MCC)
[]

Returns:

Name Type Description chart Union[ChartReturnType, SplinkDataFrame]

An altair chart

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.accuracy_analysis_from_labels_table","title":"accuracy_analysis_from_labels_table(labels_splinkdataframe_or_table_name, *, threshold_match_probability=0.5, match_weight_round_to_nearest=0.1, output_type='threshold_selection', add_metrics=[])","text":"

Generate an accuracy chart or table from labelled (ground truth) data.

The table of labels should be in the following format, and should be registered as a table with your database using labels_table = linker.register_labels_table(my_df)

source_dataset_l unique_id_l source_dataset_r unique_id_r clerical_match_score df_1 1 df_2 2 0.99 df_1 1 df_2 3 0.2

Note that source_dataset and unique_id should correspond to the values specified in the settings dict, and the input_table_aliases passed to the linker object.

For dedupe_only links, the source_dataset columns can be ommitted.

Parameters:

Name Type Description Default labels_splinkdataframe_or_table_name str | SplinkDataFrame

Name of table containing labels in the database

required threshold_match_probability float

Where the clerical_match_score provided by the user is a probability rather than binary, this value is used as the threshold to classify clerical_match_scores as binary matches or non matches. Defaults to 0.5.

0.5 match_weight_round_to_nearest float

When provided, thresholds are rounded. When large numbers of labels are provided, this is sometimes necessary to reduce the size of the ROC table, and therefore the number of points plotted on the chart. Defaults to None.

0.1 add_metrics list(str)

Precision and recall metrics are always included. Where provided, add_metrics specifies additional metrics to show, with the following options:

  • \"specificity\": specificity, selectivity, true negative rate (TNR)
  • \"npv\": negative predictive value (NPV)
  • \"accuracy\": overall accuracy (TP+TN)/(P+N)
  • \"f1\"/\"f2\"/\"f0_5\": F-scores for \u03b2=1 (balanced), \u03b2=2 (emphasis on recall) and \u03b2=0.5 (emphasis on precision)
  • \"p4\" - an extended F1 score with specificity and NPV included
  • \"phi\" - \u03c6 coefficient or Matthews correlation coefficient (MCC)
[]

Returns:

Type Description Union[ChartReturnType, SplinkDataFrame]

altair.Chart: An altair chart

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.prediction_errors_from_labels_column","title":"prediction_errors_from_labels_column(label_colname, include_false_positives=True, include_false_negatives=True, threshold_match_probability=0.5)","text":"

Generate a dataframe containing false positives and false negatives based on the comparison between the splink match probability and the labels column. A label column is a column in the input dataset that contains the 'ground truth' cluster to which the record belongs

Parameters:

Name Type Description Default label_colname str

Name of labels column in input data

required include_false_positives bool

Defaults to True.

True include_false_negatives bool

Defaults to True.

True threshold_match_probability float

Threshold above which a score is considered to be a match. Defaults to 0.5.

0.5

Examples:

linker.evaluation.prediction_errors_from_labels_column(\n    \"ground_truth_cluster\",\n    include_false_negatives=True,\n    include_false_positives=False\n).as_pandas_dataframe()\n

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

Table containing false positives and negatives

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.unlinkables_chart","title":"unlinkables_chart(x_col='match_weight', name_of_data_in_title=None, as_dict=False)","text":"

Generate an interactive chart displaying the proportion of records that are \"unlinkable\" for a given splink score threshold and model parameters.

Unlinkable records are those that, even when compared with themselves, do not contain enough information to confirm a match.

Parameters:

Name Type Description Default x_col str

Column to use for the x-axis. Defaults to \"match_weight\".

'match_weight' name_of_data_in_title str

Name of the source dataset to use for the title of the output chart.

None as_dict bool

If True, return a dict version of the chart.

False

Examples:

After estimating the parameters of the model, run:

linker.evaluation.unlinkables_chart()\n

Returns:

Type Description ChartReturnType

altair.Chart: An altair chart

","tags":["API","Clustering"]},{"location":"api_docs/evaluation.html#splink.internals.linker_components.evaluation.LinkerEvalution.labelling_tool_for_specific_record","title":"labelling_tool_for_specific_record(unique_id, source_dataset=None, out_path='labelling_tool.html', overwrite=False, match_weight_threshold=-4, view_in_jupyter=False, show_splink_predictions_in_interface=True)","text":"

Create a standalone, offline labelling dashboard for a specific record as identified by its unique id

Parameters:

Name Type Description Default unique_id str

The unique id of the record for which to create the labelling tool

required source_dataset str

If there are multiple datasets, to identify the record you must also specify the source_dataset. Defaults to None.

None out_path str

The output path for the labelling tool. Defaults to \"labelling_tool.html\".

'labelling_tool.html' overwrite bool

If true, overwrite files at the output path if they exist. Defaults to False.

False match_weight_threshold int

Include possible matches in the output which score above this threshold. Defaults to -4.

-4 view_in_jupyter bool

If you're viewing in the Jupyter html viewer, set this to True to extract your labels. Defaults to False.

False show_splink_predictions_in_interface bool

Whether to show information about the Splink model's predictions that could potentially bias the decision of the clerical labeller. Defaults to True.

True","tags":["API","Clustering"]},{"location":"api_docs/exploratory.html","title":"Exploratory","text":"","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#documentation-forsplinkexploratory","title":"Documentation forsplink.exploratory","text":"","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.completeness_chart","title":"completeness_chart(table_or_tables, db_api, cols=None, table_names_for_chart=None)","text":"

Generate a summary chart of data completeness (proportion of non-nulls) of columns in each of the input table or tables. By default, completeness is assessed for all columns in the input data.

Parameters:

Name Type Description Default table_or_tables Sequence[AcceptableInputTableType]

A single table or a list of tables of data

required db_api DatabaseAPISubClass

The backend database API to use

required cols List[str]

List of column names to calculate completeness. If none, all columns will be computed. Default to None.

None table_names_for_chart List[str]

A list of names. Must be the same length as table_or_tables.

None","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.profile_columns","title":"profile_columns(table_or_tables, db_api, column_expressions=None, top_n=10, bottom_n=10)","text":"

Profiles the specified columns of the dataframe initiated with the linker.

This can be computationally expensive if the dataframe is large.

For the provided columns with column_expressions (or for all columns if left empty) calculate: - A distribution plot that shows the count of values at each percentile. - A top n chart, that produces a chart showing the count of the top n values within the column - A bottom n chart, that produces a chart showing the count of the bottom n values within the column

This should be used to explore the dataframe, determine if columns have sufficient completeness for linking, analyse the cardinality of columns, and identify the need for standardisation within a given column.

Args:

column_expressions (list, optional): A list of strings containing the\n    specified column names.\n    If left empty this will default to all columns.\ntop_n (int, optional): The number of top n values to plot.\nbottom_n (int, optional): The number of bottom n values to plot.\n

Returns:

Type Description Optional[ChartReturnType]

altair.Chart or dict: A visualization or JSON specification describing the

Optional[ChartReturnType]

profiling charts.

Note
  • The linker object should be an instance of the initiated linker.
  • The provided column_expressions can be a list of column names to profile. If left empty, all columns will be profiled.
  • The top_n and bottom_n parameters determine the number of top and bottom values to display in the respective charts.
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#documentation-forsplinkexploratorysimilarity_analysis","title":"Documentation forsplink.exploratory.similarity_analysis","text":"","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.comparator_score","title":"comparator_score(str1, str2, decimal_places=2)","text":"

Helper function to give the similarity between two strings for the string comparators in splink.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nsa.comparator_score(\"Richard\", \"iRchard\")\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.comparator_score_chart","title":"comparator_score_chart(list, col1, col2)","text":"

Helper function returning a heatmap showing the sting similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nlist = {\n        \"string1\": [\"Stephen\", \"Stephen\", \"Stephen\"],\n        \"string2\": [\"Stephen\", \"Steven\", \"Stephan\"],\n        }\n\nsa.comparator_score_chart(list, \"string1\", \"string2\")\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.comparator_score_df","title":"comparator_score_df(list, col1, col2, decimal_places=2)","text":"

Helper function returning a dataframe showing the string similarity scores and string distances for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nlist = {\n        \"string1\": [\"Stephen\", \"Stephen\",\"Stephen\"],\n        \"string2\": [\"Stephen\", \"Steven\", \"Stephan\"],\n        }\n\nsa.comparator_score_df(list, \"string1\", \"string2\")\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.comparator_score_threshold_chart","title":"comparator_score_threshold_chart(list, col1, col2, similarity_threshold=None, distance_threshold=None)","text":"

Helper function returning a heatmap showing the string similarity scores and string distances for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nlist = {\n        \"string1\": [\"Stephen\", \"Stephen\",\"Stephen\"],\n        \"string2\": [\"Stephen\", \"Steven\", \"Stephan\"],\n        }\n\nsa.comparator_score_threshold_chart(data,\n                         \"string1\", \"string2\",\n                         similarity_threshold=0.8,\n                         distance_threshold=2)\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.phonetic_match_chart","title":"phonetic_match_chart(list, col1, col2)","text":"

Helper function returning a heatmap showing the phonetic transform and matches for a list of strings given a threshold.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nlist = {\n        \"string1\": [\"Stephen\", \"Stephen\",\"Stephen\"],\n        \"string2\": [\"Stephen\", \"Steven\", \"Stephan\"],\n        }\n\nsa.comparator_score_threshold_chart(list,\n                         \"string1\", \"string2\",\n                         similarity_threshold=0.8,\n                         distance_threshold=2)\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.phonetic_transform","title":"phonetic_transform(string)","text":"

Helper function to give the phonetic transformation of two strings with Soundex, Metaphone and Double Metaphone.

Examples:

phonetic_transform(\"Richard\", \"iRchard\")\n
","tags":["API","comparisons"]},{"location":"api_docs/exploratory.html#splink.exploratory.similarity_analysis.phonetic_transform_df","title":"phonetic_transform_df(list, col1, col2)","text":"

Helper function returning a dataframe showing the phonetic transforms for a list of strings.

Examples:

import splink.exploratory.similarity_analysis as sa\n\nlist = {\n        \"string1\": [\"Stephen\", \"Stephen\",\"Stephen\"],\n        \"string2\": [\"Stephen\", \"Steven\", \"Stephan\"],\n        }\n\nsa.phonetic_match_chart(list, \"string1\", \"string2\")\n
","tags":["API","comparisons"]},{"location":"api_docs/inference.html","title":"Inference","text":"","tags":["API","Inference"]},{"location":"api_docs/inference.html#methods-in-linkerinference","title":"Methods in Linker.inference","text":"

Use your Splink model to make predictions (perform inference). Accessed via linker.inference.

","tags":["API","Inference"]},{"location":"api_docs/inference.html#splink.internals.linker_components.inference.LinkerInference.deterministic_link","title":"deterministic_link()","text":"

Uses the blocking rules specified by blocking_rules_to_generate_predictions in your settings to generate pairwise record comparisons.

For deterministic linkage, this should be a list of blocking rules which are strict enough to generate only true links.

Deterministic linkage, however, is likely to result in missed links (false negatives).

Examples:

```py\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"dob\", \"first_name\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\nsplink_df = linker.inference.deterministic_link()\n```\n

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

A SplinkDataFrame of the pairwise comparisons.

","tags":["API","Inference"]},{"location":"api_docs/inference.html#splink.internals.linker_components.inference.LinkerInference.predict","title":"predict(threshold_match_probability=None, threshold_match_weight=None, materialise_after_computing_term_frequencies=True, materialise_blocked_pairs=True)","text":"

Create a dataframe of scored pairwise comparisons using the parameters of the linkage model.

Uses the blocking rules specified in the blocking_rules_to_generate_predictions key of the settings to generate the pairwise comparisons.

Parameters:

Name Type Description Default threshold_match_probability float

If specified, filter the results to include only pairwise comparisons with a match_probability above this threshold. Defaults to None.

None threshold_match_weight float

If specified, filter the results to include only pairwise comparisons with a match_weight above this threshold. Defaults to None.

None materialise_after_computing_term_frequencies bool

If true, Splink will materialise the table containing the input nodes (rows) joined to any term frequencies which have been asked for in the settings object. If False, this will be computed as part of a large CTE pipeline. Defaults to True

True materialise_blocked_pairs bool

In the blocking phase, materialise the table of pairs of records that will be scored

True

Examples:

linker = linker(df, \"saved_settings.json\", db_api=db_api)\nsplink_df = linker.inference.predict(threshold_match_probability=0.95)\nsplink_df.as_pandas_dataframe(limit=5)\n
","tags":["API","Inference"]},{"location":"api_docs/inference.html#splink.internals.linker_components.inference.LinkerInference.find_matches_to_new_records","title":"find_matches_to_new_records(records_or_tablename, blocking_rules=[], match_weight_threshold=-4)","text":"

Given one or more records, find records in the input dataset(s) which match and return in order of the Splink prediction score.

This effectively provides a way of searching the input datasets for given record(s)

Parameters:

Name Type Description Default records_or_tablename List[dict]

Input search record(s) as list of dict, or a table registered to the database.

required blocking_rules list

Blocking rules to select which records to find and score. If [], do not use a blocking rule - meaning the input records will be compared to all records provided to the linker when it was instantiated. Defaults to [].

[] match_weight_threshold int

Return matches with a match weight above this threshold. Defaults to -4.

-4

Examples:

linker = Linker(df, \"saved_settings.json\", db_api=db_api)\n\n# You should load or pre-compute tf tables for any tables with\n# term frequency adjustments\nlinker.table_management.compute_tf_table(\"first_name\")\n# OR\nlinker.table_management.register_term_frequency_lookup(df, \"first_name\")\n\nrecord = {'unique_id': 1,\n    'first_name': \"John\",\n    'surname': \"Smith\",\n    'dob': \"1971-05-24\",\n    'city': \"London\",\n    'email': \"john@smith.net\"\n    }\ndf = linker.inference.find_matches_to_new_records(\n    [record], blocking_rules=[]\n)\n

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

The pairwise comparisons.

","tags":["API","Inference"]},{"location":"api_docs/inference.html#splink.internals.linker_components.inference.LinkerInference.compare_two_records","title":"compare_two_records(record_1, record_2)","text":"

Use the linkage model to compare and score a pairwise record comparison based on the two input records provided

Parameters:

Name Type Description Default record_1 dict

dictionary representing the first record. Columns names and data types must be the same as the columns in the settings object

required record_2 dict

dictionary representing the second record. Columns names and data types must be the same as the columns in the settings object

required

Examples:

linker = Linker(df, \"saved_settings.json\", db_api=db_api)\n\n# You should load or pre-compute tf tables for any tables with\n# term frequency adjustments\nlinker.table_management.compute_tf_table(\"first_name\")\n# OR\nlinker.table_management.register_term_frequency_lookup(df, \"first_name\")\n\nrecord_1 = {'unique_id': 1,\n    'first_name': \"John\",\n    'surname': \"Smith\",\n    'dob': \"1971-05-24\",\n    'city': \"London\",\n    'email': \"john@smith.net\"\n    }\n\nrecord_2 = {'unique_id': 1,\n    'first_name': \"Jon\",\n    'surname': \"Smith\",\n    'dob': \"1971-05-23\",\n    'city': \"London\",\n    'email': \"john@smith.net\"\n    }\ndf = linker.inference.compare_two_records(record_1, record_2)\n

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

Pairwise comparison with scored prediction

","tags":["API","Inference"]},{"location":"api_docs/misc.html","title":"Miscellaneous functions","text":"","tags":["API","Misc"]},{"location":"api_docs/misc.html#methods-in-linkermisc","title":"Methods in Linker.misc","text":"

Miscellaneous methods on the linker that don't fit into other categories. Accessed via linker.misc.

","tags":["API","Misc"]},{"location":"api_docs/misc.html#splink.internals.linker_components.misc.LinkerMisc.save_model_to_json","title":"save_model_to_json(out_path=None, overwrite=False)","text":"

Save the configuration and parameters of the linkage model to a .json file.

The model can later be loaded into a new linker using `Linker(df, settings=\"path/to/model.json\", db_api=db_api).

The settings dict is also returned in case you want to save it a different way.

Examples:

linker.misc.save_model_to_json(\"my_settings.json\", overwrite=True)\n

Returns:

Name Type Description dict dict[str, Any]

The settings as a dictionary.

","tags":["API","Misc"]},{"location":"api_docs/misc.html#splink.internals.linker_components.misc.LinkerMisc.query_sql","title":"query_sql(sql, output_type='pandas')","text":"

Run a SQL query against your backend database and return the resulting output.

Examples:

linker = Linker(df, settings, db_api)\ndf_predict = linker.predict()\nlinker.misc.query_sql(f\"select * from {df_predict.physical_name} limit 10\")\n

Parameters:

Name Type Description Default sql str

The SQL to be queried.

required output_type str

One of splink_df/splinkdf or pandas. This determines the type of table that your results are output in.

'pandas'","tags":["API","Misc"]},{"location":"api_docs/settings_dict_guide.html","title":"Settings Dict","text":"","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#guide-to-splink-settings","title":"Guide to Splink settings","text":"

This document enumerates all the settings and configuration options available when developing your data linkage model.

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#link_type","title":"link_type","text":"

The type of data linking task. Required.

  • When dedupe_only, splink find duplicates. User expected to provide a single input dataset.

  • When link_and_dedupe, splink finds links within and between input datasets. User is expected to provide two or more input datasets.

  • When link_only, splink finds links between datasets, but does not attempt to deduplicate the datasets (it does not try and find links within each input dataset.) User is expected to provide two or more input datasets.

Examples: ['dedupe_only', 'link_only', 'link_and_dedupe']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#probability_two_random_records_match","title":"probability_two_random_records_match","text":"

The probability that two records chosen at random (with no blocking) are a match. For example, if there are a million input records and each has on average one match, then this value should be 1/1,000,000.

If you estimate parameters using expectation maximisation (EM), this provides an initial value (prior) from which the EM algorithm will start iterating. EM will then estimate the true value of this parameter.

Default value: 0.0001

Examples: [1e-05, 0.006]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#em_convergence","title":"em_convergence","text":"

Convergence tolerance for the Expectation Maximisation algorithm

The algorithm will stop converging when the maximum of the change in model parameters between iterations is below this value

Default value: 0.0001

Examples: [0.0001, 1e-05, 1e-06]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#max_iterations","title":"max_iterations","text":"

The maximum number of Expectation Maximisation iterations to run (even if convergence has not been reached)

Default value: 25

Examples: [20, 150]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#unique_id_column_name","title":"unique_id_column_name","text":"

Splink requires that the input dataset has a column that uniquely identifies each record. unique_id_column_name is the name of the column in the input dataset representing this unique id

For linking tasks, ids must be unique within each dataset being linked, and do not need to be globally unique across input datasets

Default value: unique_id

Examples: ['unique_id', 'id', 'pk']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#source_dataset_column_name","title":"source_dataset_column_name","text":"

The name of the column in the input dataset representing the source dataset

Where we are linking datasets, we can't guarantee that the unique id column is globally unique across datasets, so we combine it with a source_dataset column. Usually, this is created by Splink for the user

Default value: source_dataset

Examples: ['source_dataset', 'dataset_name']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#retain_matching_columns","title":"retain_matching_columns","text":"

If set to true, each column used by the comparisons SQL expressions will be retained in output datasets

This is helpful so that the user can inspect matches, but once the comparison vector (gamma) columns are computed, this information is not actually needed by the algorithm. The algorithm will run faster and use less resources if this is set to false.

Default value: True

Examples: [False, True]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#retain_intermediate_calculation_columns","title":"retain_intermediate_calculation_columns","text":"

Retain intermediate calculation columns, such as the Bayes factors associated with each column in comparisons

The algorithm will run faster and use less resources if this is set to false.

Default value: False

Examples: [False, True]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#comparisons","title":"comparisons","text":"

A list specifying how records should be compared for probabilistic matching. Each element is a dictionary

Settings keys nested within each member of comparisons","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#output_column_name","title":"output_column_name","text":"

The name used to refer to this comparison in the output dataset. By default, Splink will set this to the name(s) of any input columns used in the comparison. This key is most useful to give a clearer description to comparisons that use multiple input columns. e.g. a location column that uses postcode and town may be named location

For a comparison column that uses a single input column, e.g. first_name, this will be set first_name. For comparison columns that use multiple columns, if left blank, this will be set to the concatenation of columns used.

Examples: ['first_name', 'surname']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#comparison_description","title":"comparison_description","text":"

An optional label to describe this comparison, to be used in charting outputs.

Examples: ['First name exact match', 'Surname with middle levenshtein level']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#comparison_levels","title":"comparison_levels","text":"

Comparison levels specify how input values should be compared. Each level corresponds to an assessment of similarity, such as exact match, Jaro-Winkler match, one side of the match being null, etc

Each comparison level represents a branch of a SQL case expression. They are specified in order of evaluation, each with a sql_condition that represents the branch of a case expression

Example:

[{\n    \"sql_condition\": \"first_name_l IS NULL OR first_name_r IS NULL\",\n    \"label_for_charts\": \"null\",\n    \"null_level\": True\n},\n{\n    \"sql_condition\": \"first_name_l = first_name_r\",\n    \"label_for_charts\": \"exact_match\",\n    \"tf_adjustment_column\": \"first_name\"\n},\n{\n    \"sql_condition\": \"ELSE\",\n    \"label_for_charts\": \"else\"\n}]\n
Settings keys nested within each member of comparison_levels","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#sql_condition","title":"sql_condition","text":"

A branch of a SQL case expression without WHEN and THEN e.g. jaro_winkler_sim(surname_l, surname_r) > 0.88

Examples: ['forename_l = forename_r', 'jaro_winkler_sim(surname_l, surname_r) > 0.88']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#label_for_charts","title":"label_for_charts","text":"

A label for this comparison level, which will appear on charts as a reminder of what the level represents

Examples: ['exact', 'postcode exact']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#u_probability","title":"u_probability","text":"

the u probability for this comparison level - i.e. the proportion of records that match this level amongst truly non-matching records

Examples: [0.9]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#m_probability","title":"m_probability","text":"

the m probability for this comparison level - i.e. the proportion of records that match this level amongst truly matching records

Examples: [0.1]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#is_null_level","title":"is_null_level","text":"

If true, m and u values will not be estimated and instead the match weight will be zero for this column. See treatment of nulls here on page 356, quote '. Under this MAR assumption, we can simply ignore missing data.': https://imai.fas.harvard.edu/research/files/linkage.pdf

Default value: False

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#tf_adjustment_column","title":"tf_adjustment_column","text":"

Make term frequency adjustments for this comparison level using this input column

Default value: None

Examples: ['first_name', 'postcode']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#tf_adjustment_weight","title":"tf_adjustment_weight","text":"

Make term frequency adjustments using this weight. A weight of 1.0 is a full adjustment. A weight of 0.0 is no adjustment. A weight of 0.5 is a half adjustment

Default value: 1.0

Examples: ['first_name', 'postcode']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#tf_minimum_u_value","title":"tf_minimum_u_value","text":"

Where the term frequency adjustment implies a u value below this value, use this minimum value instead

This prevents excessive weight being assigned to very unusual terms, such as a collision on a typo

Default value: 0.0

Examples: [0.001, 1e-09]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#blocking_rules_to_generate_predictions","title":"blocking_rules_to_generate_predictions","text":"

A list of one or more blocking rules to apply. A Cartesian join is applied if blocking_rules_to_generate_predictions is empty or not supplied.

Each rule is a SQL expression representing the blocking rule, which will be used to create a join. The left table is aliased with l and the right table is aliased with r. For example, if you want to block on a first_name column, the blocking rule would be

l.first_name = r.first_name.

To block on first name and the first letter of surname, it would be

l.first_name = r.first_name and substr(l.surname,1,1) = substr(r.surname,1,1).

Note that Splink deduplicates the comparisons generated by the blocking rules.

If empty or not supplied, all comparisons between the input dataset(s) will be generated and blocking will not be used. For large input datasets, this will generally be computationally intractable because it will generate comparisons equal to the number of rows squared.

Default value: []

Examples: [['l.first_name = r.first_name AND l.surname = r.surname', 'l.dob = r.dob']]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#additional_columns_to_retain","title":"additional_columns_to_retain","text":"

A list of columns not being used in the probabilistic matching comparisons that you want to include in your results.

By default, Splink drops columns which are not used by any comparisons. This gives you the option to retain columns which are not used by the model. A common example is if the user has labelled data (training data) and wishes to retain the labels in the outputs

Default value: []

Examples: [['cluster', 'col_2'], ['other_information']]

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#bayes_factor_column_prefix","title":"bayes_factor_column_prefix","text":"

The prefix to use for the columns that will be created to store the Bayes factors

Default value: bf_

Examples: ['bf_', '__bf__']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#term_frequency_adjustment_column_prefix","title":"term_frequency_adjustment_column_prefix","text":"

The prefix to use for the columns that will be created to store the term frequency adjustments

Default value: tf_

Examples: ['tf_', '__tf__']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#comparison_vector_value_column_prefix","title":"comparison_vector_value_column_prefix","text":"

The prefix to use for the columns that will be created to store the comparison vector values

Default value: gamma_

Examples: ['gamma_', '__gamma__']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/settings_dict_guide.html#sql_dialect","title":"sql_dialect","text":"

The SQL dialect in which sql_conditions are written. Must be a valid SQLGlot dialect

Default value: None

Examples: ['spark', 'duckdb', 'presto', 'sqlite']

","tags":["settings","Dedupe","Link","Link and Dedupe","Expectation Maximisation","Comparisons","Blocking Rules"]},{"location":"api_docs/splink_dataframe.html","title":"SplinkDataFrame","text":"","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#documentation-forsplinkdataframe","title":"Documentation forSplinkDataFrame","text":"

Bases: ABC

Abstraction over dataframe to handle basic operations like retrieving data and retrieving column names, which need different implementations depending on whether it's a spark dataframe, sqlite table etc. Uses methods like as_pandas_dataframe() and as_record_dict() to retrieve data

","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#splink.internals.splink_dataframe.SplinkDataFrame.as_pandas_dataframe","title":"as_pandas_dataframe(limit=None)","text":"

Return the dataframe as a pandas dataframe.

This can be computationally expensive if the dataframe is large.

Parameters:

Name Type Description Default limit int

If provided, return this number of rows (equivalent to a limit statement in SQL). Defaults to None, meaning return all rows

None

Examples:

df_predict = linker.inference.predict()\ndf_ten_edges = df_predict.as_pandas_dataframe(10)\n
","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#splink.internals.splink_dataframe.SplinkDataFrame.as_record_dict","title":"as_record_dict(limit=None)","text":"

Return the dataframe as a list of record dictionaries.

This can be computationally expensive if the dataframe is large.

Examples:

df_predict = linker.inference.predict()\nten_edges = df_predict.as_record_dict(10)\n

Returns:

Name Type Description list list[dict[str, Any]]

a list of records, each of which is a dictionary

","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#splink.internals.splink_dataframe.SplinkDataFrame.drop_table_from_database_and_remove_from_cache","title":"drop_table_from_database_and_remove_from_cache(force_non_splink_table=False)","text":"

Drops the table from the underlying database, and removes it from the (linker) cache.

By default this will fail if the table is not one created by Splink, but this check can be overriden

Examples:

df_predict = linker.inference.predict()\ndf_predict.drop_table_from_database_and_remove_from_cache()\n# predictions table no longer in the database / cache\n
","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#splink.internals.splink_dataframe.SplinkDataFrame.to_csv","title":"to_csv(filepath, overwrite=False)","text":"

Save the dataframe in csv format.

Examples:

df_predict = linker.inference.predict()\ndf_predict.to_csv(\"model_predictions.csv\", overwrite=True)\n
","tags":["API","comparisons"]},{"location":"api_docs/splink_dataframe.html#splink.internals.splink_dataframe.SplinkDataFrame.to_parquet","title":"to_parquet(filepath, overwrite=False)","text":"

Save the dataframe in parquet format.

Examples:

df_predict = linker.inference.predict()\ndf_predict.to_parquet(\"model_predictions.parquet\", overwrite=True)\n
","tags":["API","comparisons"]},{"location":"api_docs/table_management.html","title":"Table Management","text":"","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#methods-in-linkertable_management","title":"Methods in Linker.table_management","text":"

Register Splink tables against your database backend and manage the Splink cache. Accessed via linker.table_management.

","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.compute_tf_table","title":"compute_tf_table(column_name)","text":"

Compute a term frequency table for a given column and persist to the database

This method is useful if you want to pre-compute term frequency tables e.g. so that real time linkage executes faster, or so that you can estimate various models without having to recompute term frequency tables each time

Examples:

Real time linkage\n```py\nlinker = Linker(df, settings=\"saved_settings.json\", db_api=db_api)\nlinker.table_management.compute_tf_table(\"surname\")\nlinker.compare_two_records(record_left, record_right)\n```\nPre-computed term frequency tables\n```py\nlinker = Linker(df, db_api)\ndf_first_name_tf = linker.table_management.compute_tf_table(\"first_name\")\ndf_first_name_tf.write.parquet(\"folder/first_name_tf\")\n>>>\n# On subsequent data linking job, read this table rather than recompute\ndf_first_name_tf = pd.read_parquet(\"folder/first_name_tf\")\ndf_first_name_tf.createOrReplaceTempView(\"__splink__df_tf_first_name\")\n```\n

Parameters:

Name Type Description Default column_name str

The column name in the input table

required

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

The resultant table as a splink data frame

","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.invalidate_cache","title":"invalidate_cache()","text":"

Invalidate the Splink cache. Any previously-computed tables will be recomputed. This is useful, for example, if the input data tables have changed.

","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.register_table_input_nodes_concat_with_tf","title":"register_table_input_nodes_concat_with_tf(input_data, overwrite=False)","text":"

Register a pre-computed version of the input_nodes_concat_with_tf table that you want to re-use e.g. that you created in a previous run.

This method allows you to register this table in the Splink cache so it will be used rather than Splink computing this table anew.

Parameters:

Name Type Description Default input_data AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.

required overwrite bool

Overwrite the table in the underlying database if it exists.

False

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

An abstraction representing the table created by the sql pipeline

","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.register_table_predict","title":"register_table_predict(input_data, overwrite=False)","text":"

Register a pre-computed version of the prediction table for use in Splink.

This method allows you to register a pre-computed prediction table in the Splink cache so it will be used rather than Splink computing the table anew.

Examples:

predict_df = pd.read_parquet(\"path/to/predict_df.parquet\")\npredict_as_splinkdataframe = linker.table_management.register_table_predict(predict_df)\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    predict_as_splinkdataframe, threshold_match_probability=0.75\n)\n

Parameters:

Name Type Description Default input_data AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.

required overwrite bool

Overwrite the table in the underlying database if it exists. Defaults to False.

False

Returns:

Name Type Description SplinkDataFrame

An abstraction representing the table created by the SQL pipeline.

","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.register_term_frequency_lookup","title":"register_term_frequency_lookup(input_data, col_name, overwrite=False)","text":"

Register a pre-computed term frequency lookup table for a given column.

This method allows you to register a term frequency table in the Splink cache for a specific column. This table will then be used during linkage rather than computing the term frequency table anew from your input data.

Parameters:

Name Type Description Default input_data AcceptableInputTableType

The data representing the term frequency table. This can be either a dictionary, pandas dataframe, pyarrow table, or a spark dataframe.

required col_name str

The name of the column for which the term frequency lookup table is being registered.

required overwrite bool

Overwrite the table in the underlying database if it exists. Defaults to False.

False

Returns:

Name Type Description SplinkDataFrame

An abstraction representing the registered term

frequency table.

Examples:

tf_table = [\n    {\"first_name\": \"theodore\", \"tf_first_name\": 0.012},\n    {\"first_name\": \"alfie\", \"tf_first_name\": 0.013},\n]\ntf_df = pd.DataFrame(tf_table)\nlinker.table_management.register_term_frequency_lookup(tf_df,\n                                                        \"first_name\")\n
","tags":["API","Clustering"]},{"location":"api_docs/table_management.html#splink.internals.linker_components.table_management.LinkerTableManagement.register_table","title":"register_table(input_table, table_name, overwrite=False)","text":"

Register a table to your backend database, to be used in one of the splink methods, or simply to allow querying.

Tables can be of type: dictionary, record level dictionary, pandas dataframe, pyarrow table and in the spark case, a spark df.

Examples:

test_dict = {\"a\": [666,777,888],\"b\": [4,5,6]}\nlinker.table_management.register_table(test_dict, \"test_dict\")\nlinker.query_sql(\"select * from test_dict\")\n

Parameters:

Name Type Description Default input_table AcceptableInputTableType

The data you wish to register. This can be either a dictionary, pandas dataframe, pyarrow table or a spark dataframe.

required table_name str

The name you wish to assign to the table.

required overwrite bool

Overwrite the table in the underlying database if it exists

False

Returns:

Name Type Description SplinkDataFrame SplinkDataFrame

An abstraction representing the table created by the sql pipeline

","tags":["API","Clustering"]},{"location":"api_docs/training.html","title":"Training","text":"","tags":["API","Training"]},{"location":"api_docs/training.html#methods-in-linkertraining","title":"Methods in Linker.training","text":"

Estimate the parameters of the linkage model, accessed via linker.training.

","tags":["API","Training"]},{"location":"api_docs/training.html#splink.internals.linker_components.training.LinkerTraining.estimate_probability_two_random_records_match","title":"estimate_probability_two_random_records_match(deterministic_matching_rules, recall, max_rows_limit=int(1000000000.0))","text":"

Estimate the model parameter probability_two_random_records_match using a direct estimation approach.

This method counts the number of matches found using deterministic rules and divides by the total number of possible record comparisons. The recall of the deterministic rules is used to adjust this proportion up to reflect missed matches, providing an estimate of the probability that two random records from the input data are a match.

Note that if more than one deterministic rule is provided, any duplicate pairs are automatically removed, so you do not need to worry about double counting.

See here for discussion of methodology.

Parameters:

Name Type Description Default deterministic_matching_rules list

A list of deterministic matching rules designed to admit very few (preferably no) false positives.

required recall float

An estimate of the recall the deterministic matching rules will achieve, i.e., the proportion of all true matches these rules will recover.

required max_rows_limit int

Maximum number of rows to consider during estimation. Defaults to 1e9.

int(1000000000.0)

Examples:

deterministic_rules = [\n    block_on(\"forename\", \"dob\"),\n    \"l.forename = r.forename and levenshtein(r.surname, l.surname) <= 2\",\n    block_on(\"email\")\n]\nlinker.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.8\n)\n
","tags":["API","Training"]},{"location":"api_docs/training.html#splink.internals.linker_components.training.LinkerTraining.estimate_u_using_random_sampling","title":"estimate_u_using_random_sampling(max_pairs=1000000.0, seed=None)","text":"

Estimate the u parameters of the linkage model using random sampling.

The u parameters estimate the proportion of record comparisons that fall into each comparison level amongst truly non-matching records.

This procedure takes a sample of the data and generates the cartesian product of pairwise record comparisons amongst the sampled records. The validity of the u values rests on the assumption that the resultant pairwise comparisons are non-matches (or at least, they are very unlikely to be matches). For large datasets, this is typically true.

The results of estimate_u_using_random_sampling, and therefore an entire splink model, can be made reproducible by setting the seed parameter. Setting the seed will have performance implications as additional processing is required.

Parameters:

Name Type Description Default max_pairs int

The maximum number of pairwise record comparisons to sample. Larger will give more accurate estimates but lead to longer runtimes. In our experience at least 1e9 (one billion) gives best results but can take a long time to compute. 1e7 (ten million) is often adequate whilst testing different model specifications, before the final model is estimated.

1000000.0 seed int

Seed for random sampling. Assign to get reproducible u probabilities. Note, seed for random sampling is only supported for DuckDB and Spark, for Athena and SQLite set to None.

None

Examples:

linker.training.estimate_u_using_random_sampling(max_pairs=1e8)\n

Returns:

Name Type Description Nothing None

Updates the estimated u parameters within the linker object and returns nothing.

","tags":["API","Training"]},{"location":"api_docs/training.html#splink.internals.linker_components.training.LinkerTraining.estimate_parameters_using_expectation_maximisation","title":"estimate_parameters_using_expectation_maximisation(blocking_rule, estimate_without_term_frequencies=False, fix_probability_two_random_records_match=False, fix_m_probabilities=False, fix_u_probabilities=True, populate_probability_two_random_records_match_from_trained_values=False)","text":"

Estimate the parameters of the linkage model using expectation maximisation.

By default, the m probabilities are estimated, but not the u probabilities, because good estimates for the u probabilities can be obtained from linker.training.estimate_u_using_random_sampling(). You can change this by setting fix_u_probabilities to False.

The blocking rule provided is used to generate pairwise record comparisons. Usually, this should be a blocking rule that results in a dataframe where matches are between about 1% and 99% of the blocked comparisons.

By default, m parameters are estimated for all comparisons except those which are included in the blocking rule.

For example, if the blocking rule is block_on(\"first_name\"), then parameter estimates will be made for all comparison except those which use first_name in their sql_condition

By default, the probability two random records match is allowed to vary during EM estimation, but is not saved back to the model. See this PR for the rationale.

Examples:

Default behaviour

br_training = block_on(\"first_name\", \"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    br_training\n)\n

Parameters:

Name Type Description Default blocking_rule BlockingRuleCreator | str

The blocking rule used to generate pairwise record comparisons.

required estimate_without_term_frequencies bool

If True, the iterations of the EM algorithm ignore any term frequency adjustments and only depend on the comparison vectors. This allows the EM algorithm to run much faster, but the estimation of the parameters will change slightly.

False fix_probability_two_random_records_match bool

If True, do not update the probability two random records match after each iteration. Defaults to False.

False fix_m_probabilities bool

If True, do not update the m probabilities after each iteration. Defaults to False.

False fix_u_probabilities bool

If True, do not update the u probabilities after each iteration. Defaults to True.

True populate_prob... (bool, optional)

The full name of this parameter is populate_probability_two_random_records_match_from_trained_values. If True, derive this parameter from the blocked value. Defaults to False.

required

Examples:

blocking_rule = block_on(\"first_name\", \"surname\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule\n)\n

Returns:

Name Type Description EMTrainingSession EMTrainingSession

An object containing information about the training session such as how parameters changed during the iteration history

","tags":["API","Training"]},{"location":"api_docs/training.html#splink.internals.linker_components.training.LinkerTraining.estimate_m_from_pairwise_labels","title":"estimate_m_from_pairwise_labels(labels_splinkdataframe_or_table_name)","text":"

Estimate the m probabilities of the linkage model from a dataframe of pairwise labels.

The table of labels should be in the following format, and should be registered with your database:

source_dataset_l unique_id_l source_dataset_r unique_id_r df_1 1 df_2 2 df_1 1 df_2 3

Note that source_dataset and unique_id should correspond to the values specified in the settings dict, and the input_table_aliases passed to the linker object. Note that at the moment, this method does not respect values in a clerical_match_score column. If provided, these are ignored and it is assumed that every row in the table of labels is a score of 1, i.e. a perfect match.

Parameters:

Name Type Description Default labels_splinkdataframe_or_table_name str

Name of table containing labels in the database or SplinkDataframe

required

Examples:

pairwise_labels = pd.read_csv(\"./data/pairwise_labels_to_estimate_m.csv\")\n\nlinker.table_management.register_table(\n    pairwise_labels, \"labels\", overwrite=True\n)\n\nlinker.training.estimate_m_from_pairwise_labels(\"labels\")\n
","tags":["API","Training"]},{"location":"api_docs/training.html#splink.internals.linker_components.training.LinkerTraining.estimate_m_from_label_column","title":"estimate_m_from_label_column(label_colname)","text":"

Estimate the m parameters of the linkage model from a label (ground truth) column in the input dataframe(s).

The m parameters represent the proportion of record comparisons that fall into each comparison level amongst truly matching records.

The ground truth column is used to generate pairwise record comparisons which are then assumed to be matches.

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

Parameters:

Name Type Description Default label_colname str

The name of the column containing the ground truth label in the input data.

required

Examples:

linker.training.estimate_m_from_label_column(\"social_security_number\")\n

Returns:

Name Type Description Nothing None

Updates the estimated m parameters within the linker object.

","tags":["API","Training"]},{"location":"api_docs/visualisations.html","title":"Visualisations","text":"","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#methods-in-linkervisualisations","title":"Methods in Linker.visualisations","text":"

Visualisations to help you understand and diagnose your linkage model. Accessed via linker.visualisations.

Most of the visualisations return an altair.Chart object, meaning it can be saved an manipulated using Altair.

For example:

altair_chart = linker.visualisations.match_weights_chart()\n\n# Save to various formats\naltair_chart.save(\"mychart.png\")\naltair_chart.save(\"mychart.html\")\naltair_chart.save(\"mychart.svg\")\naltair_chart.save(\"mychart.json\")\n\n# Get chart spec as dict\naltair_chart.to_dict()\n

To save the chart as a self-contained html file with all scripts inlined so it can be viewed offline:

from splink.internals.charts import save_offline_chart\nc = linker.visualisations.match_weights_chart()\nsave_offline_chart(c.to_dict(), \"test_chart.html\")\n

View resultant html file in Jupyter (or just load it in your browser)

from IPython.display import IFrame\nIFrame(src=\"./test_chart.html\", width=1000, height=500)\n
","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.match_weights_chart","title":"match_weights_chart(as_dict=False)","text":"

Display a chart of the (partial) match weights of the linkage model

Parameters:

Name Type Description Default as_dict bool

If True, return the chart as a dictionary.

False

Examples:

altair_chart = linker.visualisations.match_weights_chart()\naltair_chart.save(\"mychart.png\")\n
","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.m_u_parameters_chart","title":"m_u_parameters_chart(as_dict=False)","text":"

Display a chart of the m and u parameters of the linkage model

Parameters:

Name Type Description Default as_dict bool

If True, return the chart as a dictionary.

False

Examples:

altair_chart = linker.visualisations.m_u_parameters_chart()\naltair_chart.save(\"mychart.png\")\n

Returns:

Name Type Description altair_chart ChartReturnType

An altair chart

","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.match_weights_histogram","title":"match_weights_histogram(df_predict, target_bins=30, width=600, height=250, as_dict=False)","text":"

Generate a histogram that shows the distribution of match weights in df_predict

Parameters:

Name Type Description Default df_predict SplinkDataFrame

Output of linker.inference.predict()

required target_bins int

Target number of bins in histogram. Defaults to 30.

30 width int

Width of output. Defaults to 600.

600 height int

Height of output chart. Defaults to 250.

250 as_dict bool

If True, return the chart as a dictionary.

False

Examples:

df_predict = linker.inference.predict(threshold_match_weight=-2)\nlinker.visualisations.match_weights_histogram(df_predict)\n
","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.parameter_estimate_comparisons_chart","title":"parameter_estimate_comparisons_chart(include_m=True, include_u=False, as_dict=False)","text":"

Show a chart that shows how parameter estimates have differed across the different estimation methods you have used.

For example, if you have run two EM estimation sessions, blocking on different variables, and both result in parameter estimates for first_name, this chart will enable easy comparison of the different estimates

Parameters:

Name Type Description Default include_m bool

Show different estimates of m values. Defaults to True.

True include_u bool

Show different estimates of u values. Defaults to False.

False as_dict bool

If True, return the chart as a dictionary.

False

Examples:

linker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule=block_on(\"first_name\"),\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule=block_on(\"surname\"),\n)\n\nlinker.visualisations.parameter_estimate_comparisons_chart()\n

Returns:

Name Type Description altair_chart ChartReturnType

An Altair chart

","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.tf_adjustment_chart","title":"tf_adjustment_chart(output_column_name, n_most_freq=10, n_least_freq=10, vals_to_include=None, as_dict=False)","text":"

Display a chart showing the impact of term frequency adjustments on a specific comparison level. Each value

Parameters:

Name Type Description Default output_column_name str

Name of an output column for which term frequency adjustment has been applied.

required n_most_freq int

Number of most frequent values to show. If this or n_least_freq set to None, all values will be shown. Default to 10.

10 n_least_freq int

Number of least frequent values to show. If this or n_most_freq set to None, all values will be shown. Default to 10.

10 vals_to_include list

Specific values for which to show term sfrequency adjustments. Defaults to None.

None as_dict bool

If True, return the chart as a dictionary.

False

Examples:

linker.visualisations.tf_adjustment_chart(\"first_name\")\n

Returns:

Name Type Description altair_chart ChartReturnType

An Altair chart

","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.waterfall_chart","title":"waterfall_chart(records, filter_nulls=True, remove_sensitive_data=False, as_dict=False)","text":"

Visualise how the final match weight is computed for the provided pairwise record comparisons.

Records must be provided as a list of dictionaries. This would usually be obtained from df.as_record_dict(limit=n) where df is a SplinkDataFrame.

Examples:

df = linker.inference.predict(threshold_match_weight=2)\nrecords = df.as_record_dict(limit=10)\nlinker.visualisations.waterfall_chart(records)\n

Parameters:

Name Type Description Default records List[dict]

Usually be obtained from df.as_record_dict(limit=n) where df is a SplinkDataFrame.

required filter_nulls bool

Whether the visualisation shows null comparisons, which have no effect on final match weight. Defaults to True.

True remove_sensitive_data bool

When True, The waterfall chart will contain match weights only, and all of the (potentially sensitive) data from the input tables will be removed prior to the chart being created.

False as_dict bool

If True, return the chart as a dictionary.

False

Returns:

Name Type Description altair_chart ChartReturnType

An Altair chart

","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.comparison_viewer_dashboard","title":"comparison_viewer_dashboard(df_predict, out_path, overwrite=False, num_example_rows=2, return_html_as_string=False)","text":"

Generate an interactive html visualization of the linker's predictions and save to out_path. For more information see this video

Parameters:

Name Type Description Default df_predict SplinkDataFrame

The outputs of linker.predict()

required out_path str

The path (including filename) to save the html file to.

required overwrite bool

Overwrite the html file if it already exists? Defaults to False.

False num_example_rows int

Number of example rows per comparison vector. Defaults to 2.

2 return_html_as_string bool

If True, return the html as a string

False

Examples:

df_predictions = linker.predict()\nlinker.visualisations.comparison_viewer_dashboard(\n    df_predictions, \"scv.html\", True, 2\n)\n

Optionally, in Jupyter, you can display the results inline Otherwise you can just load the html file in your browser

from IPython.display import IFrame\nIFrame(src=\"./scv.html\", width=\"100%\", height=1200)\n
","tags":["API","Visualisations"]},{"location":"api_docs/visualisations.html#splink.internals.linker_components.visualisations.LinkerVisualisations.cluster_studio_dashboard","title":"cluster_studio_dashboard(df_predict, df_clustered, out_path, sampling_method='random', sample_size=10, cluster_ids=None, cluster_names=None, overwrite=False, return_html_as_string=False, _df_cluster_metrics=None)","text":"

Generate an interactive html visualization of the predicted cluster and save to out_path.

Parameters:

Name Type Description Default df_predict SplinkDataFrame

The outputs of linker.predict()

required df_clustered SplinkDataFrame

The outputs of linker.cluster_pairwise_predictions_at_threshold()

required out_path str

The path (including filename) to save the html file to.

required sampling_method str

random, by_cluster_size or lowest_density_clusters. Defaults to random.

'random' sample_size int

Number of clusters to show in the dahboard. Defaults to 10.

10 cluster_ids list

The IDs of the clusters that will be displayed in the dashboard. If provided, ignore the sampling_method and sample_size arguments. Defaults to None.

None overwrite bool

Overwrite the html file if it already exists? Defaults to False.

False cluster_names list

If provided, the dashboard will display these names in the selection box. Ony works in conjunction with cluster_ids. Defaults to None.

None return_html_as_string bool

If True, return the html as a string

False

Examples:

df_p = linker.inference.predict()\ndf_c = linker.visualisations.cluster_pairwise_predictions_at_threshold(\n    df_p, 0.5\n)\n\nlinker.cluster_studio_dashboard(\n    df_p, df_c, [0, 4, 7], \"cluster_studio.html\"\n)\n

Optionally, in Jupyter, you can display the results inline Otherwise you can just load the html file in your browser

from IPython.display import IFrame\nIFrame(src=\"./cluster_studio.html\", width=\"100%\", height=1200)\n
","tags":["API","Visualisations"]},{"location":"blog/index.html","title":"Blog","text":"","tags":["Blog","News"]},{"location":"blog/2023/07/27/splink-updates---july-2023.html","title":"Splink Updates - July 2023","text":""},{"location":"blog/2023/07/27/splink-updates---july-2023.html#splink-updates-july-2023","title":"Splink Updates - July 2023","text":""},{"location":"blog/2023/07/27/splink-updates---july-2023.html#welcome-to-the-splink-blog","title":"Welcome to the Splink Blog!","text":"

Its hard to keep up to date with all of the new features being added to Splink, so we have launched this blog to share a round up of latest developments every few months.

So, without further ado, here are some of the highlights from the first half of 2023!

Latest Splink version: v3.9.4

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#massive-speed-gains-in-em-training","title":"Massive speed gains in EM training","text":"

There\u2019s now an option to make EM training much faster - in one example we\u2019ve seen at 1000x fold speedup. Kudos to external contributor @aymonwuolanne from the Australian Bureau of Statistics!

To make use of this, set the estimate_without_term_frequencies parameter to True; for example:

linker.estimate_parameters_using_expectation_maximisation(..., estimate_without_term_frequencies=True)\n

Note: If True, the EM algorithm ignores term frequency adjustments during the iterations. Instead, the adjustments are added once the EM algorithm has converged. This will result in slight difference in the final parameter estimations.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#out-of-the-box-comparisons","title":"Out-of-the-box Comparisons","text":"

Splink now contains lots of new out-of-the-box comparisons for dates, names, postcodes etc. The Comparison Template Library (CTL) provides suggested settings for common types of data used in linkage models.

For example, a Comparison for \"first_name\" can now be written as:

import splink.duckdb.comparison_template_library as ctl\n\nfirst_name_comparison = ctl.name_comparison(\"first_name\")\n

Check out these new functions in the Topic Guide and Documentation.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#blocking-rule-library","title":"Blocking Rule Library","text":"

Blocking has, historically, been a point of confusion for users so we have been working behind the scenes to make that easier! The recently launched Blocking Rules Library (BRL) provides a set of functions for defining Blocking Rules (similar to the Comparison Library functions).

For example, a Blocking Rule for \"date_of_birth\" can now be written as:

import splink.duckdb.blocking_rule_library as brl\n\nbrl.exact_match_rule(\"date_of_birth\")\n

Note: from Splink v3.9.6, exact_match_rule has been superseded by block_on. We advise using this going forward.

Check out these new functions in the BRL Documentation as well as some new Blocking Topic Guides to better explain what Blocking Rules are, how they are used in Splink, and how to choose them.

Keep a look out, as there are more improvements in the pipeline for Blocking in the coming months!

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#postgres-support","title":"Postgres Support","text":"

With a massive thanks to external contributor @hanslemm, Splink now supports Postgres. To get started, check out the Postgres Topic Guide.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#clerical-labelling-tool-beta","title":"Clerical Labelling Tool (beta)","text":"

Clerical labelling is an important tool for generating performance metrics for linkage models (False Positive Rate, Recall, Precision etc.).

Splink now has a (beta) GUI for clerical labelling which produces labels in a form that can be easily ingested into Splink to generate these performance metrics. Check out the example tool, linked Pull Request, and some previous tweets:

Draft new Splink tool to speed up manual labelling of record linkage data. Example dashboard: https://t.co/yc1yHpa90X Grateful for any feedback whilst I'm still working on this, on Twitter or the draft PR: https://t.co/eXSNHHe2kcFree and open source pic.twitter.com/MEo4DmaxO9

\u2014 Robin Linacre (@RobinLinacre) April 28, 2023

This tool is still in the beta phase, so is a work in progress and subject to change based on feedback we get from users. As a result, it is not thoroughly documented at this stage. We recommend checking out the links above to see a ready-made example of the tool. However, if you would like to generate your own, this example is a good starting point.

We would love any feedback from users, so please comment on the PR or open a discussion.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#charts-in-altair-5","title":"Charts in Altair 5","text":"

Charts are now all fully-fledged Altair charts, making them much easier to work with.

For example, a chart c can now be saved with:

c.save(\u201cchart.png\u201d, scale_factor=2)\n

where json, html, png, svg and pdf are all supported.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#reduced-duplication-in-comparison-libraries","title":"Reduced duplication in Comparison libraries","text":"

Historically, importing of the comparison libraries has included declaring the backend twice. For example:

import splink.duckdb.duckdb_comparison_level_library as cll\n
This repetition has now been removed
import splink.duckdb.comparison_level_library as cll\n
The original structure still works, but throws a warning to switch to the new version."},{"location":"blog/2023/07/27/splink-updates---july-2023.html#in-built-datasets","title":"In-built datasets","text":"

When following along with the tutorial or example notebooks, one issue can be references of paths to data that does not exists on users machines. To overcome this issue, Splink now has a splink_datasets module which will store these datasets and make sure any users can copy and paste working code without fear of path issues. For example:

from splink.datasets import splink_datasets\n\ndf = splink_datasets.fake_1000\n
returns the fake 1000 row dataset that is used in the Splink tutorial.

For more information check out the in-built datasets Documentation.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#regular-expressions-in-comparisons","title":"Regular Expressions in Comparisons","text":"

When comparing records, some columns will have a particular structure (e.g. dates, postcodes, email addresses). It can be useful to compare sections of a column entry. Splink's string comparison level functions now include a regex_extract to extract a portion of strings to be compared. For example, an exact_match comparison that compares the first section of a postcode (outcode) can be written as:

import splink.duckdb.duckdb_comparison_library as cl\n\npc_comparison = cl.exact_match(\"postcode\", regex_extract=\"^[A-Z]{1,2}\")\n

Splink's string comparison level functions also now include a valid_string_regex parameter which sends any entries that do not conform to a specified structure to the null level. For example, a levenshtein comparison that ensures emails have an \"@\" symbol can be written as:

import splink.duckdb.duckdb_comparison_library as cl\n\nemail_comparison = cl.levenshtein_at_thresholds(\"email\", valid_string_regex=\"^[^@]+\")\n

For more on how Regular Expressions can be used in Splink, check out the Regex topic guide.

Note: from Splink v3.9.6, valid_string_regex has been renamed as valid_string_pattern.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#documentation-improvements","title":"Documentation Improvements","text":"

We have been putting a lot of effort into improving our documentation site, including launching this blog!

Some of the improvements include:

  • More Topic Guides covering things such as Record Linkage Theory, Guidance on Splink's backends and String Fuzzy Matching.
  • A Contributors Guide to make contributing to Splink even easier. If you are interested in getting involved in open source, check the guide out!
  • Adding tables to the Comparison libraries documentation to show the functions available for each SQL backend.

Thanks to everyone who filled out our feedback survey. If you have any more feedback or ideas for how we can make the docs better please do let us know by raising an issue, starting a discussion or filling out the survey.

"},{"location":"blog/2023/07/27/splink-updates---july-2023.html#whats-in-the-pipeline","title":"What's in the pipeline?","text":"
  • More Blocking improvements
  • Settings dictionary improvements
  • More guidance on how to evaluate Splink models and linkages
"},{"location":"blog/2023/12/06/splink-updates---december-2023.html","title":"Splink Updates - December 2023","text":""},{"location":"blog/2023/12/06/splink-updates---december-2023.html#splink-updates-december-2023","title":"Splink Updates - December 2023","text":"

Welcome to the second installment of the Splink Blog!

Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!

Latest Splink version: v3.9.10

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#charts-gallery","title":"Charts Gallery","text":"

The Splink docs site now has a Charts Gallery to show off all of the charts that come out-of-the-box with Splink to make linking easier.

Each chart now has an explanation of:

  1. What the chart shows
  2. How to interpret it
  3. Actions to take as a result

This is the first step on a longer term journey to provide more guidance on how to evaluate Splink models and linkages, so watch this space for more in the coming months!

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#new-charts","title":"New Charts","text":"

We are always adding more charts to Splink - to understand how these charts are built see our new Charts Developer Guide.

Two of our latest additions are:

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#confusion-matrix","title":"Confusion Matrix","text":"

When evaluating any classification model, a confusion matrix is a useful tool for summarising performance by representing counts of true positive, true negative, false positive, and false negative predictions.

Splink now has its own confusion matrix chart to show how model performance changes with a given match weight threshold.

Note, labelled data is required to generate this chart.

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#completeness-chart","title":"Completeness Chart","text":"

When linking multiple datasets together, one of the most important factors for a successful linkage is the number of common fields across the datasets.

Splink now has the completeness chart which gives a simple view of how well populated fields are across datasets.

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#settings-validation","title":"Settings Validation","text":"

The Settings dictionary is central to everything in Splink. It defines everything from the SQL dialect of your backend to how features are compared in Splink model.

A common sticking point with users is how easy it is to make small errors when defining the Settings dictionary, resulting in unhelpful error messages.

To address this issue, the Settings Validator provides clear, user-friendly feedback on what the issue is and how to fix it.

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#blocking-rule-library-improved","title":"Blocking Rule Library (Improved)","text":"

In our previous blog we introduced the Blocking Rule Library (BRL) built upon the exact_match_rule function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the block_on function to supersede exact_match_rule.

For example, a block on first_name and surname now looks like:

from splink.duckdb.blocking_rule_library import block_on\nblock_on([\"first_name\", \"surname\"])\n

as opposed to

import splink.duckdb.blocking_rule_library as brl\nbrl.and_(\n  brl.exact_match_rule(\"first_name\"),\n  brl.exact_match_rule(\"surname\")\n)\n

All of the tutorials, example notebooks and docs have been updated to use block_on.

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#backend-specific-installs","title":"Backend Specific Installs","text":"

Some users have had difficulties downloading Splink due to additional dependencies, some of which may not be relevant for the backend they are using. To solve this, you can now install a minimal version of Splink for your given SQL engine.

For example, to install Splink purely for Spark use the command:

pip install 'splink[spark]'\n

See the Getting Started page for further guidance.

"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#drop-support-for-python-37","title":"Drop support for python 3.7","text":"

From Splink 3.9.7, support has been dropped for python 3.7. This decision has been made to manage dependency clashes in the back end of Splink.

If you are working with python 3.7, please revert to Splink 3.9.6.

pip install splink==3.9.6\n
"},{"location":"blog/2023/12/06/splink-updates---december-2023.html#whats-in-the-pipeline","title":"What's in the pipeline?","text":"
  • Work on Splink 4 is currently underway
  • More guidance on how to evaluate Splink models and linkages
"},{"location":"blog/2024/01/23/ethics-in-data-linking.html","title":"Ethics in Data Linking","text":""},{"location":"blog/2024/01/23/ethics-in-data-linking.html#ethics-in-data-linking","title":"Ethics in Data Linking","text":"

Welcome to the next installment of the Splink Blog where we\u2019re talking about Data Ethics!

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#why-should-we-care-about-ethics","title":"Why should we care about ethics?","text":"

Splink was developed in-house at the UK Government\u2019s Ministry of Justice. As data scientists in government, we are accountable to the public and have a duty to maintain public trust. This includes upholding high standards of data ethics in our work.

Furthermore, data linkage is generally used at the start of analytical projects so any design decisions that are made, or biases introduced, will have consequences for all downstream use cases of that data. With this in mind, it is important to try and address any potential ethical issues at the linking stage.

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#ethics-and-splink","title":"Ethics and Splink","text":""},{"location":"blog/2024/01/23/ethics-in-data-linking.html#what-do-we-already-have-in-place","title":"What do we already have in place?","text":"

Data ethics has been a foundational consideration throughout Splink\u2019s development. For example, the decision to make Splink open-source was motivated by an ambition to make our data linking software fully transparent, accessible and auditable to users both inside and outside of government. The fact that this also empowers external users to expand and improve upon Splink\u2019s functionality is another huge benefit!

Another core principle guiding the development of Splink has been explainability. Under the hood we use the Fellegi-Sunter model which is an industry-standard, well-researched, explainable methodology. This, in combination with interactive charts such as the waterfall chart, where model results can be easily broken down and visualised for individual record pairs, make Splink predictions easily interrogatable and explainable. Being able to interrogate predictions is especially valuable when things go wrong - if an incorrect link has been made you can trace it back see exactly why the model made the decision.

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#what-else-should-we-be-considering","title":"What else should we be considering?","text":"

To continue our exploration of ethical issues, we recently had a team away day focused on data ethics. We aimed to better understand where ethical concerns (e.g. bias) could arise in our own Splink linkage pipelines and what further steps we could take to empower users to be able to better understand and possibly mitigate these issues within their own projects.

We discussed a typical data linking pipeline, as used in the Ministry of Justice, from data collection at source through to the generation of Splink cluster IDs. It became clear that there are considerations to make at each stage of a pipeline that can have an ethical implications such as:

For example, a higher occurrence of misspellings for names of non-UK origin during data collection can impact the accuracy of links for certain groups.

As you can see, the entire data linking process has many stages with lots of moving parts, resulting in numerous opportunities for ethical issues to arise.

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#what-are-we-going-to-do-about-it","title":"What are we going to do about it?","text":"

Summarised below are the key areas of ethical concern we identified and how we plan to address them.

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#evaluation","title":"Evaluation","text":"

Splink is not plug and play. As a software, it provides many configuration options to support its users, from blocking rules to term frequency adjustments. However, with greater flexibility comes greater variation in model design. From an explainability and quality assurance perspective, it is important to understand how different choices on model build interact and can influence results.

It isn\u2019t trivial to unpick the interplaying factors that affect Splink\u2019s outputs, but as a first step we are building a framework and guidance to demonstrate how changes to a model's settings can impact predictions. We hope this will give users a better understanding of model sensitivity and more confidence in explaining and justifying the results of their models. We also hope this will serve as a stepping stone to tools that help evaluate models in a production setting (e.g. model drift).

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#bias","title":"Bias","text":"

Bias is a key area of ethical concern within data linking and one that crops up at many stages during a typical linking pipeline; from data collection to downstream linking. It is important to identify, quantify and, where possible, mitigate bias in input sources, model building and outputs. However, sources of bias are specific to a given use-case, and therefore finding general solutions to mitigating bias is challenging.

This year we are embarking on a collaboration with the Alan Turing Institute to get expert support on assessing bias in our linking pipelines. The long-term goal is to create general tooling to help Splink users gain a better understanding of how bias could be being introduced into their models. Improved model evaluation (see above) is a first step in the development of these tools.

"},{"location":"blog/2024/01/23/ethics-in-data-linking.html#communication","title":"Communication","text":"

Sharing both our current knowledge and future discoveries on the ethics of data linking with Splink is important to help support our users and the data linking community more broadly. This blog is the first step on that journey for us.

As already mentioned, Splink comes with a variety of tools that support explainability. We will be updating the Splink documentation to convey the significance of these resources from a data ethics perspective to help give existing users, potential adopters and their customers greater confidence in building Splink models and model predictions.

Please visit the Ethics in Data Linking discussion on Splink's GitHub repository to get involved in the conversation and share your thoughts - we'd love to hear them!

If you want to stay up to date with the latest Splink blogs subscribe to our new RSS feed!

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html","title":"Splink 3 updates, and Splink 4 development announcement - April 2024","text":""},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#splink-3-updates-and-splink-4-development-announcement-april-2024","title":"Splink 3 updates, and Splink 4 development announcement - April 2024","text":"

This post describes significant updates to Splink since our previous post and details of development work taking place on the forthcoming release of Splink 4.

Latest Splink version: v3.9.14

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#splink-3-updates","title":"Splink 3 Updates","text":"

Here are some highlights of Splink development since our last update in December 2023. As always, keep an eye on the changelog for more regular updates.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#graph-metrics","title":"Graph metrics","text":"

Linked data can be interpreted as graphs, as described in our graph definitions guide. Given this, graph metrics are useful in record linkage because they give insights into the quality of your final output (linked data) and, by extension, the linkage pipeline. They are particularly relevant for the analysis of clusters.

For example, a cluster where all entities are connected to all others with high match weights is likely to be more reliable than a cluster where many of the entities connect to only a small proportion of the other entities in the cluster. This can be measured by a graph metric called density.

Several graph metrics can now be computed using linker.compute_graph_metrics.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#duckdb-performance-improvements-and-benchmarking","title":"DuckDB Performance Improvements and Benchmarking","text":"

The DuckDB backend is now fully parallelised, resulting in large performance increases especially on high core count machines.

We now recommend the DuckDB backend for most users. It is the fastest backend, and is capable of linking large datasets, especially if you have access to high-spec machines.

For the first time, we have also conducted formal benchmarking of DuckDB on machines of different sizes. Check out our blog post outlining the results of this investigation.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#blocking-on-an-array-column","title":"Blocking on an array column","text":"

In some circumstances, it is useful to block on an array column. For example, if a persons have an array (list) of postcodes associated with each record, then we may wish to generate all record comparisons where there is a match of at least one postcode (the union of the arrays is of length 1 or more). This feature was added in PR 1692, with thanks to Github user nerskin for this external contribution!

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#more-documentation","title":"More Documentation","text":"

We have been building more guidance and documentation to make life as easy as possible for users, including:

  • Topic Guides exploring Evaluation for different outputs of the linkage process, including the Linkage Model, the Edges (Links) and Clusters.
  • Guidance on our strategy for Managing Dependencies within Splink.
  • A Developer Quickstart guide to help contributors get up and running smoothly (with thanks to external contributor zmbc for putting this together).

Warning

Splink 3 has entered maintenance mode. We will continue to apply bugfixes, but new features should be built on the splink4_dev branch. We are no longer accepting new features on the master (Splink 3) branch.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#splink-4","title":"Splink 4","text":"

The team has been focussing development efforts on Splink 4, due to be released later this year.

We\u2019re pleased to announce we\u2019ve recently reached an important milestone: all tests are passing, and all of the tutorial and examples notebooks have been updated and work successfully in the new version

Development releases of Splink 4 have commenced, and you can try it out using pip install --pre splink, or try it out in your web browser using the Colab links at the top the tutorial and example notebooks.

As a result, Splink 3 has entered maintenance mode. We will continue to apply bugfixes, but new features should be built on the splink4_dev branch. We are no longer accepting new features on the master (Splink 3) branch.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#aims-of-splink-4","title":"Aims of Splink 4","text":"

Splink 4 represents an incremental improvement to version 3 that makes Splink easier to use without making any major changes to workflows. The core functionality has not changed - the steps to train a model and predict results are the same, and models trained in Splink 3 will still work in Splink 4.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#improve-ease-of-use","title":"Improve ease of use","text":"

The primary aim is to improve the user-facing API so that:

  • The user has to write less code to achieve the same result
  • Function imports are simpler and grouped more intuitively
  • Settings and configuration can now be constructed entirely using Python objects, meaning that the user can rely heavily on autocomplete, rather than needing to remember the names of settings.
  • Less dialect-specific code

You can see an example of how the code changes between version 3 and 4 in the screenshot below:

The corresponding code is here.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#improve-ease-of-development","title":"Improve ease of development","text":"

A second important aim of Splink 4 is to improve the internal codebase to make Splink easier to develop for the core team and external contributors. These changes don\u2019t affect the end user, but should enable a faster pace of development.

A wide range of improvements have been made such as:

  • Code quality: type hinting, mypy conformance etc.
  • Making CI run much faster
  • Reducing rigidities in dependencies
  • Decoupling parts of the codebase and less mutable state
"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#timelines","title":"Timelines","text":"

We expect to do regular beta releases to PyPI in the coming months. They can be found here, and you can install the latest version of Splink 4 using pip install --pre splink

Warning

During this time, there may be further breaking changes to the public API so please use Splink 4 with caution. However, we think the new API is now relatively stable, and big changes are unlikely.

We expect to bring Splink 4 out of beta, and do a first full release sometime in the autumn.

"},{"location":"blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html#feedback","title":"Feedback","text":"

We would love feedback on Splink 4, so please check it out and let us know what you think! The best way to get in contact is via our discussion forum.

"},{"location":"blog/2024/07/24/splink-400-released.html","title":"Splink 4.0.0 released","text":""},{"location":"blog/2024/07/24/splink-400-released.html#splink-400-released","title":"Splink 4.0.0 released","text":"

We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3.

For the uninitiated, Splink is a free and open source library for record linkage and deduplication at scale, capable of deduplicating 100 million records+, that is widely used and has been downloaded over 8 million times.

Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines.

The improvements we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in .json format can be imported into Splink 4.

To get started quickly with Splink 4, checkout the examples. You can see how things have changed by comparing them to the Splink 3 examples, or see the screenshot at the bottom of this post.

"},{"location":"blog/2024/07/24/splink-400-released.html#main-enhancements","title":"Main enhancements","text":"
  • User Experience: We have revamped all aspects of the user-facing API. Functionality is easier to find, better named and better organised.

  • Faster and more scalable Our testing suggests that the internal changes have made Splink 4 significantly more scalable. Our testing also suggests Splink 4 is faster than Splink 3 for many workloads. This is in addition to dramatic speedups that were integrated into Splink 3 in January, meaning Splink is now 5x faster for a typical workload on a modern laptop than it was in November 2023. We welcome any feedback from users about speed and scalability, as it's hard for us to test the full range of scenarios.

  • Improved backend code quality The Splink 4 codebase represents a big refactor to focus on code quality. It should now be easier to contribute, and quicker and easier for the team to fix bugs.

  • Autocomplete everywhere: All functions, most notably the settings object, have been rewritten to ensure autocomplete (IntelliSense/code completion) works. This means you no longer need to remember the specific name of the wide range of configuration options - a key like blocking_rules_to_generate_predictions will autocomplete. Where settings such as link_type have a predefined list of valid options, these will also autocomplete.

"},{"location":"blog/2024/07/24/splink-400-released.html#smaller-enhancements","title":"Smaller enhancements","text":"

Some highlights of other smaller improvements:

  • Linker functionality is now organised into namespaces. In Splink 3, a very large number of functions were available on the linker object, making it hard to find and remember what functionality exists. In Splink 4, functions are available in namespaces such as linker.training and linker.inference. Documentation here.

  • Blocking analysis. The new blocking functions at splink.blocking include routines to ensure users don't accidentally run code that generates so many comparisons it never completes. Blocking analysis is also much faster. See the blocking tutorial for more.

  • Switch between dialects more easily. The backend SQL dialect (DuckDB, Spark etc.) is now imported using the relevant database API. This is passed into Splink functions (such as creation of the linker), meaning that switching between dialects is now a case of importing a different database API, no other code needs to change. For example, compare the DuckDB and SQLite examples.

  • Exploratory analysis no longer needs a linker. Exploratory analysis that is typically conducted before starting data linking can now be done in isolation, without the user needing to configure a linker. Exploratory analysis is now available at splink.exploratory. Similarly, blocking can be done without a linker using the functions at splink.blocking.

  • Enhancements to API documentation. Now that the codebase is better organised, it's made it much easier provide high quality API documentation - the new pages are here.

"},{"location":"blog/2024/07/24/splink-400-released.html#updating-splink-3-code","title":"Updating Splink 3 code","text":"

Conceptually, there are no major changes in Splink 4. Splink 4 code follows the same steps as Splink 3. The same core estimation and prediction routines are used. Splink 4 code that uses the same settings will produce the same results (predictions) as Splink 3.

That said, there have been significant changes to the syntax and a reorganisation of functions.

For users wishing to familiarise themselves with Splink 4, we recommend the easiest way is to compare and contrast the new examples with their Splink 3 equivalents.

You may also find the following screenshot useful, which shows the diff of a fairly standard Splink 3 workflow that has been rewritten in Splink 4.

You can find the corresponding code here.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html","title":"Bias in Data Linking","text":""},{"location":"blog/2024/08/19/bias-in-data-linking.html#bias-in-data-linking","title":"Bias in Data Linking","text":"

In March 2024, the Splink team launched a 6-month 'Bias in Data Linking' internship with the Alan Turing Institute. This installment of the Splink Blog is going to introduce the internship, its goals, and provide an update on what's happened so far.

The internship is being undertaken by myself, Erica Kane. I am a PhD student based at the University of Leeds. My doctoral research is in Data Analytics, conducted in partnership with the Parole Board, and I have a background in quantitative research within Criminal Justice.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#background","title":"\ud83d\udcdd Background","text":"

The internship stemmed from the team's previous engagement with ethics, understanding that this complex yet inevitable aspect of data science has implications for data linking.

Data science pipelines are intricate processes with lots of decision points. At each of these decision points bias can creep in. If it does, its impact on results can vary as it interacts with different parts of the pipeline. For example, two datasets might react differently to the same bias introduced by a model. Additionally, multiple biases can interact with each other, making it difficult to see their individual effects. Therefore, detecting, monitoring, quantifying, and mitigating bias in data science pipelines is extremely challenging.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#goals","title":"\ud83c\udfaf Goals","text":"

To set the direction for the internship, it was useful to first define what a successful outcome would look like.

Many users and developers of data linking pipelines have ideas about where bias might be entering their pipeline, but they aren\u2019t always sure how to evaluate this bias or understand its impact. So, the goal was to create a standardised approach to evaluating bias that\u2019s adaptable to different use cases.

Before developing this approach, it was useful to look at different types and sources of bias in data linking pipelines. This made sure that the development was grounded in real-life examples, which was crucial for assessing if an evaluation method was suitable.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#sources-of-bias","title":"\ud83d\udd0d Sources of bias","text":"

From talking with experts and reviewing relevant materials, it was clear that there were already hypotheses about where bias might enter a data linking pipeline.

These hypotheses were reviewed and grouped into broad categories, highlighting the key areas for evaluation:

The input data can contain mistakes or legitimate qualities which make some records harder to link than others. Addressing these mistakes or qualities through data preparation techniques can have the same effect. If this impact is not random, this means the input data will introduce bias.

Model design involves specifying settings that define which records to compare and how to compare them. If these design choices result in a better/worse performance for certain record types, bias will be introduced.

Understanding these potential bias sources laid the groundwork for determining the most suitable evaluation method.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#evaluating-bias","title":"\ud83d\udcca Evaluating bias","text":"

There are many ways to evaluate performance in data science, and a common approach is to compare the output of a model with a ground truth. In data linking, this means manually labelling comparisons as \"link\" or \"non-link\", running them through the pipeline, and then comparing the predicted results to these labels.

Since this method is commonly used to measure overall performance, labelled data may already exist. If this is the case, it's worth exploring how these labels could be repurposed to evaluate bias instead. This requires a more focussed approach, where it's necessary to pinpoint specific records that align with a defined hypothesis. These might include records that represent mistakes, qualities, or preparation of the input data, or those affected by model design settings.

Assuming there's already a hypothesis in place, this approach involves a 3-step process:

Each step was reviewed to understand the considerations for evaluating bias...

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#step-1-hand-label-record-comparisons","title":"Step 1: Hand label record comparisons","text":"

Firstly, a sample of record pairs are labelled by human experts to decide if they relate to the same person. This sample provides the base from which performance is assessed.

When working with real data, it's not always clear whether records relate to the same individual...

Even human evaluators can struggle, and individuals often disagree. In these uncertain cases, there is a risk of bias being introduced into the labels by the evaluators themselves. The lack of consistency or reliability of labels makes it hard to consider them a \"ground truth\" from which to assess bias.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#step-2-identify-records","title":"Step 2: Identify records","text":"

The second step is to focus the evaluation on bias by identifying records that represent the specific hypothesis. Issues with this process are demonstrated by the following example:

Bias is suspected to enter a pipeline through data standardisation using an English phonetic algorithm (e.g. Metaphone).

Records with non-English phonetic names must be identified for evaluation. There are two main options to identify these records, both with associated drawbacks.

  1. Using variables as proxies

    • Assumes a relationship between the variable and name phonetics (e.g. ethnicity/nationality)
    • Relies on accurate recording of the variable
  2. Direct identification

    • Requires a complex technical solution which would be difficult to develop and verify

These issues are applicable to most hypotheses, and both options are likely to introduce additional bias.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#step-3-assess-the-performance","title":"Step 3: Assess the performance","text":"

The final step is to assess performance outcomes by comparing the labelled data with the pipeline\u2019s predictions. In bias evaluation, understanding where the model goes wrong is of particular interest (false positives and false negatives).

A high-level typical data linking performance may look like this:

This represents the reality of the dominant class (non-links) in data linking, as most records which are compared will not relate to the same individual. This leaves very few errors to evaluate within a large sample of labels. When analysing for bias, the focus would be on an even smaller subset of records of interest.

A bias specific data linking performance may look like this:

This further reduces the absolute number of examples, making it difficult to quantify the impact of any bias.

Getting enough useful examples is possible, but impractical. It would require either dedicating a lot of resources to labelling or using a sampling method that could introduce additional bias.

"},{"location":"blog/2024/08/19/bias-in-data-linking.html#conclusions","title":"\ud83d\udca1 Conclusions","text":"

The internship aims to develop an approach that helps users of data linking pipelines evaluate suspected bias. This first blog covers the initial steps taken to figure out what the evaluation process could look like.

Looking into how a common performance evaluation strategy handles bias investigation in data linking uncovered three main issues:

  1. Manual labelling does not give a reliable \"ground truth\".
  2. Records of interest for a specific hypothesis are difficult to identify.
  3. Gathering large samples of FPs and FNs is impractical.

These challenges stem from working with real data and make this approach unsuitable for bias evaluation. We\u2019re currently looking into alternative options \u2014 stay tuned for updates!

"},{"location":"charts/index.html","title":"Charts Gallery","text":""},{"location":"charts/index.html#charts-gallery","title":"Charts Gallery","text":""},{"location":"charts/index.html#exploratory-analysis","title":"Exploratory Analysis","text":"

profile columns

completeness chart

"},{"location":"charts/index.html#blocking","title":"Blocking","text":"

cumulative comparisons to be scored from blocking rules chart

"},{"location":"charts/index.html#comparison-helpers","title":"Comparison Helpers","text":"

comparator score chart

comparator score threshold chart

phonetic match chart

"},{"location":"charts/index.html#evaluation","title":"Evaluation","text":""},{"location":"charts/index.html#model-evaluation","title":"Model Evaluation","text":"

match weights chart

m u parameters chart

parameter estimate comparisons chart

tf adjustment chart

unlinkables chart

"},{"location":"charts/index.html#edge-link-evaluation","title":"Edge (Link) Evaluation","text":""},{"location":"charts/index.html#overall","title":"Overall","text":"

accuracy chart from labels table

threshold_selection_tool from labels table

"},{"location":"charts/index.html#spot-checking","title":"Spot Checking","text":"

comparison viewer dashboard

waterfall chart

"},{"location":"charts/index.html#cluster-evaluation","title":"Cluster Evaluation","text":""},{"location":"charts/index.html#overall_1","title":"Overall","text":""},{"location":"charts/index.html#spot-checking_1","title":"Spot Checking","text":"

cluster studio dashboard

"},{"location":"charts/index.html#all-charts","title":"All Charts","text":"

accuracy chart from labels table

cluster studio dashboard

comparator score chart

comparator score threshold chart

comparison viewer dashboard

completeness chart

cumulative comparisons from blocking rules chart

m u parameters chart

match weights chart

parameter estimate comparisons chart

phonetic match chart

profile columns

tf adjustment chart

unlinkables chart

waterfall chart

"},{"location":"charts/accuracy_analysis_from_labels_table.html","title":"accuracy chart from labels table","text":""},{"location":"charts/accuracy_analysis_from_labels_table.html#accuracy_analysis_from_labels_table","title":"accuracy_analysis_from_labels_table","text":"

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

"},{"location":"charts/accuracy_analysis_from_labels_table.html#what-the-chart-shows","title":"What the chart shows","text":"

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. For all possible match weight thresholds, this chart shows various accuracy metrics comparing the Splink scores against clerical labels.

Precision and recall are shown by default, but various additional metrics can be added: specificity, negative predictive value (NPV), accuracy, \\(F_1\\), \\(F_2\\), \\(F_{0.5}\\), \\(P_4\\) and \\(\\phi\\) (Matthews correlation coefficient).

"},{"location":"charts/accuracy_analysis_from_labels_table.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

Confusion matrix

See threshold_selection_tool_from_labels_table for a more complete visualisation of the impact of match threshold on false positives and false negatives, with reference to the confusion matrix.

"},{"location":"charts/accuracy_analysis_from_labels_table.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().

"},{"location":"charts/accuracy_analysis_from_labels_table.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\nfrom splink.datasets import splink_dataset_labels\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"substr(first_name,1,1)\"),\n        block_on(\"substr(surname, 1,1)\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")], recall=0.7\n)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\n\ndf_labels = splink_dataset_labels.fake_1000_labels\nlabels_table = linker.table_management.register_labels_table(df_labels)\n\nchart = linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"accuracy\", add_metrics=[\"f1\"]\n)\n

Note that you can also produce a ROC chart, a precision recall chart, or get the results as a table:

linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"roc\", add_metrics=[\"f1\"]\n)\n
linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"precision_recall\", add_metrics=[\"f1\"]\n)\n
linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"table\", add_metrics=[\"f1\"]\n).as_pandas_dataframe()\n
truth_threshold match_probability total_clerical_labels p n tp tn fp fn P_rate ... precision recall specificity npv accuracy f1 f2 f0_5 p4 phi 0 -23.8 6.846774e-08 3176.0 2031.0 1145.0 1446.0 1055.0 90.0 585.0 0.639484 ... 0.941406 0.711965 0.921397 0.643293 0.787469 0.810765 0.748447 0.884404 0.783298 0.608544 1 -22.7 1.467638e-07 3176.0 2031.0 1145.0 1446.0 1077.0 68.0 585.0 0.639484 ... 0.955086 0.711965 0.940611 0.648014 0.794395 0.815797 0.750156 0.894027 0.790841 0.627351 2 -21.7 2.935275e-07 3176.0 2031.0 1145.0 1446.0 1083.0 62.0 585.0 0.639484 ... 0.958886 0.711965 0.945852 0.649281 0.796285 0.817180 0.750623 0.896689 0.792887 0.632504 3 -21.6 3.145950e-07 3176.0 2031.0 1145.0 1446.0 1088.0 57.0 585.0 0.639484 ... 0.962076 0.711965 0.950218 0.650329 0.797859 0.818336 0.751013 0.898918 0.794588 0.636808 4 -20.6 6.291899e-07 3176.0 2031.0 1145.0 1446.0 1094.0 51.0 585.0 0.639484 ... 0.965932 0.711965 0.955459 0.651578 0.799748 0.819728 0.751481 0.901609 0.796624 0.641982 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 278 24.2 9.999999e-01 3176.0 2031.0 1145.0 5.0 1145.0 0.0 2026.0 0.639484 ... 1.000000 0.002462 1.000000 0.361085 0.362091 0.004912 0.003075 0.012189 0.009733 0.029815 279 24.3 1.000000e+00 3176.0 2031.0 1145.0 4.0 1145.0 0.0 2027.0 0.639484 ... 1.000000 0.001969 1.000000 0.360971 0.361776 0.003931 0.002461 0.009770 0.007805 0.026663 280 24.4 1.000000e+00 3176.0 2031.0 1145.0 3.0 1145.0 0.0 2028.0 0.639484 ... 1.000000 0.001477 1.000000 0.360857 0.361461 0.002950 0.001846 0.007342 0.005867 0.023087 281 24.6 1.000000e+00 3176.0 2031.0 1145.0 2.0 1145.0 0.0 2029.0 0.639484 ... 1.000000 0.000985 1.000000 0.360744 0.361146 0.001968 0.001231 0.004904 0.003921 0.018848 282 25.1 1.000000e+00 3176.0 2031.0 1145.0 1.0 1145.0 0.0 2030.0 0.639484 ... 1.000000 0.000492 1.000000 0.360630 0.360831 0.000984 0.000615 0.002457 0.001965 0.013325

283 rows \u00d7 25 columns

"},{"location":"charts/cluster_studio_dashboard.html","title":"cluster studio dashboard","text":""},{"location":"charts/cluster_studio_dashboard.html#cluster_studio_dashboard","title":"cluster_studio_dashboard","text":"

At a glance

API Documentation: cluster_studio_dashboard()

"},{"location":"charts/cluster_studio_dashboard.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"substr(first_name,1,1)\"),\n        block_on(\"substr(surname, 1,1)\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n    retain_matching_columns=True,\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\ndf_predictions = linker.inference.predict(threshold_match_probability=0.2)\ndf_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predictions, threshold_match_probability=0.5\n)\n\nlinker.visualisations.cluster_studio_dashboard(\n    df_predictions, df_clusters, \"img/cluster_studio.html\",\n    sampling_method=\"by_cluster_size\", overwrite=True\n)\n\n# You can view the scv.html file in your browser, or inline in a notebook as follows\nfrom IPython.display import IFrame\nIFrame(src=\"./img/cluster_studio.html\", width=\"100%\", height=1200)\n
"},{"location":"charts/cluster_studio_dashboard.html#what-the-chart-shows","title":"What the chart shows","text":"

See here for a video explanation of the chart.

"},{"location":"charts/comparison_viewer_dashboard.html","title":"comparison viewer dashboard","text":""},{"location":"charts/comparison_viewer_dashboard.html#comparison_viewer_dashboard","title":"comparison_viewer_dashboard","text":"

At a glance

API Documentation: comparison_viewer_dashboard()

"},{"location":"charts/comparison_viewer_dashboard.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"substr(first_name,1,1)\"),\n        block_on(\"substr(surname, 1,1)\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n    retain_matching_columns=True,\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\ndf_predictions = linker.inference.predict(threshold_match_probability=0.2)\n\nlinker.visualisations.comparison_viewer_dashboard(\n    df_predictions, \"img/scv.html\", overwrite=True\n)\n\n# You can view the scv.html file in your browser, or inline in a notebook as follows\nfrom IPython.display import IFrame\nIFrame(\n    src=\"./img/scv.html\", width=\"100%\", height=1200\n)\n
"},{"location":"charts/comparison_viewer_dashboard.html#what-the-chart-shows","title":"What the chart shows","text":"

See the following video: An introduction to the Splink Comparison Viewer dashboard

"},{"location":"charts/completeness_chart.html","title":"completeness chart","text":""},{"location":"charts/completeness_chart.html#completeness_chart","title":"completeness_chart","text":"

At a glance

Useful for: Looking at which columns are populated across datasets.

API Documentation: completeness_chart()

What is needed to generate the chart? A linker with some data.

"},{"location":"charts/completeness_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The completeness_chart shows the proportion of populated (non-null) values in the columns of multiple datasets.

What the chart tooltip shows

The tooltip shows a number of values based on the panel that the user is hovering over, including:

  • The dataset and column name
  • The count and percentage of non-null values in the column for the relelvant dataset.
"},{"location":"charts/completeness_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Each panel represents the percentage of non-null values in a given dataset-column combination. The darker the panel, the lower the percentage of non-null values.

"},{"location":"charts/completeness_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

Only choose features that are sufficiently populated across all datasets in a linkage model.

"},{"location":"charts/completeness_chart.html#worked-example","title":"Worked Example","text":"
from splink import splink_datasets, DuckDBAPI\nfrom splink.exploratory import completeness_chart\n\ndf = splink_datasets.fake_1000\n\n# Split a simple dataset into two, separate datasets which can be linked together.\ndf_l = df.sample(frac=0.5)\ndf_r = df.drop(df_l.index)\n\n\nchart = completeness_chart([df_l, df_r], db_api=DuckDBAPI())\nchart\n
"},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html","title":"cumulative num comparisons from blocking rules chart","text":""},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html#cumulative_comparisons_to_be_scored_from_blocking_rules_chart","title":"cumulative_comparisons_to_be_scored_from_blocking_rules_chart","text":"

At a glance

Useful for: Counting the number of comparisons generated by Blocking Rules.

API Documentation: cumulative_comparisons_to_be_scored_from_blocking_rules_chart()

What is needed to generate the chart? A linker with some data and a settings dictionary defining some Blocking Rules.

"},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The cumulative_comparisons_to_be_scored_from_blocking_rules_chart shows the count of pairwise comparisons generated by a set of blocking rules.

What the chart tooltip shows

The tooltip shows a number of statistics based on the bar that the user is hovering over, including:

  • The blocking rule as an SQL statement.
  • The number of additional pairwise comparisons generated by the blocking rule.
  • The cumulative number of pairwise comparisons generated by the blocking rule and the previous blocking rules.
  • The total number of possible pariwise comparisons (i.e. the Cartesian product). This represents the number of comparisons which would need to be evaluated if no blocking was implemented.
  • The percentage of possible pairwise comparisons excluded by the blocking rule and the previous blocking rules (i.e. the Reduction Ratio). This is calculated as \\(1-\\frac{\\textsf{cumulative comparisons}}{\\textsf{total possible comparisons}}\\).
"},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Blocking rules are order dependent, therefore each bar in this chart shows the additional comparisons generated ontop of the previous blocking rules.

For example, the chart above shows an exact match on surname generates an additional 1351 comparisons. If we reverse the order of the surname and first_name blocking rules:

blocking_rules_for_analysis = [\n    block_on(\"surname\"),\n    block_on(\"first_name\"),\n    block_on(\"email\"),\n]\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules_for_analysis,\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n)\n

The total number of comparisons is the same (3,664), but now 1,638 have been generated by the surname blocking rule. This suggests that 287 record comparisons have the same first_name and surname.

"},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

The main aim of this chart is to understand how many comparisons are generated by blocking rules that the Splink model will consider. The number of comparisons is the main primary driver of the amount of computational resource required for Splink model training, predictions etc. (i.e. how long things will take to run).

The number of comparisons that are appropriate for a model varies. In general, if a model is taking hours to run (unless you are working with 100+ million records), it could be helpful to reduce the number of comparisons by defining more restrictive blocking rules.

For instance, there are many people who could share the same first_name in the example above you may want to add an additonal requirement for a match on dob as well to reduce the number of records the model needs to consider.

blocking_rules_for_analysis = [\n    block_on(\"first_name\", \"dob\"),\n    block_on(\"surname\"),\n    block_on(\"email\"),\n]\n\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules_for_analysis,\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n)\n

Here, the total number of records pairs considered by the model have been reduced from 3,664 to 2,213.

Further Reading

For a deeper dive on blocking, please refer to the Blocking Topic Guides.

For more on the blocking tools in Splink, please refer to the Blocking API documentation.

"},{"location":"charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html#worked-example","title":"Worked Example","text":""},{"location":"charts/m_u_parameters_chart.html","title":"m u parameters chart","text":""},{"location":"charts/m_u_parameters_chart.html#m_u_parameters_chart","title":"m_u_parameters_chart","text":"

At a glance

Useful for: Looking at the m and u values generated by a Splink model.

API Documentation: m_u_parameters_chart()

What is needed to generate the chart? A trained Splink model.

"},{"location":"charts/m_u_parameters_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The m_u_parameters_chart shows the results of a trained Splink model:

  • The left chart shows the estimated m probabilities from the Splink model
  • The right chart shows the estimated u probabilities from the Splink model.

Each comparison within a model is represented in trained m and u values that have been estimated during the Splink model training for each comparison level.

What the chart tooltip shows"},{"location":"charts/m_u_parameters_chart.html#estimated-m-probability-tooltip","title":"Estimated m probability tooltip","text":"

The tooltip of the left chart shows information based on the comparison level bar that the user is hovering over, including:

  • An explanation of the m probability for the comparison level.
  • The name of the comparison and comparison level.
  • The comparison level condition as an SQL statement.
  • The m and u proability for the comparison level.
  • The resulting bayes factor and match weight for the comparison level.
"},{"location":"charts/m_u_parameters_chart.html#estimated-u-probability-tooltip","title":"Estimated u probability tooltip","text":"

The tooltip of the right chart shows information based on the comparison level bar that the user is hovering over, including:

  • An explanation of the u probability from the comparison level.
  • The name of the comparison and comparison level.
  • The comparison level condition as an SQL statement.
  • The m and u proability for the comparison level.
  • The resulting bayes factor and match weight for the comparison level.
"},{"location":"charts/m_u_parameters_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Each bar of the left chart shows the probability of a given comparison level when two records are a match. This can also be interpreted as the proportion of matching records which are allocated to the comparison level (as stated in the x axis label).

Similarly, each bar of the right chart shows the probability of a given comparison level when two records are not a match. This can also be interpreted as the proportion of non-matching records which are allocated to the comparison level (as stated in the x axis label).

Further Reading

For a more comprehensive introduction to m and u probabilities, check out the Fellegi Sunter model topic guide.

"},{"location":"charts/m_u_parameters_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

As with the match_weights_chart, one of the most effective methods to assess a Splink model is to walk through each of the comparison levels of the m_u_parameters_chart and sense check the m and u probabilities that have been allocated by the model.

For example, for all non-matching pairwise comparisons (which form the vast majority of all pairwise comparisons), it makes sense that the exact match and fuzzy levels occur very rarely. Furthermore, dob and city are lower cardinality features (i.e. have fewer possible values) than names so \"All other comparisons\" is less likely.

If there are any m or u values that appear unusual, check out the values generated for each training session in the parameter_estimate_comparisons_chart.

"},{"location":"charts/m_u_parameters_chart.html#related-charts","title":"Related Charts","text":"

match weights chart

parameter estimate comparisons chart

"},{"location":"charts/m_u_parameters_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\n\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nchart = linker.visualisations.m_u_parameters_chart()\nchart\n
\n
"},{"location":"charts/match_weights_chart.html","title":"match weights chart","text":""},{"location":"charts/match_weights_chart.html#match_weights_chart","title":"match_weights_chart","text":"

At a glance

Useful for: Looking at the whole Splink model definition.

API Documentation: match_weights_chart()

What is needed to generate the chart? A trained Splink model.

"},{"location":"charts/match_weights_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The match_weights_chart show the results of a trained Splink model. Each comparison within a model is represented in a bar chart, with a bar showing the evidence for two records being a match (i.e. match weight) for each comparison level.

What the chart tooltip shows

The tooltip shows information based on the comparison level bar that the user is hovering over, including:

  • The name of the comparison and comaprison level.
  • The comparison level condition as an SQL statement.
  • The m and u proability for the comparison level.
  • The resulting bayes factor and match weight for the comparison level.
"},{"location":"charts/match_weights_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Each bar in the match_weights_chart shows the evidence of a match provided by each level in a Splink model (i.e. match weight). As such, the match weight chart provides a summary for the entire Splink model, as it shows the match weights for every type of comparison defined within the model.

Any Splink score generated to compare two records will add up the evidence (i.e. match weights) for each comparison to come up with a final match weight score, which can then be converted into a probability of a match.

The first bar chart is the Prior Match Weight, which is the . This can be thought of in the same way as the y-intercept of a simple regression model

This chart is an aggregation of the m_u_parameters_chart. The match weight for a comparison level is simply \\(log_2(\\frac{m}{u})\\).

"},{"location":"charts/match_weights_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

Some heuristics to help assess Splink models with the match_weights_chart:

"},{"location":"charts/match_weights_chart.html#match-weights-gradually-reducing-within-a-comparison","title":"Match weights gradually reducing within a comparison","text":"

Comparison levels are order dependent, therefore they are constructed that the most \"similar\" levels come first and get gradually less \"similar\". As a result, we would generally expect that match weight will reduce as we go down the levels in a comparison.

"},{"location":"charts/match_weights_chart.html#very-similar-comparison-levels","title":"Very similar comparison levels","text":"

Comparisons are broken up into comparison levels to show different levels of similarity between records. As these levels are associated with different levels of similarity, we expect the amount of evidence (i.e. match weight) to vary between comparison levels. Two levels with the same match weight does not provide the model with any additional information which could make it perform better.

Therefore, if two levels of a comparison return the same match weight, these should be combined into a single level.

"},{"location":"charts/match_weights_chart.html#very-different-comparison-levels","title":"Very different comparison levels","text":"

Levels that have a large variation between comparison levels have a significant impact on the model results. For example, looking at the email comparison in the chart above, the difference in match weight between an exact/fuzzy match and \"All other comparisons\" is > 13, which is quite extreme. This generally happens with highly predictive features (e.g. email, national insurance number, social security number).

If there are a number of highly predictive features, it is worth looking at simplifying your model using these more predictive features. In some cases, similar results may be obtained with a deterministic rather than a probabilistic linkage model.

"},{"location":"charts/match_weights_chart.html#logical-walk-through","title":"Logical Walk-through","text":"

One of the most effective methods to assess a splink model is to walk through each of the comparison levels of the match_weights_chart and sense check the amount of evidence (i.e. match weight) that has been allocated by the model.

For example, in the chart above, we would expect records with the same dob to provide more evidence of a match that first_name or surname. Conversely, given how people can move location, we would expect that city would be less predictive than people's fixed, personally identifying characteristics like surname, dob etc.

"},{"location":"charts/match_weights_chart.html#anything-look-strange","title":"Anything look strange?","text":"

If anything still looks unusual, check out:

  • the underlying m and u values in the m_u_parameters_chart
  • the values from each training session in the parameter_estimate_comparisons_chart
"},{"location":"charts/match_weights_chart.html#related-charts","title":"Related Charts","text":"

m u parameters chart

parameter estimate comparisons chart

"},{"location":"charts/match_weights_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\n\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nchart = linker.visualisations.match_weights_chart()\nchart\n
\n
"},{"location":"charts/parameter_estimate_comparisons_chart.html","title":"parameter estimate comparisons chart","text":""},{"location":"charts/parameter_estimate_comparisons_chart.html#parameter_estimate_comparisons_chart","title":"parameter_estimate_comparisons_chart","text":"

At a glance

Useful for: Looking at the m and u value estimates across multiple Splink model training sessions.

API Documentation: parameter_estimate_comparisons_chart()

What is needed to generate the chart? A trained Splink model.

"},{"location":"charts/parameter_estimate_comparisons_chart.html#related-charts","title":"Related Charts","text":"

m u parameters chart

match weights chart

"},{"location":"charts/parameter_estimate_comparisons_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"email\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nchart = linker.visualisations.parameter_estimate_comparisons_chart()\nchart\n
\n
"},{"location":"charts/profile_columns.html","title":"profile columns","text":""},{"location":"charts/profile_columns.html#profile_columns","title":"profile_columns","text":"

At a glance

Useful for: Looking at the distribution of values in columns.

API Documentation: profile_columns()

What is needed to generate the chart?: A linker with some data.

"},{"location":"charts/profile_columns.html#what-the-chart-shows","title":"What the chart shows","text":"

The profile_columns chart shows 3 charts for each selected column:

  • The left chart shows the distribution of all values in the column. It is a summary of the skew of value frequencies. The width of each \"step\" represents the proportion of all (non-null) values with a given count while the height of each \"step\" gives the count of the same given value.
  • The middle chart shows the counts of the ten most common values in the column. These correspond to the 10 leftmost \"steps\" in the left chart.
  • The right chart shows the counts of the ten least common values in the column. These correspond to the 10 rightmost \"steps\" in the left chart.
What the chart tooltip shows"},{"location":"charts/profile_columns.html#left-chart","title":"Left chart:","text":"

This tooltip shows a number of statistics based on the column value of the \"step\" that the user is hovering over, including:

  • The number of occurances of the given value.
  • The precentile of the column value (excluding and including null values).
  • The total number of rows in the column (excluding and including null values).
"},{"location":"charts/profile_columns.html#middle-and-right-chart","title":"Middle and right chart:","text":"

This tooltip shows a number of statistics based on the column value of the bar that the user is hovering over, including:

  • The column value
  • The count of the column value.
  • The total number of rows in the column (excluding and including null values).
"},{"location":"charts/profile_columns.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

The distribution of values in your data is important for two main reasons:

  1. Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.

  2. The skew of values is important. If you have a birth_place column that has 1,000 distinct values, but 75% of them are London, this is much less useful for linkage than if the 1,000 values were equally distributed

"},{"location":"charts/profile_columns.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

In an ideal world, all of the columns in datasets used for linkage would be high cardinality with a low skew (i.e. many distinct values that are evenly distributed). This is rarely the case with real-life datasets, but there a number of steps to extract the most predictive value, particularly with skewed data.

"},{"location":"charts/profile_columns.html#skewed-string-columns","title":"Skewed String Columns","text":"

Consider the skew of birth_place in our example:

profile_columns(df, column_expressions=\"birth_place\", db_api=DuckDBAPI())\n

Here we can see that \"london\" is the most common value, with many multiples more entires than the other values. In this case two records both having a birth_place of \"london\" gives far less evidence for a match than both having a rarer birth_place (e.g. \"felthorpe\").

To take this skew into account, we can build Splink models with Term Frequency Adjustments. These adjustments will increase the amount of evidence for rare matching values and reduce the amount of evidence for common matching values.

To understand how these work in more detail, check out the Term Frequency Adjustments Topic Guide

"},{"location":"charts/profile_columns.html#skewed-date-columns","title":"Skewed Date Columns","text":"

Dates can also be skewed, but tend to be dealt with slightly differently.

Consider the dob column from our example:

profile_columns(df, column_expressions=\"dob\", db_api=DuckDBAPI())\n

Here we can see a large skew towards dates which are the 1st January. We can narrow down the profiling to show the distribution of month and day to explore this further:

profile_columns(df, column_expressions=\"substr(dob, 6, 10)\", db_api=DuckDBAPI())\n

Here we can see that over 35% of all dates in this dataset are the 1st January. This is fairly common in manually entered datasets where if only the year of birth is known, people will generally enter the 1st January for that year.

"},{"location":"charts/profile_columns.html#low-cardinality-columns","title":"Low cardinality columns","text":"

Unfortunately, there is not much that can be done to improve low cardinality data. Ultimately, they will provide some evidence of a match between records, but need to be used in conjunction with some more predictive, higher cardinality fields.

"},{"location":"charts/profile_columns.html#worked-example","title":"Worked Example","text":"
from splink import splink_datasets, DuckDBAPI\nfrom splink.exploratory import profile_columns\n\ndf = splink_datasets.historical_50k\nprofile_columns(df, db_api=DuckDBAPI())\n
"},{"location":"charts/template.html","title":"XXXXX_chart","text":""},{"location":"charts/template.html#xxxxx_chart","title":"XXXXX_chart","text":"

At a glance

Useful for:

API Documentation: XXXXXX_chart()

What is needed to generate the chart?

"},{"location":"charts/template.html#worked-example","title":"Worked Example","text":"
from splink.duckdb.linker import DuckDBLinker\nimport splink.duckdb.comparison_library as cl\nimport splink.duckdb.comparison_template_library as ctl\nfrom splink.duckdb.blocking_rule_library import block_on\nfrom splink.datasets import splink_datasets\nimport logging, sys\nlogging.disable(sys.maxsize)\n\ndf = splink_datasets.fake_1000\n\nsettings = {\n    \"link_type\": \"dedupe_only\",\n    \"blocking_rules_to_generate_predictions\": [\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    \"comparisons\": [\n        ctl.name_comparison(\"first_name\"),\n        ctl.name_comparison(\"surname\"),\n        ctl.date_comparison(\"dob\", cast_strings_to_date=True),\n        cl.exact_match(\"city\", term_frequency_adjustments=True),\n        ctl.email_comparison(\"email\", include_username_fuzzy_level=False),\n    ],\n}\n\nlinker = DuckDBLinker(df, settings)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on([\"first_name\", \"surname\"])\n\nlinker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)\n
"},{"location":"charts/template.html#what-the-chart-shows","title":"What the chart shows","text":"What the chart tooltip shows"},{"location":"charts/template.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":""},{"location":"charts/template.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":""},{"location":"charts/tf_adjustment_chart.html","title":"tf adjustment chart","text":""},{"location":"charts/tf_adjustment_chart.html#tf_adjustment_chart","title":"tf_adjustment_chart","text":"

At a glance

Useful for: Looking at the impact of Term Frequency Adjustments on Match Weights.

API Documentation: tf_adjustment_chart()

What is needed to generate the chart?: A trained Splink model, including comparisons with term frequency adjustments.

"},{"location":"charts/tf_adjustment_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The tf_adjustment_chart shows the impact of Term Frequency Adjustments on the Match Weight of a comparison. It is made up of two charts for each selected comparison:

  • The left chart shows the match weight for two records with a matching first_name including a term frequency adjustment. The black horizontal line represents the base match weight (i.e. with no term frequency adjustment applied). By default this chart contains the 10 most frequent and 10 least frequent values in a comparison as well as any values assigned in the vals_to_include parameter.
  • The right chart shows the distribution of match weights across all of the values of first_name.
What the tooltip shows"},{"location":"charts/tf_adjustment_chart.html#left-chart","title":"Left chart","text":"

The tooltip shows a number of statistics based on the column value of the point theat the user is hovering over, including:

  • The column value
  • The base match weight (i.e. with no term frequency adjustment) for a match on the column.
  • The term frequency adjustment for the column value.
  • The final match weight (i.e. the combined base match weight and term frequency adjustment)
"},{"location":"charts/tf_adjustment_chart.html#right-chart","title":"Right chart","text":"

The tooltip shows a number of statistics based on the bar that the user is hovering over, including:

  • The final match weight bucket (in steps of 0.5).
  • The number of records with a final match weight in the final match weight bucket.
"},{"location":"charts/tf_adjustment_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

The most common terms (on the left of the first chart) will have a negative term frequency adjustment and the values on the chart and represent the lowest match weight for a match for the selected comparison. Conversely, the least common terms (on the right of the first chart) will have a positive term frequency adjustment and the values on the chart represent the highest match weight for a match for the selected comparison.

Given that the first chart only shows the most and least frequently occuring values, the second chart is provided to show the distribution of final match weights (including term frequency adjustments) across all values in the dataset.

"},{"location":"charts/tf_adjustment_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

There are no direct actions that need to be taken as a result of this chart. It is intended to give the user an indication of the size of the impact of Term Frequency Adjustments on comparisons, as seen in the Waterfall Chart.

"},{"location":"charts/tf_adjustment_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]).configure(\n            term_frequency_adjustments=True\n        ),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nchart = linker.visualisations.tf_adjustment_chart(\n    \"first_name\", vals_to_include=[\"Robert\", \"Grace\"]\n)\nchart\n
"},{"location":"charts/threshold_selection_tool_from_labels_table.html","title":"threshold selection tool","text":""},{"location":"charts/threshold_selection_tool_from_labels_table.html#threshold_selection_tool_from_labels_table","title":"threshold_selection_tool_from_labels_table","text":"

At a glance

Useful for: Selecting an optimal match weight threshold for generating linked clusters.

API Documentation: accuracy_chart_from_labels_table()

What is needed to generate the chart? A linker with some data and a corresponding labelled dataset

"},{"location":"charts/threshold_selection_tool_from_labels_table.html#what-the-chart-shows","title":"What the chart shows","text":"

For a given match weight threshold, a record pair with a score above this threshold will be labelled a match and below the threshold will be labelled a non-match. Lowering the threshold to the extreme ensures many more matches are generated - this maximises the True Positives (high recall) but at the expense of some False Positives (low precision).

You can then see the effect on the confusion matrix of raising the match threshold. As more predicted matches become non-matches at the higher threshold, True Positives become False Negatives, but False Positives become True Negatives.

This demonstrates the trade-off between Type 1 (FP) and Type 2 (FN) errors when selecting a match threshold, or precision vs recall.

This chart adds further context to accuracy_analysis_from_labels_table showing:

  • the relationship between match weight and match probability
  • various accuracy metrics comparing the Splink scores against clerical labels
  • the confusion matrix of the predictions and the labels
"},{"location":"charts/threshold_selection_tool_from_labels_table.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

Precision can be maximised by increasing the match threshold (reducing false positives).

Recall can be maximised by decreasing the match threshold (reducing false negatives).

Additional metrics can be used to find the optimal compromise between these two, looking for the threshold at which peak accuracy is achieved.

"},{"location":"charts/threshold_selection_tool_from_labels_table.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

Having identified an optimal match weight threshold, this can be applied when generating linked clusters using cluster_pairwise_predictions_at_thresholds().

"},{"location":"charts/threshold_selection_tool_from_labels_table.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\nfrom splink.datasets import splink_dataset_labels\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"substr(first_name,1,1)\"),\n        block_on(\"substr(surname, 1,1)\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")], recall=0.7\n)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\ndf_labels = splink_dataset_labels.fake_1000_labels\nlabels_table = linker.table_management.register_labels_table(df_labels)\n\nchart = linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"threshold_selection\", add_metrics=[\"f1\"]\n)\nchart\n
"},{"location":"charts/unlinkables_chart.html","title":"unlinkables chart","text":""},{"location":"charts/unlinkables_chart.html#unlinkables_chart","title":"unlinkables_chart","text":"

At a glance

Useful for: Looking at how many records have insufficient information to be linked to themselves.

API Documentation: unlinkables_chart()

What is needed to generate the chart? A trained Splink model

"},{"location":"charts/unlinkables_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The unlinkables_chart shows the proportion of records with insufficient information to be matched to themselves at differing match thresholds.

What the chart tooltip shows

This tooltip shows a number of statistics based on the match weight of the selected point of the line, including:

  • The chosen match weight and corresponding match probability.
  • The proportion of records of records that cannot be linked to themselves given the chosen match weight threshold for a match.
"},{"location":"charts/unlinkables_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

This chart gives an indication of both data quality and/or model predictiveness within a Splink model. If a high proportion of records are not linkable to themselves at a low match threshold (e.g. 0 match weight/50% probability) we can conclude that either/or:

  • the data quality is low enough such that a significant proportion of records are unable to be linked to themselves
  • the parameters of the Splink model are such that features have not been assigned enough weight, and therefore will not perform well

This chart also gives an indication of the number of False Negatives (i.e. missed links) at a given threshold, assuming sufficient data quality. For example:

  • we know that a record should be linked to itself, so seeing that a match weight \\(\\approx\\) 10 gives 16% of records unable to link to themselves
  • exact matches generally provide the strongest matches, therefore, we can expect that any \"fuzzy\" matches to have lower match scores. As a result, we can deduce that the propoertion of False Negatives will be higher than 16%.
"},{"location":"charts/unlinkables_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

If the level of unlinkable records is extremely high at low match weight thresholds, you have a poorly performing model. This may be an issue that can be resolved by tweaking the models comparisons, but if the poor performance is primarily down to poor data quality, there is very little that can be done to improve the model.

When interpretted as an indicator of False Negatives, this chart can be used to establish an upper bound for match weight, depending on the propensity for False Negatives in the particular use case.

"},{"location":"charts/unlinkables_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nchart = linker.evaluation.unlinkables_chart()\nchart\n
"},{"location":"charts/waterfall_chart.html","title":"waterfall chart","text":""},{"location":"charts/waterfall_chart.html#waterfall_chart","title":"waterfall_chart","text":"

At a glance

Useful for: Looking at the breakdown of the match weight for a pair of records.

API Documentation: waterfall_chart()

What is needed to generate the chart? A trained Splink model

"},{"location":"charts/waterfall_chart.html#what-the-chart-shows","title":"What the chart shows","text":"

The waterfall_chart shows the amount of evidence of a match that is provided by each comparison for a pair of records. Each bar represents a comparison and the corresponding amount of evidence (i.e. match weight) of a match for the pair of values displayed above the bar.

What the chart tooltip shows

The tooltip contains information based on the bar that the user is hovering over, including:

  • The comparison column (or columns)
  • The column values from the pair of records being compared
  • The comparison level as a label, SQL statement and the corresponding comparison vector value
  • The bayes factor (i.e. how many times more likely is a match based on this evidence)
  • The match weight for the comparison level
  • The cumulative match probability from the chosen comparison and all of the previous comparisons.
"},{"location":"charts/waterfall_chart.html#how-to-interpret-the-chart","title":"How to interpret the chart","text":"

The first bar (labelled \"Prior\") is the match weight if no additional knowledge of features is taken into account, and can be thought of as similar to the y-intercept in a simple regression.

Each subsequent bar shows the match weight for a comparison. These bars can be positive or negative depending on whether the given comparison gives positive or negative evidence for the two records being a match.

Additional bars are added for comparisons with term frequency adjustments. For example, the chart above has term frequency adjustments for first_name so there is an extra tf_first_name bar showing how the frequency of a given name impacts the amount of evidence for the two records being a match.

The final bar represents total match weight for the pair of records. This match weight can also be translated into a final match probablility, and the corresponding match probability is shown on the right axis (note the logarithmic scale).

"},{"location":"charts/waterfall_chart.html#actions-to-take-as-a-result-of-the-chart","title":"Actions to take as a result of the chart","text":"

This chart is useful for spot checking pairs of records to see if the Splink model is behaving as expected.

If a pair of records look like they are incorrectly being assigned as a match/non-match, it is a sign that the Splink model is not working optimally. If this is the case, it is worth revisiting the model training step.

Some common scenarios include:

  • If a comparison isn't capturing a specific edge case (e.g. fuzzy match), add a comparison level to capture this case and retrain the model.

  • If the match weight for a comparison is looking unusual, refer to the match_weights_chart to see the match weight in context with the rest of the comparison levels within that comparison. If it is still looking unusual, you can dig deeper with the parameter_estimate_comparisons_chart to see if the model training runs are consistent. If there is a lot of variation between model training sessions, this can suggest some instability in the model. In this case, try some different model training rules and/or comparison levels.

  • If the \"Prior\" match weight is too small or large compared to the match weight provided by the comparisons, try some different determininstic rules and recall inputs to the estimate_probability_two_records_match function.

  • If you are working with a model with term frequency adjustments and want to dig deeper into the impact of term frequency on the model as a whole (i.e. not just for a single pairwise comparison), check out the tf_adjustment_chart.

"},{"location":"charts/waterfall_chart.html#worked-example","title":"Worked Example","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        ctl.NameComparison(\"first_name\").configure(term_frequency_adjustments=True),\n        ctl.NameComparison(\"surname\"),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\"),\n        ctl.EmailComparison(\"email\", include_username_fuzzy_level=False),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n    retain_matching_columns=True,\n)\n\nlinker = Linker(df, settings, DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n\ndf_predictions = linker.inference.predict(threshold_match_probability=0.2)\nrecords_to_view = df_predictions.as_record_dict(limit=5)\n\nchart = linker.visualisations.waterfall_chart(records_to_view, filter_nulls=False)\nchart\n
"},{"location":"demos/examples/examples_index.html","title":"Introduction","text":"","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#example-notebooks","title":"Example Notebooks","text":"

This section provides a series of examples to help you get started with Splink. You can find the underlying notebooks in the demos folder of the Splink repository.

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#duckdb-examples","title":"DuckDB examples","text":"","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#entity-type-persons","title":"Entity type: Persons","text":"

Deduplicating 50,000 records of realistic data based on historical persons

Using the link_only setting to link, but not dedupe, two datasets

Real time record linkage

Accuracy analysis and ROC charts using a ground truth (cluster) column

Estimating m probabilities from pairwise labels

Deduplicating 50,000 records with Deterministic Rules

Deduplicating the febrl3 dataset. Note this dataset comes from febrl, as referenced in A.2 here and replicated here.

Linking the febrl4 datasets. As above, these datasets are from febrl, replicated here.

Cookbook of various Splink techniques

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#entity-type-financial-transactions","title":"Entity type: Financial transactions","text":"

Linking financial transactions

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#pyspark-examples","title":"PySpark examples","text":"

Deduplication of a small dataset using PySpark. Entity type is persons.

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#athena-examples","title":"Athena examples","text":"

Deduplicating 50,000 records of realistic data based on historical persons

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/examples_index.html#sqlite-examples","title":"SQLite examples","text":"

Deduplicating 50,000 records of realistic data based on historical persons

","tags":["Examples","DuckDB","Spark","Athena"]},{"location":"demos/examples/athena/deduplicate_50k_synthetic.html","title":"Deduplicate 50k rows historical persons","text":""},{"location":"demos/examples/athena/deduplicate_50k_synthetic.html#linking-a-dataset-of-real-historical-persons","title":"Linking a dataset of real historical persons","text":"

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

Create a boto3 session to be used within the linker

import boto3\n\nboto3_session = boto3.Session(region_name=\"eu-west-1\")\n
"},{"location":"demos/examples/athena/deduplicate_50k_synthetic.html#athenalinker-setup","title":"AthenaLinker Setup","text":"

To work nicely with Athena, you need to outline various filepaths, buckets and the database(s) you wish to interact with.

The AthenaLinker has three required inputs: * input_table_or_tables - the input table to use for linking. This can either be a table in a database or a pandas dataframe * output_database - the database to output all of your splink tables to. * output_bucket - the s3 bucket you wish any parquet files produced by splink to be output to.

and two optional inputs: * output_filepath - the s3 filepath to output files to. This is an extension of output_bucket and dictate the full filepath your files will be output to. * input_table_aliases - the name of your table within your database, should you choose to use a pandas df as an input.

# Set the output bucket and the additional filepath to write outputs to\n############################################\n# EDIT THESE BEFORE ATTEMPTING TO RUN THIS #\n############################################\n\nfrom splink.backends.athena import AthenaAPI\n\n\nbucket = \"MYTESTBUCKET\"\ndatabase = \"MYTESTDATABASE\"\nfilepath = \"MYTESTFILEPATH\"  # file path inside of your bucket\n\naws_filepath = f\"s3://{bucket}/{filepath}\"\ndb_api = AthenaAPI(\n    boto3_session,\n    output_bucket=bucket,\n    output_database=database,\n    output_filepath=filepath,\n)\n
import splink.comparison_library as cl\nfrom splink import block_on\n\nfrom splink import Linker, SettingsCreator, splink_datasets\n\ndf = splink_datasets.historical_50k\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"surname\", \"dob\"),\n    ],\n    comparisons=[\n        cl.ExactMatch(\"first_name\").configure(term_frequency_adjustments=True),\n        cl.LevenshteinAtThresholds(\"surname\", [1, 3]),\n        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n        cl.LevenshteinAtThresholds(\"postcode_fake\", [1, 2]),\n        cl.ExactMatch(\"birth_place\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"occupation\").configure(term_frequency_adjustments=True),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n
from splink.exploratory import profile_columns\n\nprofile_columns(df, db_api, column_expressions=[\"first_name\", \"substr(surname,1,2)\"])\n
from splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\nfrom splink import block_on\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    db_api=db_api,\n    blocking_rules=[block_on(\"first_name\", \"surname\"), block_on(\"surname\", \"dob\")],\n    link_type=\"dedupe_only\",\n)\n
import splink.comparison_library as cl\n\n\nfrom splink import Linker, SettingsCreator\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"surname\", \"dob\"),\n    ],\n    comparisons=[\n        cl.ExactMatch(\"first_name\").configure(term_frequency_adjustments=True),\n        cl.LevenshteinAtThresholds(\"surname\", [1, 3]),\n        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n        cl.LevenshteinAtThresholds(\"postcode_fake\", [1, 2]),\n        cl.ExactMatch(\"birth_place\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"occupation\").configure(term_frequency_adjustments=True),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n
linker.training.estimate_probability_two_random_records_match(\n    [\n        block_on(\"first_name\", \"surname\", \"dob\"),\n        block_on(\"substr(first_name,1,2)\", \"surname\", \"substr(postcode_fake, 1,2)\"),\n        block_on(\"dob\", \"postcode_fake\"),\n    ],\n    recall=0.6,\n)\n
Probability two random records match is estimated to be  0.000136.\nThis means that amongst all possible pairwise record comparisons, one in 7,362.31 are expected to match.  With 1,279,041,753 total possible comparisons, we expect a total of around 173,728.33 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=5e6)\n
----- Estimating u probabilities using random sampling -----\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n    - dob (no m values are trained).\n    - postcode_fake (no m values are trained).\n    - birth_place (no m values are trained).\n    - occupation (no m values are trained).\n
blocking_rule = block_on(\"first_name\", \"surname\")\ntraining_session_names = (\n    linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule)\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\n(l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\")\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name\n    - surname\n\nIteration 1: Largest change in params was -0.526 in probability_two_random_records_match\nIteration 2: Largest change in params was -0.0321 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.0109 in the m_probability of birth_place, level `Exact match on birth_place`\nIteration 4: Largest change in params was -0.00341 in the m_probability of birth_place, level `All other comparisons`\nIteration 5: Largest change in params was -0.00116 in the m_probability of dob, level `All other comparisons`\nIteration 6: Largest change in params was -0.000547 in the m_probability of dob, level `All other comparisons`\nIteration 7: Largest change in params was -0.00029 in the m_probability of dob, level `All other comparisons`\nIteration 8: Largest change in params was -0.000169 in the m_probability of dob, level `All other comparisons`\nIteration 9: Largest change in params was -0.000105 in the m_probability of dob, level `All other comparisons`\nIteration 10: Largest change in params was -6.87e-05 in the m_probability of dob, level `All other comparisons`\n\nEM converged after 10 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n
blocking_rule = block_on(\"dob\")\ntraining_session_dob = (\n    linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule)\n)\n
----- Starting EM training session -----\n\n\n\nEstimating the m probabilities of the model by blocking on:\nl.\"dob\" = r.\"dob\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\nIteration 1: Largest change in params was -0.355 in the m_probability of first_name, level `Exact match on first_name`\nIteration 2: Largest change in params was -0.0383 in the m_probability of first_name, level `Exact match on first_name`\nIteration 3: Largest change in params was 0.00531 in the m_probability of postcode_fake, level `All other comparisons`\nIteration 4: Largest change in params was 0.00129 in the m_probability of postcode_fake, level `All other comparisons`\nIteration 5: Largest change in params was 0.00034 in the m_probability of surname, level `All other comparisons`\nIteration 6: Largest change in params was 8.9e-05 in the m_probability of surname, level `All other comparisons`\n\nEM converged after 6 iterations\n\nYour model is fully trained. All comparisons have at least one estimate for their m and u values\n
linker.visualisations.match_weights_chart()\n
linker.evaluation.unlinkables_chart()\n
df_predict = linker.inference.predict()\ndf_e = df_predict.as_pandas_dataframe(limit=5)\ndf_e\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name ... bf_birth_place bf_tf_adj_birth_place occupation_l occupation_r gamma_occupation tf_occupation_l tf_occupation_r bf_occupation bf_tf_adj_occupation match_key 0 27.149493 1.000000 Q2296770-1 Q2296770-12 thomas rhomas 0 0.028667 0.000059 0.455194 ... 160.713933 4.179108 politician politician 1 0.088932 0.088932 22.916859 0.441273 1 1 1.627242 0.755454 Q2296770-1 Q2296770-15 thomas clifford, 0 0.028667 0.000020 0.455194 ... 0.154550 1.000000 politician <NA> -1 0.088932 NaN 1.000000 1.000000 1 2 29.206505 1.000000 Q2296770-1 Q2296770-3 thomas tom 0 0.028667 0.012948 0.455194 ... 160.713933 4.179108 politician politician 1 0.088932 0.088932 22.916859 0.441273 1 3 13.783027 0.999929 Q2296770-1 Q2296770-7 thomas tom 0 0.028667 0.012948 0.455194 ... 0.154550 1.000000 politician <NA> -1 0.088932 NaN 1.000000 1.000000 1 4 29.206505 1.000000 Q2296770-2 Q2296770-3 thomas tom 0 0.028667 0.012948 0.455194 ... 160.713933 4.179108 politician politician 1 0.088932 0.088932 22.916859 0.441273 1

5 rows \u00d7 38 columns

You can also view rows in this dataset as a waterfall chart as follows:

records_to_plot = df_e.to_dict(orient=\"records\")\nlinker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)\n
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predict, threshold_match_probability=0.95\n)\n
Completed iteration 1, root rows count 641\nCompleted iteration 2, root rows count 187\nCompleted iteration 3, root rows count 251\nCompleted iteration 4, root rows count 75\nCompleted iteration 5, root rows count 23\nCompleted iteration 6, root rows count 30\nCompleted iteration 7, root rows count 34\nCompleted iteration 8, root rows count 30\nCompleted iteration 9, root rows count 9\nCompleted iteration 10, root rows count 5\nCompleted iteration 11, root rows count 0\n
linker.visualisations.cluster_studio_dashboard(\n    df_predict,\n    clusters,\n    \"dashboards/50k_cluster.html\",\n    sampling_method=\"by_cluster_size\",\n    overwrite=True,\n)\n\nfrom IPython.display import IFrame\n\nIFrame(src=\"./dashboards/50k_cluster.html\", width=\"100%\", height=1200)\n

"},{"location":"demos/examples/duckdb/accuracy_analysis_from_labels_column.html","title":"Evaluation from ground truth column","text":""},{"location":"demos/examples/duckdb/accuracy_analysis_from_labels_column.html#evaluation-when-you-have-fully-labelled-data","title":"Evaluation when you have fully labelled data","text":"

In this example, our data contains a fully-populated ground-truth column called cluster that enables us to perform accuracy analysis of the final model

from splink import splink_datasets\n\ndf = splink_datasets.fake_1000\ndf.head(2)\n
unique_id first_name surname dob city email cluster 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0 1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0
from splink import SettingsCreator, Linker, block_on, DuckDBAPI\n\nimport splink.comparison_library as cl\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n        block_on(\"dob\"),\n        block_on(\"email\"),\n    ],\n    comparisons=[\n        cl.ForenameSurnameComparison(\"first_name\", \"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n
db_api = DuckDBAPI()\nlinker = Linker(df, settings, db_api=db_api)\ndeterministic_rules = [\n    \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n    \"l.email = r.email\",\n]\n\nlinker.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.7\n)\n
Probability two random records match is estimated to be  0.00333.\nThis means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name_surname (no m values are trained).\n    - dob (no m values are trained).\n    - city (no m values are trained).\n    - email (no m values are trained).\n
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"dob\"), estimate_without_term_frequencies=True\n)\nsession_email = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"email\"), estimate_without_term_frequencies=True\n)\nsession_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\"), estimate_without_term_frequencies=True\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"dob\" = r.\"dob\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name_surname\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\nWARNING:\nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was -0.751 in the m_probability of first_name_surname, level `(Exact match on first_name) AND (Exact match on surname)`\nIteration 2: Largest change in params was 0.196 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.0536 in probability_two_random_records_match\nIteration 4: Largest change in params was 0.0189 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.00731 in probability_two_random_records_match\nIteration 6: Largest change in params was 0.0029 in probability_two_random_records_match\nIteration 7: Largest change in params was 0.00116 in probability_two_random_records_match\nIteration 8: Largest change in params was 0.000469 in probability_two_random_records_match\nIteration 9: Largest change in params was 0.000189 in probability_two_random_records_match\nIteration 10: Largest change in params was 7.62e-05 in probability_two_random_records_match\n\nEM converged after 10 iterations\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - dob (no m values are trained).\n    - email (some m values are not trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"email\" = r.\"email\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name_surname\n    - dob\n    - city\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - email\n\nIteration 1: Largest change in params was -0.438 in the m_probability of dob, level `Exact match on dob`\nIteration 2: Largest change in params was 0.122 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.0286 in probability_two_random_records_match\nIteration 4: Largest change in params was 0.01 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.00448 in probability_two_random_records_match\nIteration 6: Largest change in params was 0.00237 in probability_two_random_records_match\nIteration 7: Largest change in params was 0.0014 in probability_two_random_records_match\nIteration 8: Largest change in params was 0.000893 in probability_two_random_records_match\nIteration 9: Largest change in params was 0.000597 in probability_two_random_records_match\nIteration 10: Largest change in params was 0.000413 in probability_two_random_records_match\nIteration 11: Largest change in params was 0.000292 in probability_two_random_records_match\nIteration 12: Largest change in params was 0.000211 in probability_two_random_records_match\nIteration 13: Largest change in params was 0.000154 in probability_two_random_records_match\nIteration 14: Largest change in params was 0.000113 in probability_two_random_records_match\nIteration 15: Largest change in params was 8.4e-05 in probability_two_random_records_match\n\nEM converged after 15 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - email (some m values are not trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\n(l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\")\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name_surname\n\nWARNING:\nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was 0.473 in probability_two_random_records_match\nIteration 2: Largest change in params was 0.0452 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.00766 in probability_two_random_records_match\nIteration 4: Largest change in params was 0.00135 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.00025 in probability_two_random_records_match\nIteration 6: Largest change in params was 0.000468 in the m_probability of email, level `All other comparisons`\nIteration 7: Largest change in params was 0.00776 in the m_probability of email, level `All other comparisons`\nIteration 8: Largest change in params was 0.00992 in the m_probability of email, level `All other comparisons`\nIteration 9: Largest change in params was 0.00277 in probability_two_random_records_match\nIteration 10: Largest change in params was 0.000972 in probability_two_random_records_match\nIteration 11: Largest change in params was 0.000337 in probability_two_random_records_match\nIteration 12: Largest change in params was 0.000118 in probability_two_random_records_match\nIteration 13: Largest change in params was 4.14e-05 in probability_two_random_records_match\n\nEM converged after 13 iterations\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - email (some m values are not trained).\n
linker.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\", output_type=\"table\"\n).as_pandas_dataframe(limit=5)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
truth_threshold match_probability total_clerical_labels p n tp tn fp fn P_rate ... precision recall specificity npv accuracy f1 f2 f0_5 p4 phi 0 -17.8 0.000004 499500.0 2031.0 497469.0 1650.0 495130.0 2339.0 381.0 0.004066 ... 0.413638 0.812408 0.995298 0.999231 0.994555 0.548173 0.681086 0.458665 0.707466 0.577474 1 -17.7 0.000005 499500.0 2031.0 497469.0 1650.0 495225.0 2244.0 381.0 0.004066 ... 0.423729 0.812408 0.995489 0.999231 0.994745 0.556962 0.686470 0.468564 0.714769 0.584558 2 -17.1 0.000007 499500.0 2031.0 497469.0 1650.0 495311.0 2158.0 381.0 0.004066 ... 0.433298 0.812408 0.995662 0.999231 0.994917 0.565165 0.691418 0.477901 0.721512 0.591197 3 -17.0 0.000008 499500.0 2031.0 497469.0 1650.0 495354.0 2115.0 381.0 0.004066 ... 0.438247 0.812408 0.995748 0.999231 0.995003 0.569358 0.693919 0.482710 0.724931 0.594601 4 -16.9 0.000008 499500.0 2031.0 497469.0 1650.0 495386.0 2083.0 381.0 0.004066 ... 0.442004 0.812408 0.995813 0.999231 0.995067 0.572519 0.695792 0.486353 0.727497 0.597173

5 rows \u00d7 25 columns

linker.evaluation.accuracy_analysis_from_labels_column(\"cluster\", output_type=\"roc\")\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
linker.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\",\n    output_type=\"threshold_selection\",\n    threshold_match_probability=0.5,\n    add_metrics=[\"f1\"],\n)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
# Plot some false positives\nlinker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\", include_false_negatives=True, include_false_positives=True\n).as_pandas_dataframe(limit=5)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
clerical_match_score found_by_blocking_rules match_weight match_probability unique_id_l unique_id_r surname_l surname_r first_name_l first_name_r ... email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email cluster_l cluster_r match_key 0 1.0 False -15.568945 0.000021 452 454 Daves Reuben None Davies ... rd@lewis.com idlewrs.cocm 0 0.003802 0.001267 0.01099 1.0 115 115 4 1 1.0 False -14.884057 0.000033 715 717 Joes Jones None Mia ... None mia.j63@martinez.biz -1 NaN 0.005070 1.00000 1.0 182 182 4 2 1.0 False -14.884057 0.000033 626 628 Davidson None geeorGe Geeorge ... None gdavidson@johnson-brown.com -1 NaN 0.005070 1.00000 1.0 158 158 4 3 1.0 False -13.761589 0.000072 983 984 Milller Miller Jessica aessicJ ... None jessica.miller@johnson.com -1 NaN 0.007605 1.00000 1.0 246 246 4 4 1.0 True -11.637585 0.000314 594 595 Kik Kiirk Grace Grace ... gk@frey-robinson.org rgk@frey-robinon.org 0 0.001267 0.001267 0.01099 1.0 146 146 0

5 rows \u00d7 38 columns

records = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\", include_false_negatives=True, include_false_positives=True\n).as_record_dict(limit=5)\n\nlinker.visualisations.waterfall_chart(records)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
"},{"location":"demos/examples/duckdb/cookbook.html","title":"Cookbook","text":""},{"location":"demos/examples/duckdb/cookbook.html#cookbook","title":"Cookbook","text":"

This notebook contains a miscellaneous collection of runnable examples illustrating various Splink techniques.

"},{"location":"demos/examples/duckdb/cookbook.html#array-columns","title":"Array columns","text":""},{"location":"demos/examples/duckdb/cookbook.html#comparing-array-columns","title":"Comparing array columns","text":"

This example shows how we can use use ArrayIntersectAtSizes to assess the similarity of columns containing arrays.

import pandas as pd\n\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on\n\n\ndata = [\n    {\"unique_id\": 1, \"first_name\": \"John\", \"postcode\": [\"A\", \"B\"]},\n    {\"unique_id\": 2, \"first_name\": \"John\", \"postcode\": [\"B\"]},\n    {\"unique_id\": 3, \"first_name\": \"John\", \"postcode\": [\"A\"]},\n    {\"unique_id\": 4, \"first_name\": \"John\", \"postcode\": [\"A\", \"B\"]},\n    {\"unique_id\": 5, \"first_name\": \"John\", \"postcode\": [\"C\"]},\n]\n\ndf = pd.DataFrame(data)\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n    ],\n    comparisons=[\n        cl.ArrayIntersectAtSizes(\"postcode\", [2, 1]),\n        cl.ExactMatch(\"first_name\"),\n    ]\n)\n\n\nlinker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)\n\nlinker.inference.predict().as_pandas_dataframe()\n
match_weight match_probability unique_id_l unique_id_r postcode_l postcode_r gamma_postcode first_name_l first_name_r gamma_first_name 0 -8.287568 0.003190 4 5 [A, B] [C] 0 John John 1 1 -0.287568 0.450333 3 4 [A] [A, B] 1 John John 1 2 -8.287568 0.003190 3 5 [A] [C] 0 John John 1 3 -8.287568 0.003190 2 3 [B] [A] 0 John John 1 4 -0.287568 0.450333 2 4 [B] [A, B] 1 John John 1 5 -8.287568 0.003190 2 5 [B] [C] 0 John John 1 6 -0.287568 0.450333 1 2 [A, B] [B] 1 John John 1 7 -0.287568 0.450333 1 3 [A, B] [A] 1 John John 1 8 6.712432 0.990554 1 4 [A, B] [A, B] 2 John John 1 9 -8.287568 0.003190 1 5 [A, B] [C] 0 John John 1"},{"location":"demos/examples/duckdb/cookbook.html#blocking-on-array-columns","title":"Blocking on array columns","text":"

This example shows how we can use block_on to block on the individual elements of an array column - that is, pairwise comaprisons are created for pairs or records where any of the elements in the array columns match.

import pandas as pd\n\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on\n\n\ndata = [\n    {\"unique_id\": 1, \"first_name\": \"John\", \"postcode\": [\"A\", \"B\"]},\n    {\"unique_id\": 2, \"first_name\": \"John\", \"postcode\": [\"B\"]},\n    {\"unique_id\": 3, \"first_name\": \"John\", \"postcode\": [\"C\"]},\n\n]\n\ndf = pd.DataFrame(data)\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"postcode\", arrays_to_explode=[\"postcode\"]),\n    ],\n    comparisons=[\n        cl.ArrayIntersectAtSizes(\"postcode\", [2, 1]),\n        cl.ExactMatch(\"first_name\"),\n    ]\n)\n\n\nlinker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)\n\nlinker.inference.predict().as_pandas_dataframe()\n
match_weight match_probability unique_id_l unique_id_r postcode_l postcode_r gamma_postcode first_name_l first_name_r gamma_first_name 0 -0.287568 0.450333 1 2 [A, B] [B] 1 John John 1"},{"location":"demos/examples/duckdb/cookbook.html#other","title":"Other","text":""},{"location":"demos/examples/duckdb/cookbook.html#using-duckdb-without-pandas","title":"Using DuckDB without pandas","text":"

In this example, we read data directly using DuckDB and obtain results in native DuckDB DuckDBPyRelation format.

import duckdb\nimport tempfile\nimport os\n\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\n# Create a parquet file on disk to demontrate native DuckDB parquet reading\ndf = splink_datasets.fake_1000\ntemp_file = tempfile.NamedTemporaryFile(delete=True, suffix=\".parquet\")\ntemp_file_path = temp_file.name\ndf.to_parquet(temp_file_path)\n\n# Example would start here if you already had a parquet file\nduckdb_df = duckdb.read_parquet(temp_file_path)\n\ndb_api = DuckDBAPI(\":default:\")\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.JaroAtThresholds(\"surname\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n\nresult = linker.inference.predict().as_duckdbpyrelation()\n\n# Since result is a DuckDBPyRelation, we can use all the usual DuckDB API\n# functions on it.\n\n# For example, we can use the `sort` function to sort the results,\n# or could use result.to_parquet() to write to a parquet file.\nresult.sort(\"match_weight\")\n
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502    match_weight     \u2502  match_probability   \u2502 unique_id_l \u2502 \u2026 \u2502 gamma_surname \u2502   dob_l    \u2502   dob_r    \u2502 match_key \u2502\n\u2502       double        \u2502        double        \u2502    int64    \u2502   \u2502     int32     \u2502  varchar   \u2502  varchar   \u2502  varchar  \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502  -11.83278901894715 \u2502 0.000274066864295451 \u2502         758 \u2502 \u2026 \u2502             0 \u2502 2002-09-15 \u2502 2002-09-15 \u2502 0         \u2502\n\u2502 -10.247826518225994 \u2502  0.0008217501639050\u2026 \u2502         670 \u2502 \u2026 \u2502             0 \u2502 2006-12-05 \u2502 2006-12-05 \u2502 0         \u2502\n\u2502  -9.662864017504837 \u2502  0.0012321189988629\u2026 \u2502         558 \u2502 \u2026 \u2502             0 \u2502 2020-02-11 \u2502 2020-02-11 \u2502 0         \u2502\n\u2502  -9.470218939562441 \u2502  0.0014078881864458\u2026 \u2502         259 \u2502 \u2026 \u2502             1 \u2502 1983-03-07 \u2502 1983-03-07 \u2502 0         \u2502\n\u2502  -8.470218939562441 \u2502 0.002811817648042493 \u2502         644 \u2502 \u2026 \u2502            -1 \u2502 1992-02-06 \u2502 1992-02-06 \u2502 0         \u2502\n\u2502  -8.287568102831404 \u2502  0.0031901106569634\u2026 \u2502         393 \u2502 \u2026 \u2502             3 \u2502 1991-05-06 \u2502 1991-04-12 \u2502 1         \u2502\n\u2502  -8.287568102831404 \u2502  0.0031901106569634\u2026 \u2502         282 \u2502 \u2026 \u2502             3 \u2502 2004-12-02 \u2502 2002-02-25 \u2502 1         \u2502\n\u2502  -8.287568102831404 \u2502  0.0031901106569634\u2026 \u2502         282 \u2502 \u2026 \u2502             3 \u2502 2004-12-02 \u2502 1993-03-01 \u2502 1         \u2502\n\u2502  -8.287568102831404 \u2502  0.0031901106569634\u2026 \u2502         531 \u2502 \u2026 \u2502             3 \u2502 1987-09-11 \u2502 2000-09-03 \u2502 1         \u2502\n\u2502  -8.287568102831404 \u2502  0.0031901106569634\u2026 \u2502         531 \u2502 \u2026 \u2502             3 \u2502 1987-09-11 \u2502 1990-10-06 \u2502 1         \u2502\n\u2502           \u00b7         \u2502            \u00b7         \u2502          \u00b7  \u2502 \u00b7 \u2502             \u00b7 \u2502     \u00b7      \u2502     \u00b7      \u2502 \u00b7         \u2502\n\u2502           \u00b7         \u2502            \u00b7         \u2502          \u00b7  \u2502 \u00b7 \u2502             \u00b7 \u2502     \u00b7      \u2502     \u00b7      \u2502 \u00b7         \u2502\n\u2502           \u00b7         \u2502            \u00b7         \u2502          \u00b7  \u2502 \u00b7 \u2502             \u00b7 \u2502     \u00b7      \u2502     \u00b7      \u2502 \u00b7         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         554 \u2502 \u2026 \u2502             3 \u2502 2020-02-11 \u2502 2030-02-08 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         774 \u2502 \u2026 \u2502             3 \u2502 2027-04-21 \u2502 2017-04-23 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         874 \u2502 \u2026 \u2502             3 \u2502 2020-06-23 \u2502 2019-05-23 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         409 \u2502 \u2026 \u2502             3 \u2502 2017-05-03 \u2502 2008-05-05 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         415 \u2502 \u2026 \u2502             3 \u2502 2002-02-25 \u2502 1993-03-01 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         740 \u2502 \u2026 \u2502             3 \u2502 2005-09-18 \u2502 2006-09-14 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         417 \u2502 \u2026 \u2502             3 \u2502 2002-02-24 \u2502 1992-02-28 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         534 \u2502 \u2026 \u2502             3 \u2502 1974-02-28 \u2502 1975-03-31 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         286 \u2502 \u2026 \u2502             3 \u2502 1985-01-05 \u2502 1986-02-04 \u2502 1         \u2502\n\u2502   5.337135982495163 \u2502   0.9758593366351407 \u2502         172 \u2502 \u2026 \u2502             3 \u2502 2012-07-06 \u2502 2012-07-09 \u2502 1         \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 1800 rows (20 shown)                                                                          13 columns (7 shown) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
"},{"location":"demos/examples/duckdb/cookbook.html#fixing-m-or-u-probabilities-during-training","title":"Fixing m or u probabilities during training","text":"
import splink.comparison_level_library as cll\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\n\ndb_api = DuckDBAPI()\n\nfirst_name_comparison = cl.CustomComparison(\n    comparison_levels=[\n        cll.NullLevel(\"first_name\"),\n        cll.ExactMatchLevel(\"first_name\").configure(\n            m_probability=0.9999,\n            fix_m_probability=True,\n            u_probability=0.7,\n            fix_u_probability=True,\n        ),\n        cll.ElseLevel(),\n    ]\n)\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        first_name_comparison,\n        cl.ExactMatch(\"surname\"),\n        cl.ExactMatch(\"dob\"),\n        cl.ExactMatch(\"city\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"dob\"),\n    ],\n    additional_columns_to_retain=[\"cluster\"],\n)\n\ndf = splink_datasets.fake_1000\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\nlinker.visualisations.m_u_parameters_chart()\n
"},{"location":"demos/examples/duckdb/cookbook.html#manually-altering-m-and-u-probabilities-post-training","title":"Manually altering m and u probabilities post-training","text":"

This is not officially supported, but can be useful for ad-hoc alterations to trained models.

import splink.comparison_level_library as cll\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\nfrom splink.datasets import splink_dataset_labels\n\nlabels = splink_dataset_labels.fake_1000_labels\n\ndb_api = DuckDBAPI()\n\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.ExactMatch(\"first_name\"),\n        cl.ExactMatch(\"surname\"),\n        cl.ExactMatch(\"dob\"),\n        cl.ExactMatch(\"city\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"dob\"),\n    ],\n)\ndf = splink_datasets.fake_1000\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\n\nsurname_comparison = linker._settings_obj._get_comparison_by_output_column_name(\n    \"surname\"\n)\nelse_comparison_level = (\n    surname_comparison._get_comparison_level_by_comparison_vector_value(0)\n)\nelse_comparison_level._m_probability = 0.1\n\n\nlinker.visualisations.m_u_parameters_chart()\n
"},{"location":"demos/examples/duckdb/cookbook.html#generate-the-beta-labelling-tool","title":"Generate the (beta) labelling tool","text":"
import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.ExactMatch(\"first_name\"),\n        cl.ExactMatch(\"surname\"),\n        cl.ExactMatch(\"dob\"),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    max_iterations=2,\n)\n\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")], recall=0.7\n)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\npairwise_predictions = linker.inference.predict(threshold_match_weight=-10)\n\nfirst_unique_id = df.iloc[0].unique_id\nlinker.evaluation.labelling_tool_for_specific_record(unique_id=first_unique_id, overwrite=True)\n
"},{"location":"demos/examples/duckdb/deduplicate_50k_synthetic.html","title":"Deduplicate 50k rows historical persons","text":""},{"location":"demos/examples/duckdb/deduplicate_50k_synthetic.html#linking-a-dataset-of-real-historical-persons","title":"Linking a dataset of real historical persons","text":"

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

from splink import splink_datasets\n\ndf = splink_datasets.historical_50k\n
df.head()\n
unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation 0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 1 Q2296770-2 Q2296770 thomas of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh tom chudleigh tom chudleigh 1630-08-01 devon tq13 8df male politician 3 Q2296770-4 Q2296770 thomas 1st chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8hu None politician 4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df None politician
from splink import DuckDBAPI\nfrom splink.exploratory import profile_columns\n\ndb_api = DuckDBAPI()\nprofile_columns(df, db_api, column_expressions=[\"first_name\", \"substr(surname,1,2)\"])\n
from splink import DuckDBAPI, block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\nblocking_rules = [\n    block_on(\"substr(first_name,1,3)\", \"substr(surname,1,4)\"),\n    block_on(\"surname\", \"dob\"),\n    block_on(\"first_name\", \"dob\"),\n    block_on(\"postcode_fake\", \"first_name\"),\n    block_on(\"postcode_fake\", \"surname\"),\n    block_on(\"dob\", \"birth_place\"),\n    block_on(\"substr(postcode_fake,1,3)\", \"dob\"),\n    block_on(\"substr(postcode_fake,1,3)\", \"first_name\"),\n    block_on(\"substr(postcode_fake,1,3)\", \"surname\"),\n    block_on(\"substr(first_name,1,2)\", \"substr(surname,1,2)\", \"substr(dob,1,4)\"),\n]\n\ndb_api = DuckDBAPI()\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules,\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n)\n
import splink.comparison_library as cl\n\nfrom splink import Linker, SettingsCreator\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=blocking_rules,\n    comparisons=[\n        cl.ForenameSurnameComparison(\n            \"first_name\",\n            \"surname\",\n            forename_surname_concat_col_name=\"first_name_surname_concat\",\n        ),\n        cl.DateOfBirthComparison(\n            \"dob\", input_is_string=True\n        ),\n        cl.PostcodeComparison(\"postcode_fake\"),\n        cl.ExactMatch(\"birth_place\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"occupation\").configure(term_frequency_adjustments=True),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n# Needed to apply term frequencies to first+surname comparison\ndf[\"first_name_surname_concat\"] = df[\"first_name\"] + \" \" + df[\"surname\"]\nlinker = Linker(df, settings, db_api=db_api)\n
linker.training.estimate_probability_two_random_records_match(\n    [\n        \"l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob\",\n        \"substr(l.first_name,1,2) = substr(r.first_name,1,2) and l.surname = r.surname and substr(l.postcode_fake,1,2) = substr(r.postcode_fake,1,2)\",\n        \"l.dob = r.dob and l.postcode_fake = r.postcode_fake\",\n    ],\n    recall=0.6,\n)\n
Probability two random records match is estimated to be  0.000136.\nThis means that amongst all possible pairwise record comparisons, one in 7,362.31 are expected to match.  With 1,279,041,753 total possible comparisons, we expect a total of around 173,728.33 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=5e6)\n
----- Estimating u probabilities using random sampling -----\n\n\n\nFloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\nu probability not trained for first_name_surname - Match on reversed cols: first_name and surname (comparison vector value: 5). This usually means the comparison level was never observed in the training data.\n\n\n\nEstimated u probabilities using random sampling\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name_surname (some u values are not trained, no m values are trained).\n    - dob (no m values are trained).\n    - postcode_fake (no m values are trained).\n    - birth_place (no m values are trained).\n    - occupation (no m values are trained).\n
training_blocking_rule = block_on(\"first_name\", \"surname\")\ntraining_session_names = (\n    linker.training.estimate_parameters_using_expectation_maximisation(\n        training_blocking_rule, estimate_without_term_frequencies=True\n    )\n)\n
----- Starting EM training session -----\n\n\n\nEstimating the m probabilities of the model by blocking on:\n(l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\")\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name_surname\n\n\n\n\n\nIteration 1: Largest change in params was 0.247 in probability_two_random_records_match\n\n\nIteration 2: Largest change in params was -0.0938 in the m_probability of postcode_fake, level `Exact match on full postcode`\n\n\nIteration 3: Largest change in params was -0.0236 in the m_probability of birth_place, level `Exact match on birth_place`\n\n\nIteration 4: Largest change in params was 0.00967 in the m_probability of birth_place, level `All other comparisons`\n\n\nIteration 5: Largest change in params was -0.00467 in the m_probability of birth_place, level `Exact match on birth_place`\n\n\nIteration 6: Largest change in params was 0.00267 in the m_probability of birth_place, level `All other comparisons`\n\n\nIteration 7: Largest change in params was 0.00186 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 8: Largest change in params was 0.00127 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 9: Largest change in params was 0.000847 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 10: Largest change in params was 0.000563 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 11: Largest change in params was 0.000373 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 12: Largest change in params was 0.000247 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 13: Largest change in params was 0.000163 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 14: Largest change in params was 0.000108 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\nIteration 15: Largest change in params was 7.14e-05 in the m_probability of dob, level `Abs date difference <= 10 year`\n\n\n\nEM converged after 15 iterations\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name_surname (some u values are not trained, no m values are trained).\n
training_blocking_rule = block_on(\"dob\")\ntraining_session_dob = (\n    linker.training.estimate_parameters_using_expectation_maximisation(\n        training_blocking_rule, estimate_without_term_frequencies=True\n    )\n)\n
----- Starting EM training session -----\n\n\n\nEstimating the m probabilities of the model by blocking on:\nl.\"dob\" = r.\"dob\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name_surname\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\n\n\n\n\nIteration 1: Largest change in params was -0.472 in the m_probability of first_name_surname, level `Exact match on first_name_surname_concat`\n\n\nIteration 2: Largest change in params was 0.0524 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 3: Largest change in params was 0.0175 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 4: Largest change in params was 0.00537 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 5: Largest change in params was 0.00165 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 6: Largest change in params was 0.000518 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 7: Largest change in params was 0.000164 in the m_probability of first_name_surname, level `All other comparisons`\n\n\nIteration 8: Largest change in params was 5.2e-05 in the m_probability of first_name_surname, level `All other comparisons`\n\n\n\nEM converged after 8 iterations\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name_surname (some u values are not trained).\n

The final match weights can be viewed in the match weights chart:

linker.visualisations.match_weights_chart()\n
linker.evaluation.unlinkables_chart()\n
df_predict = linker.inference.predict()\ndf_e = df_predict.as_pandas_dataframe(limit=5)\ndf_e\n
Blocking time: 0.65 seconds\n\n\nPredict time: 1.71 seconds\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name_surname':\n    u values not fully trained\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r surname_l surname_r first_name_surname_concat_l first_name_surname_concat_r ... bf_birth_place bf_tf_adj_birth_place occupation_l occupation_r gamma_occupation tf_occupation_l tf_occupation_r bf_occupation bf_tf_adj_occupation match_key 0 5.903133 0.983565 Q6105786-11 Q6105786-6 joan j. garson garson joan garson j. garson ... 0.164159 1.000000 anthropologist anatomist 0 0.002056 0.000593 0.107248 1.0 4 1 2.354819 0.836476 Q6105786-11 Q6105786-8 joan j. garson garson joan garson j. garson ... 0.164159 1.000000 anthropologist anatomist 0 0.002056 0.000593 0.107248 1.0 4 2 2.354819 0.836476 Q6105786-11 Q6105786-9 joan ian garson garson joan garson ian garson ... 0.164159 1.000000 anthropologist anatomist 0 0.002056 0.000593 0.107248 1.0 4 3 3.319202 0.908935 Q6105786-11 Q6105786-13 joan j. garson garson joan garson j. garson ... 0.164159 1.000000 anthropologist None -1 0.002056 NaN 1.000000 1.0 4 4 16.881661 0.999992 Q6241382-1 Q6241382-11 john joan jackson jackson john jackson joan jackson ... 147.489511 17.689372 author None -1 0.003401 NaN 1.000000 1.0 4

5 rows \u00d7 42 columns

You can also view rows in this dataset as a waterfall chart as follows:

records_to_plot = df_e.to_dict(orient=\"records\")\nlinker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)\n
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predict, threshold_match_probability=0.95\n)\n
Completed iteration 1, root rows count 858\n\n\nCompleted iteration 2, root rows count 202\n\n\nCompleted iteration 3, root rows count 68\n\n\nCompleted iteration 4, root rows count 9\n\n\nCompleted iteration 5, root rows count 1\n\n\nCompleted iteration 6, root rows count 0\n
from IPython.display import IFrame\n\nlinker.visualisations.cluster_studio_dashboard(\n    df_predict,\n    clusters,\n    \"dashboards/50k_cluster.html\",\n    sampling_method=\"by_cluster_size\",\n    overwrite=True,\n)\n\n\nIFrame(src=\"./dashboards/50k_cluster.html\", width=\"100%\", height=1200)\n

linker.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\", output_type=\"accuracy\", match_weight_round_to_nearest=0.02\n)\n
Blocking time: 1.10 seconds\n\n\nPredict time: 1.54 seconds\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name_surname':\n    u values not fully trained\n
records = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\",\n    threshold_match_probability=0.999,\n    include_false_negatives=False,\n    include_false_positives=True,\n).as_record_dict()\nlinker.visualisations.waterfall_chart(records)\n
Blocking time: 0.86 seconds\n\n\nPredict time: 0.30 seconds\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name_surname':\n    u values not fully trained\n
# Some of the false negatives will be because they weren't detected by the blocking rules\nrecords = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\",\n    threshold_match_probability=0.5,\n    include_false_negatives=True,\n    include_false_positives=False,\n).as_record_dict(limit=50)\n\nlinker.visualisations.waterfall_chart(records)\n
Blocking time: 0.92 seconds\n\n\nPredict time: 0.30 seconds\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name_surname':\n    u values not fully trained\n
"},{"location":"demos/examples/duckdb/deterministic_dedupe.html","title":"Deterministic dedupe","text":""},{"location":"demos/examples/duckdb/deterministic_dedupe.html#linking-a-dataset-of-real-historical-persons-with-deterrministic-rules","title":"Linking a dataset of real historical persons with Deterrministic Rules","text":"

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at Deduplicate 50k rows historical persons.

# Uncomment and run this cell if you're running in Google Colab.\n# !pip install splink\n
import pandas as pd\n\nfrom splink import splink_datasets\n\npd.options.display.max_rows = 1000\ndf = splink_datasets.historical_50k\ndf.head()\n
unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation 0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 1 Q2296770-2 Q2296770 thomas of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh tom chudleigh tom chudleigh 1630-08-01 devon tq13 8df male politician 3 Q2296770-4 Q2296770 thomas 1st chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8hu None politician 4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df None politician

When defining the settings object, specity your deterministic rules in the blocking_rules_to_generate_predictions key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define comparisons nor any other parameters required for model training in a probabilistic model.

Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:

from splink import DuckDBAPI, block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\ndb_api = DuckDBAPI()\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=[\n        block_on(\"first_name\", \"surname\", \"dob\"),\n        block_on(\"surname\", \"dob\", \"postcode_fake\"),\n        block_on(\"first_name\", \"dob\", \"occupation\"),\n    ],\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n)\n
from splink import Linker, SettingsCreator\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\", \"dob\"),\n        block_on(\"surname\", \"dob\", \"postcode_fake\"),\n        block_on(\"first_name\", \"dob\", \"occupation\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n

The results of the linkage can be viewed with the deterministic_link function.

df_predict = linker.inference.deterministic_link()\ndf_predict.as_pandas_dataframe().head()\n
unique_id_l unique_id_r occupation_l occupation_r first_name_l first_name_r dob_l dob_r surname_l surname_r postcode_fake_l postcode_fake_r match_key 0 Q55455287-12 Q55455287-2 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2uu 0 1 Q55455287-12 Q55455287-3 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2uu 0 2 Q55455287-12 Q55455287-4 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2sz 0 3 Q55455287-12 Q55455287-5 None None jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug ta4 2ug 0 4 Q55455287-12 Q55455287-6 None writer jaido jaido 1836-01-01 1836-01-01 morata morata ta4 2ug None 0

Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set threshold_match_probability=1 in the cluster_pairwise_predictions_at_threshold function.

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predict, threshold_match_probability=1\n)\n
Completed iteration 1, root rows count 94\n\n\nCompleted iteration 2, root rows count 10\n\n\nCompleted iteration 3, root rows count 0\n
clusters.as_pandas_dataframe(limit=5)\n
cluster_id unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation __splink_salt 0 Q16025107-1 Q5497940-9 Q5497940 frederick hall frederick hall frederick hall 1855-01-01 bristol, city of bs11 9pn None None 0.002739 1 Q1149445-1 Q1149445-9 Q1149445 earl egerton earl egerton earl egerton 1800-01-01 westminster w1d 2hf None None 0.991459 2 Q20664532-1 Q21466387-2 Q21466387 harry brooker harry brooker harry brooker 1848-01-01 plymouth pl4 9hx male painter 0.506127 3 Q1124636-1 Q1124636-12 Q1124636 tom stapleton tom stapleton tom stapleton 1535-01-01 None bn6 9na male theologian 0.612694 4 Q18508292-1 Q21466711-4 Q21466711 harry s0ence harry s0ence harry s0ence 1860-01-01 london se1 7pb male painter 0.488917

These results can then be passed into the Cluster Studio Dashboard.

linker.visualisations.cluster_studio_dashboard(\n    df_predict,\n    clusters,\n    \"dashboards/50k_deterministic_cluster.html\",\n    sampling_method=\"by_cluster_size\",\n    overwrite=True,\n)\n\nfrom IPython.display import IFrame\n\nIFrame(src=\"./dashboards/50k_deterministic_cluster.html\", width=\"100%\", height=1200)\n

"},{"location":"demos/examples/duckdb/febrl3.html","title":"Febrl3 Dedupe","text":""},{"location":"demos/examples/duckdb/febrl3.html#deduplicating-the-febrl3-dataset","title":"Deduplicating the febrl3 dataset","text":"

See A.2 here and here for the source of this data

from splink.datasets import splink_datasets\n\ndf = splink_datasets.febrl3\n
df = df.rename(columns=lambda x: x.strip())\n\ndf[\"cluster\"] = df[\"rec_id\"].apply(lambda x: \"-\".join(x.split(\"-\")[:2]))\n\ndf[\"date_of_birth\"] = df[\"date_of_birth\"].astype(str).str.strip()\ndf[\"soc_sec_id\"] = df[\"soc_sec_id\"].astype(str).str.strip()\n\ndf.head(2)\n
rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster 0 rec-1496-org mitchell green 7 wallaby place delmar cleveland 2119 sa 19560409 1804974 rec-1496 1 rec-552-dup-3 harley mccarthy 177 pridhamstreet milton marsden 3165 nsw 19080419 6089216 rec-552
df[\"date_of_birth\"] = df[\"date_of_birth\"].astype(str).str.strip()\ndf[\"soc_sec_id\"] = df[\"soc_sec_id\"].astype(str).str.strip()\n
df[\"date_of_birth\"] = df[\"date_of_birth\"].astype(str).str.strip()\ndf[\"soc_sec_id\"] = df[\"soc_sec_id\"].astype(str).str.strip()\n
from splink import DuckDBAPI, Linker, SettingsCreator\n\n# TODO:  Allow missingness to be analysed without a linker\nsettings = SettingsCreator(\n    unique_id_column_name=\"rec_id\",\n    link_type=\"dedupe_only\",\n)\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\n

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing:

from splink.exploratory import completeness_chart\n\ncompleteness_chart(df, db_api=DuckDBAPI())\n
from splink.exploratory import profile_columns\n\nprofile_columns(df, db_api=DuckDBAPI(), column_expressions=[\"given_name\", \"surname\"])\n
from splink import DuckDBAPI, block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\nblocking_rules = [\n    block_on(\"soc_sec_id\"),\n    block_on(\"given_name\"),\n    block_on(\"surname\"),\n    block_on(\"date_of_birth\"),\n    block_on(\"postcode\"),\n]\n\ndb_api = DuckDBAPI()\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules,\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n    unique_id_column_name=\"rec_id\",\n)\n
import splink.comparison_library as cl\n\nfrom splink import Linker\n\nsettings = SettingsCreator(\n    unique_id_column_name=\"rec_id\",\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=blocking_rules,\n    comparisons=[\n        cl.NameComparison(\"given_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"date_of_birth\",\n            input_is_string=True,\n            datetime_format=\"%Y%m%d\",\n        ),\n        cl.DamerauLevenshteinAtThresholds(\"soc_sec_id\", [2]),\n        cl.ExactMatch(\"street_number\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"postcode\").configure(term_frequency_adjustments=True),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\n
from splink import block_on\n\ndeterministic_rules = [\n    block_on(\"soc_sec_id\"),\n    block_on(\"given_name\", \"surname\", \"date_of_birth\"),\n    \"l.given_name = r.surname and l.surname = r.given_name and l.date_of_birth = r.date_of_birth\",\n]\n\nlinker.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.9\n)\n
Probability two random records match is estimated to be  0.000528.\nThis means that amongst all possible pairwise record comparisons, one in 1,893.56 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 6,600.00 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\n\n\n\nFloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - given_name (no m values are trained).\n    - surname (no m values are trained).\n    - date_of_birth (some u values are not trained, no m values are trained).\n    - soc_sec_id (no m values are trained).\n    - street_number (no m values are trained).\n    - postcode (no m values are trained).\n
em_blocking_rule_1 = block_on(\"date_of_birth\")\nsession_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    em_blocking_rule_1\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"date_of_birth\" = r.\"date_of_birth\"\n\nParameter estimates will be made for the following comparison(s):\n    - given_name\n    - surname\n    - soc_sec_id\n    - street_number\n    - postcode\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - date_of_birth\n\nIteration 1: Largest change in params was -0.376 in the m_probability of surname, level `Exact match on surname`\nIteration 2: Largest change in params was 0.0156 in the m_probability of surname, level `All other comparisons`\nIteration 3: Largest change in params was 0.000699 in the m_probability of postcode, level `All other comparisons`\nIteration 4: Largest change in params was -3.77e-05 in the m_probability of postcode, level `Exact match on postcode`\n\nEM converged after 4 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - date_of_birth (some u values are not trained, no m values are trained).\n
em_blocking_rule_2 = block_on(\"postcode\")\nsession_postcode = linker.training.estimate_parameters_using_expectation_maximisation(\n    em_blocking_rule_2\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"postcode\" = r.\"postcode\"\n\nParameter estimates will be made for the following comparison(s):\n    - given_name\n    - surname\n    - date_of_birth\n    - soc_sec_id\n    - street_number\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - postcode\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was 0.0681 in probability_two_random_records_match\nIteration 2: Largest change in params was -0.00185 in the m_probability of date_of_birth, level `Exact match on date_of_birth`\nIteration 3: Largest change in params was -5.7e-05 in the m_probability of date_of_birth, level `Exact match on date_of_birth`\n\nEM converged after 3 iterations\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - date_of_birth (some u values are not trained, some m values are not trained).\n
linker.visualisations.match_weights_chart()\n
results = linker.inference.predict(threshold_match_probability=0.2)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n
linker.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\", match_weight_round_to_nearest=0.1, output_type=\"accuracy\"\n)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n
pred_errors_df = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\"\n).as_pandas_dataframe()\nlen(pred_errors_df)\npred_errors_df.head()\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n
clerical_match_score found_by_blocking_rules match_weight match_probability rec_id_l rec_id_r given_name_l given_name_r gamma_given_name tf_given_name_l ... postcode_l postcode_r gamma_postcode tf_postcode_l tf_postcode_r bf_postcode bf_tf_adj_postcode cluster_l cluster_r match_key 0 1.0 False -27.805731 4.262268e-09 rec-993-dup-1 rec-993-dup-3 westbrook jake 0 0.0004 ... 2704 2074 0 0.0002 0.0014 0.230173 1.0 rec-993 rec-993 5 1 1.0 False -27.805731 4.262268e-09 rec-829-dup-0 rec-829-dup-2 wilde kyra 0 0.0002 ... 3859 3595 0 0.0004 0.0006 0.230173 1.0 rec-829 rec-829 5 2 1.0 False -19.717877 1.159651e-06 rec-829-dup-0 rec-829-dup-1 wilde kyra 0 0.0002 ... 3859 3889 0 0.0004 0.0002 0.230173 1.0 rec-829 rec-829 5 3 1.0 True -15.453190 2.229034e-05 rec-721-dup-0 rec-721-dup-1 mikhaili elly 0 0.0008 ... 4806 4860 0 0.0008 0.0014 0.230173 1.0 rec-721 rec-721 2 4 1.0 True -12.931781 1.279648e-04 rec-401-dup-1 rec-401-dup-3 whitbe alexa-ose 0 0.0002 ... 3040 3041 0 0.0020 0.0004 0.230173 1.0 rec-401 rec-401 0

5 rows \u00d7 45 columns

The following chart seems to suggest that, where the model is making errors, it's because the data is corrupted beyond recognition and no reasonable linkage model could find these matches

records = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\"\n).as_record_dict(limit=10)\nlinker.visualisations.waterfall_chart(records)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n
"},{"location":"demos/examples/duckdb/febrl4.html","title":"Febrl4 link-only","text":""},{"location":"demos/examples/duckdb/febrl4.html#linking-the-febrl4-datasets","title":"Linking the febrl4 datasets","text":"

See A.2 here and here for the source of this data.

It consists of two datasets, A and B, of 5000 records each, with each record in dataset A having a corresponding record in dataset B. The aim will be to capture as many of those 5000 true links as possible, with minimal false linkages.

It is worth noting that we should not necessarily expect to capture all links. There are some links that although we know they do correspond to the same person, the data is so mismatched between them that we would not reasonably expect a model to link them, and indeed should a model do so may indicate that we have overengineered things using our knowledge of true links, which will not be a helpful reference in situations where we attempt to link unlabelled data, as will usually be the case.

"},{"location":"demos/examples/duckdb/febrl4.html#exploring-data-and-defining-model","title":"Exploring data and defining model","text":"

Firstly let's read in the data and have a little look at it

from splink import splink_datasets\n\ndf_a = splink_datasets.febrl4a\ndf_b = splink_datasets.febrl4b\n\n\ndef prepare_data(data):\n    data = data.rename(columns=lambda x: x.strip())\n    data[\"cluster\"] = data[\"rec_id\"].apply(lambda x: \"-\".join(x.split(\"-\")[:2]))\n    data[\"date_of_birth\"] = data[\"date_of_birth\"].astype(str).str.strip()\n    data[\"soc_sec_id\"] = data[\"soc_sec_id\"].astype(str).str.strip()\n    data[\"postcode\"] = data[\"postcode\"].astype(str).str.strip()\n    return data\n\n\ndfs = [prepare_data(dataset) for dataset in [df_a, df_b]]\n\ndisplay(dfs[0].head(2))\ndisplay(dfs[1].head(2))\n
rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster 0 rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218 rec-1070 1 rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625 rec-1016 rec_id given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id cluster 0 rec-561-dup-0 elton 3 light setreet pinehill windermere 3212 vic 19651013 1551941 rec-561 1 rec-2642-dup-0 mitchell maxon 47 edkins street lochaoair north ryde 3355 nsw 19390212 8859999 rec-2642

Next, to better understand which variables will prove useful in linking, we have a look at how populated each column is, as well as the distribution of unique values within each

from splink import DuckDBAPI, Linker, SettingsCreator\n\nbasic_settings = SettingsCreator(\n    unique_id_column_name=\"rec_id\",\n    link_type=\"link_only\",\n    # NB as we are linking one-one, we know the probability that a random pair will be a match\n    # hence we could set:\n    # \"probability_two_random_records_match\": 1/5000,\n    # however we will not specify this here, as we will use this as a check that\n    # our estimation procedure returns something sensible\n)\n\nlinker = Linker(dfs, basic_settings, db_api=DuckDBAPI())\n

It's usually a good idea to perform exploratory analysis on your data so you understand what's in each column and how often it's missing

from splink.exploratory import completeness_chart\n\ncompleteness_chart(dfs, db_api=DuckDBAPI())\n
from splink.exploratory import profile_columns\n\nprofile_columns(dfs, db_api=DuckDBAPI(), column_expressions=[\"given_name\", \"surname\"])\n

Next let's come up with some candidate blocking rules, which define which record comparisons are generated, and have a look at how many comparisons each will generate.

For blocking rules that we use in prediction, our aim is to have the union of all rules cover all true matches, whilst avoiding generating so many comparisons that it becomes computationally intractable - i.e. each true match should have at least one of the following conditions holding.

from splink import DuckDBAPI, block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\nblocking_rules = [\n    block_on(\"given_name\", \"surname\"),\n    # A blocking rule can also be an aribtrary SQL expression\n    \"l.given_name = r.surname and l.surname = r.given_name\",\n    block_on(\"date_of_birth\"),\n    block_on(\"soc_sec_id\"),\n    block_on(\"state\", \"address_1\"),\n    block_on(\"street_number\", \"address_1\"),\n    block_on(\"postcode\"),\n]\n\n\ndb_api = DuckDBAPI()\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=dfs,\n    blocking_rules=blocking_rules,\n    db_api=db_api,\n    link_type=\"link_only\",\n    unique_id_column_name=\"rec_id\",\n    source_dataset_column_name=\"source_dataset\",\n)\n

The broadest rule, having a matching postcode, unsurpisingly gives the largest number of comparisons. For this small dataset we still have a very manageable number, but if it was larger we might have needed to include a further AND condition with it to break the number of comparisons further.

Now we get the full settings by including the blocking rules, as well as deciding the actual comparisons we will be including in our model.

We will define two models, each with a separate linker with different settings, so that we can compare performance. One will be a very basic model, whilst the other will include a lot more detail.

import splink.comparison_level_library as cll\nimport splink.comparison_library as cl\n\n\n# the simple model only considers a few columns, and only two comparison levels for each\nsimple_model_settings = SettingsCreator(\n    unique_id_column_name=\"rec_id\",\n    link_type=\"link_only\",\n    blocking_rules_to_generate_predictions=blocking_rules,\n    comparisons=[\n        cl.ExactMatch(\"given_name\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"surname\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"street_number\").configure(term_frequency_adjustments=True),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\n# the detailed model considers more columns, using the information we saw in the exploratory phase\n# we also include further comparison levels to account for typos and other differences\ndetailed_model_settings = SettingsCreator(\n    unique_id_column_name=\"rec_id\",\n    link_type=\"link_only\",\n    blocking_rules_to_generate_predictions=blocking_rules,\n    comparisons=[\n        cl.NameComparison(\"given_name\").configure(term_frequency_adjustments=True),\n        cl.NameComparison(\"surname\").configure(term_frequency_adjustments=True),\n        cl.DateOfBirthComparison(\n            \"date_of_birth\",\n            input_is_string=True,\n            datetime_format=\"%Y%m%d\",\n            invalid_dates_as_null=True,\n        ),\n        cl.DamerauLevenshteinAtThresholds(\"soc_sec_id\", [1, 2]),\n        cl.ExactMatch(\"street_number\").configure(term_frequency_adjustments=True),\n        cl.DamerauLevenshteinAtThresholds(\"postcode\", [1, 2]).configure(\n            term_frequency_adjustments=True\n        ),\n        # we don't consider further location columns as they will be strongly correlated with postcode\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\n\nlinker_simple = Linker(dfs, simple_model_settings, db_api=DuckDBAPI())\nlinker_detailed = Linker(dfs, detailed_model_settings, db_api=DuckDBAPI())\n
"},{"location":"demos/examples/duckdb/febrl4.html#estimating-model-parameters","title":"Estimating model parameters","text":"

We need to furnish our models with parameter estimates so that we can generate results. We will focus on the detailed model, generating the values for the simple model at the end

We can instead estimate the probability two random records match, and compare with the known value of 1/5000 = 0.0002, to see how well our estimation procedure works.

To do this we come up with some deterministic rules - the aim here is that we generate very few false positives (i.e. we expect that the majority of records with at least one of these conditions holding are true matches), whilst also capturing the majority of matches - our guess here is that these two rules should capture 80% of all matches.

deterministic_rules = [\n    block_on(\"soc_sec_id\"),\n    block_on(\"given_name\", \"surname\", \"date_of_birth\"),\n]\n\nlinker_detailed.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.8\n)\n
Probability two random records match is estimated to be  0.000239.\nThis means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs\n

Even playing around with changing these deterministic rules, or the nominal recall leaves us with an answer which is pretty close to our known value

Next we estimate u and m values for each comparison, so that we can move to generating predictions

# We generally recommend setting max pairs higher (e.g. 1e7 or more)\n# But this will run faster for the purpose of this demo\nlinker_detailed.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\n\n\n\nFloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - given_name (no m values are trained).\n    - surname (no m values are trained).\n    - date_of_birth (some u values are not trained, no m values are trained).\n    - soc_sec_id (no m values are trained).\n    - street_number (no m values are trained).\n    - postcode (no m values are trained).\n

When training the m values using expectation maximisation, we need somre more blocking rules to reduce the total number of comparisons. For each rule, we want to ensure that we have neither proportionally too many matches, or too few.

We must run this multiple times using different rules so that we can obtain estimates for all comparisons - if we block on e.g. date_of_birth, then we cannot compute the m values for the date_of_birth comparison, as we have only looked at records where these match.

session_dob = (\n    linker_detailed.training.estimate_parameters_using_expectation_maximisation(\n        block_on(\"date_of_birth\"), estimate_without_term_frequencies=True\n    )\n)\nsession_pc = (\n    linker_detailed.training.estimate_parameters_using_expectation_maximisation(\n        block_on(\"postcode\"), estimate_without_term_frequencies=True\n    )\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"date_of_birth\" = r.\"date_of_birth\"\n\nParameter estimates will be made for the following comparison(s):\n    - given_name\n    - surname\n    - soc_sec_id\n    - street_number\n    - postcode\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - date_of_birth\n\nIteration 1: Largest change in params was -0.331 in probability_two_random_records_match\nIteration 2: Largest change in params was 0.00365 in the m_probability of given_name, level `All other comparisons`\nIteration 3: Largest change in params was 9.22e-05 in the m_probability of soc_sec_id, level `All other comparisons`\n\nEM converged after 3 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - date_of_birth (some u values are not trained, no m values are trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"postcode\" = r.\"postcode\"\n\nParameter estimates will be made for the following comparison(s):\n    - given_name\n    - surname\n    - date_of_birth\n    - soc_sec_id\n    - street_number\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - postcode\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`\nIteration 2: Largest change in params was 0.000457 in the m_probability of date_of_birth, level `All other comparisons`\nIteration 3: Largest change in params was 7.66e-06 in the m_probability of soc_sec_id, level `All other comparisons`\n\nEM converged after 3 iterations\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - date_of_birth (some u values are not trained, some m values are not trained).\n

If we wish we can have a look at how our parameter estimates changes over these training sessions

session_dob.m_u_values_interactive_history_chart()\n

For variables that aren't used in the m-training blocking rules, we have two estimates --- one from each of the training sessions (see for example street_number). We can have a look at how the values compare between them, to ensure that we don't have drastically different values, which may be indicative of an issue.

linker_detailed.visualisations.parameter_estimate_comparisons_chart()\n

We repeat our parameter estimations for the simple model in much the same fashion

linker_simple.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.8\n)\nlinker_simple.training.estimate_u_using_random_sampling(max_pairs=1e7)\nsession_ssid = (\n    linker_simple.training.estimate_parameters_using_expectation_maximisation(\n        block_on(\"given_name\"), estimate_without_term_frequencies=True\n    )\n)\nsession_pc = linker_simple.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"street_number\"), estimate_without_term_frequencies=True\n)\nlinker_simple.visualisations.parameter_estimate_comparisons_chart()\n
Probability two random records match is estimated to be  0.000239.\nThis means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs\n----- Estimating u probabilities using random sampling -----\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - given_name (no m values are trained).\n    - surname (no m values are trained).\n    - street_number (no m values are trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"given_name\" = r.\"given_name\"\n\nParameter estimates will be made for the following comparison(s):\n    - surname\n    - street_number\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - given_name\n\nIteration 1: Largest change in params was 0.0812 in the m_probability of surname, level `All other comparisons`\nIteration 2: Largest change in params was -0.0261 in the m_probability of surname, level `Exact match on surname`\nIteration 3: Largest change in params was -0.0247 in the m_probability of surname, level `Exact match on surname`\nIteration 4: Largest change in params was 0.0227 in the m_probability of surname, level `All other comparisons`\nIteration 5: Largest change in params was -0.0198 in the m_probability of surname, level `Exact match on surname`\nIteration 6: Largest change in params was 0.0164 in the m_probability of surname, level `All other comparisons`\nIteration 7: Largest change in params was -0.0131 in the m_probability of surname, level `Exact match on surname`\nIteration 8: Largest change in params was 0.0101 in the m_probability of surname, level `All other comparisons`\nIteration 9: Largest change in params was -0.00769 in the m_probability of surname, level `Exact match on surname`\nIteration 10: Largest change in params was 0.00576 in the m_probability of surname, level `All other comparisons`\nIteration 11: Largest change in params was -0.00428 in the m_probability of surname, level `Exact match on surname`\nIteration 12: Largest change in params was 0.00316 in the m_probability of surname, level `All other comparisons`\nIteration 13: Largest change in params was -0.00234 in the m_probability of surname, level `Exact match on surname`\nIteration 14: Largest change in params was -0.00172 in the m_probability of surname, level `Exact match on surname`\nIteration 15: Largest change in params was 0.00127 in the m_probability of surname, level `All other comparisons`\nIteration 16: Largest change in params was -0.000939 in the m_probability of surname, level `Exact match on surname`\nIteration 17: Largest change in params was -0.000694 in the m_probability of surname, level `Exact match on surname`\nIteration 18: Largest change in params was -0.000514 in the m_probability of surname, level `Exact match on surname`\nIteration 19: Largest change in params was -0.000381 in the m_probability of surname, level `Exact match on surname`\nIteration 20: Largest change in params was -0.000282 in the m_probability of surname, level `Exact match on surname`\nIteration 21: Largest change in params was 0.00021 in the m_probability of surname, level `All other comparisons`\nIteration 22: Largest change in params was -0.000156 in the m_probability of surname, level `Exact match on surname`\nIteration 23: Largest change in params was 0.000116 in the m_probability of surname, level `All other comparisons`\nIteration 24: Largest change in params was 8.59e-05 in the m_probability of surname, level `All other comparisons`\n\nEM converged after 24 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - given_name (no m values are trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"street_number\" = r.\"street_number\"\n\nParameter estimates will be made for the following comparison(s):\n    - given_name\n    - surname\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - street_number\n\nIteration 1: Largest change in params was -0.0446 in the m_probability of surname, level `Exact match on surname`\nIteration 2: Largest change in params was -0.0285 in the m_probability of surname, level `All other comparisons`\nIteration 3: Largest change in params was -0.026 in the m_probability of given_name, level `Exact match on given_name`\nIteration 4: Largest change in params was 0.0252 in the m_probability of given_name, level `All other comparisons`\nIteration 5: Largest change in params was -0.0231 in the m_probability of given_name, level `Exact match on given_name`\nIteration 6: Largest change in params was -0.02 in the m_probability of given_name, level `Exact match on given_name`\nIteration 7: Largest change in params was -0.0164 in the m_probability of given_name, level `Exact match on given_name`\nIteration 8: Largest change in params was -0.013 in the m_probability of given_name, level `Exact match on given_name`\nIteration 9: Largest change in params was 0.01 in the m_probability of given_name, level `All other comparisons`\nIteration 10: Largest change in params was -0.00757 in the m_probability of given_name, level `Exact match on given_name`\nIteration 11: Largest change in params was 0.00564 in the m_probability of given_name, level `All other comparisons`\nIteration 12: Largest change in params was -0.00419 in the m_probability of given_name, level `Exact match on given_name`\nIteration 13: Largest change in params was 0.0031 in the m_probability of given_name, level `All other comparisons`\nIteration 14: Largest change in params was -0.00231 in the m_probability of given_name, level `Exact match on given_name`\nIteration 15: Largest change in params was -0.00173 in the m_probability of given_name, level `Exact match on given_name`\nIteration 16: Largest change in params was 0.0013 in the m_probability of given_name, level `All other comparisons`\nIteration 17: Largest change in params was 0.000988 in the m_probability of given_name, level `All other comparisons`\nIteration 18: Largest change in params was -0.000756 in the m_probability of given_name, level `Exact match on given_name`\nIteration 19: Largest change in params was -0.000584 in the m_probability of given_name, level `Exact match on given_name`\nIteration 20: Largest change in params was -0.000465 in the m_probability of surname, level `Exact match on surname`\nIteration 21: Largest change in params was -0.000388 in the m_probability of surname, level `Exact match on surname`\nIteration 22: Largest change in params was -0.000322 in the m_probability of surname, level `Exact match on surname`\nIteration 23: Largest change in params was 0.000266 in the m_probability of surname, level `All other comparisons`\nIteration 24: Largest change in params was -0.000219 in the m_probability of surname, level `Exact match on surname`\nIteration 25: Largest change in params was -0.00018 in the m_probability of surname, level `Exact match on surname`\n\nEM converged after 25 iterations\n\nYour model is fully trained. All comparisons have at least one estimate for their m and u values\n
# import json\n# we can have a look at the full settings if we wish, including the values of our estimated parameters:\n# print(json.dumps(linker_detailed._settings_obj.as_dict(), indent=2))\n# we can also get a handy summary of of the model in an easily readable format if we wish:\n# print(linker_detailed._settings_obj.human_readable_description)\n# (we suppress output here for brevity)\n

We can now visualise some of the details of our models. We can look at the match weights, which tell us the relative importance for/against a match for each of our comparsion levels.

Comparing the two models will show the added benefit we get in the more detailed model --- what in the simple model is classed as 'all other comparisons' is instead broken down further, and we can see that the detail of how this is broken down in fact gives us quite a bit of useful information about the likelihood of a match.

linker_simple.visualisations.match_weights_chart()\n
linker_detailed.visualisations.match_weights_chart()\n

As well as the match weights, which give us an idea of the overall effect of each comparison level, we can also look at the individual u and m parameter estimates, which tells us about the prevalence of coincidences and mistakes (for further details/explanation about this see this article). We might want to revise aspects of our model based on the information we ascertain here.

Note however that some of these values are very small, which is why the match weight chart is often more useful for getting a decent picture of things.

# linker_simple.m_u_parameters_chart()\nlinker_detailed.visualisations.m_u_parameters_chart()\n

It is also useful to have a look at unlinkable records - these are records which do not contain enough information to be linked at some match probability threshold. We can figure this out be seeing whether records are able to be matched with themselves.

This is of course relative to the information we have put into the model - we see that in our simple model, at a 99% match threshold nearly 10% of records are unlinkable, as we have not included enough information in the model for distinct records to be adequately distinguished; this is not an issue in our more detailed model.

linker_simple.evaluation.unlinkables_chart()\n
linker_detailed.evaluation.unlinkables_chart()\n

Our simple model doesn't do terribly, but suffers if we want to have a high match probability --- to be 99% (match weight ~7) certain of matches we have ~10% of records that we will be unable to link.

Our detailed model, however, has enough nuance that we can at least self-link records.

"},{"location":"demos/examples/duckdb/febrl4.html#predictions","title":"Predictions","text":"

Now that we have had a look into the details of the models, we will focus on only our more detailed model, which should be able to capture more of the genuine links in our data

predictions = linker_detailed.inference.predict(threshold_match_probability=0.2)\ndf_predictions = predictions.as_pandas_dataframe()\ndf_predictions.head(5)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n
match_weight match_probability source_dataset_l source_dataset_r rec_id_l rec_id_r given_name_l given_name_r gamma_given_name tf_given_name_l ... gamma_postcode tf_postcode_l tf_postcode_r bf_postcode bf_tf_adj_postcode address_1_l address_1_r state_l state_r match_key 0 -1.830001 0.219521 __splink__input_table_0 __splink__input_table_1 rec-760-org rec-3951-dup-0 lachlan lachlan 4 0.0113 ... 3 0.0007 0.0007 759.407155 1.583362 bushby close templestoew avenue nsw vic 0 1 -1.801736 0.222896 __splink__input_table_0 __splink__input_table_1 rec-4980-org rec-4980-dup-0 isabella ctercteko 0 0.0069 ... 3 0.0004 0.0004 759.407155 2.770884 sturt avenue sturta venue vic vic 2 2 -1.271794 0.292859 __splink__input_table_0 __splink__input_table_1 rec-585-org rec-585-dup-0 danny stephenson 0 0.0001 ... 2 0.0016 0.0012 11.264825 1.000000 o'shanassy street o'shanassy street tas tas 1 3 -1.213441 0.301305 __splink__input_table_0 __splink__input_table_1 rec-1250-org rec-1250-dup-0 luke gazzola 0 0.0055 ... 2 0.0015 0.0002 11.264825 1.000000 newman morris circuit newman morr is circuit nsw nsw 1 4 -0.380336 0.434472 __splink__input_table_0 __splink__input_table_1 rec-4763-org rec-4763-dup-0 max alisha 0 0.0021 ... 1 0.0004 0.0016 0.043565 1.000000 duffy street duffy s treet nsw nsw 2

5 rows \u00d7 47 columns

We can see how our model performs at different probability thresholds, with a couple of options depending on the space we wish to view things

linker_detailed.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\", output_type=\"accuracy\"\n)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\n

and we can easily see how many individuals we identify and link by looking at clusters generated at some threshold match probability of interest - in this example 99%

clusters = linker_detailed.clustering.cluster_pairwise_predictions_at_threshold(\n    predictions, threshold_match_probability=0.99\n)\ndf_clusters = clusters.as_pandas_dataframe().sort_values(\"cluster_id\")\ndf_clusters.groupby(\"cluster_id\").size().value_counts()\n
Completed iteration 1, root rows count 0\n\n\n\n\n\n2    4959\n1      82\nName: count, dtype: int64\n

In this case, we happen to know what the true links are, so we can manually inspect the ones that are doing worst to see what our model is not capturing - i.e. where we have false negatives.

Similarly, we can look at the non-links which are performing the best, to see whether we have an issue with false positives.

Ordinarily we would not have this luxury, and so would need to dig a bit deeper for clues as to how to improve our model, such as manually inspecting records across threshold probabilities,

df_predictions[\"cluster_l\"] = df_predictions[\"rec_id_l\"].apply(\n    lambda x: \"-\".join(x.split(\"-\")[:2])\n)\ndf_predictions[\"cluster_r\"] = df_predictions[\"rec_id_r\"].apply(\n    lambda x: \"-\".join(x.split(\"-\")[:2])\n)\ndf_true_links = df_predictions[\n    df_predictions[\"cluster_l\"] == df_predictions[\"cluster_r\"]\n].sort_values(\"match_probability\")\n
records_to_view = 3\nlinker_detailed.visualisations.waterfall_chart(\n    df_true_links.head(records_to_view).to_dict(orient=\"records\")\n)\n
df_non_links = df_predictions[\n    df_predictions[\"cluster_l\"] != df_predictions[\"cluster_r\"]\n].sort_values(\"match_probability\", ascending=False)\nlinker_detailed.visualisations.waterfall_chart(\n    df_non_links.head(records_to_view).to_dict(orient=\"records\")\n)\n
"},{"location":"demos/examples/duckdb/febrl4.html#further-refinements","title":"Further refinements","text":"

Looking at the non-links we have done well in having no false positives at any substantial match probability --- however looking at some of the true links we can see that there are a few that we are not capturing with sufficient match probability.

We can see that there are a few features that we are not capturing/weighting appropriately

  • single-character transpostions, particularly in postcode (which is being lumped in with more 'severe typos'/probable non-matches)
  • given/sur-names being swapped with typos
  • given/sur-names being cross-matches on one only, with no match on the other cross

We will quickly see if we can incorporate these features into a new model. As we are now going into more detail with the inter-relationship between given name and surname, it is probably no longer sensible to model them as independent comparisons, and so we will need to switch to a combined comparison on full name.

# we need to append a full name column to our source data frames\n# so that we can use it for term frequency adjustments\ndfs[0][\"full_name\"] = dfs[0][\"given_name\"] + \"_\" + dfs[0][\"surname\"]\ndfs[1][\"full_name\"] = dfs[1][\"given_name\"] + \"_\" + dfs[1][\"surname\"]\n\n\nextended_model_settings = {\n    \"unique_id_column_name\": \"rec_id\",\n    \"link_type\": \"link_only\",\n    \"blocking_rules_to_generate_predictions\": blocking_rules,\n    \"comparisons\": [\n        {\n            \"output_column_name\": \"Full name\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"(given_name_l IS NULL OR given_name_r IS NULL) and (surname_l IS NULL OR surname_r IS NULL)\",\n                    \"label_for_charts\": \"Null\",\n                    \"is_null_level\": True,\n                },\n                # full name match\n                cll.ExactMatchLevel(\"full_name\", term_frequency_adjustments=True),\n                # typos - keep levels across full name rather than scoring separately\n                cll.JaroWinklerLevel(\"full_name\", 0.9),\n                cll.JaroWinklerLevel(\"full_name\", 0.7),\n                # name switched\n                cll.ColumnsReversedLevel(\"given_name\", \"surname\"),\n                # name switched + typo\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.8\",\n                    \"label_for_charts\": \"switched + jaro_winkler_similarity >= 1.8\",\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(given_name_l, surname_r) + jaro_winkler_similarity(surname_l, given_name_r) >= 1.4\",\n                    \"label_for_charts\": \"switched + jaro_winkler_similarity >= 1.4\",\n                },\n                # single name match\n                cll.ExactMatchLevel(\"given_name\", term_frequency_adjustments=True),\n                cll.ExactMatchLevel(\"surname\", term_frequency_adjustments=True),\n                # single name cross-match\n                {\n                    \"sql_condition\": \"given_name_l = surname_r OR surname_l = given_name_r\",\n                    \"label_for_charts\": \"single name cross-matches\",\n                },  # single name typos\n                cll.JaroWinklerLevel(\"given_name\", 0.9),\n                cll.JaroWinklerLevel(\"surname\", 0.9),\n                # the rest\n                cll.ElseLevel(),\n            ],\n        },\n        cl.DateOfBirthComparison(\n            \"date_of_birth\",\n            input_is_string=True,\n            datetime_format=\"%Y%m%d\",\n            invalid_dates_as_null=True,\n        ),\n        {\n            \"output_column_name\": \"Social security ID\",\n            \"comparison_levels\": [\n                cll.NullLevel(\"soc_sec_id\"),\n                cll.ExactMatchLevel(\"soc_sec_id\", term_frequency_adjustments=True),\n                cll.DamerauLevenshteinLevel(\"soc_sec_id\", 1),\n                cll.DamerauLevenshteinLevel(\"soc_sec_id\", 2),\n                cll.ElseLevel(),\n            ],\n        },\n        {\n            \"output_column_name\": \"Street number\",\n            \"comparison_levels\": [\n                cll.NullLevel(\"street_number\"),\n                cll.ExactMatchLevel(\"street_number\", term_frequency_adjustments=True),\n                cll.DamerauLevenshteinLevel(\"street_number\", 1),\n                cll.ElseLevel(),\n            ],\n        },\n        {\n            \"output_column_name\": \"Postcode\",\n            \"comparison_levels\": [\n                cll.NullLevel(\"postcode\"),\n                cll.ExactMatchLevel(\"postcode\", term_frequency_adjustments=True),\n                cll.DamerauLevenshteinLevel(\"postcode\", 1),\n                cll.DamerauLevenshteinLevel(\"postcode\", 2),\n                cll.ElseLevel(),\n            ],\n        },\n        # we don't consider further location columns as they will be strongly correlated with postcode\n    ],\n    \"retain_intermediate_calculation_columns\": True,\n}\n
# train\nlinker_advanced = Linker(dfs, extended_model_settings, db_api=DuckDBAPI())\nlinker_advanced.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.8\n)\n# We recommend increasing target rows to 1e8 improve accuracy for u\n# values in full name comparison, as we have subdivided the data more finely\n\n# Here, 1e7 for speed\nlinker_advanced.training.estimate_u_using_random_sampling(max_pairs=1e7)\n
Probability two random records match is estimated to be  0.000239.\nThis means that amongst all possible pairwise record comparisons, one in 4,185.85 are expected to match.  With 25,000,000 total possible comparisons, we expect a total of around 5,972.50 matching pairs\n----- Estimating u probabilities using random sampling -----\n\n\n\nFloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nu probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - Full name (no m values are trained).\n    - date_of_birth (some u values are not trained, no m values are trained).\n    - Social security ID (no m values are trained).\n    - Street number (no m values are trained).\n    - Postcode (no m values are trained).\n
session_dob = (\n    linker_advanced.training.estimate_parameters_using_expectation_maximisation(\n        \"l.date_of_birth = r.date_of_birth\", estimate_without_term_frequencies=True\n    )\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.date_of_birth = r.date_of_birth\n\nParameter estimates will be made for the following comparison(s):\n    - Full name\n    - Social security ID\n    - Street number\n    - Postcode\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - date_of_birth\n\nWARNING:\nLevel single name cross-matches on comparison Full name not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was -0.465 in the m_probability of Full name, level `Exact match on full_name`\nIteration 2: Largest change in params was 0.00252 in the m_probability of Social security ID, level `All other comparisons`\nIteration 3: Largest change in params was 4.98e-05 in the m_probability of Social security ID, level `All other comparisons`\n\nEM converged after 3 iterations\nm probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - Full name (some m values are not trained).\n    - date_of_birth (some u values are not trained, no m values are trained).\n
session_pc = (\n    linker_advanced.training.estimate_parameters_using_expectation_maximisation(\n        \"l.postcode = r.postcode\", estimate_without_term_frequencies=True\n    )\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.postcode = r.postcode\n\nParameter estimates will be made for the following comparison(s):\n    - Full name\n    - date_of_birth\n    - Social security ID\n    - Street number\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - Postcode\n\nWARNING:\nLevel single name cross-matches on comparison Full name not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 month' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 1 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nWARNING:\nLevel Abs difference of 'transformed date_of_birth <= 10 year' on comparison date_of_birth not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was 0.0374 in the m_probability of date_of_birth, level `All other comparisons`\nIteration 2: Largest change in params was 0.000656 in the m_probability of date_of_birth, level `All other comparisons`\nIteration 3: Largest change in params was 1.75e-05 in the m_probability of Social security ID, level `All other comparisons`\n\nEM converged after 3 iterations\nm probability not trained for Full name - single name cross-matches (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 month' (comparison vector value: 3). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 1 year' (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nm probability not trained for date_of_birth - Abs difference of 'transformed date_of_birth <= 10 year' (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - Full name (some m values are not trained).\n    - date_of_birth (some u values are not trained, some m values are not trained).\n
linker_advanced.visualisations.parameter_estimate_comparisons_chart()\n
linker_advanced.visualisations.match_weights_chart()\n
predictions_adv = linker_advanced.inference.predict()\ndf_predictions_adv = predictions_adv.as_pandas_dataframe()\nclusters_adv = linker_advanced.clustering.cluster_pairwise_predictions_at_threshold(\n    predictions_adv, threshold_match_probability=0.99\n)\ndf_clusters_adv = clusters_adv.as_pandas_dataframe().sort_values(\"cluster_id\")\ndf_clusters_adv.groupby(\"cluster_id\").size().value_counts()\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'Full name':\n    m values not fully trained\nComparison: 'date_of_birth':\n    m values not fully trained\nComparison: 'date_of_birth':\n    u values not fully trained\nCompleted iteration 1, root rows count 0\n\n\n\n\n\n2    4960\n1      80\nName: count, dtype: int64\n

This is a pretty modest improvement on our previous model - however it is worth re-iterating that we should not necessarily expect to recover all matches, as in several cases it may be unreasonable for a model to have reasonable confidence that two records refer to the same entity.

If we wished to improve matters we could iterate on this process - investigating where our model is not performing as we would hope, and seeing how we can adjust these areas to address these shortcomings.

"},{"location":"demos/examples/duckdb/link_only.html","title":"Linking two tables of persons","text":""},{"location":"demos/examples/duckdb/link_only.html#linking-without-deduplication","title":"Linking without deduplication","text":"

A simple record linkage model using the link_only link type.

With link_only, only between-dataset record comparisons are generated. No within-dataset record comparisons are created, meaning that the model does not attempt to find within-dataset duplicates.

from splink import splink_datasets\n\ndf = splink_datasets.fake_1000\n\n# Split a simple dataset into two, separate datasets which can be linked together.\ndf_l = df.sample(frac=0.5)\ndf_r = df.drop(df_l.index)\n\ndf_l.head(2)\n
unique_id first_name surname dob city email cluster 922 922 Evie Jones 2002-07-22 NaN eviejones@brewer-sparks.org 230 224 224 Logn Feeruson 2013-10-15 London l.fergson46@shah.com 58
import splink.comparison_library as cl\n\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on\n\nsettings = SettingsCreator(\n    link_type=\"link_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        cl.NameComparison(\n            \"first_name\",\n        ),\n        cl.NameComparison(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            invalid_dates_as_null=True,\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n)\n\nlinker = Linker(\n    [df_l, df_r],\n    settings,\n    db_api=DuckDBAPI(),\n    input_table_aliases=[\"df_left\", \"df_right\"],\n)\n
from splink.exploratory import completeness_chart\n\ncompleteness_chart(\n    [df_l, df_r],\n    cols=[\"first_name\", \"surname\", \"dob\", \"city\", \"email\"],\n    db_api=DuckDBAPI(),\n    table_names_for_chart=[\"df_left\", \"df_right\"],\n)\n
deterministic_rules = [\n    \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n    block_on(\"email\"),\n]\n\n\nlinker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)\n
Probability two random records match is estimated to be  0.00338.\nThis means that amongst all possible pairwise record comparisons, one in 295.61 are expected to match.  With 250,000 total possible comparisons, we expect a total of around 845.71 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6, seed=1)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n    - dob (no m values are trained).\n    - city (no m values are trained).\n    - email (no m values are trained).\n
session_dob = linker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\nsession_email = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"email\")\n)\nsession_first_name = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\")\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"dob\" = r.\"dob\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\nWARNING:\nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was -0.418 in the m_probability of surname, level `Exact match on surname`\nIteration 2: Largest change in params was 0.104 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.0711 in the m_probability of first_name, level `All other comparisons`\nIteration 4: Largest change in params was 0.0237 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.0093 in probability_two_random_records_match\nIteration 6: Largest change in params was 0.00407 in probability_two_random_records_match\nIteration 7: Largest change in params was 0.0019 in probability_two_random_records_match\nIteration 8: Largest change in params was 0.000916 in probability_two_random_records_match\nIteration 9: Largest change in params was 0.000449 in probability_two_random_records_match\nIteration 10: Largest change in params was 0.000222 in probability_two_random_records_match\nIteration 11: Largest change in params was 0.00011 in probability_two_random_records_match\nIteration 12: Largest change in params was 5.46e-05 in probability_two_random_records_match\n\nEM converged after 12 iterations\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - dob (no m values are trained).\n    - email (some m values are not trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"email\" = r.\"email\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - dob\n    - city\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - email\n\nIteration 1: Largest change in params was -0.483 in the m_probability of dob, level `Exact match on dob`\nIteration 2: Largest change in params was 0.0905 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.02 in probability_two_random_records_match\nIteration 4: Largest change in params was 0.00718 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.0031 in probability_two_random_records_match\nIteration 6: Largest change in params was 0.00148 in probability_two_random_records_match\nIteration 7: Largest change in params was 0.000737 in probability_two_random_records_match\nIteration 8: Largest change in params was 0.000377 in probability_two_random_records_match\nIteration 9: Largest change in params was 0.000196 in probability_two_random_records_match\nIteration 10: Largest change in params was 0.000102 in probability_two_random_records_match\nIteration 11: Largest change in params was 5.37e-05 in probability_two_random_records_match\n\nEM converged after 11 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - email (some m values are not trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"first_name\" = r.\"first_name\"\n\nParameter estimates will be made for the following comparison(s):\n    - surname\n    - dob\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name\n\nIteration 1: Largest change in params was -0.169 in the m_probability of surname, level `All other comparisons`\nIteration 2: Largest change in params was -0.0127 in the m_probability of surname, level `All other comparisons`\nIteration 3: Largest change in params was -0.00388 in the m_probability of surname, level `All other comparisons`\nIteration 4: Largest change in params was -0.00164 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\nIteration 5: Largest change in params was -0.00089 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\nIteration 6: Largest change in params was -0.000454 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\nIteration 7: Largest change in params was -0.000225 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\nIteration 8: Largest change in params was -0.00011 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\nIteration 9: Largest change in params was -5.31e-05 in the m_probability of email, level `Jaro-Winkler >0.88 on username`\n\nEM converged after 9 iterations\n\nYour model is fully trained. All comparisons have at least one estimate for their m and u values\n
results = linker.inference.predict(threshold_match_probability=0.9)\n
results.as_pandas_dataframe(limit=5)\n
match_weight match_probability source_dataset_l source_dataset_r unique_id_l unique_id_r first_name_l first_name_r gamma_first_name surname_l ... dob_l dob_r gamma_dob city_l city_r gamma_city email_l email_r gamma_email match_key 0 3.180767 0.900674 df_left df_right 242 240 Freya Freya 4 Shah ... 1970-12-17 1970-12-16 4 Lonnod noLdon 0 None None -1 0 1 3.180767 0.900674 df_left df_right 241 240 Freya Freya 4 None ... 1970-12-17 1970-12-16 4 London noLdon 0 f.s@flynn.com None -1 0 2 3.212523 0.902626 df_left df_right 679 682 Elizabeth Elizabeth 4 Shaw ... 2006-04-21 2016-04-18 1 Cardiff Cardifrf 0 e.shaw@smith-hall.biz e.shaw@smith-hall.lbiz 3 0 3 3.224126 0.903331 df_left df_right 576 580 Jessica Jessica 4 None ... 1974-11-17 1974-12-17 4 None Walsall -1 jesscac.owen@elliott.org None -1 0 4 3.224126 0.903331 df_left df_right 577 580 Jessica Jessica 4 None ... 1974-11-17 1974-12-17 4 None Walsall -1 jessica.owen@elliott.org None -1 0

5 rows \u00d7 22 columns

"},{"location":"demos/examples/duckdb/pairwise_labels.html","title":"Estimating m probabilities from labels","text":""},{"location":"demos/examples/duckdb/pairwise_labels.html#estimating-m-from-a-sample-of-pairwise-labels","title":"Estimating m from a sample of pairwise labels","text":"

In this example, we estimate the m probabilities of the model from a table containing pairwise record comparisons which we know are 'true' matches. For example, these may be the result of work by a clerical team who have manually labelled a sample of matches.

The table must be in the following format:

source_dataset_l unique_id_l source_dataset_r unique_id_r df_1 1 df_2 2 df_1 1 df_2 3

It is assumed that every record in the table represents a certain match.

Note that the column names above are the defaults. They should correspond to the values you've set for unique_id_column_name and source_dataset_column_name, if you've chosen custom values.

from splink.datasets import splink_dataset_labels\n\npairwise_labels = splink_dataset_labels.fake_1000_labels\n\n# Choose labels indicating a match\npairwise_labels = pairwise_labels[pairwise_labels[\"clerical_match_score\"] == 1]\npairwise_labels\n
unique_id_l source_dataset_l unique_id_r source_dataset_r clerical_match_score 0 0 fake_1000 1 fake_1000 1.0 1 0 fake_1000 2 fake_1000 1.0 2 0 fake_1000 3 fake_1000 1.0 49 1 fake_1000 2 fake_1000 1.0 50 1 fake_1000 3 fake_1000 1.0 ... ... ... ... ... ... 3171 994 fake_1000 996 fake_1000 1.0 3172 995 fake_1000 996 fake_1000 1.0 3173 997 fake_1000 998 fake_1000 1.0 3174 997 fake_1000 999 fake_1000 1.0 3175 998 fake_1000 999 fake_1000 1.0

2031 rows \u00d7 5 columns

We now proceed to estimate the Fellegi Sunter model:

from splink import splink_datasets\n\ndf = splink_datasets.fake_1000\ndf.head(2)\n
unique_id first_name surname dob city email cluster 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 0 1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 0
import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n
linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)\ndeterministic_rules = [\n    \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n    \"l.email = r.email\",\n]\n\nlinker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n
# Register the pairwise labels table with the database, and then use it to estimate the m values\nlabels_df = linker.table_management.register_labels_table(pairwise_labels, overwrite=True)\nlinker.training.estimate_m_from_pairwise_labels(labels_df)\n\n\n# If the labels table already existing in the dataset you could run\n# linker.training.estimate_m_from_pairwise_labels(\"labels_tablename_here\")\n
training_blocking_rule = block_on(\"first_name\")\nlinker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)\n
<EMTrainingSession, blocking on l.\"first_name\" = r.\"first_name\", deactivating comparisons first_name>\n
linker.visualisations.parameter_estimate_comparisons_chart()\n
linker.visualisations.match_weights_chart()\n
"},{"location":"demos/examples/duckdb/quick_and_dirty_persons.html","title":"Quick and dirty persons model","text":""},{"location":"demos/examples/duckdb/quick_and_dirty_persons.html#historical-people-quick-and-dirty","title":"Historical people: Quick and dirty","text":"

This example shows how to get some initial record linkage results as quickly as possible.

There are many ways to improve the accuracy of this model. But this may be a good place to start if you just want to give Splink a try and see what it's capable of.

from splink.datasets import splink_datasets\n\ndf = splink_datasets.historical_50k\ndf.head(5)\n
unique_id cluster full_name first_and_surname first_name surname dob birth_place postcode_fake gender occupation 0 Q2296770-1 Q2296770 thomas clifford, 1st baron clifford of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 1 Q2296770-2 Q2296770 thomas of chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df male politician 2 Q2296770-3 Q2296770 tom 1st baron clifford of chudleigh tom chudleigh tom chudleigh 1630-08-01 devon tq13 8df male politician 3 Q2296770-4 Q2296770 thomas 1st chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8hu None politician 4 Q2296770-5 Q2296770 thomas clifford, 1st baron chudleigh thomas chudleigh thomas chudleigh 1630-08-01 devon tq13 8df None politician
from splink import block_on, SettingsCreator\nimport splink.comparison_library as cl\n\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"full_name\"),\n        block_on(\"substr(full_name,1,6)\", \"dob\", \"birth_place\"),\n        block_on(\"dob\", \"birth_place\"),\n        block_on(\"postcode_fake\"),\n    ],\n    comparisons=[\n        cl.ForenameSurnameComparison(\n            \"first_name\",\n            \"surname\",\n            forename_surname_concat_col_name=\"first_and_surname\",\n        ),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.LevenshteinAtThresholds(\"postcode_fake\", 2),\n        cl.JaroWinklerAtThresholds(\"birth_place\", 0.9).configure(\n            term_frequency_adjustments=True\n        ),\n        cl.ExactMatch(\"occupation\").configure(term_frequency_adjustments=True),\n    ],\n)\n
from splink import Linker, DuckDBAPI\n\n\nlinker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)\ndeterministic_rules = [\n    \"l.full_name = r.full_name\",\n    \"l.postcode_fake = r.postcode_fake and l.dob = r.dob\",\n]\n\nlinker.training.estimate_probability_two_random_records_match(\n    deterministic_rules, recall=0.6\n)\n
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n
results = linker.inference.predict(threshold_match_probability=0.9)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name_surname':\n    m values not fully trained\nComparison: 'first_name_surname':\n    u values not fully trained\nComparison: 'dob':\n    m values not fully trained\nComparison: 'postcode_fake':\n    m values not fully trained\nComparison: 'birth_place':\n    m values not fully trained\nComparison: 'occupation':\n    m values not fully trained\n
results.as_pandas_dataframe(limit=5)\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r surname_l surname_r first_and_surname_l first_and_surname_r ... gamma_postcode_fake birth_place_l birth_place_r gamma_birth_place occupation_l occupation_r gamma_occupation full_name_l full_name_r match_key 0 3.170005 0.900005 Q7412607-1 Q7412607-3 samuel samuel shelley shelley samuel shelley samuel shelley ... 0 whitechapel city of london 0 illuminator illuminator 1 samuel shelley samuel shelley 0 1 3.170695 0.900048 Q15997578-4 Q15997578-7 job wilding wilding None job wilding wilding ... -1 wrexham wrexham 2 association football player association football player 1 job wilding wilding 2 2 3.170695 0.900048 Q15997578-2 Q15997578-7 job wilding wilding None job wilding wilding ... -1 wrexham wrexham 2 association football player association football player 1 job wilding wilding 2 3 3.170695 0.900048 Q15997578-1 Q15997578-7 job wilding wilding None job wilding wilding ... -1 wrexham wrexham 2 association football player association football player 1 job wilding wilding 2 4 3.172553 0.900164 Q5726641-11 Q5726641-8 henry harry page paige henry page harry paige ... 2 staffordshire moorlands staffordshire moorlands 2 cricketer cricketer 1 henry page harry paige 3

5 rows \u00d7 26 columns

"},{"location":"demos/examples/duckdb/real_time_record_linkage.html","title":"Real time record linkage","text":""},{"location":"demos/examples/duckdb/real_time_record_linkage.html#real-time-linkage","title":"Real time linkage","text":"

In this notebook, we demonstrate splink's incremental and real time linkage capabilities - specifically:

  • the linker.compare_two_records function, that allows you to interactively explore the results of a linkage model; and
  • the linker.find_matches_to_new_records that allows you to incrementally find matches to a small number of new records

"},{"location":"demos/examples/duckdb/real_time_record_linkage.html#step-1-load-a-pre-trained-linkage-model","title":"Step 1: Load a pre-trained linkage model","text":"
import urllib.request\nimport json\nfrom pathlib import Path\nfrom splink import Linker, DuckDBAPI, block_on, SettingsCreator, splink_datasets\n\ndf = splink_datasets.fake_1000\n\nurl = \"https://raw.githubusercontent.com/moj-analytical-services/splink_demos/master/demo_settings/real_time_settings.json\"\n\nwith urllib.request.urlopen(url) as u:\n    settings = json.loads(u.read().decode())\n\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\n
linker.visualisations.waterfall_chart(\n    linker.inference.predict().as_record_dict(limit=2)\n)\n
"},{"location":"demos/examples/duckdb/real_time_record_linkage.html#step-comparing-two-records","title":"Step Comparing two records","text":"

It's now possible to compute a match weight for any two records using linker.compare_two_records()

record_1 = {\n    \"unique_id\": 1,\n    \"first_name\": \"Lucas\",\n    \"surname\": \"Smith\",\n    \"dob\": \"1984-01-02\",\n    \"city\": \"London\",\n    \"email\": \"lucas.smith@hotmail.com\",\n}\n\nrecord_2 = {\n    \"unique_id\": 2,\n    \"first_name\": \"Lucas\",\n    \"surname\": \"Smith\",\n    \"dob\": \"1983-02-12\",\n    \"city\": \"Machester\",\n    \"email\": \"lucas.smith@hotmail.com\",\n}\n\nlinker._settings_obj._retain_intermediate_calculation_columns = True\n\n\n# To `compare_two_records` the linker needs to compute term frequency tables\n# If you have precomputed tables, you can linker.register_term_frequency_lookup()\nlinker.table_management.compute_tf_table(\"first_name\")\nlinker.table_management.compute_tf_table(\"surname\")\nlinker.table_management.compute_tf_table(\"dob\")\nlinker.table_management.compute_tf_table(\"city\")\nlinker.table_management.compute_tf_table(\"email\")\n\n\ndf_two = linker.inference.compare_two_records(record_1, record_2)\ndf_two.as_pandas_dataframe()\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name ... bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email match_key 0 13.161672 0.999891 1 2 Lucas Lucas 2 0.001203 0.001203 87.571229 ... 0.446404 1.0 lucas.smith@hotmail.com lucas.smith@hotmail.com 1 NaN NaN 263.229168 1.0 0

1 rows \u00d7 40 columns

"},{"location":"demos/examples/duckdb/real_time_record_linkage.html#step-3-interactive-comparisons","title":"Step 3: Interactive comparisons","text":"

One interesting applicatin of compare_two_records is to create a simple interface that allows the user to input two records interactively, and get real time feedback.

In the following cell we use ipywidets for this purpose. \u2728\u2728 Change the values in the text boxes to see the waterfall chart update in real time. \u2728\u2728

import ipywidgets as widgets\nfrom IPython.display import display\n\n\nfields = [\"unique_id\", \"first_name\", \"surname\", \"dob\", \"email\", \"city\"]\n\nleft_text_boxes = []\nright_text_boxes = []\n\ninputs_to_interactive_output = {}\n\nfor f in fields:\n    wl = widgets.Text(description=f, value=str(record_1[f]))\n    left_text_boxes.append(wl)\n    inputs_to_interactive_output[f\"{f}_l\"] = wl\n    wr = widgets.Text(description=f, value=str(record_2[f]))\n    right_text_boxes.append(wr)\n    inputs_to_interactive_output[f\"{f}_r\"] = wr\n\nb1 = widgets.VBox(left_text_boxes)\nb2 = widgets.VBox(right_text_boxes)\nui = widgets.HBox([b1, b2])\n\n\ndef myfn(**kwargs):\n    my_args = dict(kwargs)\n\n    record_left = {}\n    record_right = {}\n\n    for key, value in my_args.items():\n        if value == \"\":\n            value = None\n        if key.endswith(\"_l\"):\n            record_left[key[:-2]] = value\n        elif key.endswith(\"_r\"):\n            record_right[key[:-2]] = value\n\n    # Assuming 'linker' is defined earlier in your code\n    linker._settings_obj._retain_intermediate_calculation_columns = True\n\n    df_two = linker.inference.compare_two_records(record_left, record_right)\n\n    recs = df_two.as_pandas_dataframe().to_dict(orient=\"records\")\n\n    display(linker.visualisations.waterfall_chart(recs, filter_nulls=False))\n\n\nout = widgets.interactive_output(myfn, inputs_to_interactive_output)\n\ndisplay(ui, out)\n
HBox(children=(VBox(children=(Text(value='1', description='unique_id'), Text(value='Lucas', description='first\u2026\n\n\n\nOutput()\n
"},{"location":"demos/examples/duckdb/real_time_record_linkage.html#finding-matching-records-interactively","title":"Finding matching records interactively","text":"

It is also possible to search the records in the input dataset rapidly using the linker.find_matches_to_new_records() function

record = {\n    \"unique_id\": 123987,\n    \"first_name\": \"Robert\",\n    \"surname\": \"Alan\",\n    \"dob\": \"1971-05-24\",\n    \"city\": \"London\",\n    \"email\": \"robert255@smith.net\",\n}\n\n\ndf_inc = linker.inference.find_matches_to_new_records(\n    [record], blocking_rules=[]\n).as_pandas_dataframe()\ndf_inc.sort_values(\"match_weight\", ascending=False)\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name ... tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email 6 23.531793 1.000000 0 123987 Robert Robert 2 0.003610 0.00361 87.571229 ... 0.212792 1.000000 1.000000 robert255@smith.net robert255@smith.net 1 0.001267 0.001267 263.229168 1.730964 5 14.550320 0.999958 1 123987 Robert Robert 2 0.003610 0.00361 87.571229 ... 0.212792 1.000000 1.000000 roberta25@smith.net robert255@smith.net 0 0.002535 0.001267 0.423438 1.000000 4 10.388623 0.999255 3 123987 Robert Robert 2 0.003610 0.00361 87.571229 ... 0.212792 0.446404 1.000000 None robert255@smith.net -1 NaN 0.001267 1.000000 1.000000 3 2.427256 0.843228 2 123987 Rob Robert 0 0.001203 0.00361 0.218767 ... 0.212792 10.484859 0.259162 roberta25@smith.net robert255@smith.net 0 0.002535 0.001267 0.423438 1.000000 2 -2.123090 0.186697 8 123987 None Robert -1 NaN 0.00361 1.000000 ... 0.212792 1.000000 1.000000 None robert255@smith.net -1 NaN 0.001267 1.000000 1.000000 1 -2.205894 0.178139 754 123987 None Robert -1 NaN 0.00361 1.000000 ... 0.212792 1.000000 1.000000 j.c@whige.wort robert255@smith.net 0 0.001267 0.001267 0.423438 1.000000 0 -2.802309 0.125383 750 123987 None Robert -1 NaN 0.00361 1.000000 ... 0.212792 10.484859 0.259162 j.c@white.org robert255@smith.net 0 0.002535 0.001267 0.423438 1.000000

7 rows \u00d7 39 columns

"},{"location":"demos/examples/duckdb/real_time_record_linkage.html#interactive-interface-for-finding-records","title":"Interactive interface for finding records","text":"

Again, we can use ipywidgets to build an interactive interface for the linker.find_matches_to_new_records function

@widgets.interact(\n    first_name=\"Robert\",\n    surname=\"Alan\",\n    dob=\"1971-05-24\",\n    city=\"London\",\n    email=\"robert255@smith.net\",\n)\ndef interactive_link(first_name, surname, dob, city, email):\n    record = {\n        \"unique_id\": 123987,\n        \"first_name\": first_name,\n        \"surname\": surname,\n        \"dob\": dob,\n        \"city\": city,\n        \"email\": email,\n        \"group\": 0,\n    }\n\n    for key in record.keys():\n        if type(record[key]) == str:\n            if record[key].strip() == \"\":\n                record[key] = None\n\n    df_inc = linker.inference.find_matches_to_new_records(\n        [record], blocking_rules=[f\"(true)\"]\n    ).as_pandas_dataframe()\n    df_inc = df_inc.sort_values(\"match_weight\", ascending=False)\n    recs = df_inc.to_dict(orient=\"records\")\n\n    display(linker.visualisations.waterfall_chart(recs, filter_nulls=False))\n
interactive(children=(Text(value='Robert', description='first_name'), Text(value='Alan', description='surname'\u2026\n
linker.visualisations.match_weights_chart()\n
"},{"location":"demos/examples/duckdb/transactions.html","title":"Linking financial transactions","text":""},{"location":"demos/examples/duckdb/transactions.html#linking-banking-transactions","title":"Linking banking transactions","text":"

This example shows how to perform a one-to-one link on banking transactions.

The data is fake data, and was generated has the following features:

  • Money shows up in the destination account with some time delay
  • The amount sent and the amount received are not always the same - there are hidden fees and foreign exchange effects
  • The memo is sometimes truncated and content is sometimes missing

Since each origin payment should end up in the destination account, the probability_two_random_records_match of the model is known.

from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf_origin = splink_datasets.transactions_origin\ndf_destination = splink_datasets.transactions_destination\n\ndisplay(df_origin.head(2))\ndisplay(df_destination.head(2))\n
ground_truth memo transaction_date amount unique_id 0 0 MATTHIAS C paym 2022-03-28 36.36 0 1 1 M CORVINUS dona 2022-02-14 221.91 1 ground_truth memo transaction_date amount unique_id 0 0 MATTHIAS C payment BGC 2022-03-29 36.36 0 1 1 M CORVINUS BGC 2022-02-16 221.91 1

In the following chart, we can see this is a challenging dataset to link:

  • There are only 151 distinct transaction dates, with strong skew
  • Some 'memos' are used multiple times (up to 48 times)
  • There is strong skew in the 'amount' column, with 1,400 transactions of around 60.00
from splink.exploratory import profile_columns\n\ndb_api = DuckDBAPI()\nprofile_columns(\n    [df_origin, df_destination],\n    db_api=db_api,\n    column_expressions=[\n        \"memo\",\n        \"transaction_date\",\n        \"amount\",\n    ],\n)\n
from splink import DuckDBAPI, block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\n# Design blocking rules that allow for differences in transaction date and amounts\nblocking_rule_date_1 = \"\"\"\n    strftime(l.transaction_date, '%Y%m') = strftime(r.transaction_date, '%Y%m')\n    and substr(l.memo, 1,3) = substr(r.memo,1,3)\n    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3\n\"\"\"\n\n# Offset by half a month to ensure we capture case when the dates are e.g. 31st Jan and 1st Feb\nblocking_rule_date_2 = \"\"\"\n    strftime(l.transaction_date+15, '%Y%m') = strftime(r.transaction_date, '%Y%m')\n    and substr(l.memo, 1,3) = substr(r.memo,1,3)\n    and l.amount/r.amount > 0.7   and l.amount/r.amount < 1.3\n\"\"\"\n\nblocking_rule_memo = block_on(\"substr(memo,1,9)\")\n\nblocking_rule_amount_1 = \"\"\"\nround(l.amount/2,0)*2 = round(r.amount/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date)\n\"\"\"\n\nblocking_rule_amount_2 = \"\"\"\nround(l.amount/2,0)*2 = round((r.amount+1)/2,0)*2 and yearweek(r.transaction_date) = yearweek(l.transaction_date + 4)\n\"\"\"\n\nblocking_rule_cheat = block_on(\"unique_id\")\n\n\nbrs = [\n    blocking_rule_date_1,\n    blocking_rule_date_2,\n    blocking_rule_memo,\n    blocking_rule_amount_1,\n    blocking_rule_amount_2,\n    blocking_rule_cheat,\n]\n\n\ndb_api = DuckDBAPI()\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=[df_origin, df_destination],\n    blocking_rules=brs,\n    db_api=db_api,\n    link_type=\"link_only\"\n)\n
# Full settings for linking model\nimport splink.comparison_level_library as cll\nimport splink.comparison_library as cl\n\ncomparison_amount = {\n    \"output_column_name\": \"amount\",\n    \"comparison_levels\": [\n        cll.NullLevel(\"amount\"),\n        cll.ExactMatchLevel(\"amount\"),\n        cll.PercentageDifferenceLevel(\"amount\", 0.01),\n        cll.PercentageDifferenceLevel(\"amount\", 0.03),\n        cll.PercentageDifferenceLevel(\"amount\", 0.1),\n        cll.PercentageDifferenceLevel(\"amount\", 0.3),\n        cll.ElseLevel(),\n    ],\n    \"comparison_description\": \"Amount percentage difference\",\n}\n\n# The date distance is one sided becaause transactions should only arrive after they've left\n# As a result, the comparison_template_library date difference functions are not appropriate\nwithin_n_days_template = \"transaction_date_r - transaction_date_l <= {n} and transaction_date_r >= transaction_date_l\"\n\ncomparison_date = {\n    \"output_column_name\": \"transaction_date\",\n    \"comparison_levels\": [\n        cll.NullLevel(\"transaction_date\"),\n        {\n            \"sql_condition\": within_n_days_template.format(n=1),\n            \"label_for_charts\": \"1 day\",\n        },\n        {\n            \"sql_condition\": within_n_days_template.format(n=4),\n            \"label_for_charts\": \"<=4 days\",\n        },\n        {\n            \"sql_condition\": within_n_days_template.format(n=10),\n            \"label_for_charts\": \"<=10 days\",\n        },\n        {\n            \"sql_condition\": within_n_days_template.format(n=30),\n            \"label_for_charts\": \"<=30 days\",\n        },\n        cll.ElseLevel(),\n    ],\n    \"comparison_description\": \"Transaction date days apart\",\n}\n\n\nsettings = SettingsCreator(\n    link_type=\"link_only\",\n    probability_two_random_records_match=1 / len(df_origin),\n    blocking_rules_to_generate_predictions=[\n        blocking_rule_date_1,\n        blocking_rule_date_2,\n        blocking_rule_memo,\n        blocking_rule_amount_1,\n        blocking_rule_amount_2,\n        blocking_rule_cheat,\n    ],\n    comparisons=[\n        comparison_amount,\n        cl.LevenshteinAtThresholds(\"memo\", [2, 6, 10]),\n        comparison_date,\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n
linker = Linker(\n    [df_origin, df_destination],\n    settings,\n    input_table_aliases=[\"__ori\", \"_dest\"],\n    db_api=db_api,\n)\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - amount (no m values are trained).\n    - memo (no m values are trained).\n    - transaction_date (no m values are trained).\n
linker.training.estimate_parameters_using_expectation_maximisation(block_on(\"memo\"))\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"memo\" = r.\"memo\"\n\nParameter estimates will be made for the following comparison(s):\n    - amount\n    - transaction_date\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - memo\n\nIteration 1: Largest change in params was -0.588 in the m_probability of amount, level `Exact match on amount`\nIteration 2: Largest change in params was -0.176 in the m_probability of transaction_date, level `1 day`\nIteration 3: Largest change in params was 0.00996 in the m_probability of amount, level `Percentage difference of 'amount' within 10.00%`\nIteration 4: Largest change in params was 0.0022 in the m_probability of transaction_date, level `<=30 days`\nIteration 5: Largest change in params was 0.000385 in the m_probability of transaction_date, level `<=30 days`\nIteration 6: Largest change in params was -0.000255 in the m_probability of amount, level `All other comparisons`\nIteration 7: Largest change in params was -0.000229 in the m_probability of amount, level `All other comparisons`\nIteration 8: Largest change in params was -0.000208 in the m_probability of amount, level `All other comparisons`\nIteration 9: Largest change in params was -0.00019 in the m_probability of amount, level `All other comparisons`\nIteration 10: Largest change in params was -0.000173 in the m_probability of amount, level `All other comparisons`\nIteration 11: Largest change in params was -0.000159 in the m_probability of amount, level `All other comparisons`\nIteration 12: Largest change in params was -0.000146 in the m_probability of amount, level `All other comparisons`\nIteration 13: Largest change in params was -0.000135 in the m_probability of amount, level `All other comparisons`\nIteration 14: Largest change in params was -0.000124 in the m_probability of amount, level `All other comparisons`\nIteration 15: Largest change in params was -0.000115 in the m_probability of amount, level `All other comparisons`\nIteration 16: Largest change in params was -0.000107 in the m_probability of amount, level `All other comparisons`\nIteration 17: Largest change in params was -9.92e-05 in the m_probability of amount, level `All other comparisons`\n\nEM converged after 17 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - memo (no m values are trained).\n\n\n\n\n\n<EMTrainingSession, blocking on l.\"memo\" = r.\"memo\", deactivating comparisons memo>\n
session = linker.training.estimate_parameters_using_expectation_maximisation(block_on(\"amount\"))\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.\"amount\" = r.\"amount\"\n\nParameter estimates will be made for the following comparison(s):\n    - memo\n    - transaction_date\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - amount\n\nIteration 1: Largest change in params was -0.373 in the m_probability of memo, level `Exact match on memo`\nIteration 2: Largest change in params was -0.108 in the m_probability of memo, level `Exact match on memo`\nIteration 3: Largest change in params was 0.0202 in the m_probability of memo, level `Levenshtein distance of memo <= 10`\nIteration 4: Largest change in params was -0.00538 in the m_probability of memo, level `Exact match on memo`\nIteration 5: Largest change in params was 0.00482 in the m_probability of memo, level `All other comparisons`\nIteration 6: Largest change in params was 0.00508 in the m_probability of memo, level `All other comparisons`\nIteration 7: Largest change in params was 0.00502 in the m_probability of memo, level `All other comparisons`\nIteration 8: Largest change in params was 0.00466 in the m_probability of memo, level `All other comparisons`\nIteration 9: Largest change in params was 0.00409 in the m_probability of memo, level `All other comparisons`\nIteration 10: Largest change in params was 0.00343 in the m_probability of memo, level `All other comparisons`\nIteration 11: Largest change in params was 0.00276 in the m_probability of memo, level `All other comparisons`\nIteration 12: Largest change in params was 0.00216 in the m_probability of memo, level `All other comparisons`\nIteration 13: Largest change in params was 0.00165 in the m_probability of memo, level `All other comparisons`\nIteration 14: Largest change in params was 0.00124 in the m_probability of memo, level `All other comparisons`\nIteration 15: Largest change in params was 0.000915 in the m_probability of memo, level `All other comparisons`\nIteration 16: Largest change in params was 0.000671 in the m_probability of memo, level `All other comparisons`\nIteration 17: Largest change in params was 0.000488 in the m_probability of memo, level `All other comparisons`\nIteration 18: Largest change in params was 0.000353 in the m_probability of memo, level `All other comparisons`\nIteration 19: Largest change in params was 0.000255 in the m_probability of memo, level `All other comparisons`\nIteration 20: Largest change in params was 0.000183 in the m_probability of memo, level `All other comparisons`\nIteration 21: Largest change in params was 0.000132 in the m_probability of memo, level `All other comparisons`\nIteration 22: Largest change in params was 9.45e-05 in the m_probability of memo, level `All other comparisons`\n\nEM converged after 22 iterations\n\nYour model is fully trained. All comparisons have at least one estimate for their m and u values\n
linker.visualisations.match_weights_chart()\n
df_predict = linker.inference.predict(threshold_match_probability=0.001)\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n
linker.visualisations.comparison_viewer_dashboard(\n    df_predict, \"dashboards/comparison_viewer_transactions.html\", overwrite=True\n)\nfrom IPython.display import IFrame\n\nIFrame(\n    src=\"./dashboards/comparison_viewer_transactions.html\", width=\"100%\", height=1200\n)\n

pred_errors = linker.evaluation.prediction_errors_from_labels_column(\n    \"ground_truth\", include_false_positives=True, include_false_negatives=False\n)\nlinker.visualisations.waterfall_chart(pred_errors.as_record_dict(limit=5))\n
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))\n
pred_errors = linker.evaluation.prediction_errors_from_labels_column(\n    \"ground_truth\", include_false_positives=False, include_false_negatives=True\n)\nlinker.visualisations.waterfall_chart(pred_errors.as_record_dict(limit=5))\n
"},{"location":"demos/examples/spark/deduplicate_1k_synthetic.html","title":"Deduplication using Pyspark","text":""},{"location":"demos/examples/spark/deduplicate_1k_synthetic.html#linking-in-spark","title":"Linking in Spark","text":"
from pyspark import SparkConf, SparkContext\nfrom pyspark.sql import SparkSession\n\nfrom splink.backends.spark import similarity_jar_location\n\nconf = SparkConf()\n# This parallelism setting is only suitable for a small toy example\nconf.set(\"spark.driver.memory\", \"12g\")\nconf.set(\"spark.default.parallelism\", \"8\")\nconf.set(\"spark.sql.codegen.wholeStage\", \"false\")\n\n\n# Add custom similarity functions, which are bundled with Splink\n# documented here: https://github.com/moj-analytical-services/splink_scalaudfs\npath = similarity_jar_location()\nconf.set(\"spark.jars\", path)\n\nsc = SparkContext.getOrCreate(conf=conf)\n\nspark = SparkSession(sc)\nspark.sparkContext.setCheckpointDir(\"./tmp_checkpoints\")\n
24/07/13 19:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n
from splink import splink_datasets\n\npandas_df = splink_datasets.fake_1000\n\ndf = spark.createDataFrame(pandas_df)\n
import splink.comparison_library as cl\nfrom splink import Linker, SettingsCreator, SparkAPI, block_on\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.LevenshteinAtThresholds(\n            \"dob\"\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        \"l.surname = r.surname\",  # alternatively, you can write BRs in their SQL form\n    ],\n    retain_intermediate_calculation_columns=True,\n    em_convergence=0.01,\n)\n
linker = Linker(df, settings, db_api=SparkAPI(spark_session=spark))\ndeterministic_rules = [\n    \"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1\",\n    \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n    \"l.email = r.email\",\n]\n\nlinker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)\n
Probability two random records match is estimated to be  0.0806.                \nThis means that amongst all possible pairwise record comparisons, one in 12.41 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 40,246.67 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=5e5)\n
----- Estimating u probabilities using random sampling -----\n\n\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n    - dob (no m values are trained).\n    - city (no m values are trained).\n    - email (no m values are trained).\n
training_blocking_rule = \"l.first_name = r.first_name and l.surname = r.surname\"\ntraining_session_fname_sname = (\n    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)\n)\n\ntraining_blocking_rule = \"l.dob = r.dob\"\ntraining_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    training_blocking_rule\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.first_name = r.first_name and l.surname = r.surname\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name\n    - surname\n\nIteration 1: Largest change in params was -0.709 in probability_two_random_records_match\nIteration 2: Largest change in params was 0.0573 in the m_probability of email, level `All other comparisons`\nIteration 3: Largest change in params was 0.0215 in the m_probability of email, level `All other comparisons`\nIteration 4: Largest change in params was 0.00888 in the m_probability of email, level `All other comparisons`\n\nEM converged after 4 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n\n----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.dob = r.dob\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\nWARNING:                                                                        \nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was -0.548 in the m_probability of surname, level `Exact match on surname`\nIteration 2: Largest change in params was 0.129 in probability_two_random_records_match\nIteration 3: Largest change in params was 0.0313 in probability_two_random_records_match\nIteration 4: Largest change in params was 0.0128 in probability_two_random_records_match\nIteration 5: Largest change in params was 0.00651 in probability_two_random_records_match\n\nEM converged after 5 iterations\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\nYour model is fully trained. All comparisons have at least one estimate for their m and u values\n
results = linker.inference.predict(threshold_match_probability=0.9)\n
Blocking time: 4.65 seconds                                                     \nPredict time: 82.92 seconds\n
spark_df = results.as_spark_dataframe().show()\n
+------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+\n|      match_weight| match_probability|unique_id_l|unique_id_r|first_name_l|first_name_r|gamma_first_name|tf_first_name_l|tf_first_name_r|     bf_first_name|bf_tf_adj_first_name|surname_l|surname_r|gamma_surname|tf_surname_l|tf_surname_r|         bf_surname| bf_tf_adj_surname|     dob_l|     dob_r|gamma_dob|            bf_dob|    city_l|    city_r|gamma_city|tf_city_l|tf_city_r|           bf_city|    bf_tf_adj_city|             email_l|             email_r|gamma_email|tf_email_l|tf_email_r|           bf_email|    bf_tf_adj_email|match_key|\n+------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+\n|15.131885475840011|0.9999721492762709|         51|         56|      Jayden|      Jayden|               4|          0.008|          0.008|11.371009132404957|  4.0525525525525525|  Bennett|  Bennett|            4|       0.006|       0.006|  9.113630950205666| 5.981981981981981|2017-01-11|2017-02-10|        1|14.373012181955707|   Swansea|   Swansea|         1|    0.013|    0.013|5.8704874944935215| 5.481481481481482|                 NaN|       jb88@king.com|          0|     0.211|     0.004|0.35260600559686806|                1.0|        0|\n|  7.86514930254232|0.9957293356289956|        575|        577|     Jessica|     Jessica|               4|          0.011|          0.011|11.371009132404957|  2.9473109473109473|     Owen|      NaN|            0|       0.006|       0.181|0.45554364195240765|               1.0|1974-11-17|1974-11-17|        3|220.92747883214062|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|                 NaN|jessica.owen@elli...|          0|     0.211|     0.002|0.35260600559686806|                1.0|        0|\n| 5.951711022429932|0.9841000517299358|        171|        174|         NaN|        Leah|               0|          0.169|          0.002|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2012-07-09|        0|0.2607755750325071|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|\n|21.650093935297473|0.9999996961409438|        518|        519|      Amelia|     Amlelia|               2|          0.009|          0.001| 47.10808446952784|                 1.0|   Morgan|   Morgan|            4|       0.012|       0.012|  9.113630950205666|2.9909909909909906|2011-05-26|2011-05-26|        3|220.92747883214062|   Swindno|   Swindon|         0|    0.001|     0.01|0.6263033203299755|               1.0|amelia.morgan92@d...|amelia.morgan92@d...|          3|     0.004|     0.001| 211.35554441198767|                1.0|        1|\n|11.456207518049865|0.9996442185022277|        752|        754|        Jaes|         NaN|               0|          0.001|          0.169|0.4452000905514796|                 1.0|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|1972-07-20|1971-07-20|        2| 84.28155355946456|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|       j.c@white.org|      j.c@whige.wort|          3|     0.002|     0.001| 211.35554441198767|                1.0|        1|\n|24.387299048327478|0.9999999544286963|        760|        761|       Henry|       Henry|               4|          0.009|          0.009|11.371009132404957|   3.602268935602269|      Day|      Day|            4|       0.004|       0.004|  9.113630950205666| 8.972972972972972|2002-09-15|2002-08-18|        1|14.373012181955707|     Leeds|     Leeds|         1|    0.017|    0.017|5.8704874944935215| 4.191721132897603|hday48@thomas-car...|hday48@thomas-car...|          3|     0.003|     0.001| 211.35554441198767|                1.0|        0|\n|12.076660303346712|0.9997685471829967|        920|        922|         Evi|        Evie|               3|          0.001|          0.007| 61.79623639995749|                 1.0|    Jones|    Jones|            4|       0.023|       0.023|  9.113630950205666|1.5605170387779081|2012-06-19|2002-07-22|        0|0.2607755750325071|       NaN|       NaN|         1|    0.187|    0.187|5.8704874944935215|0.3810655575361458|eviejones@brewer-...|eviejones@brewer-...|          4|     0.004|     0.004|  8.411105418567649|  11.42992992992993|        1|\n| 4.002786788974079|0.9412833223288347|        171|        175|         NaN|       Lheah|               0|          0.169|          0.001|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2011-07-10|        0|0.2607755750325071|    London|   Londoon|         0|    0.173|    0.002|0.6263033203299755|               1.0|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|\n|19.936162812706836|0.9999990031804153|        851|        853|    Mhichael|     Michael|               2|          0.001|          0.006| 47.10808446952784|                 1.0|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|2000-04-03|2000-04-03|        3|220.92747883214062|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|      m.w@cannon.com|      m@w.cannon.com|          2|     0.002|     0.001| 251.69908796212906|                1.0|        1|\n| 21.33290823458872|0.9999996214227064|        400|        402|       James|       James|               4|          0.013|          0.013|11.371009132404957|  2.4938784938784937|    Dixon|    Dixon|            4|       0.009|       0.009|  9.113630950205666| 3.987987987987988|1991-04-12|1991-04-12|        3|220.92747883214062|       NaN|   Loodnon|         0|    0.187|    0.001|0.6263033203299755|               1.0|james.d@merritot-...|james.d@merritt-s...|          3|     0.001|     0.005| 211.35554441198767|                1.0|        0|\n|22.169132705637786|0.9999997879560012|         81|         84|        Ryan|        Ryan|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|     Cole|     Cole|            4|       0.005|       0.005|  9.113630950205666| 7.178378378378378|1987-05-27|1988-05-27|        2| 84.28155355946456|       NaN|   Bristol|         0|    0.187|    0.016|0.6263033203299755|               1.0|r.cole1@ramirez-a...|r.cole1@ramtrez-a...|          3|     0.005|     0.001| 211.35554441198767|                1.0|        0|\n|6.1486678498977065|0.9861008615160808|        652|        654|         NaN|         NaN|               4|          0.169|          0.169|11.371009132404957| 0.19183680722142257|  Roberts|      NaN|            0|       0.006|       0.181|0.45554364195240765|               1.0|1990-10-26|1990-10-26|        3|220.92747883214062|Birmingham|Birmingham|         1|     0.04|     0.04|5.8704874944935215|1.7814814814814814|                 NaN|droberts73@taylor...|          0|     0.211|     0.003|0.35260600559686806|                1.0|        0|\n|17.935398542824068|0.9999960106207738|        582|        584|      ilivOa|      Olivia|               1|          0.001|          0.014| 3.944098136204933|                 1.0|  Edwards|  Edwards|            4|       0.008|       0.008|  9.113630950205666| 4.486486486486486|1988-12-27|1988-12-27|        3|220.92747883214062|    Dudley|   Duudley|         0|    0.006|    0.001|0.6263033203299755|               1.0|      oe56@lopez.net|      oe56@lopez.net|          4|     0.003|     0.003|  8.411105418567649| 15.239906573239907|        1|\n|21.036204363210302|0.9999995349803662|        978|        981|     Jessica|     Jessica|               4|          0.011|          0.011|11.371009132404957|  2.9473109473109473|   Miller|  Miiller|            3|       0.004|       0.001|  82.56312210691897|               1.0|2001-05-23|2001-05-23|        3|220.92747883214062|       NaN|  Coventry|         0|    0.187|    0.021|0.6263033203299755|               1.0|jessica.miller@jo...|jessica.miller@jo...|          4|     0.006|     0.006|  8.411105418567649|  7.619953286619953|        0|\n|13.095432674729635|0.9998857562788657|        684|        686|       Rosie|       Rosie|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|  Johnstn| Johnston|            3|       0.001|       0.002|  82.56312210691897|               1.0|1979-12-23|1978-11-23|        1|14.373012181955707|       NaN| Sheffield|         0|    0.187|    0.007|0.6263033203299755|               1.0|                 NaN|                 NaN|          4|     0.211|     0.211|  8.411105418567649|0.21668113611241574|        0|\n|25.252698357543103|0.9999999749861632|        279|        280|        Lola|        Lola|               4|          0.008|          0.008|11.371009132404957|  4.0525525525525525|   Taylor|   Taylor|            4|       0.014|       0.014|  9.113630950205666|2.5637065637065635|2017-11-20|2016-11-20|        2| 84.28155355946456|  Aberdeen|  Aberdeen|         1|    0.016|    0.016|5.8704874944935215| 4.453703703703703|lolat86@bishop-gi...|lolat86@bishop-gi...|          4|     0.002|     0.002|  8.411105418567649|  22.85985985985986|        0|\n| 9.711807138722323|0.9988089303569408|         42|         43|    Theodore|    Theodore|               4|           0.01|           0.01|11.371009132404957|   3.242042042042042|   Morris|   Morris|            4|       0.004|       0.004|  9.113630950205666| 8.972972972972972|1978-09-18|1978-08-19|        1|14.373012181955707|Birgmhniam|Birmingham|         0|    0.001|     0.04|0.6263033203299755|               1.0|                 NaN|t.m39@brooks-sawy...|          0|     0.211|     0.005|0.35260600559686806|                1.0|        0|\n| 5.951711022429932|0.9841000517299358|        173|        174|         NaN|        Leah|               0|          0.169|          0.002|0.4452000905514796|                 1.0|  Russell|  Russell|            4|        0.01|        0.01|  9.113630950205666| 3.589189189189189|2011-06-08|2012-07-09|        0|0.2607755750325071|    London|    London|         1|    0.173|    0.173|5.8704874944935215|0.4119032327124813|leahrussell@charl...|leahrussell@charl...|          4|     0.005|     0.005|  8.411105418567649|  9.143943943943944|        1|\n| 23.43211696288854|0.9999999116452517|         88|         89|        Lexi|        Lexi|               4|          0.003|          0.003|11.371009132404957|  10.806806806806806|      NaN|      NaN|            4|       0.181|       0.181|  9.113630950205666|0.1982977452590712|1994-09-02|1994-09-02|        3|220.92747883214062|Birmingham|Birmingham|         1|     0.04|     0.04|5.8704874944935215|1.7814814814814814|l.gordon34cfren@h...|l.gordon34@french...|          2|     0.001|     0.002| 251.69908796212906|                1.0|        0|\n|7.1659948250873144|0.9930847652376709|        391|        393|       Isaac|       Isaac|               4|          0.005|          0.005|11.371009132404957|   6.484084084084084|      NaN|    James|            0|       0.181|       0.007|0.45554364195240765|               1.0|1991-05-06|1991-05-06|        3|220.92747883214062|     Lodon|    London|         0|    0.008|    0.173|0.6263033203299755|               1.0|isaac.james@smich...|                 NaN|          0|     0.001|     0.211|0.35260600559686806|                1.0|        0|\n+------------------+------------------+-----------+-----------+------------+------------+----------------+---------------+---------------+------------------+--------------------+---------+---------+-------------+------------+------------+-------------------+------------------+----------+----------+---------+------------------+----------+----------+----------+---------+---------+------------------+------------------+--------------------+--------------------+-----------+----------+----------+-------------------+-------------------+---------+\nonly showing top 20 rows\n
"},{"location":"demos/examples/sqlite/deduplicate_50k_synthetic.html","title":"Deduplicate 50k rows historical persons","text":""},{"location":"demos/examples/sqlite/deduplicate_50k_synthetic.html#linking-a-dataset-of-real-historical-persons","title":"Linking a dataset of real historical persons","text":"

In this example, we deduplicate a more realistic dataset. The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced.

Note, as explained in the backends topic guide, SQLite does not natively support string fuzzy matching functions such as damareau-levenshtein and jaro-winkler (as used in this example). Instead, these have been imported as python User Defined Functions (UDFs). One drawback of python UDFs is that they are considerably slower than native-SQL comparisons. As such, if you are hitting issues with large run times, consider switching to DuckDB (or some other backend).

# Uncomment and run this cell if you're running in Google Colab.\n# !pip install splink\n# !pip install rapidfuzz\n
import pandas as pd\n\nfrom splink import splink_datasets\n\npd.options.display.max_rows = 1000\n# reduce size of dataset to make things run faster\ndf = splink_datasets.historical_50k.sample(5000)\n
from splink.backends.sqlite import SQLiteAPI\nfrom splink.exploratory import profile_columns\n\ndb_api = SQLiteAPI()\nprofile_columns(\n    df, db_api, column_expressions=[\"first_name\", \"postcode_fake\", \"substr(dob, 1,4)\"]\n)\n
from splink import block_on\nfrom splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\nblocking_rules =  [block_on(\"first_name\", \"surname\"),\n        block_on(\"surname\", \"dob\"),\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"postcode_fake\", \"first_name\")]\n\ndb_api = SQLiteAPI()\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules,\n    db_api=db_api,\n    link_type=\"dedupe_only\"\n)\n
import splink.comparison_library as cl\nfrom splink import Linker\n\nsettings = {\n    \"link_type\": \"dedupe_only\",\n    \"blocking_rules_to_generate_predictions\": [\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"surname\", \"dob\"),\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"postcode_fake\", \"first_name\"),\n\n    ],\n    \"comparisons\": [\n        cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.DamerauLevenshteinAtThresholds(\"dob\", [1, 2]).configure(\n            term_frequency_adjustments=True\n        ),\n        cl.DamerauLevenshteinAtThresholds(\"postcode_fake\", [1, 2]),\n        cl.ExactMatch(\"birth_place\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\n            \"occupation\",\n        ).configure(term_frequency_adjustments=True),\n    ],\n    \"retain_matching_columns\": True,\n    \"retain_intermediate_calculation_columns\": True,\n    \"max_iterations\": 10,\n    \"em_convergence\": 0.01,\n}\n\nlinker = Linker(df, settings, db_api=db_api)\n
linker.training.estimate_probability_two_random_records_match(\n    [\n        \"l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob\",\n        \"substr(l.first_name,1,2) = substr(r.first_name,1,2) and l.surname = r.surname and substr(l.postcode_fake,1,2) = substr(r.postcode_fake,1,2)\",\n        \"l.dob = r.dob and l.postcode_fake = r.postcode_fake\",\n    ],\n    recall=0.6,\n)\n
Probability two random records match is estimated to be  0.000125.\nThis means that amongst all possible pairwise record comparisons, one in 7,985.62 are expected to match.  With 12,497,500 total possible comparisons, we expect a total of around 1,565.00 matching pairs\n
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n----- Estimating u probabilities using random sampling -----\nu probability not trained for first_name - Jaro-Winkler distance of first_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nu probability not trained for surname - Jaro-Winkler distance of surname >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\n\nEstimated u probabilities using random sampling\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (some u values are not trained, no m values are trained).\n    - surname (some u values are not trained, no m values are trained).\n    - dob (no m values are trained).\n    - postcode_fake (no m values are trained).\n    - birth_place (no m values are trained).\n    - occupation (no m values are trained).\n
training_blocking_rule = \"l.first_name = r.first_name and l.surname = r.surname\"\ntraining_session_names = linker.training.estimate_parameters_using_expectation_maximisation(\n    training_blocking_rule, estimate_without_term_frequencies=True\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.first_name = r.first_name and l.surname = r.surname\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name\n    - surname\n\nIteration 1: Largest change in params was -0.438 in probability_two_random_records_match\nIteration 2: Largest change in params was -0.0347 in probability_two_random_records_match\nIteration 3: Largest change in params was -0.0126 in the m_probability of birth_place, level `All other comparisons`\nIteration 4: Largest change in params was 0.00644 in the m_probability of birth_place, level `Exact match on birth_place`\n\nEM converged after 4 iterations\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (some u values are not trained, no m values are trained).\n    - surname (some u values are not trained, no m values are trained).\n
training_blocking_rule = \"l.dob = r.dob\"\ntraining_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    training_blocking_rule, estimate_without_term_frequencies=True\n)\n
----- Starting EM training session -----\n\nEstimating the m probabilities of the model by blocking on:\nl.dob = r.dob\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - postcode_fake\n    - birth_place\n    - occupation\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\nWARNING:\nLevel Jaro-Winkler distance of first_name >= 0.88 on comparison first_name not observed in dataset, unable to train m value\n\nWARNING:\nLevel Jaro-Winkler distance of surname >= 0.88 on comparison surname not observed in dataset, unable to train m value\n\nIteration 1: Largest change in params was 0.327 in the m_probability of first_name, level `All other comparisons`\nIteration 2: Largest change in params was -0.0566 in the m_probability of surname, level `Exact match on surname`\nIteration 3: Largest change in params was -0.0184 in the m_probability of surname, level `Exact match on surname`\nIteration 4: Largest change in params was -0.006 in the m_probability of surname, level `Exact match on surname`\n\nEM converged after 4 iterations\nm probability not trained for first_name - Jaro-Winkler distance of first_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\nm probability not trained for surname - Jaro-Winkler distance of surname >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (some u values are not trained, some m values are not trained).\n    - surname (some u values are not trained, some m values are not trained).\n

The final match weights can be viewed in the match weights chart:

linker.visualisations.match_weights_chart()\n
linker.evaluation.unlinkables_chart()\n
df_predict = linker.inference.predict()\ndf_e = df_predict.as_pandas_dataframe(limit=5)\ndf_e\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name':\n    m values not fully trained\nComparison: 'first_name':\n    u values not fully trained\nComparison: 'surname':\n    m values not fully trained\nComparison: 'surname':\n    u values not fully trained\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name ... bf_birth_place bf_tf_adj_birth_place occupation_l occupation_r gamma_occupation tf_occupation_l tf_occupation_r bf_occupation bf_tf_adj_occupation match_key 0 26.932083 1.000000 Q446382-1 Q446382-3 marianne marianne 4 0.000801 0.000801 51.871289 ... 0.162366 1.000000 None None -1 NaN NaN 1.000000 1.000000 0 1 30.788800 1.000000 Q2835078-1 Q2835078-2 alfred alfred 4 0.013622 0.013622 51.871289 ... 197.452526 0.607559 None None -1 NaN NaN 1.000000 1.000000 0 2 23.882340 1.000000 Q2835078-1 Q2835078-5 alfred alfred 4 0.013622 0.013622 51.871289 ... 1.000000 1.000000 None None -1 NaN NaN 1.000000 1.000000 0 3 39.932187 1.000000 Q80158702-1 Q80158702-4 john john 4 0.053085 0.053085 51.871289 ... 197.452526 2.025198 sculptor sculptor 1 0.002769 0.002769 23.836781 13.868019 0 4 17.042339 0.999993 Q18810722-3 Q18810722-6 frederick frederick 4 0.012220 0.012220 51.871289 ... 197.452526 0.607559 printer printer 1 0.000791 0.000791 23.836781 48.538067 0

5 rows \u00d7 44 columns

You can also view rows in this dataset as a waterfall chart as follows:

records_to_plot = df_e.to_dict(orient=\"records\")\nlinker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False)\n
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predict, threshold_match_probability=0.95\n)\n
Completed iteration 1, root rows count 5\nCompleted iteration 2, root rows count 0\n
linker.visualisations.cluster_studio_dashboard(\n    df_predict,\n    clusters,\n    \"dashboards/50k_cluster.html\",\n    sampling_method=\"by_cluster_size\",\n    overwrite=True,\n)\n\nfrom IPython.display import IFrame\n\nIFrame(src=\"./dashboards/50k_cluster.html\", width=\"100%\", height=1200)\n

linker.evaluation.accuracy_analysis_from_labels_column(\n    \"cluster\", output_type=\"roc\", match_weight_round_to_nearest=0.02\n)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name':\n    m values not fully trained\nComparison: 'first_name':\n    u values not fully trained\nComparison: 'surname':\n    m values not fully trained\nComparison: 'surname':\n    u values not fully trained\n
records = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\",\n    threshold_match_probability=0.999,\n    include_false_negatives=False,\n    include_false_positives=True,\n).as_record_dict()\nlinker.visualisations.waterfall_chart(records)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name':\n    m values not fully trained\nComparison: 'first_name':\n    u values not fully trained\nComparison: 'surname':\n    m values not fully trained\nComparison: 'surname':\n    u values not fully trained\n
# Some of the false negatives will be because they weren't detected by the blocking rules\nrecords = linker.evaluation.prediction_errors_from_labels_column(\n    \"cluster\",\n    threshold_match_probability=0.5,\n    include_false_negatives=True,\n    include_false_positives=False,\n).as_record_dict(limit=50)\n\nlinker.visualisations.waterfall_chart(records)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'first_name':\n    m values not fully trained\nComparison: 'first_name':\n    u values not fully trained\nComparison: 'surname':\n    m values not fully trained\nComparison: 'surname':\n    u values not fully trained\n
"},{"location":"demos/tutorials/00_Tutorial_Introduction.html","title":"Introduction","text":""},{"location":"demos/tutorials/00_Tutorial_Introduction.html#introductory-tutorial","title":"Introductory tutorial","text":"

This is the introduction to a seven part tutorial which demonstrates how to de-duplicate a small dataset using simple settings.

The aim of the tutorial is to demonstrate core Splink functionality succinctly, rather that comprehensively document all configuration options.

The seven parts are:

  • 1. Data prep pre-requisites

  • 2. Exploratory analysis

  • 3. Choosing blocking rules to optimise runtimes

  • 4. Estimating model parameters

  • 5. Predicting results

  • 6. Visualising predictions

  • 7. Evaluation

Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.

You can find these tutorial notebooks in the docs/demos/tutorials/ folder of the splink repo, or click the Colab links to run in your browser.

"},{"location":"demos/tutorials/00_Tutorial_Introduction.html#end-to-end-demos","title":"End-to-end demos","text":"

After following the steps of the tutorial, it might prove useful to have a look at some of the example notebooks that show various use-case scenarios of Splink from start to finish.

"},{"location":"demos/tutorials/00_Tutorial_Introduction.html#interactive-introduction-to-record-linkage-theory","title":"Interactive Introduction to Record Linkage Theory","text":"

If you'd like to learn more about record linkage theory, an interactive introduction is available here.

"},{"location":"demos/tutorials/01_Prerequisites.html","title":"1. Data prep prerequisites","text":""},{"location":"demos/tutorials/01_Prerequisites.html#data-prerequisites","title":"Data Prerequisites","text":"

Splink requires that you clean your data and assign unique IDs to rows before linking.

This section outlines the additional data cleaning steps needed before loading data into Splink.

"},{"location":"demos/tutorials/01_Prerequisites.html#unique-ids","title":"Unique IDs","text":"
  • Each input dataset must have a unique ID column, which is unique within the dataset. By default, Splink assumes this column will be called unique_id, but this can be changed with the unique_id_column_name key in your Splink settings. The unique id is essential because it enables Splink to keep track each row correctly.
"},{"location":"demos/tutorials/01_Prerequisites.html#conformant-input-datasets","title":"Conformant input datasets","text":"
  • Input datasets must be conformant, meaning they share the same column names and data formats. For instance, if one dataset has a \"date of birth\" column and another has a \"dob\" column, rename them to match. Ensure data type and number formatting are consistent across both columns. The order of columns in input dataframes is not important.
"},{"location":"demos/tutorials/01_Prerequisites.html#cleaning","title":"Cleaning","text":"
  • Ensure data consistency by cleaning your data. This process includes standardizing date formats, matching text case, and handling invalid data. For example, if one dataset uses \"yyyy-mm-dd\" date format and another uses \"mm/dd/yyyy,\" convert them to the same format before using Splink. Try also to identify and rectify any obvious data entry errors, such as removing values such as 'Mr' or 'Mrs' from a 'first name' column.
"},{"location":"demos/tutorials/01_Prerequisites.html#ensure-nulls-are-consistently-and-correctly-represented","title":"Ensure nulls are consistently and correctly represented","text":"
  • Ensure null values (or other 'not known' indicators) are represented as true nulls, not empty strings. Splink treats null values differently from empty strings, so using true nulls guarantees proper matching across datasets.
"},{"location":"demos/tutorials/01_Prerequisites.html#further-details-on-data-cleaning-and-standardisation","title":"Further details on data cleaning and standardisation","text":"

Splink performs optimally with cleaned and standardized data. Here is a non-exhaustive list of suggestions for data cleaning rules to enhance matching accuracy:

  • Trim leading and trailing whitespace from string values (e.g., \" john smith \" becomes \"john smith\").
  • Remove special characters from string values (e.g., \"O'Hara\" becomes \"Ohara\").
  • Standardise date formats as strings in \"yyyy-mm-dd\" format.
  • Replace abbreviations with full words (e.g., standardize \"St.\" and \"Street\" to \"Street\").
"},{"location":"demos/tutorials/02_Exploratory_analysis.html","title":"2. Exploratory analysis","text":""},{"location":"demos/tutorials/02_Exploratory_analysis.html#exploratory-analysis","title":"Exploratory analysis","text":"

Exploratory analysis helps you understand features of your data which are relevant linking or deduplicating your data.

Splink includes a variety of charts to help with this, which are demonstrated in this notebook.

"},{"location":"demos/tutorials/02_Exploratory_analysis.html#read-in-the-data","title":"Read in the data","text":"

For the purpose of this tutorial we will use a 1,000 row synthetic dataset that contains duplicates.

The first five rows of this dataset are printed below.

Note that the cluster column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)

from splink import  splink_datasets\n\ndf = splink_datasets.fake_1000\ndf = df.drop(columns=[\"cluster\"])\ndf.head(5)\n
unique_id first_name surname dob city email 0 0 Robert Alan 1971-06-24 NaN robert255@smith.net 1 1 Robert Allen 1971-05-24 NaN roberta25@smith.net 2 2 Rob Allen 1971-06-24 London roberta25@smith.net 3 3 Robert Alen 1971-06-24 Lonon NaN 4 4 Grace NaN 1997-04-26 Hull grace.kelly52@jones.com"},{"location":"demos/tutorials/02_Exploratory_analysis.html#analyse-missingness","title":"Analyse missingness","text":"

It's important to understand the level of missingness in your data, because columns with higher levels of missingness are less useful for data linking.

from splink.exploratory import completeness_chart\nfrom splink import DuckDBAPI\ndb_api = DuckDBAPI()\ncompleteness_chart(df, db_api=db_api)\n

The above summary chart shows that in this dataset, the email, city, surname and forename columns contain nulls, but the level of missingness is relatively low (less than 22%).

"},{"location":"demos/tutorials/02_Exploratory_analysis.html#analyse-the-distribution-of-values-in-your-data","title":"Analyse the distribution of values in your data","text":"

The distribution of values in your data is important for two main reasons:

  1. Columns with higher cardinality (number of distinct values) are usually more useful for data linking. For instance, date of birth is a much stronger linkage variable than gender.

  2. The skew of values is important. If you have a city column that has 1,000 distinct values, but 75% of them are London, this is much less useful for linkage than if the 1,000 values were equally distributed

The linker.profile_columns() method creates summary charts to help you understand these aspects of your data.

To profile all columns, leave the column_expressions argument empty.

from splink.exploratory import profile_columns\n\nprofile_columns(df, db_api=DuckDBAPI(), top_n=10, bottom_n=5)\n

This chart is very information-dense, but here are some key takehomes relevant to our linkage:

  • There is strong skew in the city field with around 20% of the values being London. We therefore will probably want to use term_frequency_adjustments in our linkage model, so that it can weight a match on London differently to a match on, say, Norwich.

  • Looking at the \"Bottom 5 values by value count\", we can see typos in the data in most fields. This tells us this information was possibly entered by hand, or using Optical Character Recognition, giving us an insight into the type of data entry errors we may see.

  • Email is a much more uniquely-identifying field than any others, with a maximum value count of 6. It's likely to be a strong linking variable.

Further Reading

For more on exploratory analysis tools in Splink, please refer to the Exploratory Analysis API documentation.

For more on the charts used in this tutorial, please refer to the Charts Gallery.

"},{"location":"demos/tutorials/02_Exploratory_analysis.html#next-steps","title":"Next steps","text":"

At this point, we have begun to develop a strong understanding of our data. It's time to move on to estimating a linkage model

"},{"location":"demos/tutorials/03_Blocking.html","title":"3. Blocking","text":""},{"location":"demos/tutorials/03_Blocking.html#choosing-blocking-rules-to-optimise-runtime","title":"Choosing blocking rules to optimise runtime","text":"

To link records, we need to compare pairs of records and decide which pairs are matches.

For example consider the following two records:

first_name surname dob city email Robert Allen 1971-05-24 nan roberta25@smith.net Rob Allen 1971-06-24 London roberta25@smith.net

These can be represented as a pairwise comparison as follows:

first_name_l first_name_r surname_l surname_r dob_l dob_r city_l city_r email_l email_r Robert Rob Allen Allen 1971-05-24 1971-06-24 nan London roberta25@smith.net roberta25@smith.net

For most large datasets, it is computationally intractable to compare every row with every other row, since the number of comparisons rises quadratically with the number of records.

Instead we rely on blocking rules, which specify which pairwise comparisons to generate. For example, we could generate the subset of pairwise comparisons where either first name or surname matches.

This is part of a two step process to link data:

  1. Use blocking rules to generate candidate pairwise record comparisons

  2. Use a probabilistic linkage model to score these candidate pairs, to determine which ones should be linked

Blocking rules are the most important determinant of the performance of your linkage job.

When deciding on your blocking rules, you're trading off accuracy for performance:

  • If your rules are too loose, your linkage job may fail.
  • If they're too tight, you may miss some valid links.

This tutorial clarifies what blocking rules are, and how to choose good rules.

"},{"location":"demos/tutorials/03_Blocking.html#blocking-rules-in-splink","title":"Blocking rules in Splink","text":"

In Splink, blocking rules are specified as SQL expressions.

For example, to generate the subset of record comparisons where the first name and surname matches, we can specify the following blocking rule:

from splink import block_on\nblock_on(\"first_name\", \"surname\")\n

When executed, this blocking rule will be converted to a SQL statement with the following form:

SELECT ...\nFROM input_tables as l\nINNER JOIN input_tables as r\nON l.first_name = r.first_name AND l.surname = r.surname\n

Since blocking rules are SQL expressions, they can be arbitrarily complex. For example, you could create record comparisons where the initial of the first name and the surname match with the following rule:

from splink import block_on\nblock_on(\"substr(first_name, 1, 2)\", \"surname\")\n
"},{"location":"demos/tutorials/03_Blocking.html#devising-effective-blocking-rules-for-prediction","title":"Devising effective blocking rules for prediction","text":"

The aims of your blocking rules are twofold:

  1. Eliminate enough non-matching comparison pairs so your record linkage job is small enough to compute
  2. Eliminate as few truly matching pairs as possible (ideally none)

It is usually impossible to find a single blocking rule which achieves both aims, so we recommend using multiple blocking rules.

When we specify multiple blocking rules, Splink will generate all comparison pairs that meet any one of the rules.

For example, consider the following blocking rule:

block_on(\"first_name\", \"dob\")

This rule is likely to be effective in reducing the number of comparison pairs. It will retain all truly matching pairs, except those with errors or nulls in either the first_name or dob fields.

Now consider a second blocking rule:

block_on(\"email\").

This will retain all truly matching pairs, except those with errors or nulls in the email column.

Individually, these blocking rules are problematic because they exclude true matches where the records contain typos of certain types. But between them, they might do quite a good job.

For a true match to be eliminated by the use of these two blocking rules, it would have to have an error in both email AND (first_name or dob).

This is not completely implausible, but it is significantly less likely than if we'd used a single rule.

More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking criteria. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible.

The question then becomes how to choose what to put in this list.

"},{"location":"demos/tutorials/03_Blocking.html#splink-tools-to-help-choose-your-blocking-rules","title":"Splink tools to help choose your blocking rules","text":"

Splink contains a number of tools to help you choose effective blocking rules. Let's try them out, using our small test dataset:

from splink import DuckDBAPI, block_on, splink_datasets\n\ndf = splink_datasets.fake_1000\n
"},{"location":"demos/tutorials/03_Blocking.html#counting-the-number-of-comparisons-created-by-a-single-blocking-rule","title":"Counting the number of comparisons created by a single blocking rule","text":"

On large datasets, some blocking rules imply the creation of trillions of record comparisons, which would cause a linkage job to fail.

Before using a blocking rule in a linkage job, it's therefore a good idea to count the number of records it generates to ensure it is not too loose:

from splink.blocking_analysis import count_comparisons_from_blocking_rule\n\ndb_api = DuckDBAPI()\n\nbr = block_on(\"substr(first_name, 1,1)\", \"surname\")\n\ncounts = count_comparisons_from_blocking_rule(\n    table_or_tables=df,\n    blocking_rule=br,\n    link_type=\"dedupe_only\",\n    db_api=db_api,\n)\n\ncounts\n
{'number_of_comparisons_generated_pre_filter_conditions': 1632,\n 'number_of_comparisons_to_be_scored_post_filter_conditions': 473,\n 'filter_conditions_identified': '',\n 'equi_join_conditions_identified': 'SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1) AND l.\"surname\" = r.\"surname\"',\n 'link_type_join_condition': 'where l.\"unique_id\" < r.\"unique_id\"'}\n
br = \"l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2\"\n\ncounts = count_comparisons_from_blocking_rule(\n    table_or_tables=df,\n    blocking_rule= br,\n    link_type=\"dedupe_only\",\n    db_api=db_api,\n)\ncounts\n
{'number_of_comparisons_generated_pre_filter_conditions': 4827,\n 'number_of_comparisons_to_be_scored_post_filter_conditions': 372,\n 'filter_conditions_identified': 'LEVENSHTEIN(l.surname, r.surname) < 2',\n 'equi_join_conditions_identified': 'l.first_name = r.first_name',\n 'link_type_join_condition': 'where l.\"unique_id\" < r.\"unique_id\"'}\n

The maximum number of comparisons that you can compute will be affected by your choice of SQL backend, and how powerful your computer is.

For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons. For Spark and Athena, try starting with fewer than 100 million comparisons, before scaling up.

"},{"location":"demos/tutorials/03_Blocking.html#finding-worst-offending-values-for-your-blocking-rule","title":"Finding 'worst offending' values for your blocking rule","text":"

Blocking rules can be affected by skew: some values of a field may be much more common than others, and this can lead to a disproportionate number of comparisons being generated.

It can be useful to identify whether your data is afflicted by this problem.

from splink.blocking_analysis import n_largest_blocks\n\nresult = n_largest_blocks(    table_or_tables=df,\n    blocking_rule= block_on(\"city\", \"first_name\"),\n    link_type=\"dedupe_only\",\n    db_api=db_api,\n    n_largest=3\n    )\n\nresult.as_pandas_dataframe()\n
key_0 key_1 count_l count_r block_count 0 Birmingham Theodore 7 7 49 1 London Oliver 7 7 49 2 London James 6 6 36

In this case, we can see that Olivers in London will result in 49 comparisons being generated. This is acceptable on this small dataset, but on a larger dataset, Olivers in London could be responsible for many million comparisons.

"},{"location":"demos/tutorials/03_Blocking.html#counting-the-number-of-comparisons-created-by-a-list-of-blocking-rules","title":"Counting the number of comparisons created by a list of blocking rules","text":"

As noted above, it's usually a good idea to use multiple blocking rules. It's therefore useful to know how many record comparisons will be generated when these rules are applied.

Since the same record comparison may be created by several blocking rules, and Splink automatically deduplicates these comparisons, we cannot simply total the number of comparisons generated by each rule individually.

Splink provides a chart that shows the marginal (additional) comparisons generated by each blocking rule, after deduplication:

from splink.blocking_analysis import (\n    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n)\n\nblocking_rules_for_analysis = [\n    block_on(\"substr(first_name, 1,1)\", \"surname\"),\n    block_on(\"surname\"),\n    block_on(\"email\"),\n    block_on(\"city\", \"first_name\"),\n    \"l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2\",\n]\n\n\ncumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n    table_or_tables=df,\n    blocking_rules=blocking_rules_for_analysis,\n    db_api=db_api,\n    link_type=\"dedupe_only\",\n)\n
"},{"location":"demos/tutorials/03_Blocking.html#digging-deeper-understanding-why-certain-blocking-rules-create-large-numbers-of-comparisons","title":"Digging deeper: Understanding why certain blocking rules create large numbers of comparisons","text":"

Finally, we can use the profile_columns function we saw in the previous tutorial to understand a specific blocking rule in more depth.

Suppose we're interested in blocking on city and first initial.

Within each distinct value of (city, first initial), all possible pairwise comparisons will be generated.

So for instance, if there are 15 distinct records with London,J then these records will result in n(n-1)/2 = 105 pairwise comparisons being generated.

In a larger dataset, we might observe 10,000 London,J records, which would then be responsible for 49,995,000 comparisons.

These high-frequency values therefore have a disproportionate influence on the overall number of pairwise comparisons, and so it can be useful to analyse skew, as follows:

from splink.exploratory import profile_columns\n\nprofile_columns(df, column_expressions=[\"city || left(first_name,1)\"], db_api=db_api)\n

Further Reading

For a deeper dive on blocking, please refer to the Blocking Topic Guides.

For more on the blocking tools in Splink, please refer to the Blocking API documentation.

For more on the charts used in this tutorial, please refer to the Charts Gallery.

"},{"location":"demos/tutorials/03_Blocking.html#next-steps","title":"Next steps","text":"

Now we have chosen which records to compare, we can use those records to train a linkage model.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html","title":"4. Estimating model parameters","text":""},{"location":"demos/tutorials/04_Estimating_model_parameters.html#specifying-and-estimating-a-linkage-model","title":"Specifying and estimating a linkage model","text":"

In the last tutorial we looked at how we can use blocking rules to generate pairwise record comparisons.

Now it's time to estimate a probabilistic linkage model to score each of these comparisons. The resultant match score is a prediction of whether the two records represent the same entity (e.g. are the same person).

The purpose of estimating the model is to learn the relative importance of different parts of your data for the purpose of data linking.

For example, a match on date of birth is a much stronger indicator that two records refer to the same entity than a match on gender. A mismatch on gender may be a stronger indicate against two records referring than a mismatch on name, since names are more likely to be entered differently.

The relative importance of different information is captured in the (partial) 'match weights', which can be learned from your data. These match weights are then added up to compute the overall match score.

The match weights are are derived from the m and u parameters of the underlying Fellegi Sunter model. Splink uses various statistical routines to estimate these parameters. Further details of the underlying theory can be found here, which will help you understand this part of the tutorial.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#specifying-a-linkage-model","title":"Specifying a linkage model","text":"

To build a linkage model, the user defines the partial match weights that splink needs to estimate. This is done by defining how the information in the input records should be compared.

To be concrete, here is an example comparison:

first_name_l first_name_r surname_l surname_r dob_l dob_r city_l city_r email_l email_r Robert Rob Allen Allen 1971-05-24 1971-06-24 nan London roberta25@smith.net roberta25@smith.net

What functions should we use to assess the similarity of Rob vs. Robert in the the first_name field?

Should similarity in the dob field be computed in the same way, or a different way?

Your job as the developer of a linkage model is to decide what comparisons are most appropriate for the types of data you have.

Splink can then estimate how much weight to place on a fuzzy match of Rob vs. Robert, relative to an exact match on Robert, or a non-match.

Defining these scenarios is done using Comparisons.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#comparisons","title":"Comparisons","text":"

The concept of a Comparison has a specific definition within Splink: it defines how data from one or more input columns is compared.

For example, one Comparison may represent how similarity is assessed for a person's date of birth.

Another Comparison may represent the comparison of a person's name or location.

A model is composed of many Comparisons, which between them assess the similarity of all of the columns being used for data linking.

Each Comparison contains two or more ComparisonLevels which define n discrete gradations of similarity between the input columns within the Comparison.

As such ComparisonLevelsare nested within Comparisons as follows:

Data Linking Model\n\u251c\u2500-- Comparison: Date of birth\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match\n\u2502    \u251c\u2500-- ComparisonLevel: One character difference\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n\u251c\u2500-- Comparison: Surname\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match on surname\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n\u2502    etc.\n

Our example data would therefore result in the following comparisons, for dob and surname:

dob_l dob_r comparison_level interpretation 1971-05-24 1971-05-24 Exact match great match 1971-05-24 1971-06-24 One character difference fuzzy match 1971-05-24 2000-01-02 All other bad match

surname_l surname_r comparison_level interpretation Rob Rob Exact match great match Rob Jane All other bad match Rob Robert All other bad match, this comparison has no notion of nicknames

More information about specifying comparisons can be found here and here.

We will now use these concepts to build a data linking model.

# Begin by reading in the tutorial data again\nfrom splink import splink_datasets\n\ndf = splink_datasets.fake_1000\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#specifying-the-model-using-comparisons","title":"Specifying the model using comparisons","text":"

Splink includes a library of comparison functions at splink.comparison_library to make it simple to get started. These are split into two categories:

  1. Generic Comparison functions which apply a particular fuzzy matching function. For example, levenshtein distance.
import splink.comparison_library as cl\n\ncity_comparison = cl.LevenshteinAtThresholds(\"city\", 2)\nprint(city_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'LevenshteinAtThresholds' of \"city\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'city is NULL' with SQL rule: \"city_l\" IS NULL OR \"city_r\" IS NULL\n    - 'Exact match on city' with SQL rule: \"city_l\" = \"city_r\"\n    - 'Levenshtein distance of city <= 2' with SQL rule: levenshtein(\"city_l\", \"city_r\") <= 2\n    - 'All other comparisons' with SQL rule: ELSE\n
  1. Comparison functions tailored for specific data types. For example, email.
email_comparison = cl.EmailComparison(\"email\")\nprint(email_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'EmailComparison' of \"email\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'email is NULL' with SQL rule: \"email_l\" IS NULL OR \"email_r\" IS NULL\n    - 'Exact match on email' with SQL rule: \"email_l\" = \"email_r\"\n    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')\n    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')) >= 0.88\n    - 'All other comparisons' with SQL rule: ELSE\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#specifying-the-full-settings-dictionary","title":"Specifying the full settings dictionary","text":"

Comparisons are specified as part of the Splink settings, a Python dictionary which controls all of the configuration of a Splink model:

from splink import Linker, SettingsCreator, block_on, DuckDBAPI\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.LevenshteinAtThresholds(\"dob\", 1),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"city\"),\n        block_on(\"surname\"),\n\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\n

In words, this setting dictionary says:

  • We are performing a dedupe_only (the other options are link_only, or link_and_dedupe, which may be used if there are multiple input datasets).
  • When comparing records, we will use information from the first_name, surname, dob, city and email columns to compute a match score.
  • The blocking_rules_to_generate_predictions states that we will only check for duplicates amongst records where either the first_name AND city or surname is identical.
  • We have enabled term frequency adjustments for the 'city' column, because some values (e.g. London) appear much more frequently than others.
  • We have set retain_intermediate_calculation_columns and additional_columns_to_retain to True so that Splink outputs additional information that helps the user understand the calculations. If they were False, the computations would run faster.
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimate-the-parameters-of-the-model","title":"Estimate the parameters of the model","text":"

Now that we have specified our linkage model, we need to estimate the probability_two_random_records_match, u, and m parameters.

  • The probability_two_random_records_match parameter is the probability that two records taken at random from your input data represent a match (typically a very small number).

  • The u values are the proportion of records falling into each ComparisonLevel amongst truly non-matching records.

  • The m values are the proportion of records falling into each ComparisonLevel amongst truly matching records

You can read more about the theory of what these mean.

We can estimate these parameters using unlabeled data. If we have labels, then we can estimate them even more accurately.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimation-of-probability_two_random_records_match","title":"Estimation of probability_two_random_records_match","text":"

In some cases, the probability_two_random_records_match will be known. For example, if you are linking two tables of 10,000 records and expect a one-to-one match, then you should set this value to 1/10_000 in your settings instead of estimating it.

More generally, this parameter is unknown and needs to be estimated.

It can be estimated accurately enough for most purposes by combining a series of deterministic matching rules and a guess of the recall corresponding to those rules. For further details of the rationale behind this appraoch see here.

In this example, I guess that the following deterministic matching rules have a recall of about 70%. That means, between them, the rules recover 70% of all true matches.

deterministic_rules = [\n    block_on(\"first_name\", \"dob\"),\n    \"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2\",\n    block_on(\"email\")\n]\n\nlinker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)\n
Probability two random records match is estimated to be  0.00298.\nThis means that amongst all possible pairwise record comparisons, one in 335.56 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,488.57 matching pairs\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimation-of-u-probabilities","title":"Estimation of u probabilities","text":"

Once we have the probability_two_random_records_match parameter, we can estimate the u probabilities.

We estimate u using the estimate_u_using_random_sampling method, which doesn't require any labels.

It works by sampling random pairs of records, since most of these pairs are going to be non-matches. Over these non-matches we compute the distribution of ComparisonLevels for each Comparison.

For instance, for gender, we would find that the the gender matches 50% of the time, and mismatches 50% of the time.

For dob on the other hand, we would find that the dob matches 1% of the time, has a \"one character difference\" 3% of the time, and everything else happens 96% of the time.

The larger the random sample, the more accurate the predictions. You control this using the max_pairs parameter. For large datasets, we recommend using at least 10 million - but the higher the better and 1 billion is often appropriate for larger datasets.

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n
You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.\n\n\n----- Estimating u probabilities using random sampling -----\n\n\n\nEstimated u probabilities using random sampling\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n    - dob (no m values are trained).\n    - city (no m values are trained).\n    - email (no m values are trained).\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimation-of-m-probabilities","title":"Estimation of m probabilities","text":"

m is the trickiest of the parameters to estimate, because we have to have some idea of what the true matches are.

If we have labels, we can directly estimate it.

If we do not have labelled data, the m parameters can be estimated using an iterative maximum likelihood approach called Expectation Maximisation.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimating-directly","title":"Estimating directly","text":"

If we have labels, we can estimate m directly using the estimate_m_from_label_column method of the linker.

For example, if the entity being matched is persons, and your input dataset(s) contain social security number, this could be used to estimate the m values for the model.

Note that this column does not need to be fully populated. A common case is where a unique identifier such as social security number is only partially populated.

For example (in this tutorial we don't have labels, so we're not actually going to use this):

linker.estimate_m_from_label_column(\"social_security_number\")\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#estimating-with-expectation-maximisation","title":"Estimating with Expectation Maximisation","text":"

This algorithm estimates the m values by generating pairwise record comparisons, and using them to maximise a likelihood function.

Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.

In our first estimation pass, we block on first_name and surname, meaning we will generate all record comparisons that have first_name and surname exactly equal.

Recall we are trying to estimate the m values of the model, i.e. proportion of records falling into each ComparisonLevel amongst truly matching records.

This means that, in this training session, we cannot estimate parameter estimates for the first_name or surname columns, since they will be equal for all the comparisons we do.

We can, however, estimate parameter estimates for all of the other columns. The output messages produced by Splink confirm this.

training_blocking_rule = block_on(\"first_name\", \"surname\")\ntraining_session_fname_sname = (\n    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)\n)\n
----- Starting EM training session -----\n\n\n\nEstimating the m probabilities of the model by blocking on:\n(l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\")\n\nParameter estimates will be made for the following comparison(s):\n    - dob\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - first_name\n    - surname\n\n\n\n\n\nWARNING:\nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\n\n\nIteration 1: Largest change in params was -0.521 in the m_probability of dob, level `Exact match on dob`\n\n\nIteration 2: Largest change in params was 0.0516 in probability_two_random_records_match\n\n\nIteration 3: Largest change in params was 0.0183 in probability_two_random_records_match\n\n\nIteration 4: Largest change in params was 0.00744 in probability_two_random_records_match\n\n\nIteration 5: Largest change in params was 0.00349 in probability_two_random_records_match\n\n\nIteration 6: Largest change in params was 0.00183 in probability_two_random_records_match\n\n\nIteration 7: Largest change in params was 0.00103 in probability_two_random_records_match\n\n\nIteration 8: Largest change in params was 0.000607 in probability_two_random_records_match\n\n\nIteration 9: Largest change in params was 0.000367 in probability_two_random_records_match\n\n\nIteration 10: Largest change in params was 0.000226 in probability_two_random_records_match\n\n\nIteration 11: Largest change in params was 0.00014 in probability_two_random_records_match\n\n\nIteration 12: Largest change in params was 8.73e-05 in probability_two_random_records_match\n\n\n\nEM converged after 12 iterations\n\n\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - first_name (no m values are trained).\n    - surname (no m values are trained).\n    - email (some m values are not trained).\n

In a second estimation pass, we block on dob. This allows us to estimate parameters for the first_name and surname comparisons.

Between the two estimation passes, we now have parameter estimates for all comparisons.

training_blocking_rule = block_on(\"dob\")\ntraining_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(\n    training_blocking_rule\n)\n
----- Starting EM training session -----\n\n\n\nEstimating the m probabilities of the model by blocking on:\nl.\"dob\" = r.\"dob\"\n\nParameter estimates will be made for the following comparison(s):\n    - first_name\n    - surname\n    - city\n    - email\n\nParameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: \n    - dob\n\n\n\n\n\nWARNING:\nLevel Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value\n\n\n\nIteration 1: Largest change in params was -0.407 in the m_probability of surname, level `Exact match on surname`\n\n\nIteration 2: Largest change in params was 0.0929 in probability_two_random_records_match\n\n\nIteration 3: Largest change in params was 0.0548 in the m_probability of first_name, level `All other comparisons`\n\n\nIteration 4: Largest change in params was 0.0186 in probability_two_random_records_match\n\n\nIteration 5: Largest change in params was 0.00758 in probability_two_random_records_match\n\n\nIteration 6: Largest change in params was 0.00339 in probability_two_random_records_match\n\n\nIteration 7: Largest change in params was 0.0016 in probability_two_random_records_match\n\n\nIteration 8: Largest change in params was 0.000773 in probability_two_random_records_match\n\n\nIteration 9: Largest change in params was 0.000379 in probability_two_random_records_match\n\n\nIteration 10: Largest change in params was 0.000189 in probability_two_random_records_match\n\n\nIteration 11: Largest change in params was 9.68e-05 in probability_two_random_records_match\n\n\n\nEM converged after 11 iterations\n\n\nm probability not trained for email - Jaro-Winkler >0.88 on username (comparison vector value: 1). This usually means the comparison level was never observed in the training data.\n\n\n\nYour model is not yet fully trained. Missing estimates for:\n    - email (some m values are not trained).\n

Note that Splink includes other algorithms for estimating m and u values, which are documented here.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#visualising-model-parameters","title":"Visualising model parameters","text":"

Splink can generate a number of charts to help you understand your model. For an introduction to these charts and how to interpret them, please see this video.

The final estimated match weights can be viewed in the match weights chart:

linker.visualisations.match_weights_chart()\n
linker.visualisations.m_u_parameters_chart()\n

We can also compare the estimates that were produced by the different EM training sessions

linker.visualisations.parameter_estimate_comparisons_chart()\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#saving-the-model","title":"Saving the model","text":"

We can save the model, including our estimated parameters, to a .json file, so we can use it in the next tutorial.

settings = linker.misc.save_model_to_json(\n    \"../demo_settings/saved_model_from_demo.json\", overwrite=True\n)\n
"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#detecting-unlinkable-records","title":"Detecting unlinkable records","text":"

An interesting application of our trained model that is useful to explore before making any predictions is to detect 'unlinkable' records.

Unlinkable records are those which do not contain enough information to be linked. A simple example would be a record containing only 'John Smith', and null in all other fields. This record may link to other records, but we'll never know because there's not enough information to disambiguate any potential links. Unlinkable records can be found by linking records to themselves - if, even when matched to themselves, they don't meet the match threshold score, we can be sure they will never link to anything.

linker.evaluation.unlinkables_chart()\n

In the above chart, we can see that about 1.3% of records in the input dataset are unlinkable at a threshold match weight of 6.11 (correponding to a match probability of around 98.6%)

Further Reading

For more on the model estimation tools in Splink, please refer to the Model Training API documentation.

For a deeper dive on:

  • choosing comparisons, please refer to the Comparisons Topic Guides
  • the underlying model theory, please refer to the Fellegi Sunter Topic Guide
  • model training, please refer to the Model Training Topic Guides (Coming Soon).

For more on the charts used in this tutorial, please refer to the Charts Gallery.

"},{"location":"demos/tutorials/04_Estimating_model_parameters.html#next-steps","title":"Next steps","text":"

Now we have trained a model, we can move on to using it predict matching records.

"},{"location":"demos/tutorials/05_Predicting_results.html","title":"5. Predicting results","text":""},{"location":"demos/tutorials/05_Predicting_results.html#predicting-which-records-match","title":"Predicting which records match","text":"

In the previous tutorial, we built and estimated a linkage model.

In this tutorial, we will load the estimated model and use it to make predictions of which pairwise record comparisons match.

from splink import Linker, DuckDBAPI, splink_datasets\n\nimport pandas as pd\n\npd.options.display.max_columns = 1000\n\ndb_api = DuckDBAPI()\ndf = splink_datasets.fake_1000\n
"},{"location":"demos/tutorials/05_Predicting_results.html#load-estimated-model-from-previous-tutorial","title":"Load estimated model from previous tutorial","text":"
import json\nimport urllib\n\nurl = \"https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json\"\n\nwith urllib.request.urlopen(url) as u:\n    settings = json.loads(u.read().decode())\n\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\n
"},{"location":"demos/tutorials/05_Predicting_results.html#predicting-match-weights-using-the-trained-model","title":"Predicting match weights using the trained model","text":"

We use linker.predict() to run the model.

Under the hood this will:

  • Generate all pairwise record comparisons that match at least one of the blocking_rules_to_generate_predictions

  • Use the rules specified in the Comparisons to evaluate the similarity of the input data

  • Use the estimated match weights, applying term frequency adjustments where requested to produce the final match_weight and match_probability scores

Optionally, a threshold_match_probability or threshold_match_weight can be provided, which will drop any row where the predicted score is below the threshold.

df_predictions = linker.inference.predict(threshold_match_probability=0.2)\ndf_predictions.as_pandas_dataframe(limit=5)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name bf_tf_adj_first_name surname_l surname_r gamma_surname tf_surname_l tf_surname_r bf_surname bf_tf_adj_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email match_key 0 -1.749664 0.229211 324 326 Kai Kai 4 0.006017 0.006017 84.821765 0.962892 None Turner -1 NaN 0.007326 1.000000 1.000000 2018-12-31 2009-11-03 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 k.t50eherand@z.ncom None -1 0.001267 NaN 1.0 1.0 0 1 -1.626076 0.244695 25 27 Gabriel None -1 0.001203 NaN 1.000000 1.000000 Thomas Thomas 4 0.004884 0.004884 88.870507 1.001222 1977-09-13 1977-10-17 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 gabriel.t54@nichols.info None -1 0.002535 NaN 1.0 1.0 1 2 -1.551265 0.254405 626 629 geeorGe George 1 0.001203 0.014440 4.176727 1.000000 Davidson Davidson 4 0.007326 0.007326 88.870507 0.667482 1999-05-07 2000-05-06 0 0.460743 Southamptn None -1 0.001230 NaN 1.00000 1.000000 None gdavidson@johnson-brown.com -1 NaN 0.00507 1.0 1.0 1 3 -1.427735 0.270985 600 602 Toby Toby 4 0.004813 0.004813 84.821765 1.203614 None None -1 NaN NaN 1.000000 1.000000 2003-04-23 2013-03-21 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 toby.d@menhez.com None -1 0.001267 NaN 1.0 1.0 0 4 -1.427735 0.270985 599 602 Toby Toby 4 0.004813 0.004813 84.821765 1.203614 Haall None -1 0.001221 NaN 1.000000 1.000000 2003-04-23 2013-03-21 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 None None -1 NaN NaN 1.0 1.0 0"},{"location":"demos/tutorials/05_Predicting_results.html#clustering","title":"Clustering","text":"

The result of linker.predict() is a list of pairwise record comparisons and their associated scores. For instance, if we have input records A, B, C and D, it could be represented conceptually as:

A -> B with score 0.9\nB -> C with score 0.95\nC -> D with score 0.1\nD -> E with score 0.99\n

Often, an alternative representation of this result is more useful, where each row is an input record, and where records link, they are assigned to the same cluster.

With a score threshold of 0.5, the above data could be represented conceptually as:

ID, Cluster ID\nA,  1\nB,  1\nC,  1\nD,  2\nE,  2\n

The algorithm that converts between the pairwise results and the clusters is called connected components, and it is included in Splink. You can use it as follows:

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predictions, threshold_match_probability=0.5\n)\nclusters.as_pandas_dataframe(limit=10)\n
Completed iteration 1, root rows count 2\n\n\nCompleted iteration 2, root rows count 0\n
cluster_id unique_id first_name surname dob city email cluster __splink_salt tf_surname tf_email tf_city tf_first_name 0 0 0 Robert Alan 1971-06-24 None robert255@smith.net 0 0.012924 0.001221 0.001267 NaN 0.003610 1 1 1 Robert Allen 1971-05-24 None roberta25@smith.net 0 0.478756 0.002442 0.002535 NaN 0.003610 2 1 2 Rob Allen 1971-06-24 London roberta25@smith.net 0 0.409662 0.002442 0.002535 0.212792 0.001203 3 3 3 Robert Alen 1971-06-24 Lonon None 0 0.311029 0.001221 NaN 0.007380 0.003610 4 4 4 Grace None 1997-04-26 Hull grace.kelly52@jones.com 1 0.486141 NaN 0.002535 0.001230 0.006017 5 5 5 Grace Kelly 1991-04-26 None grace.kelly52@jones.com 1 0.434566 0.002442 0.002535 NaN 0.006017 6 6 6 Logan pMurphy 1973-08-01 None None 2 0.423760 0.001221 NaN NaN 0.012034 7 7 7 None None 2015-03-03 Portsmouth evied56@harris-bailey.net 3 0.683689 NaN 0.002535 0.017220 NaN 8 8 8 None Dean 2015-03-03 None None 3 0.553086 0.003663 NaN NaN NaN 9 8 9 Evie Dean 2015-03-03 Pootsmruth evihd56@earris-bailey.net 3 0.753070 0.003663 0.001267 0.001230 0.008424
sql = f\"\"\"\nselect *\nfrom {df_predictions.physical_name}\nlimit 2\n\"\"\"\nlinker.misc.query_sql(sql)\n
match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r gamma_first_name tf_first_name_l tf_first_name_r bf_first_name bf_tf_adj_first_name surname_l surname_r gamma_surname tf_surname_l tf_surname_r bf_surname bf_tf_adj_surname dob_l dob_r gamma_dob bf_dob city_l city_r gamma_city tf_city_l tf_city_r bf_city bf_tf_adj_city email_l email_r gamma_email tf_email_l tf_email_r bf_email bf_tf_adj_email match_key 0 -1.749664 0.229211 324 326 Kai Kai 4 0.006017 0.006017 84.821765 0.962892 None Turner -1 NaN 0.007326 1.000000 1.000000 2018-12-31 2009-11-03 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 k.t50eherand@z.ncom None -1 0.001267 NaN 1.0 1.0 0 1 -1.626076 0.244695 25 27 Gabriel None -1 0.001203 NaN 1.000000 1.000000 Thomas Thomas 4 0.004884 0.004884 88.870507 1.001222 1977-09-13 1977-10-17 0 0.460743 London London 1 0.212792 0.212792 10.20126 0.259162 gabriel.t54@nichols.info None -1 0.002535 NaN 1.0 1.0 1

Further Reading

For more on the prediction tools in Splink, please refer to the Prediction API documentation.

"},{"location":"demos/tutorials/05_Predicting_results.html#next-steps","title":"Next steps","text":"

Now we have made predictions with a model, we can move on to visualising it to understand how it is working.

"},{"location":"demos/tutorials/06_Visualising_predictions.html","title":"6. Visualising predictions","text":""},{"location":"demos/tutorials/06_Visualising_predictions.html#visualising-predictions","title":"Visualising predictions","text":"

Splink contains a variety of tools to help you visualise your predictions.

The idea is that, by developing an understanding of how your model works, you can gain confidence that the predictions it makes are sensible, or alternatively find examples of where your model isn't working, which may help you improve the model specification and fix these problems.

# Rerun our predictions to we're ready to view the charts\nfrom splink import Linker, DuckDBAPI, splink_datasets\n\nimport pandas as pd\n\npd.options.display.max_columns = 1000\n\ndb_api = DuckDBAPI()\ndf = splink_datasets.fake_1000\n
import json\nimport urllib\n\nurl = \"https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json\"\n\nwith urllib.request.urlopen(url) as u:\n    settings = json.loads(u.read().decode())\n\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\ndf_predictions = linker.inference.predict(threshold_match_probability=0.2)\n
 -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
"},{"location":"demos/tutorials/06_Visualising_predictions.html#waterfall-chart","title":"Waterfall chart","text":"

The waterfall chart provides a means of visualising individual predictions to understand how Splink computed the final matchweight for a particular pairwise record comparison.

To plot a waterfall chart, the user chooses one or more records from the results of linker.inference.predict(), and provides these records to the linker.visualisations.waterfall_chart() function.

For an introduction to waterfall charts and how to interpret them, please see this video.

records_to_view = df_predictions.as_record_dict(limit=5)\nlinker.visualisations.waterfall_chart(records_to_view, filter_nulls=False)\n
"},{"location":"demos/tutorials/06_Visualising_predictions.html#comparison-viewer-dashboard","title":"Comparison viewer dashboard","text":"

The comparison viewer dashboard takes this one step further by producing an interactive dashboard that contains example predictions from across the spectrum of match scores.

An in-depth video describing how to interpret the dashboard can be found here.

linker.visualisations.comparison_viewer_dashboard(df_predictions, \"scv.html\", overwrite=True)\n\n# You can view the scv.html file in your browser, or inline in a notbook as follows\nfrom IPython.display import IFrame\n\nIFrame(src=\"./scv.html\", width=\"100%\", height=1200)\n

"},{"location":"demos/tutorials/06_Visualising_predictions.html#cluster-studio-dashboard","title":"Cluster studio dashboard","text":"

Cluster studio is an interactive dashboards that visualises the results of clustering your predictions.

It provides examples of clusters of different sizes. The shape and size of clusters can be indicative of problems with record linkage, so it provides a tool to help you find potential false positive and negative links.

df_clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    df_predictions, threshold_match_probability=0.5\n)\n\nlinker.visualisations.cluster_studio_dashboard(\n    df_predictions,\n    df_clusters,\n    \"cluster_studio.html\",\n    sampling_method=\"by_cluster_size\",\n    overwrite=True,\n)\n\n# You can view the scv.html file in your browser, or inline in a notbook as follows\nfrom IPython.display import IFrame\n\nIFrame(src=\"./cluster_studio.html\", width=\"100%\", height=1000)\n
Completed iteration 1, root rows count 2\n\n\nCompleted iteration 2, root rows count 0\n

Further Reading

For more on the visualisation tools in Splink, please refer to the Visualisation API documentation.

For more on the charts used in this tutorial, please refer to the Charts Gallery

"},{"location":"demos/tutorials/06_Visualising_predictions.html#next-steps","title":"Next steps","text":"

Now we have visualised the results of a model, we can move on to some more formal Quality Assurance procedures using labelled data.

"},{"location":"demos/tutorials/07_Evaluation.html","title":"7. Evaluation","text":""},{"location":"demos/tutorials/07_Evaluation.html#evaluation-of-prediction-results","title":"Evaluation of prediction results","text":"

In the previous tutorial, we looked at various ways to visualise the results of our model. These are useful for evaluating a linkage pipeline because they allow us to understand how our model works and verify that it is doing something sensible. They can also be useful to identify examples where the model is not performing as expected.

In addition to these spot checks, Splink also has functions to perform more formal accuracy analysis. These functions allow you to understand the likely prevalence of false positives and false negatives in your linkage models.

They rely on the existence of a sample of labelled (ground truth) matches, which may have been produced (for example) by human beings. For the accuracy analysis to be unbiased, the sample should be representative of the overall dataset.

# Rerun our predictions to we're ready to view the charts\nimport pandas as pd\n\nfrom splink import DuckDBAPI, Linker, splink_datasets\n\npd.options.display.max_columns = 1000\n\ndb_api = DuckDBAPI()\ndf = splink_datasets.fake_1000\n
import json\nimport urllib\n\nfrom splink import block_on\n\nurl = \"https://raw.githubusercontent.com/moj-analytical-services/splink/847e32508b1a9cdd7bcd2ca6c0a74e547fb69865/docs/demos/demo_settings/saved_model_from_demo.json\"\n\nwith urllib.request.urlopen(url) as u:\n    settings = json.loads(u.read().decode())\n\n# The data quality is very poor in this dataset, so we need looser blocking rules\n# to achieve decent recall\nsettings[\"blocking_rules_to_generate_predictions\"] = [\n    block_on(\"first_name\"),\n    block_on(\"city\"),\n    block_on(\"email\"),\n    block_on(\"dob\"),\n]\n\nlinker = Linker(df, settings, db_api=DuckDBAPI())\ndf_predictions = linker.inference.predict(threshold_match_probability=0.01)\n
Blocking time: 0.02 seconds\n\n\nPredict time: 0.80 seconds\n\n\n\n -- WARNING --\nYou have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.\nComparison: 'email':\n    m values not fully trained\n
"},{"location":"demos/tutorials/07_Evaluation.html#load-in-labels","title":"Load in labels","text":"

The labels file contains a list of pairwise comparisons which represent matches and non-matches.

The required format of the labels file is described here.

from splink.datasets import splink_dataset_labels\n\ndf_labels = splink_dataset_labels.fake_1000_labels\nlabels_table = linker.table_management.register_labels_table(df_labels)\ndf_labels.head(5)\n
unique_id_l source_dataset_l unique_id_r source_dataset_r clerical_match_score 0 0 fake_1000 1 fake_1000 1.0 1 0 fake_1000 2 fake_1000 1.0 2 0 fake_1000 3 fake_1000 1.0 3 0 fake_1000 4 fake_1000 0.0 4 0 fake_1000 5 fake_1000 0.0"},{"location":"demos/tutorials/07_Evaluation.html#view-examples-of-false-positives-and-false-negatives","title":"View examples of false positives and false negatives","text":"
splink_df = linker.evaluation.prediction_errors_from_labels_table(\n    labels_table, include_false_negatives=True, include_false_positives=False\n)\nfalse_negatives = splink_df.as_record_dict(limit=5)\nlinker.visualisations.waterfall_chart(false_negatives)\n
"},{"location":"demos/tutorials/07_Evaluation.html#false-positives","title":"False positives","text":"
# Note I've picked a threshold match probability of 0.01 here because otherwise\n# in this simple example there are no false positives\nsplink_df = linker.evaluation.prediction_errors_from_labels_table(\n    labels_table, include_false_negatives=False, include_false_positives=True, threshold_match_probability=0.01\n)\nfalse_postives = splink_df.as_record_dict(limit=5)\nlinker.visualisations.waterfall_chart(false_postives)\n
"},{"location":"demos/tutorials/07_Evaluation.html#threshold-selection-chart","title":"Threshold Selection chart","text":"

Splink includes an interactive dashboard that shows key accuracy statistics:

linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"threshold_selection\", add_metrics=[\"f1\"]\n)\n
"},{"location":"demos/tutorials/07_Evaluation.html#receiver-operating-characteristic-curve","title":"Receiver operating characteristic curve","text":"

A ROC chart shows how the number of false positives and false negatives varies depending on the match threshold chosen. The match threshold is the match weight chosen as a cutoff for which pairwise comparisons to accept as matches.

linker.evaluation.accuracy_analysis_from_labels_table(labels_table, output_type=\"roc\")\n
"},{"location":"demos/tutorials/07_Evaluation.html#truth-table","title":"Truth table","text":"

Finally, Splink can also report the underlying table used to construct the ROC and precision recall curves.

roc_table = linker.evaluation.accuracy_analysis_from_labels_table(\n    labels_table, output_type=\"table\"\n)\nroc_table.as_pandas_dataframe(limit=5)\n
truth_threshold match_probability total_clerical_labels p n tp tn fp fn P_rate N_rate tp_rate tn_rate fp_rate fn_rate precision recall specificity npv accuracy f1 f2 f0_5 p4 phi 0 -18.9 0.000002 3176.0 2031.0 1145.0 1709.0 1103.0 42.0 322.0 0.639484 0.360516 0.841457 0.963319 0.036681 0.158543 0.976014 0.841457 0.963319 0.774035 0.885390 0.903755 0.865316 0.945766 0.880476 0.776931 1 -16.7 0.000009 3176.0 2031.0 1145.0 1709.0 1119.0 26.0 322.0 0.639484 0.360516 0.841457 0.977293 0.022707 0.158543 0.985014 0.841457 0.977293 0.776544 0.890428 0.907594 0.866721 0.952514 0.886010 0.789637 2 -12.8 0.000140 3176.0 2031.0 1145.0 1709.0 1125.0 20.0 322.0 0.639484 0.360516 0.841457 0.982533 0.017467 0.158543 0.988433 0.841457 0.982533 0.777471 0.892317 0.909043 0.867249 0.955069 0.888076 0.794416 3 -12.5 0.000173 3176.0 2031.0 1145.0 1708.0 1125.0 20.0 323.0 0.639484 0.360516 0.840965 0.982533 0.017467 0.159035 0.988426 0.840965 0.982533 0.776934 0.892003 0.908752 0.866829 0.954937 0.887763 0.793897 4 -12.4 0.000185 3176.0 2031.0 1145.0 1705.0 1132.0 13.0 326.0 0.639484 0.360516 0.839488 0.988646 0.011354 0.160512 0.992433 0.839488 0.988646 0.776406 0.893262 0.909576 0.866186 0.957542 0.889225 0.797936"},{"location":"demos/tutorials/07_Evaluation.html#unlinkables-chart","title":"Unlinkables chart","text":"

Finally, it can be interesting to analyse whether your dataset contains any 'unlinkable' records.

'Unlinkable records' are records with such poor data quality they don't even link to themselves at a high enough probability to be accepted as matches

For example, in a typical linkage problem, a 'John Smith' record with nulls for their address and postcode may be unlinkable. By 'unlinkable' we don't mean there are no matches; rather, we mean it is not possible to determine whether there are matches.UnicodeTranslateError

A high proportion of unlinkable records is an indication of poor quality in the input dataset

linker.evaluation.unlinkables_chart()\n

For this dataset and this trained model, we can see that most records are (theoretically) linkable: At a match weight 6, around around 99% of records could be linked to themselves.

Further Reading

For more on the quality assurance tools in Splink, please refer to the Evaluation API documentation.

For more on the charts used in this tutorial, please refer to the Charts Gallery.

For more on the Evaluation Metrics used in this tutorial, please refer to the Edge Metrics guide.

"},{"location":"demos/tutorials/07_Evaluation.html#thats-it","title":"That's it!","text":"

That wraps up the Splink tutorial! Don't worry, there are still plenty of resources to help on the next steps of your Splink journey:

For some end-to-end notebooks of Splink pipelines, check out our Examples

For more deepdives into the different aspects of Splink, and record linkage more generally, check out our Topic Guides

For a reference on all the functionality avalable in Splink, see our Documentation

"},{"location":"dev_guides/index.html","title":"Contributing to Splink","text":""},{"location":"dev_guides/index.html#contributing-to-splink","title":"Contributing to Splink","text":"

Thank you for your interest in contributing to Splink! If this is your first time working with Splink, check our Contributors Guide.

When making changes to Splink, there are a number of common operations that developers need to perform. The guides below lay out some of these common operations, and provides scripts to automate these processes. These include:

  • Developer Quickstart - to get contributors up and running.
  • Linting and Formatting - to ensure consistent code style and to reformat code, where possible.
  • Testing - to ensure all of the codebase is performing as intended.
  • Building the Documentation locally - to test any changes to the docs site render correctly.
  • Releasing a new package version - to walk-through the release process for new versions of Splink. This generally happens every 2 weeks, or in the case of an urgent bug fix.
  • Contributing to the Splink Blog - to walk through the process of adding a post to the Splink blog.
"},{"location":"dev_guides/index.html#how-splink-works","title":"How Splink works","text":"

Splink is quite a large, complex codebase. The guides in this section lay out some of the key structures and key areas within the Splink codebase. These include:

  • Understanding and Debugging Splink - demonstrates several ways of understanding how Splink code is running under the hood. This includes Splink's debug mode and logging.
  • Transpilation using SQLGlot - demonstrates how Splink translates SQL in order to be compatible with multiple SQL engines using the SQLGlot package.
  • Performance and caching - demonstrates how pipelining and caching is used to make Splink run more efficiently.
  • Charts - demonstrates how charts are built in Splink, including how to add new charts and edit existing charts.
  • User-Defined Functions - demonstrates how User Defined Functions (UDFs) are used to provide functionality within Splink that is not native to a given SQL backend.
  • Settings Validation - summarises how to use and expand the existing settings schema and validation functions.
  • Managing Splink's Dependencies - this section provides guidelines for managing our core dependencies and our strategy for phasing out Python versions that have reached their end-of-life.
"},{"location":"dev_guides/CONTRIBUTING.html","title":"Contributor Guide","text":""},{"location":"dev_guides/CONTRIBUTING.html#contributing-to-splink","title":"Contributing to Splink","text":"

Contributing to an open source project takes many forms. Below are some of the ways you can contribute to Splink!

"},{"location":"dev_guides/CONTRIBUTING.html#asking-questions","title":"Asking questions","text":"

If you have a question about Splink, we recommended asking on our GitHub discussion board. This means that other users can benefit from the answers too! On that note, it is always worth checking if a similar question has been asked (and answered) before.

"},{"location":"dev_guides/CONTRIBUTING.html#reporting-issues","title":"Reporting issues","text":"

Is something broken? Or not acting how you would expect? Are we missing a feature that would make your life easier? We want to know about it!

When reporting issues please include as much detail as possible about your operating system, Splink version, python version and which SQL backend you are using. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem. It is particularly helpful if you can look through the existing issues and provide links to any related issues.

"},{"location":"dev_guides/CONTRIBUTING.html#contributing-to-documentation","title":"Contributing to documentation","text":"

Contributions to Splink are not limited to the code. Feedback and input on our documentation from a user's perspective is extremely valuable - even something as small as fixing a typo. More generally, if you are interested in starting to work on Splink, documentation is a great way to get those first commits!

The easiest way to contribute to the documentation is by clicking the pencil icon at the top right of the docs page you want to edit. This will automatically create a fork of the Splink repository on GitHub and make it easy to open a pull request with your changes, which one of the Splink dev team will review.

If you need to make a larger change to the docs, this workflow might not be the best, since you won't get to see the effects of your changes before submitting them. To do this, you will need to create a fork of the Splink repo, then clone your fork to your computer. Then, you can edit the documentation in the docs folder (and API documentation, which can be found as docstrings in the code itself) locally. To see what the docs will look like with your changes, you can build the docs site locally. When you are happy with your changes, commit and push them to your fork, then create a Pull Request.

We are trying to make our documentation as accessible to as many people as possible. If you find any problems with accessibility then please let us know by raising an issue, or feel free to put in a Pull Request with your suggested fixes.

"},{"location":"dev_guides/CONTRIBUTING.html#contributing-code","title":"Contributing code","text":"

Thanks for your interest in contributing code to Splink!

There are a number of ways to get involved:

  • Start work on an existing issue, there should be some with a good first issue flag which are a good place to start.
  • Tackle a problem you have identified. If you have identified a feature or bug, the first step is to create a new issue to explain what you have identified and what you plan to implement, then you are free to fork the repository and get coding!

In either case, we ask that you assign yourself to the relevant issue and open up a draft pull request (PR) while you are working on your feature/bug-fix. This helps the Splink dev team keep track of developments and means we can start supporting you sooner!

You can always add further PRs to build extra functionality. Starting out with a minimum viable product and iterating makes for better software (in our opinion). It also helps get features out into the wild sooner.

To get set up for development locally, see the development quickstart.

"},{"location":"dev_guides/CONTRIBUTING.html#best-practices","title":"Best practices","text":"

When making code changes, we recommend:

  • Adding tests to ensure your code works as expected. These will be run through GitHub Actions when a PR is opened.
  • Linting to ensure that code is styled consistently.
"},{"location":"dev_guides/CONTRIBUTING.html#branching-strategy","title":"Branching Strategy","text":"

All pull requests (PRs) should target the master branch.

We believe that small Pull Requests make better code. They:

  • are more focused
  • increase understanding and clarity
  • are easier (and quicker) to review
  • get feedback quicker

If you have a larger feature, please consider creating a simple minimum-viable feature and submit for review. Once this has been reviewed by the Splink dev team there are two options to consider:

  1. Merge minimal feature, then create a new branch with additional features.
  2. Do not merge the initial feature branch, create additional feature branches from the reviewed branch.

The best solution often depends on the specific feature being created and any other development work happening in that area of the codebase. If you are unsure, please ask the dev team for advice on how to best structure your changes in your initial PR and we can come to a decision together.

"},{"location":"dev_guides/caching.html","title":"Caching and pipelining","text":""},{"location":"dev_guides/caching.html#caching-and-pipelining","title":"Caching and pipelining","text":"

Splink is able to run against multiple SQL backends because all of the core data linking calculations are implemented in SQL. This SQL can therefore be submitted to a chosen SQL backend for execution.

Computations in Splink often take the form of a number of select statements run in sequence.

For example, the predict() step:

  • Inputs __splink__df_concat_with_tf and outputs __splink__df_blocked
  • Inputs __splink__df_blocked and outputs __splink__df_comparison_vectors
  • Inputs __splink__df_comparison_vectors and outputs __splink__df_match_weight_parts
  • Inputs __splink__df_match_weight_parts and outputs __splink__df_predict

To make this run faster, two key optimisations are implemented:

  • Pipelining - combining multiple select statements into a single statement using WITH(CTE) queries
  • Caching: saving the results of calculations so they don't need recalculating. This is especially useful because some intermediate calculations are reused multiple times during a typical Splink session

This article discusses the general implementation of caching and pipelining. The implementation needs some alterations for certain backends like Spark, which lazily evaluate SQL by default.

"},{"location":"dev_guides/caching.html#implementation-pipelining","title":"Implementation: Pipelining","text":"

A SQLPipeline class manages SQL pipelining.

A SQLPipeline is composed of a number of SQLTask objects, each of which represents a select statement.

The code is fairly straightforward: Given a sequence of select statements, [a,b,c] they are combined into a single query as follows:

with\na as (a_sql),\nb as (b_sql),\nc_sql\n

To make this work, each statement (a,b,c) in the pipeline must refer to the previous step by name. For example, b_sql probably selects from the a_sql table, which has been aliased a. So b_sql must use the table name a to refer to the result of a_sql.

To make this tractable, each SQLTask has an output_table_name. For example, the output_table_name for a_sql in the above example is a.

For instance, in the predict() pipeline above, the first output_table_name is __splink__df_blocked. By giving each task a meaningful output_table_name, subsequent tasks can reference previous outputs in a way which is semantically clear.

"},{"location":"dev_guides/caching.html#implementation-caching","title":"Implementation: Caching","text":"

When a SQL pipeline is executed, it has two output names:

  • A physical_name, which is the name of the materialised table in the output database e.g. __splink__df_predict_cbc9833
  • A templated_name, which is a descriptive name of what the table represents e.g. __splink__df_predict

Each time Splink runs a SQL pipeline, the SQL string is hashed. This creates a unique identifier for that particular SQL string, which serves to identify the output.

When Splink is asked to execute a SQL string, before execution, it checks whether the resultant table already exists. If it does, it returns the table rather than recomputing it.

For example, when we run linker.predict(), Splink:

  • Generates the SQL tasks
  • Pipelines them into a single SQL statement
  • Hashes the statement to create a physical name for the outputs __splink__df_predict_cbc9833
  • Checks whether a table with physical name __splink__df_predict_cbc9833 already exists in the database
  • If not, executes the SQL statement, creating table __splink__df_predict_cbc9833 in the database.

In terms of implementation, the following happens:

  • SQL statements are generated an put in the queue - see here
  • Once all the tasks have been added to the queue, we call _execute_sql_pipeline() see here
  • The SQL is combined into a single pipelined statement here
  • We call _sql_to_splink_dataframe() which returns the table (from the cache if it already exists, or it executes the SQL)
  • The table is returned as a SplinkDataframe, an abstraction over a table in a database. See here.
"},{"location":"dev_guides/caching.html#some-cached-tables-do-not-need-a-hash","title":"Some cached tables do not need a hash","text":"

A hash is required to uniquely identify some outputs. For example, blocking is used in several places in Splink, with different results. For example, the __splink__df_blocked needed to estimate parameters is different to the __splink__df_blocked needed in the predict() step.

As a result, we cannot materialise a single table called __splink__df_blocked in the database and reuse it multiple times. This is why we append the hash of the SQL, so that we can uniquely identify the different versions of __splink__df_blocked which are needed in different contexts.

There are, however, some tables which are globally unique. They only take a single form, and if they exist in the cache they never need recomputing.

An example of this is __splink__df_concat_with_tf, which represents the concatenation of the input dataframes.

To create this table, we can execute _sql_to_splink_dataframe with materialise_as_hash set to False. The resultant materialised table will not have a hash appended, and will simply be called __splink__df_concat_with_tf. This is useful, because when performing calculations Splink can now check the cache for __splink__df_concat_with_tf each time it is needed.

In fact, many Splink pipelines begin with the assumption that this table exists in the database, because the first SQLTask in the pipeline refers to a table named __splink__df_concat_with_tf. To ensure this is the case, a function is used to create this table if it doesn't exist.

"},{"location":"dev_guides/caching.html#using-pipelining-to-optimise-splink-workloads","title":"Using pipelining to optimise Splink workloads","text":"

At what point should a pipeline of SQLTasks be executed (materialised into a physical table)?

For any individual output, it will usually be fastest to pipeline the full linage of tasks, right from raw data through to the end result.

However, there are many intermediate outputs which are used by many different Splink operations.

Performance can therefore be improved by computing and saving these intermediate outputs to a cache, to ensure they don't need to be computed repeatedly.

This is achieved by enqueueing SQL to a pipeline and strategically calling execute_sql_pipeline to materialise results that need to cached.

"},{"location":"dev_guides/debug_modes.html","title":"Understanding and debugging Splink","text":""},{"location":"dev_guides/debug_modes.html#understanding-and-debugging-splinks-computations","title":"Understanding and debugging Splink's computations","text":"

Splink contains tooling to help developers understand the underlying computations, how caching and pipelining is working, and debug problems.

There are two main mechanisms: _debug_mode, and setting different logging levels

"},{"location":"dev_guides/debug_modes.html#debug-mode","title":"Debug mode","text":"

You can turn on debug mode by setting linker._debug_mode = True.

This has the following effects:

  • Each step of Splink's calculations are executed in turn. That is, pipelining is switched off.
  • The SQL statements being executed by Splink are displayed
  • The results of the SQL statements are displayed in tabular format

This is probably the best way to understand each step of the calculations being performed by Splink - because a lot of the implementation gets 'hidden' within pipelines for performance reasons.

Note that enabling debug mode will dramatically reduce Splink's performance!

"},{"location":"dev_guides/debug_modes.html#logging","title":"Logging","text":"

Splink has a range of logging modes that output information about what Splink is doing at different levels of verbosity.

Unlike debug mode, logging doesn't affect the performance of Splink.

"},{"location":"dev_guides/debug_modes.html#logging-levels","title":"Logging levels","text":"

You can set the logging level with code like logging.getLogger(\"splink\").setLevel(desired_level) although see notes below about gotchas.

The logging levels in Splink are:

  • logging.INFO (20): This outputs user facing messages about the training status of Splink models
  • 15: Outputs additional information about time taken and parameter estimation
  • logging.DEBUG (10): Outputs information about the names of the SQL statements executed
  • logging.DEBUG (7): Outputs information about the names of the components of the SQL pipelines
  • logging.DEBUG (5): Outputs the SQL statements themselves
"},{"location":"dev_guides/debug_modes.html#how-to-control-logging","title":"How to control logging","text":"

Note that by default Splink sets the logging level to INFO on initialisation

"},{"location":"dev_guides/debug_modes.html#with-basic-logging","title":"With basic logging","text":"
import logging\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n\n# This must come AFTER the linker is intialised, because the logging level\n# will be set to INFO\nlogging.getLogger(\"splink\").setLevel(logging.DEBUG)\n
"},{"location":"dev_guides/debug_modes.html#without-basic-logging","title":"Without basic logging","text":"
# This code can be anywhere since set_up_basic_logging is False\nimport logging\nlogging.basicConfig(format=\"%(message)s\")\nsplink_logger = logging.getLogger(\"splink\")\nsplink_logger.setLevel(logging.INFO)\n\nlinker = Linker(df, settings, db_api, set_up_basic_logging=False)\n
"},{"location":"dev_guides/dependency_compatibility_policy.html","title":"Dependency Compatibility Policy","text":"

This page highlights the importance of package versioning and proposes that we use a \"sunsetting\" strategy for updating our support python and dependency versions as they reach end-of-life.

Additionally, it lays out some rough guidelines for us to follow when addresses future package conflicts and issues arises from antiquated dependency versions.

"},{"location":"dev_guides/dependency_compatibility_policy.html#package-versioning-policy","title":"Package Versioning Policy","text":"

Monitoring package versioning within Splink is important. It ensures that the project can be used by as wide a group of individuals as possible, without wreaking havoc on our issues log.

Below is a rough summary of versioning and some complimentary guidelines detailing how we should look to deal with dependency management going forward.

"},{"location":"dev_guides/dependency_compatibility_policy.html#benefits-to-effective-versioning","title":"Benefits to Effective Versioning","text":"

Effective versioning is crucial for ensuring Splink's compatibility across diverse technical ecosystems and seamless integration with various Python versions and cloud tools. Key advantages include:

  • Faster dependency resolution with poetry lock.
  • Reduces dependency conflicts across systems.
"},{"location":"dev_guides/dependency_compatibility_policy.html#versioning-guidance","title":"Versioning Guidance","text":""},{"location":"dev_guides/dependency_compatibility_policy.html#establish-minimum-supported-versions","title":"Establish Minimum Supported Versions","text":"
  • Align with Python Versions: Select the minimum required versions for dependencies based on the earliest version of Python we plan to support. This approach is aligned with our policy on Sunsetting End-of-Life Python Versions, ensuring Splink remains compatible across a broad spectrum of environments.
  • Document Reasons: Where appropriate, clearly document why specific versions are chosen as minimums, including any critical features or bug fixes that dictate these choices. We should look to do this in pull requests implementing the change and as comments in pyproject.toml. Doing so allows us to easily track versioning decisions.
"},{"location":"dev_guides/dependency_compatibility_policy.html#prefer-open-version-constraints","title":"Prefer Open Version Constraints","text":"
  • Use Open Upper Bounds: Wherever feasible, avoid setting an upper version limit for a dependency. This reduces compatibility conflicts with external packages and allows the user to decide their versioning strategy at the application level.
  • Monitor Compatibility: Actively monitor the development of our core dependencies to anticipate significant updates (such as new major versions) that might necessitate code changes. Within Splink, this is particularly relevant for both SQLGlot and DuckDB, that (semi)frequently release new, breaking changes.
"},{"location":"dev_guides/dependency_compatibility_policy.html#compatibility-checks","title":"Compatibility Checks","text":"
  • Automated Testing: Use Continuous Integration (CI) to help test the latest python and package versions. This helps identify compatibility issues early.
  • Matrix Testing: Test against a matrix of dependencies or python versions to ensure broad compatibility. pytest_run_tests_with_cache.yml is currently our broad compatibility check for supported versions of python.
"},{"location":"dev_guides/dependency_compatibility_policy.html#handling-breaking-changes","title":"Handling Breaking Changes","text":"
  • Temporary Version Pinning for Major Changes: In cases where a dependency introduces breaking changes that we cannot immediately accommodate, we should look to temporarily pin to a specific version or version range until we have an opportunity to update Splink.
  • Adaptive Code Changes: When feasible, adapt code to be compatible with new major versions of dependencies. This may include conditional logic to handle differences across versions. An example of this can be found within input_column.py, where we adjust how column identifiers are extracted from SQLGlot based on its version.
"},{"location":"dev_guides/dependency_compatibility_policy.html#documentation-and-communication","title":"Documentation and Communication","text":"
  • Clear Documentation: Clearly log installation instructions within the Getting Started section of our documentation. This should cover not only standard installation procedures but also specialised instructions, for instance, installing a -less version of Splink, for locked down environments.
  • Log Dependency Changes in the CHANGELOG: Where dependencies are adjusted, ensure that changes are logged within CHANGELOG.md. This can help simplify debugging and creates a guide that can be easily referenced.
"},{"location":"dev_guides/dependency_compatibility_policy.html#user-support-and-feedback","title":"User Support and Feedback","text":"
  • Issue Tracking: Actively track and address issues related to dependency compatibility. Where users are having issues, have them report their package versions through either pip freeze or pip-chill, so we can more easily identify what may have caused the problem.
  • Feedback Loops: Encourage feedback from users regarding compatibility and dependency issues. Streamline the reporting process in our issues log.
"},{"location":"dev_guides/dependency_compatibility_policy.html#sunsetting-end-of-life-python-versions","title":"Sunsetting End-of-Life Python Versions","text":"

In alignment with the Python community's practices, we are phasing out support for Python versions that have hit end-of-life and are no longer maintained by the core Python development team. This decision ensures that Splink remains secure, efficient, and up-to-date with the latest Python features and improvements.

Our approach mirrors that of key package maintainers, such as the developers behind NumPy. The NumPy developers have kindly pulled together NEP 29, their guidelines for python version support. This outlines a recommended framework for the deprecation of outdated Python versions.

"},{"location":"dev_guides/dependency_compatibility_policy.html#benefits-of-discontinuing-support-for-older-python-versions","title":"Benefits of Discontinuing Support for Older Python Versions:","text":"
  • Enhanced Tooling: Embracing newer versions enables the use of advanced Python features. For python 3.8, these include protocols, walrus operators, and improved type annotations, amongst others.
  • Fewer Dependabot Alerts: Transitioning away from older Python versions reduces the volume of alerts associated with legacy package dependencies.
  • Minimised Package Conflicts: Updating python decreases the necessity for makeshift solutions to resolve dependency issues with our core dependencies, fostering a smoother integration with tools like Poetry.

For a comprehensive rationale behind upgrading, the article \"It's time to stop using python 3.7\" offers an insightful summary.

"},{"location":"dev_guides/dependency_compatibility_policy.html#implementation-timeline","title":"Implementation Timeline:","text":"

The cessation of support for major Python versions post-end-of-life will not be immediate but will instead be phased in gradually over the months following their official end-of-life designation.

Proposed Workflow for Sunsetting Major Python Versions:

  1. Initial Grace Period: We propose a waiting period of approximately six months post-end-of-life before initiating the upgrade process. This interval:
    • Mitigates potential complications arising from system-wide Python updates across major cloud distributors and network administrators.
    • Provides a window to inform users about the impending deprecation of older versions.
  2. Following the Grace Period:
    • Ensure the upgrade process is seamless and devoid of critical issues, leveraging the backward compatibility strengths of newer Python versions.
    • Address any bugs discovered during the upgrade process.
    • Update pyproject.toml accordingly. Pull requests updating our supported versions should be clearly marked with the [DEPENDENCIES] tag and python_version_update label for straightforward tracking.
"},{"location":"dev_guides/dependency_compatibility_policy.html#pythons-development-cycle","title":"Python's Development Cycle:","text":"

A comprehensive summary of Python's development cycle is available on the Python Developer's Guide. This includes a chart outlining the full release cycle up to 2029:

As it stands, support for Python 3.8 will officially end in October of 2024. Following an initial grace period of around six months, we will then look to phase out support.

We will look to regularly review this page and update Splink's dependencies accordingly.

"},{"location":"dev_guides/spark_pipelining_and_caching.html","title":"Spark caching","text":""},{"location":"dev_guides/spark_pipelining_and_caching.html#caching-and-pipelining-in-spark","title":"Caching and pipelining in Spark","text":"

This article assumes you've read the general guide to caching and pipelining.

In Spark, some additions have to be made to this general pattern because all transformation in Spark are lazy.

That is, when we call df = spark.sql(sql), the df is not immediately computed.

Furthermore, even when an action is called, the results aren't automatically persisted by Spark to disk. This differs from other backends, which execute SQL as a create table statement, meaning that the result is automatically saved.

This interferes with caching, because Splink assumes that when the the function _execute_sql_against_backend() is called, this will be evaluated greedily (immediately evaluated) AND the results will be saved to the 'database'.

Another quirk of Spark is that it chunks work up into tasks. This is relevant for two reasons:

  • Tasks can suffer from skew, meaning some take longer than others, which can be bad from a performance point of view.
  • The number of tasks and how data is partitioned controls how many files are output when results are saved. Some Splink operations results in a very large number of small files which can take a long time to read and write, relative to the same data stored in fewer files.

Repartitioning can be used to rebalance workloads (reduce skew) and to avoid the 'many small files' problem.

"},{"location":"dev_guides/spark_pipelining_and_caching.html#spark-specific-modifications","title":"Spark-specific modifications","text":"

The logic for Spark is captured in the implementation of _execute_sql_against_backend() in the spark_linker.py.

This has three roles:

  • It determines how to save result - using either persist, checkpoint or saving to .parquet, with .parquet being the default.
  • It determines which results to save. Some small results such __splink__m_u_counts are immediately converted using toPandas() rather than being saved. This is because saving to disk and reloading is expensive and unnecessary.
  • It chooses which Spark dataframes to repartition to reduce the number of files which are written/read

Note that repartitioning and saving is independent. Some dataframes are saved without repartitioning. Some dataframes are repartitioned without being saved.

"},{"location":"dev_guides/transpilation.html","title":"Transpilation using sqlglot","text":""},{"location":"dev_guides/transpilation.html#sql-transpilation-in-splink-and-how-we-support-multiple-sql-backends","title":"SQL Transpilation in Splink, and how we support multiple SQL backends","text":"

In Splink, all the core data linking algorithms are implemented in SQL. This allows computation to be offloaded to a SQL backend of the users choice.

One difficulty with this paradigm is that SQL implementations differ - the functions available in (say) the Spark dialect of SQL differ from those available in DuckDB SQL. And to make matters worse, functions with the same name may behave differently (e.g. different arguments, arguments in different orders, etc.).

Splink therefore needs a mechanism of writing SQL statements that are able to run against all the target SQL backends (engines).

Details are as follows:

"},{"location":"dev_guides/transpilation.html#1-core-data-linking-algorithms-are-splink","title":"1. Core data linking algorithms are Splink","text":"

Core data linking algorithms are implemented in 'backend agnostic' SQL. So they're written using basic SQL functions that are common across the available in all the target backends, and don't need any translation.

It has been possible to write all of the core Splink logic in SQL that is consistent between dialects.

However, this is not the case with Comparisons, which tend to use backend specific SQL functions like jaro_winker, whose function names and signatures differ between backends.

"},{"location":"dev_guides/transpilation.html#2-user-provided-sql-is-interpolated-into-these-dialect-agnostic-sql-statements","title":"2. User-provided SQL is interpolated into these dialect-agnostic SQL statements","text":"

The user provides custom SQL is two places in Splink:

  1. Blocking rules
  2. The sql_condition (see here) provided as part of a Comparison

The user is free to write this SQL however they want.

It's up to the user to ensure the SQL they provide will execute successfully in their chosen backend. So the sql_condition must use functions that exist in the target execution engine

"},{"location":"dev_guides/transpilation.html#3-backends-can-implement-transpilation-and-or-dialect-steps-to-further-transform-the-sql-if-needed","title":"3. Backends can implement transpilation and or dialect steps to further transform the SQL if needed","text":"

Occasionally some modifications are needed to the SQL to ensure it executes against the target backend.

sqlglot is used for this purpose. For instance, a custom dialect is implemented in the Spark linker.

A transformer is implemented in the Athena linker.

"},{"location":"dev_guides/udfs.html","title":"User-Defined Functions","text":""},{"location":"dev_guides/udfs.html#user-defined-functions","title":"User Defined Functions","text":"

User Defined Functions (UDFs) are functions that can be created to add functionality to a given SQL backend that does not already exist. These are particularly useful within Splink as it supports multiple SQL engines each with different inherent functionality. UDFs are an important tool for creating consistent functionality across backends.

For example, DuckDB has an in-built string comparison function for Jaccard similarity whereas Spark SQL doesn't have an equivalent function. Therefore, a UDF is required to use functions like JaccardAtThresholds() and JaccardLevel() with a Spark backend.

"},{"location":"dev_guides/udfs.html#spark","title":"Spark","text":"

Spark supports UDFs written in Scala and Java.

Splink currently uses UDFs written in Scala and are implemented as follows:

  • The UDFs are created in a separate repository, splink_scalaudfs, with the Scala functions being defined in Similarity.scala.
  • The functions are then stored in a Java Archive (JAR) file - for more on JAR files, see the Java documentation.
  • Once the JAR file containing the UDFs has been created, it is copied across to the spark_jars folder in Splink.
  • Specify the the correct jar location within Splink.
  • UDFS are then registered within the Spark Linker.

Now the Spark UDFs have been successfully registered, they can be used in Spark SQL. For example,

jaccard(\"name_column_1\", \"name_column_2\") >= 0.9\n

which provides the basis for functions such as JaccardAtThresholds() and JaccardLevel().

"},{"location":"dev_guides/udfs.html#duckdb","title":"DuckDB","text":"

Python UDFs can be registered to a DuckDB connection from version 0.8.0 onwards.

The documentation is here, an examples are here. Note that these functions should be registered against the DuckDB connection provided to the linker using connection.create_function.

Note that performance will generally be substantially slower than using native DuckDB functions. Consider using vectorised UDFs were possible - see here.

"},{"location":"dev_guides/udfs.html#athena","title":"Athena","text":"

Athena supports UDFs written in Java, however these have not yet been implemented in Splink.

"},{"location":"dev_guides/udfs.html#sqlite","title":"SQLite","text":"

Python UDFs can be registered to a SQLite connection using the create_function function. An example is as follows:

from rapidfuzz.distance.Levenshtein import distance\nconn = sqlite3.connect(\":memory:\")\nconn.create_function(\"levenshtein\", 2, distance)\n

The function levenshtein is now available to use as a Python function

"},{"location":"dev_guides/changing_splink/blog_posts.html","title":"Contributing to the Splink Blog","text":""},{"location":"dev_guides/changing_splink/blog_posts.html#contributing-to-the-splink-blog","title":"Contributing to the Splink Blog","text":"

Thanks for considering making a contribution to the Splink Blog! We are keen to use this blog as a forum all things data linking and Splink!

This blog, and the docs as a whole, are built using the fantastic MkDocs material, to understand more about how the blog works under the hood checkout out the MkDocs material blog documentation.

For more general guidance for contributing to Splink, check out our Contributor Guide.

"},{"location":"dev_guides/changing_splink/blog_posts.html#adding-a-blog-post","title":"Adding a blog post","text":"

The easiest way to get started with a blog post is to make a copy of one of the pre-existing blog posts and make edits from there. There is a metadata in the section at the top of each post which should be updated with the post date, authors and the category of the post (this is a tag system to make posts easier to find).

Blog posts are ordered by date, so change the name of your post markdown file to be a recent date (YYYY-MM-DD format) to make sure it appears at the top of the blog.

Note

In this blog we want to make content as easily digestible as possible. We encourage breaking up and big blocks of text into sections and using visuals/emojis/gifs to bring your post to life!

"},{"location":"dev_guides/changing_splink/blog_posts.html#adding-a-new-author-to-the-blogs","title":"Adding a new author to the blogs","text":"

If you are a new author, you will need to add yourself to the .authors.yml file.

"},{"location":"dev_guides/changing_splink/blog_posts.html#testing-your-changes","title":"Testing your changes","text":"

Once you have made a first draft, check out how the deployed blog will look by building the docs locally.

"},{"location":"dev_guides/changing_splink/building_env_locally.html","title":"Building your local environment","text":""},{"location":"dev_guides/changing_splink/building_env_locally.html#creating-a-virtual-environment-for-splink","title":"Creating a Virtual Environment for Splink","text":""},{"location":"dev_guides/changing_splink/building_env_locally.html#managing-dependencies-with-poetry","title":"Managing Dependencies with Poetry","text":"

Splink utilises poetry for managing its core dependencies, offering a clean and effective solution for tracking and resolving any ensuing package and version conflicts.

You can find a list of Splink's core dependencies within the pyproject.toml file.

"},{"location":"dev_guides/changing_splink/building_env_locally.html#fundamental-commands-in-poetry","title":"Fundamental Commands in Poetry","text":"

Below are some useful commands to help in the maintenance and upkeep of the pyproject.toml file.

Adding Packages - To incorporate a new package into Splink:

poetry add <package-name>\n
- To specify a version when adding a new package:
poetry add <package-name>==<version>\n# Add quotes if you want to use other equality calls\npoetry add \"<package-name> >= <version>\"\n

Modifying Packages - To remove a package from the project:

poetry remove <package-name>\n
- Updating an existing package to a specific version:
poetry add <package-name>==<version>\npoetry add \"<package-name> >= <version>\"\n
- To update an existing package to the latest version:
poetry add <package-name>==<version>\npoetry update <package-name>\n
Note: Direct updates can also be performed within the pyproject.toml file.

Locking the Project - To update the existing poetry.lock file, thereby locking the project to ensure consistent dependency installation across different environments:

poetry lock\n
Note: This should be used sparingly due to our loose dependency requirements and the resulting time to solve the dependency graph. If you only need to update a single dependency, update it using poetry add <pkg>==<version> instead.

Installing Dependencies - To install project dependencies as per the lock file:

poetry install\n
- For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:
poetry install -E spark\n

A comprehensive list of Poetry commands is available in the Poetry documentation.

"},{"location":"dev_guides/changing_splink/building_env_locally.html#automating-virtual-environment-creation","title":"Automating Virtual Environment Creation","text":"

To streamline the creation of a virtual environment via venv, you may use the create_venv.sh script.

This script facilitates the automatic setup of a virtual environment, with the default environment name being venv.

Default Environment Creation:

source scripts/create_venv.sh\n

Specifying a Custom Environment Name:

source scripts/create_venv.sh <name_of_venv>\n
"},{"location":"dev_guides/changing_splink/contributing_to_docs.html","title":"Contributing to Documentation","text":""},{"location":"dev_guides/changing_splink/contributing_to_docs.html#building-docs-locally","title":"Building docs locally","text":"

Before building the docs locally, you will need to follow the development quickstart to set up the necessary environment. You cannot skip this step, because some Splink docs Markdown is auto-generated using the Splink development environment.

Once you've done that, to rapidly build the documentation and immediately see changes you've made you can use this script outside your Poetry virtual environment:

source scripts/make_docs_locally.sh\n

This is much faster than waiting for GitHub actions to run if you're trying to make fiddly changes to formatting etc.

Once you've finished updating Splink documentation we ask that you run our spellchecker. Instructions on how to do this are given below.

"},{"location":"dev_guides/changing_splink/contributing_to_docs.html#quick-builds-for-rapidly-authoring-new-content","title":"Quick builds for rapidly authoring new content","text":"

When you mkdocs serve -v --dirtyreload or mkdocs build the documentation, the mkdocs command will rebuild the entire site. This can be slow if you're just making small changes to a single page.

To speed up the process, you can temporarily tell mkdocs to ignore content by modifying mkdocs.yml, for example by adding:

exclude_docs: |\n  dev_guides/**\n  charts/**\n  topic_guides/**\n  demos/**\n  blog/**\n
"},{"location":"dev_guides/changing_splink/contributing_to_docs.html#spellchecking-docs","title":"Spellchecking docs","text":"

When updating Splink documentation, we ask that you run our spellchecker before submitting a pull request. This is to help ensure quality and consistency across the documentation. If for whatever reason you can't run the spellchecker on your system, please don't let this prevent you from contributing to the documentation. Please note, the spellchecker only works on markdown files.

If you are a Mac user with the Homebrew package manager installed, the script below will automatically install the required system dependency, aspell. If you've created your development environment using conda, aspell will have been installed as part of that process. Instructions for installing aspell through other means may be added here in the future.

To run the spellchecker on either a single markdown file or folder of markdown files, you can run the following bash script:

./scripts/pyspelling/spellchecker.sh <path_to_file_or_folder>\n

Omitting the file/folder path will run the spellchecker on all markdown files contained in the docs folder. We recommend running the spellchecker only on files that you have created or edited.

The spellchecker uses the Python package PySpelling and its underlying spellchecking tool, Aspell. Running the above script will automatically install these packages along with any other necessary dependencies.

The spellchecker compares words to a standard British English dictionary and a custom dictionary (scripts/pyspelling/custom_dictionary.txt) of words. If no spelling mistakes are found, you will see the following terminal printout:

Spelling check passed :)\n

otherwise, PySpelling will printout the spelling mistakes found in each file.

Correct spellings of words not found in a standard dictionary (e.g. \"Splink\") can be recorded as such by adding them to scripts/pyspelling/custom_dictionary.txt.

Please correct any mistakes found or update the custom dictionary to ensure the spellchecker passes before putting in a pull request containing updates to the documentation.

Note

The spellchecker is configured (via pyspelling.yml) to ignore text between certain delimiters to minimise picking up Splink/programming-specific terms. If there are additional patterns that you think should be excepted then please let us know in your pull request.

The custom dictionary deliberately contains a small number of misspelled words (e.g. \u201cSiohban\u201d). These are sometimes necessary where we are explaining how Splink handles typos in data records.

"},{"location":"dev_guides/changing_splink/development_quickstart.html","title":"Development Quickstart","text":"

Splink is a complex project with many dependencies. This page provides step-by-step instructions for getting set up to develop Splink. Once you have followed these instructions, you should be all set to start making changes.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-0-unix-like-operating-system","title":"Step 0: Unix-like operating system","text":"

We highly recommend developing Splink on a Unix-like operating system, such as MacOS or Linux. While it is possible to develop on another operating system such as Windows, we do not provide instructions for how to do so.

Luckily, Windows users can easily fulfil this requirement by installing the Windows Subsystem for Linux (WSL):

  • Open PowerShell as Administrator: Right-click the Start button, select \u201cWindows Terminal (Admin)\u201d, and ensure PowerShell is the selected shell.
  • Run the command wsl --install.
  • You can find more guidance on setting up WSL on the Microsoft website but you don't need to do anything additional.
  • Open the Windows Terminal again (does not need to be Admin) and select the Ubuntu shell. Follow the rest of these instructions in that shell.
"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-1-clone-splink","title":"Step 1: Clone Splink","text":"

If you haven't already, create a fork of the Splink repository. You can find the Splink repository here, or click here to go directly to making a fork. Clone your fork to whatever directory you want to work in with git clone https://github.com/<YOUR_USERNAME>/splink.git.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-2-choose-how-to-install-system-dependencies","title":"Step 2: Choose how to install system dependencies","text":"

Developing Splink requires Python, as well as Poetry (the package manager we use to install Python package dependencies). Running Spark or PostgreSQL on your computer to test those backends requires additional dependencies. Athena only runs in the AWS cloud, so to locally run the tests for that backend you will need to create an AWS account and configure Splink to use it.

There are two ways to install these system dependencies: globally on your computer, or in an isolated conda environment.

The decision of which approach to take is subjective.

If you already have Python and Poetry installed (plus Java and PostgreSQL if you want to run the Spark and PostgreSQL backends locally), there is probably little advantage to using conda.

On the other hand, conda is particularly suitable if:

  • You're already a conda user, and/or
  • You're working in an environment where security policies prevent the installation of system level packages like Java
  • You don't want to do global installs of some of the requirements like Java
"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-3-manual-install-option-install-system-dependencies","title":"Step 3, Manual install option: Install system dependencies","text":""},{"location":"dev_guides/changing_splink/development_quickstart.html#python","title":"Python","text":"

Check if Python is already installed by running python3 --version. If that outputs a version like 3.10.12, you've already got it! Otherwise, follow the instructions for installation on your platform from the Python website.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#poetry","title":"Poetry","text":"

Run these commands to install Poetry globally. Note that we currently use an older version of Poetry, so the version must be specified.

pip install --upgrade pip\npip install poetry==1.4.2\n
"},{"location":"dev_guides/changing_splink/development_quickstart.html#java","title":"Java","text":"

The instructions to install Java globally depend on your operating system. Generally, some version of Java will be available from your operating system's package manager. Note that you must install a version of Java earlier than Java 18 because Splink currently uses an older version of Spark.

As an example, you could run this on Ubuntu:

sudo apt install openjdk-11-jre-headless\n
"},{"location":"dev_guides/changing_splink/development_quickstart.html#postgresql-optional","title":"PostgreSQL (optional)","text":"

Follow the instructions on the PostgreSQL website to install it on your computer.

Then, we will need to set up a database for Splink. You can achieve that with the following commands:

initdb splink_db\npg_ctl -D splink_db start --wait -l ./splink_db_log\ncreatedb splink_db # The inner database\npsql -d splink_db <<SQL\n  CREATE USER splinkognito CREATEDB CREATEROLE password 'splink123!' ;\nSQL\n

Most of these commands are one-time setup, but the pg_ctl -D splink_db start --wait -l ./splink_db_log command will need to be run each time you want to start PostgreSQL (after rebooting, for example).

Alternatively, you can run PostgreSQL using Docker. First, install Docker Desktop.

Then run the setup script (a thin wrapper around docker-compose) each time you want to start your PostgreSQL server:

./scripts/postgres_docker/setup.sh\n

and the teardown script each time you want to stop it:

./scripts/postgres_docker/teardown.sh\n

Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port. The default username is a@b.com with password b.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-3-conda-install-option-install-system-dependencies","title":"Step 3, Conda install option: Install system dependencies","text":"

These instructions are the same no matter what operating system you are using. As an added benefit, these installations will be specific to the conda environment you create for Splink, so they will not interfere with other projects.

For convenience, we have created an automatic installation script that will install all dependencies for you. It will create an isolated conda environment called splink.

From the directory where you have cloned the Splink repository, simply run:

./scripts/conda/development_setup_with_conda.sh\n

If you use a shell besides bash, add the mamba CLI to your PATH by running ~/miniforge3/bin/mamba init <your_shell> -- e.g. ~/miniforge3/bin/mamba init zsh for zsh.

If you've run this successfully, restart your terminal and skip to the \"Step 5: Activating your environment(s)\" section.

If you would prefer to manually go through the steps to have a better understanding of what you are installing, continue to the next section.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#install-conda-itself","title":"Install Conda itself","text":"

First, we need to install a conda CLI. Any will do, but we recommend Miniforge, which can be installed like so:

curl -L -O \"https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\"\nbash Miniforge3-$(uname)-$(uname -m).sh -b\n

Miniforge is great because it defaults to the community-curated conda-forge channel, and it installs the mamba CLI by default, which is generally faster than the conda CLI.

Before you'll be able to run the mamba command, you need to run ~/miniforge3/bin/mamba init for your shell -- e.g. ~/miniforge3/bin/mamba init for Bash or ~/miniforge3/bin/mamba init zsh for zsh.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#install-conda-packages","title":"Install Conda packages","text":"

The rest is easy, because all the other dependencies can be installed as conda packages. Simply run:

mamba env create -n splink --file ./scripts/conda/development_environment.yaml\n

Now run mamba activate splink to enter your newly created conda environment -- you will need to do this again each time you open a new terminal. Run the rest of the steps in this guide inside this environment. mamba deactivate leaves the environment.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-4-python-package-dependencies","title":"Step 4: Python package dependencies","text":"

Splink manages the other Python packages it depends on using Poetry. Simply run poetry install in the Splink directory to install them. You can find more options for this command (such as how to install optional dependencies) on the managing dependencies with Poetry page.

To enter the virtual environment created by poetry, run poetry shell. You will need to do this again each time you open a new terminal. Use exit to leave the Poetry shell.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-5-activating-your-environments","title":"Step 5: Activating your environment(s)","text":"

Depending on the options you chose in this document, you now have either:

  • Only a Poetry virtual environment.
  • Both a conda environment and a Poetry virtual environment.

If you did not use conda, then each time you open a terminal to develop Splink, after navigating to the repository directory, run poetry shell.

If you did use conda, then each time you open a terminal to develop Splink, after navigating to the repository directory, run mamba activate splink and then poetry shell.

"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-6-checking-that-it-worked","title":"Step 6: Checking that it worked","text":"

If you have installed all the dependencies, including PostgreSQL, you should be able to run the following command without error (will take about 10 minutes):

pytest tests/\n

This runs all the Splink tests across the default DuckDB and Spark backends, and runs some integration tests across the rest of the backends except for Athena, which can't run locally.

If you haven't installed PostgreSQL, try this:

pytest tests/ --ignore tests/test_full_example_postgres.py\n
"},{"location":"dev_guides/changing_splink/development_quickstart.html#step-7-visual-studio-code-optional","title":"Step 7: Visual Studio Code (optional)","text":"

You're now all set to develop Splink. If you have a text editor/IDE you are comfortable with for working on Python packages, you can use that. If you don't, we recommend Visual Studio Code. Here are some tips on how to get started:

  • Install Visual Studio Code
  • If you are using WSL on Windows, install the WSL extension. You will want to do all development inside a WSL \"remote.\"
  • Install the Python extension.
  • Use the Python extension's pytest functionality to run the tests within your IDE.
  • Use the interactive window to run code snippets.
"},{"location":"dev_guides/changing_splink/lint_and_format.html","title":"Linting and Formatting","text":""},{"location":"dev_guides/changing_splink/lint_and_format.html#linting-your-code","title":"Linting your code","text":"

We use ruff for linting and formatting.

To quickly run both the linter and formatter, you can source the linting bash script (shown below). The -f flag can be called to run automatic fixes with ruff. If you simply wish for ruff to print the errors it finds to the console, remove this flag.

poetry run ruff format\npoetry run ruff check .\n
"},{"location":"dev_guides/changing_splink/lint_and_format.html#additional-rules","title":"Additional Rules","text":"

ruff contains an extensive arsenal of linting rules and techniques that can be applied.

If you wish to add an addition rule, do so in the pyproject.toml file in the root of the project.

"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html","title":"Managing Dependencies with Poetry","text":"

Splink utilises poetry for managing its core dependencies, offering a clean and effective solution for tracking and resolving any ensuing package and version conflicts.

You can find a list of Splink's core dependencies within the pyproject.toml file.

A comprehensive list of Poetry commands is available in the Poetry documentation.

"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html#fundamental-commands-in-poetry","title":"Fundamental Commands in Poetry","text":"

Below are some useful commands to help in the maintenance and upkeep of the pyproject.toml file.

"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html#adding-packages","title":"Adding Packages","text":"

To incorporate a new package into Splink:

poetry add <package-name>\n

To specify a version when adding a new package:

poetry add <package-name>==<version>\n# Add quotes if you want to use other equality calls\npoetry add \"<package-name> >= <version>\"\n
"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html#modifying-packages","title":"Modifying Packages","text":"

To remove a package from the project:

poetry remove <package-name>\n

Updating an existing package to a specific version:

poetry add <package-name>==<version>\npoetry add \"<package-name> >= <version>\"\n

To update an existing package to the latest version:

poetry add <package-name>==<version>\npoetry update <package-name>\n

Note: Direct updates can also be performed within the pyproject.toml file.

"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html#locking-the-project","title":"Locking the Project","text":"

To update the existing poetry.lock file, thereby locking the project to ensure consistent dependency installation across different environments:

poetry lock\n

Note: This updates all dependencies and may take some time. If you only need to update a single dependency, update it using poetry add <pkg>==<version> instead.

"},{"location":"dev_guides/changing_splink/managing_dependencies_with_poetry.html#installing-dependencies","title":"Installing Dependencies","text":"

To install project dependencies as per the lock file:

poetry install\n

For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:

poetry install -E spark\n

For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:

poetry install -E spark\n

To install everything:

poetry install --with dev --with linting --with testing --with benchmarking --with typechecking --with demos --all-extras\n
"},{"location":"dev_guides/changing_splink/releases.html","title":"Releasing a Package Version","text":""},{"location":"dev_guides/changing_splink/releases.html#releasing-a-new-version-of-splink","title":"Releasing a new version of Splink","text":"

Splink is regularly updated with releases to add new features or bug fixes to the package.

Below are the steps for releasing a new version of Splink:

  1. On a new branch, update pyproject.toml and init.py with the latest version.
  2. Update CHANGELOG.md. This consists of adding a heading for the new release below the 'Unreleased' heading, with the new version and date. Additionally the links at the bottom of the file for 'unreleased' and the new version should be updated.
  3. Open a pull request to merge the new branch with the master branch (the base branch).
  4. Once the pull request has been approved, merge the changes and generate a new release in the releases section of the repo, including:

  5. Choosing a new release tag (which matches your updates to pyproject.toml and init.py). Ensure that your release tag follows semantic versioning. The target branch should be set to master.

)

  • Generating release notes. This can be done automatically by pressing the button.

This will give you release notes based off the Pull Requests which have been merged since the last release.

For example

  • Publish as the latest release

Now your release should be published to PyPI.

"},{"location":"dev_guides/changing_splink/testing.html","title":"Testing","text":"","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#testing-in-splink","title":"Testing in Splink","text":"

Tests in Splink make use of the pytest framework. You can find the tests themselves in the tests folder.

Splink tests can be broadly categorised into three sets:

  • 'Core' tests - these are tests which test some specific bit of functionality which does not depend on any specific SQL dialect. They are usually unit tests - examples are testing InputColumn and testing the latitude-longitude distance calculation.
  • Backend-agnostic tests - these are tests which run against some SQL backend, but which are written in such a way that they can run against many backends by making use of the backend-agnostic testing framework. The majority of tests are of this type.
  • Backend-specific tests - these are tests which run against a specific SQL backend, and test some feature particular to this backend. There are not many of these, as Splink is designed to run very similarly independent of the backend used.
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#running-tests","title":"Running tests","text":"","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#running-tests-locally","title":"Running tests locally","text":"

To run tests locally against duckdb only (the default) run:

poetry run pytest tests/\n

To run a single test file, append the filename to the tests/ folder call, for example:

poetry run pytest tests/test_u_train.py\n

or for a single test, additionally append the test name after a pair of colons, as:

poetry run pytest tests/test_u_train.py::test_u_train_multilink\n
Further useful pytest options

There may be many warnings emitted, for instance by library dependencies, cluttering your output in which case you can use --disable-pytest-warnings or -W ignore so that these will not be displayed. Some additional command-line options that may be useful:

  • -s to disable output capture, so that test output is displayed in the terminal in all cases
  • -v for verbose mode, where each test instance will be displayed on a separate line with status
  • -q for quiet mode, where output is extremely minimal
  • -x to fail on first error/failure rather than continuing to run all selected tests *
  • -m some_mark run only those tests marked with some_mark - see below for useful options here

For instance usage might be:

# ignore warnings, display output\npytest -W ignore -s tests/\n

or

# ignore warnings, verbose output, fail on first error/failure\npytest -W ignore -v -x tests/\n

You can find a host of other available options using pytest's in-built help:

pytest -h\n
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#running-tests-for-specific-backends-or-backend-groups","title":"Running tests for specific backends or backend groups","text":"

You may wish to run tests relating to to specific backends, tests which are backend-independent, or any combinations of these. Splink allows for various combinations by making use of pytest's mark feature.

If when you invoke pytest you pass no marks explicitly, there will be an implicit mark of default, as per the pyproject.toml pytest.ini configuration, and see also the decorator.py file.

The available options are:

","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#run-core-tests","title":"Run core tests","text":"

Option for running only the backend-independent 'core' tests:

  • poetry run pytest tests/ -m core - run only the 'core' tests, meaning those without dialect-dependence. In practice this means any test that hasn't been decorated using mark_with_dialects_excluding or mark_with_dialects_including.
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#run-tests-on-a-specific-backend","title":"Run tests on a specific backend","text":"

Options for running tests on one backend only - this includes tests written specifically for that backend, as well as backend-agnostic tests supported for that backend.

  • poetry run pytest tests/ -m duckdb - run all duckdb tests, and all core tests
    • & similarly for other dialects
  • poetry run pytest tests/ -m duckdb_only - run all duckdb tests only, and not the core tests
    • & similarly for other dialects
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#run-tests-across-multiple-backends","title":"Run tests across multiple backends","text":"

Options for running tests on multiple backends (including all backends) - this includes tests written specifically for those backends, as well as backend-agnostic tests supported for those backends.

  • pytest tests/ -m default or equivalently pytest tests/ - run all tests in the default group. The default group consists of the core tests, and those dialects in the default group - currently spark and duckdb.
    • Other groups of dialects can be added and will similarly run with pytest tests/ -m new_dialect_group. Dialects within the current scope of testing and the groups they belong to are defined in the dialect_groups dictionary in tests/decorator.py
  • pytest tests/ -m all run all tests for all available dialects

These all work alongside all the other pytest options, so for instance to run the tests for training probability_two_random_records_match for only duckdb, ignoring warnings, with quiet output, and exiting on the first failure/error:

pytest -W ignore -q -x -m duckdb tests/test_estimate_prob_two_rr_match.py\n
Running tests against a specific version of Python

Testing Splink against a specific version of Python, especially newer versions not included in our GitHub Actions, is vital for identifying compatibility issues early and reviewing errors reported by users.

If you're a conda user, you can create a isolated environment according to the instructions in the development quickstart.

Another method is to utilise docker \ud83d\udc33.

A pre-built Dockerfile for running tests against python version 3.9.10 can be located within scripts/run_tests.Dockerfile.

To run, simply use the following docker command from within a terminal and the root folder of a Splink clone:

docker build -t run_tests:testing -f scripts/run_tests.Dockerfile . && docker run --rm --name splink-test run_tests:testing\n

This will both build and run the tests library.

Feel free to replace run_tests:testing with an image name and tag you're happy with.

Reusing the same image and tag will overwrite your existing image.

You can also overwrite the default CMD if you want a different set of pytest command-line options, for example

docker run --rm --name splink-test run_tests:testing pytest -W ignore -m spark tests/test_u_train.py\n
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#running-with-a-pre-existing-postgres-database","title":"Running with a pre-existing Postgres database","text":"

If you have a pre-existing Postgres server you wish to use to run the tests against, you will need to specify environment variables for the credentials where they differ from default (in parentheses):

  • SPLINKTEST_PG_USER (splinkognito)
  • SPLINKTEST_PG_PASSWORD (splink123!)
  • SPLINKTEST_PG_HOST (localhost)
  • SPLINKTEST_PG_PORT (5432)
  • SPLINKTEST_PG_DB (splink_db) - tests will not actually run against this, but it is from a connection to this that the temporary test database + user will be created

While care has been taken to ensure that tests are run using minimal permissions, and are cleaned up after, it is probably wise to run tests connected to a non-important database, in case anything goes wrong. In addition to the standard privileges for Splink usage, in order to run the tests you will need:

  • CREATE DATABASE to create a temporary testing database
  • CREATEROLE to create a temporary user role with limited privileges, which will be actually used for all the SQL execution in the tests
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#tests-in-ci","title":"Tests in CI","text":"

Splink utilises GitHub actions to run tests for each pull request. This consists of a few independent checks:

  • The full test suite is run separately against several different python versions
  • The example notebooks are checked to ensure they run without error
  • The tutorial notebooks are checked to ensure they run without error
","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#writing-tests","title":"Writing tests","text":"","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#core-tests","title":"Core tests","text":"

Core tests are treated the same way as ordinary pytest tests. Any test is marked as core by default, and will only be excluded from being a core test if it is decorated using either:

  • @mark_with_dialects_excluding for backend-agnostic tests, or
  • @mark_with_dialects_including for backend-specific tests

from the test decorator file.

","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#backend-agnostic-testing","title":"Backend-agnostic testing","text":"

The majority of tests should be written using the backend-agnostic testing framework. This just provides some small tools which allow tests to be written in a backend-independent way. This means the tests can then by run against all available SQL backends (or a subset, if some lack necessary features for the test).

As an example, let's consider a test that will run on all dialects, and then break down the various parts to see what each is doing.

from tests.decorator import mark_with_dialects_excluding\n\n@mark_with_dialects_excluding()\ndef test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):\n    helper = test_helpers[dialect]\n\n    df = helper.load_frame_from_csv(\"./tests/datasets/fake_1000_from_splink_demos.csv\")\n    settings = SettingsCreator(\n        link_type=\"dedupe_only\",\n        comparisons=[\n            cl.ExactMatch(\"first_name\"),\n            cl.ExactMatch(\"surname\"),\n        ],\n        blocking_rules_to_generate_predictions=[\n            block_on(\"first_name\"),\n        ],\n    )\n    linker = helper.Linker(\n        df,\n        settings,\n        **helper.extra_linker_args(),\n    )\n\n\n    # and then some actual testing logic\n

Firstly you should import the decorator-factory mark_with_dialects_excluding, which will decorate each test function:

from tests.decorator import mark_with_dialects_excluding\n

Then we define the function, and pass parameters:

@mark_with_dialects_excluding()\ndef test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):\n

The decorator @mark_with_dialects_excluding() will do two things:

  • marks the test it decorates with the appropriate custom pytest marks. This ensures that it will be run with tests for each dialect, excluding any that are passed as arguments; in this case it will be run for all dialects, as we have passed no arguments.
  • parameterises the test with a string parameter dialect, which will be used to configure the test for that dialect. The test will run for each value of dialect possible, excluding any passed to the decorator (none in this case).

You should aim to exclude as few dialects as possible - consider if you really need to exclude any. Dialects should only be excluded if the test doesn't make sense for them due to features they lack. The default choice should be the decorator with no arguments @mark_with_dialects_excluding(), meaning the test runs for all dialects.

@mark_with_dialects_excluding()\ndef test_feature_that_works_for_all_backends(test_helpers, dialect, some_other_test_fixture):\n

As well as the parameter dialect (which is provided by the decorator), we must also pass the helper-factory fixture test_helpers. We can additionally pass further fixtures if needed - in this case some_other_test_fixture. We could similarly provide an explicit parameterisation to the test, in which case we would also pass these parameters - see the pytest docs on parameterisation for more information.

    helper = test_helpers[dialect]\n

The fixture test_helpers is simply a dictionary of the specific-dialect test helpers - here we pick the appropriate one for our test.

Each helper has the same set of methods and properties, which encapsulate all of the dialect-dependencies. You can find the full set of properties and methods by examining the source for the base class TestHelper.

    df = helper.load_frame_from_csv(\"./tests/datasets/fake_1000_from_splink_demos.csv\")\n

Here we are now actually using a method of the test helper - in this case we are loading a table from a csv to the database and returning it in a form suitable for passing to a Splink linker.

Finally we instantiate the linker, passing any default set of extra arguments provided by the helper, which some dialects require.

    linker = helper.Linker(df, settings_dict, **helper.extra_linker_args())\n

From this point onwards we will be working with the instantiated linker, and so will not need to refer to helper any more - the rest of the test can be written as usual.

","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#excluding-some-backends","title":"Excluding some backends","text":"

Now let's consider an example in which we wanted to test a ComparisonLevel that included the split_part function which does not exist in the sqlite dialect. We assume that this particular comparison level is crucial for the test to make sense, otherwise we would rewrite this line to make it run universally. When you come to run the tests, this test will not run on the sqlite backend.

{\n    \"sql_condition\": \"split_part(email_l, '@', 1) = split_part(email_r, '@', 1)\",\n    \"label_for_charts\": \"email local-part matches\",\n}\n

Warning

Tests should be made available to the widest range of backends possible. Only exclude backends if features not shared by all backends are crucial to the test-logic - otherwise consider rewriting things so that all backends are covered.

We therefore want to exclude sqlite backend, as the test relies on features not directly available for that backend, which we can do as follows:

from tests.decorator import mark_with_dialects_excluding\n\n@mark_with_dialects_excluding(\"sqlite\")\ndef test_feature_that_doesnt_work_with_sqlite(test_helpers, dialect, some_other_test_fixture):\n    helper = test_helpers[dialect]\n\n    df = helper.load_frame_from_csv(\"./tests/datasets/fake_1000_from_splink_demos.csv\")\n\n    # and then some actual testing logic\n

The key difference is the argument we pass to the decorator:

@mark_with_dialects_excluding(\"sqlite\")\ndef test_feature_that_doesnt_work_with_sqlite(test_helpers, dialect, some_other_test_fixture):\n
As above this marks the test it decorates with the appropriate custom pytest marks, but in this case it ensures that it will be run with tests for each dialect excluding sqlite. Again dialect is passed as a parameter, and the test will run in turn for each value of dialect except for sqlite.

If you need to exclude multiple dialects this is also possible - just pass each as an argument. For example, to decorate a test that is not supported on spark or sqlite, use the decorator @mark_with_dialects_excluding(\"sqlite\", \"spark\").

","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/changing_splink/testing.html#backend-specific-tests","title":"Backend-specific tests","text":"

If you intend to write a test for a specific backend, first consider whether it is definitely specific to that backend - if not then a backend-agnostic test would be preferable, as then your test will be run against many backends. If you really do need to test features peculiar to one backend, then you can write it simply as you would an ordinary pytest test. The only difference is that you should decorate it with @mark_with_dialects_including (from tests/decorator.py) - for example:

DuckDB Spark SQLite
@mark_with_dialects_including(\"duckdb\")\ndef test_some_specific_duckdb_feature():\n    ...\n
@mark_with_dialects_including(\"spark\")\ndef test_some_specific_spark_feature():\n    ...\n
@mark_with_dialects_including(\"sqlite\")\ndef test_some_specific_sqlite_feature():\n    ...\n

This ensures that the test gets marked appropriately for running when the Spark tests should be run, and excludes it from the set of core tests.

Note that unlike the exclusive mark_with_dialects_excluding, this decorator will not parameterise the test with the dialect argument. This is because usage of the inclusive form is largely designed for single-dialect tests. If you wish to override this behaviour and parameterise the test you can use the argument pass_dialect, for example @mark_with_dialects_including(\"spark\", \"sqlite\", pass_dialect=True), in which case you would need to write the test in a backend-independent manner.

","tags":["Testing","Pytest","Backends"]},{"location":"dev_guides/charts/building_charts.html","title":"Building new charts","text":""},{"location":"dev_guides/charts/building_charts.html#building-a-new-chart-in-splink","title":"Building a new chart in Splink","text":"

As mentioned in the Understanding Splink Charts topic guide, splink charts are made up of three distinct parts:

  1. A function to create the dataset for the chart
  2. A template chart definition (in a json file)
  3. A function to read the chart definition, add the data to it, and return the chart itself
"},{"location":"dev_guides/charts/building_charts.html#worked-example","title":"Worked Example","text":"

Below is a worked example of how to create a new chart that shows all comparisons levels ordered by match weight:

import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n      cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.DateOfBirthComparison(\"dob\", input_is_string=True),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.LevenshteinAtThresholds(\"email\", 2),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ]\n)\n\nlinker = Linker(df, settings,DuckDBAPI())\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\nfor rule in [block_on(\"first_name\"), block_on(\"dob\")]:\n    linker.training.estimate_parameters_using_expectation_maximisation(rule)\n
"},{"location":"dev_guides/charts/building_charts.html#generate-data-for-chart","title":"Generate data for chart","text":"
# Take linker object and extract complete settings dict\nrecords = linker._settings_obj._parameters_as_detailed_records\n\ncols_to_keep = [\n    \"comparison_name\",\n    \"sql_condition\",\n    \"label_for_charts\",\n    \"m_probability\",\n    \"u_probability\",\n    \"bayes_factor\",\n    \"log2_bayes_factor\",\n    \"comparison_vector_value\"\n]\n\n# Keep useful information for a match weights chart\nrecords = [{k: r[k] for k in cols_to_keep}\n           for r in records\n           if r[\"comparison_vector_value\"] != -1 and r[\"comparison_sort_order\"] != -1]\n\nrecords[:3]\n
[{'comparison_name': 'first_name',\n  'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n  'label_for_charts': 'Exact match on first_name',\n  'm_probability': 0.5009783629340309,\n  'u_probability': 0.0057935713975033705,\n  'bayes_factor': 86.4714229896119,\n  'log2_bayes_factor': 6.434151525637829,\n  'comparison_vector_value': 4},\n {'comparison_name': 'first_name',\n  'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92',\n  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.92',\n  'm_probability': 0.15450921411813767,\n  'u_probability': 0.0023429457903817435,\n  'bayes_factor': 65.9465595629351,\n  'log2_bayes_factor': 6.043225490816602,\n  'comparison_vector_value': 3},\n {'comparison_name': 'first_name',\n  'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88',\n  'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.88',\n  'm_probability': 0.07548037415770431,\n  'u_probability': 0.0015484319951285285,\n  'bayes_factor': 48.7463281533646,\n  'log2_bayes_factor': 5.607221645966225,\n  'comparison_vector_value': 2}]\n
"},{"location":"dev_guides/charts/building_charts.html#create-a-chart-template","title":"Create a chart template","text":""},{"location":"dev_guides/charts/building_charts.html#build-prototype-chart-in-altair","title":"Build prototype chart in Altair","text":"
import pandas as pd\nimport altair as alt\n\ndf = pd.DataFrame(records)\n\n# Need a unique name for each comparison level - easier to create in pandas than altair\ndf[\"cl_id\"] = df[\"comparison_name\"] + \"_\" + \\\n    df[\"comparison_vector_value\"].astype(\"str\")\n\n# Simple start - bar chart with x, y and color encodings\nalt.Chart(df).mark_bar().encode(\n    y=\"cl_id\",\n    x=\"log2_bayes_factor\",\n    color=\"comparison_name\"\n)\n
"},{"location":"dev_guides/charts/building_charts.html#sort-bars-edit-axestitles","title":"Sort bars, edit axes/titles","text":"
alt.Chart(df).mark_bar().encode(\n    y=alt.Y(\"cl_id\",\n        sort=\"-x\",\n        title=\"Comparison level\"\n    ),\n    x=alt.X(\"log2_bayes_factor\",\n        title=\"Comparison level match weight = log2(m/u)\",\n        scale=alt.Scale(domain=[-10,10])\n    ),\n    color=\"comparison_name\"\n).properties(\n    title=\"New Chart - WOO!\"\n).configure_view(\n    step=15\n)\n
"},{"location":"dev_guides/charts/building_charts.html#add-tooltip","title":"Add tooltip","text":"
alt.Chart(df).mark_bar().encode(\n    y=alt.Y(\"cl_id\",\n            sort=\"-x\",\n            title=\"Comparison level\"\n            ),\n    x=alt.X(\"log2_bayes_factor\",\n            title=\"Comparison level match weight = log2(m/u)\",\n            scale=alt.Scale(domain=[-10, 10])\n            ),\n    color=\"comparison_name\",\n    tooltip=[\n        \"comparison_name\",\n        \"label_for_charts\",\n        \"sql_condition\",\n        \"m_probability\",\n        \"u_probability\",\n        \"bayes_factor\",\n        \"log2_bayes_factor\"\n        ]\n).properties(\n    title=\"New Chart - WOO!\"\n).configure_view(\n    step=15\n)\n
"},{"location":"dev_guides/charts/building_charts.html#add-text-layer","title":"Add text layer","text":"
# Create base chart with shared data and encodings (mark type not specified)\nbase = alt.Chart(df).encode(\n    y=alt.Y(\"cl_id\",\n            sort=\"-x\",\n            title=\"Comparison level\"\n            ),\n    x=alt.X(\"log2_bayes_factor\",\n            title=\"Comparison level match weight = log2(m/u)\",\n            scale=alt.Scale(domain=[-10, 10])\n            ),\n    tooltip=[\n        \"comparison_name\",\n        \"label_for_charts\",\n        \"sql_condition\",\n        \"m_probability\",\n        \"u_probability\",\n        \"bayes_factor\",\n        \"log2_bayes_factor\"\n    ]\n)\n\n# Build bar chart from base (color legend made redundant by text labels)\nbar = base.mark_bar().encode(\n    color=alt.Color(\"comparison_name\", legend=None)\n)\n\n# Build text layer from base\ntext = base.mark_text(dx=0, align=\"right\").encode(\n    text=\"comparison_name\"\n)\n\n# Final layered chart\nchart = bar + text\n\n# Add global config\nchart.resolve_axis(\n    y=\"shared\",\n    x=\"shared\"\n).properties(\n    title=\"New Chart - WOO!\"\n).configure_view(\n    step=15\n)\n

Sometimes things go wrong in Altair and it's not clear why or how to fix it. If the docs and Stack Overflow don't have a solution, the answer is usually that Altair is making decisions under the hood about the Vega-Lite schema that are out of your control.

In this example, the sorting of the y-axis is broken when layering charts. If we show bar and text side-by-side, you can see they work as expected, but the sorting is broken in the layering process.

bar | text\n

Once we get to this stage (or whenever you're comfortable), we can switch to Vega-Lite by exporting the JSON from our chart object, or opening the chart in the Vega-Lite editor.

chart.to_json()\n
Chart JSON
  {\n  \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.8.0.json\",\n  \"config\": {\n    \"view\": {\n      \"continuousHeight\": 300,\n      \"continuousWidth\": 300\n    }\n  },\n  \"data\": {\n    \"name\": \"data-3901c03d78701611834aa82ab7374cce\"\n  },\n  \"datasets\": {\n    \"data-3901c03d78701611834aa82ab7374cce\": [\n      {\n        \"bayes_factor\": 86.62949969575988,\n        \"cl_id\": \"first_name_4\",\n        \"comparison_name\": \"first_name\",\n        \"comparison_vector_value\": 4,\n        \"label_for_charts\": \"Exact match first_name\",\n        \"log2_bayes_factor\": 6.436786480320881,\n        \"m_probability\": 0.5018941916173814,\n        \"sql_condition\": \"\\\"first_name_l\\\" = \\\"first_name_r\\\"\",\n        \"u_probability\": 0.0057935713975033705\n      },\n      {\n        \"bayes_factor\": 82.81743551783742,\n        \"cl_id\": \"first_name_3\",\n        \"comparison_name\": \"first_name\",\n        \"comparison_vector_value\": 3,\n        \"label_for_charts\": \"Damerau_levenshtein <= 1\",\n        \"log2_bayes_factor\": 6.371862624533329,\n        \"m_probability\": 0.19595791797531015,\n        \"sql_condition\": \"damerau_levenshtein(\\\"first_name_l\\\", \\\"first_name_r\\\") <= 1\",\n        \"u_probability\": 0.00236614327345483\n      },\n      {\n        \"bayes_factor\": 35.47812468678278,\n        \"cl_id\": \"first_name_2\",\n        \"comparison_name\": \"first_name\",\n        \"comparison_vector_value\": 2,\n        \"label_for_charts\": \"Jaro_winkler_similarity >= 0.9\",\n        \"log2_bayes_factor\": 5.148857848140163,\n        \"m_probability\": 0.045985303626033085,\n        \"sql_condition\": \"jaro_winkler_similarity(\\\"first_name_l\\\", \\\"first_name_r\\\") >= 0.9\",\n        \"u_probability\": 0.001296159366708712\n      },\n      {\n        \"bayes_factor\": 11.266641370022352,\n        \"cl_id\": \"first_name_1\",\n        \"comparison_name\": \"first_name\",\n        \"comparison_vector_value\": 1,\n        \"label_for_charts\": \"Jaro_winkler_similarity >= 0.8\",\n        \"log2_bayes_factor\": 3.493985601438375,\n        \"m_probability\": 0.06396730257493154,\n        \"sql_condition\": \"jaro_winkler_similarity(\\\"first_name_l\\\", \\\"first_name_r\\\") >= 0.8\",\n        \"u_probability\": 0.005677583982137938\n      },\n      {\n        \"bayes_factor\": 0.19514855669673956,\n        \"cl_id\": \"first_name_0\",\n        \"comparison_name\": \"first_name\",\n        \"comparison_vector_value\": 0,\n        \"label_for_charts\": \"All other comparisons\",\n        \"log2_bayes_factor\": -2.357355302129234,\n        \"m_probability\": 0.19219528420634394,\n        \"sql_condition\": \"ELSE\",\n        \"u_probability\": 0.9848665419801952\n      },\n      {\n        \"bayes_factor\": 113.02818119005431,\n        \"cl_id\": \"surname_4\",\n        \"comparison_name\": \"surname\",\n        \"comparison_vector_value\": 4,\n        \"label_for_charts\": \"Exact match surname\",\n        \"log2_bayes_factor\": 6.820538712806792,\n        \"m_probability\": 0.5527050424941531,\n        \"sql_condition\": \"\\\"surname_l\\\" = \\\"surname_r\\\"\",\n        \"u_probability\": 0.004889975550122249\n      },\n      {\n        \"bayes_factor\": 80.61351958508214,\n        \"cl_id\": \"surname_3\",\n        \"comparison_name\": \"surname\",\n        \"comparison_vector_value\": 3,\n        \"label_for_charts\": \"Damerau_levenshtein <= 1\",\n        \"log2_bayes_factor\": 6.332949906378981,\n        \"m_probability\": 0.22212752320956386,\n        \"sql_condition\": \"damerau_levenshtein(\\\"surname_l\\\", \\\"surname_r\\\") <= 1\",\n        \"u_probability\": 0.0027554624131641246\n      },\n      {\n        \"bayes_factor\": 48.57568460485815,\n        \"cl_id\": \"surname_2\",\n        \"comparison_name\": \"surname\",\n        \"comparison_vector_value\": 2,\n        \"label_for_charts\": \"Jaro_winkler_similarity >= 0.9\",\n        \"log2_bayes_factor\": 5.602162423566203,\n        \"m_probability\": 0.0490149338194711,\n        \"sql_condition\": \"jaro_winkler_similarity(\\\"surname_l\\\", \\\"surname_r\\\") >= 0.9\",\n        \"u_probability\": 0.0010090425738347498\n      },\n      {\n        \"bayes_factor\": 13.478820689774516,\n        \"cl_id\": \"surname_1\",\n        \"comparison_name\": \"surname\",\n        \"comparison_vector_value\": 1,\n        \"label_for_charts\": \"Jaro_winkler_similarity >= 0.8\",\n        \"log2_bayes_factor\": 3.752622370380284,\n        \"m_probability\": 0.05001678986356945,\n        \"sql_condition\": \"jaro_winkler_similarity(\\\"surname_l\\\", \\\"surname_r\\\") >= 0.8\",\n        \"u_probability\": 0.003710768991942586\n      },\n      {\n        \"bayes_factor\": 0.1277149376863226,\n        \"cl_id\": \"surname_0\",\n        \"comparison_name\": \"surname\",\n        \"comparison_vector_value\": 0,\n        \"label_for_charts\": \"All other comparisons\",\n        \"log2_bayes_factor\": -2.969000820703079,\n        \"m_probability\": 0.1261357106132424,\n        \"sql_condition\": \"ELSE\",\n        \"u_probability\": 0.9876347504709363\n      },\n      {\n        \"bayes_factor\": 236.78351486807742,\n        \"cl_id\": \"dob_5\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 5,\n        \"label_for_charts\": \"Exact match\",\n        \"log2_bayes_factor\": 7.887424832202931,\n        \"m_probability\": 0.41383785481447766,\n        \"sql_condition\": \"\\\"dob_l\\\" = \\\"dob_r\\\"\",\n        \"u_probability\": 0.0017477477477477479\n      },\n      {\n        \"bayes_factor\": 65.74625268345359,\n        \"cl_id\": \"dob_4\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 4,\n        \"label_for_charts\": \"Damerau_levenshtein <= 1\",\n        \"log2_bayes_factor\": 6.038836762842662,\n        \"m_probability\": 0.10806341031654734,\n        \"sql_condition\": \"damerau_levenshtein(\\\"dob_l\\\", \\\"dob_r\\\") <= 1\",\n        \"u_probability\": 0.0016436436436436436\n      },\n      {\n        \"bayes_factor\": 29.476860590690453,\n        \"cl_id\": \"dob_3\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 3,\n        \"label_for_charts\": \"Within 1 month\",\n        \"log2_bayes_factor\": 4.881510974428093,\n        \"m_probability\": 0.11300938544779224,\n        \"sql_condition\": \"\\n            abs(date_diff('month',\\n                strptime(\\\"dob_l\\\", '%Y-%m-%d'),\\n                strptime(\\\"dob_r\\\", '%Y-%m-%d'))\\n                ) <= 1\\n        \",\n        \"u_probability\": 0.003833833833833834\n      },\n      {\n        \"bayes_factor\": 3.397551460259144,\n        \"cl_id\": \"dob_2\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 2,\n        \"label_for_charts\": \"Within 1 year\",\n        \"log2_bayes_factor\": 1.7644954026183992,\n        \"m_probability\": 0.17200656922328977,\n        \"sql_condition\": \"\\n            abs(date_diff('year',\\n                strptime(\\\"dob_l\\\", '%Y-%m-%d'),\\n                strptime(\\\"dob_r\\\", '%Y-%m-%d'))\\n                ) <= 1\\n        \",\n        \"u_probability\": 0.05062662662662663\n      },\n      {\n        \"bayes_factor\": 0.6267794172297388,\n        \"cl_id\": \"dob_1\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 1,\n        \"label_for_charts\": \"Within 10 years\",\n        \"log2_bayes_factor\": -0.6739702908716182,\n        \"m_probability\": 0.19035523041792068,\n        \"sql_condition\": \"\\n            abs(date_diff('year',\\n                strptime(\\\"dob_l\\\", '%Y-%m-%d'),\\n                strptime(\\\"dob_r\\\", '%Y-%m-%d'))\\n                ) <= 10\\n        \",\n        \"u_probability\": 0.3037037037037037\n      },\n      {\n        \"bayes_factor\": 0.004272180302776005,\n        \"cl_id\": \"dob_0\",\n        \"comparison_name\": \"dob\",\n        \"comparison_vector_value\": 0,\n        \"label_for_charts\": \"All other comparisons\",\n        \"log2_bayes_factor\": -7.870811748958801,\n        \"m_probability\": 0.002727549779972325,\n        \"sql_condition\": \"ELSE\",\n        \"u_probability\": 0.6384444444444445\n      },\n      {\n        \"bayes_factor\": 10.904938885948333,\n        \"cl_id\": \"city_1\",\n        \"comparison_name\": \"city\",\n        \"comparison_vector_value\": 1,\n        \"label_for_charts\": \"Exact match\",\n        \"log2_bayes_factor\": 3.4469097796586596,\n        \"m_probability\": 0.6013808934279701,\n        \"sql_condition\": \"\\\"city_l\\\" = \\\"city_r\\\"\",\n        \"u_probability\": 0.0551475711801453\n      },\n      {\n        \"bayes_factor\": 0.42188504195296994,\n        \"cl_id\": \"city_0\",\n        \"comparison_name\": \"city\",\n        \"comparison_vector_value\": 0,\n        \"label_for_charts\": \"All other comparisons\",\n        \"log2_bayes_factor\": -1.2450781575619725,\n        \"m_probability\": 0.3986191065720299,\n        \"sql_condition\": \"ELSE\",\n        \"u_probability\": 0.9448524288198547\n      },\n      {\n        \"bayes_factor\": 269.6074384240141,\n        \"cl_id\": \"email_2\",\n        \"comparison_name\": \"email\",\n        \"comparison_vector_value\": 2,\n        \"label_for_charts\": \"Exact match\",\n        \"log2_bayes_factor\": 8.07471649055784,\n        \"m_probability\": 0.5914840252879943,\n        \"sql_condition\": \"\\\"email_l\\\" = \\\"email_r\\\"\",\n        \"u_probability\": 0.0021938713143283602\n      },\n      {\n        \"bayes_factor\": 222.9721189153553,\n        \"cl_id\": \"email_1\",\n        \"comparison_name\": \"email\",\n        \"comparison_vector_value\": 1,\n        \"label_for_charts\": \"Levenshtein <= 2\",\n        \"log2_bayes_factor\": 7.800719512398763,\n        \"m_probability\": 0.3019669634613132,\n        \"sql_condition\": \"levenshtein(\\\"email_l\\\", \\\"email_r\\\") <= 2\",\n        \"u_probability\": 0.0013542812658830492\n      },\n      {\n        \"bayes_factor\": 0.10692840956298139,\n        \"cl_id\": \"email_0\",\n        \"comparison_name\": \"email\",\n        \"comparison_vector_value\": 0,\n        \"label_for_charts\": \"All other comparisons\",\n        \"log2_bayes_factor\": -3.225282884575804,\n        \"m_probability\": 0.10654901125069259,\n        \"sql_condition\": \"ELSE\",\n        \"u_probability\": 0.9964518474197885\n      }\n    ]\n  },\n  \"layer\": [\n    {\n      \"encoding\": {\n        \"color\": {\n          \"field\": \"comparison_name\",\n          \"legend\": null,\n          \"type\": \"nominal\"\n        },\n        \"tooltip\": [\n          {\n            \"field\": \"comparison_name\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"label_for_charts\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"sql_condition\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"m_probability\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"u_probability\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"bayes_factor\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"log2_bayes_factor\",\n            \"type\": \"quantitative\"\n          }\n        ],\n        \"x\": {\n          \"field\": \"log2_bayes_factor\",\n          \"scale\": {\n            \"domain\": [\n              -10,\n              10\n            ]\n          },\n          \"title\": \"Comparison level match weight = log2(m/u)\",\n          \"type\": \"quantitative\"\n        },\n        \"y\": {\n          \"field\": \"cl_id\",\n          \"sort\": \"-x\",\n          \"title\": \"Comparison level\",\n          \"type\": \"nominal\"\n        }\n      },\n      \"mark\": {\n        \"type\": \"bar\"\n      }\n    },\n    {\n      \"encoding\": {\n        \"text\": {\n          \"field\": \"comparison_name\",\n          \"type\": \"nominal\"\n        },\n        \"tooltip\": [\n          {\n            \"field\": \"comparison_name\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"label_for_charts\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"sql_condition\",\n            \"type\": \"nominal\"\n          },\n          {\n            \"field\": \"m_probability\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"u_probability\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"bayes_factor\",\n            \"type\": \"quantitative\"\n          },\n          {\n            \"field\": \"log2_bayes_factor\",\n            \"type\": \"quantitative\"\n          }\n        ],\n        \"x\": {\n          \"field\": \"log2_bayes_factor\",\n          \"scale\": {\n            \"domain\": [\n              -10,\n              10\n            ]\n          },\n          \"title\": \"Comparison level match weight = log2(m/u)\",\n          \"type\": \"quantitative\"\n        },\n        \"y\": {\n          \"field\": \"cl_id\",\n          \"sort\": \"-x\",\n          \"title\": \"Comparison level\",\n          \"type\": \"nominal\"\n        }\n      },\n      \"mark\": {\n        \"align\": \"right\",\n        \"dx\": 0,\n        \"type\": \"text\"\n      }\n    }\n  ]\n  }\n
"},{"location":"dev_guides/charts/building_charts.html#edit-in-vega-lite","title":"Edit in Vega-Lite","text":"

Opening the JSON from the above chart in Vega-Lite editor, it is now behaving as intended, with both bar and text layers sorted by match weight.

If the chart is working as intended, there is only one step required before saving the JSON file - removing data from the template schema.

The data appears as follows with a dictionary of all included datasets by name, and then each chart referencing the data it uses by name:

\"data\": {\"name\": \"data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185\"},\n\"datasets\": {\n  \"data-a6c84a9cf1a0c7a2cd30cc1a0e2c1185\": [\n\n    ...\n\n  ]\n},\n

Where only one dataset is required, this is equivalent to:

\"data\": {\"values\": [...]}\n

After removing the data references, the template can be saved in Splink as splink/files/chart_defs/my_new_chart.json

"},{"location":"dev_guides/charts/building_charts.html#combine-the-chart-dataset-and-template","title":"Combine the chart dataset and template","text":"

Putting all of the above together, Splink needs definitions for the methods that generate the chart and the data behind it (these can be separate or performed by the same function if relatively simple).

"},{"location":"dev_guides/charts/building_charts.html#chart-definition","title":"Chart definition","text":"

In splink/charts.py we can add a new function to populate the chart definition with the provided data:

def my_new_chart(records, as_dict=False):\n    chart_path = \"my_new_chart.json\"\n    chart = load_chart_definition(chart_path)\n\n    chart[\"data\"][\"values\"] = records\n    return altair_or_json(chart, as_dict=as_dict)\n

Note - only the data is being added to a fixed chart definition here. Other elements of the chart spec can be changed by editing the chart dictionary in the same way.

For example, if you wanted to add a color_scheme argument to replace the default scheme (\"tableau10\"), this function could include the line: chart[\"layer\"][0][\"encoding\"][\"color\"][\"scale\"][\"scheme\"] = color_scheme

"},{"location":"dev_guides/charts/building_charts.html#chart-method","title":"Chart method","text":"

Then we can add a method to the linker in splink/linker.py so the chart can be generated by linker.my_new_chart():

from .charts import my_new_chart\n\n...\n\nclass Linker:\n\n    ...\n\n    def my_new_chart(self):\n\n        # Take linker object and extract complete settings dict\n        records = self._settings_obj._parameters_as_detailed_records\n\n        cols_to_keep = [\n            \"comparison_name\",\n            \"sql_condition\",\n            \"label_for_charts\",\n            \"m_probability\",\n            \"u_probability\",\n            \"bayes_factor\",\n            \"log2_bayes_factor\",\n            \"comparison_vector_value\"\n        ]\n\n        # Keep useful information for a match weights chart\n        records = [{k: r[k] for k in cols_to_keep}\n                   for r in records \n                   if r[\"comparison_vector_value\"] != -1 and r[\"comparison_sort_order\"] != -1]\n\n        return my_new_chart(records)\n
"},{"location":"dev_guides/charts/building_charts.html#previous-new-chart-prs","title":"Previous new chart PRs","text":"

Real-life Splink chart additions, for reference:

  • Term frequency adjustment chart
  • Completeness (multi-dataset) chart
  • Cumulative blocking rule chart
  • Unlinkables chart
  • Missingness chart
  • Waterfall chart
"},{"location":"dev_guides/charts/understanding_and_editing_charts.html","title":"Understanding and editing charts","text":""},{"location":"dev_guides/charts/understanding_and_editing_charts.html#charts-in-splink","title":"Charts in Splink","text":"

Interactive charts are a key tool when linking data with Splink. To see all of the charts available, check out the Splink Charts Gallery.

"},{"location":"dev_guides/charts/understanding_and_editing_charts.html#how-do-charts-work-in-splink","title":"How do charts work in Splink?","text":"

Charts in Splink are built with Altair.

For a given chart, there is usually:

  • A template chart definition (e.g. match_weights_waterfall.json)
  • A function to create the dataset for the chart (e.g. records_to_waterfall_data)
  • A function to read the chart definition, add the data to it, and return the chart itself (e.g. waterfall_chart)
The Vega-Lite Editor

By far the best feature of Vega-Lite is the online editor where the JSON schema and the chart are shown side-by-side, showing changes in real time as the editor helps you to navigate the API.

"},{"location":"dev_guides/charts/understanding_and_editing_charts.html#editing-existing-charts","title":"Editing existing charts","text":"

If you take any Altair chart in HTML format, you should be able to make changes pretty easily with the Vega-Lite Editor.

For example, consider the comparator_score_chart from the similarity analysis library:

Before After

Desired changes

  • Titles (shared title)
  • Axis titles
  • Shared y-axis
  • Colour scales!! \ud83e\udd2e (see the Vega colour schemes docs)
  • red-green is an accessibility no-no
  • shared colour scheme for different metrics
  • unpleasant and unclear to look at
  • legends not necessary (especially when using text labels)
  • Text size encoding (larger text for similar strings)
  • Remove \"_similarity\" and \"_distance\" from column labels
  • Fixed column width (rather than chart width)
  • Row highlighting (on click/hover)

The old spec can be pasted into the Vega Lite editorand edited as shown in the video below:

Check out the final, improved version chart specification.

Before-After diff
@@ -1,9 +1,8 @@\n{\n-  \"config\": {\n-    \"view\": {\n-      \"continuousWidth\": 400,\n-      \"continuousHeight\": 300\n-    }\n+  \"title\": {\n+    \"text\": \"Heatmaps of string comparison metrics\",\n+    \"anchor\": \"middle\",\n+    \"fontSize\": 16\n  },\n  \"hconcat\": [\n    {\n@@ -18,25 +17,32 @@\n                  0,\n                  1\n                ],\n-                \"range\": [\n-                  \"red\",\n-                  \"green\"\n-                ]\n+                \"scheme\": \"greenblue\"\n              },\n-              \"type\": \"quantitative\"\n+              \"type\": \"quantitative\",\n+              \"legend\": null\n            },\n            \"x\": {\n              \"field\": \"comparator\",\n-              \"type\": \"ordinal\"\n+              \"type\": \"ordinal\",\n+              \"title\": null\n            },\n            \"y\": {\n              \"field\": \"strings_to_compare\",\n-              \"type\": \"ordinal\"\n+              \"type\": \"ordinal\",\n+              \"title\": \"String comparison\",\n+              \"axis\": {\n+                \"titleFontSize\": 14\n+              }\n            }\n          },\n-          \"height\": 300,\n-          \"title\": \"Heatmap of Similarity Scores\",\n-          \"width\": 300\n+          \"title\": \"Similarity\",\n+          \"width\": {\n+            \"step\": 40\n+          },\n+          \"height\": {\n+            \"step\": 30\n+          }\n        },\n        {\n          \"mark\": {\n@@ -44,6 +50,16 @@\n            \"baseline\": \"middle\"\n          },\n          \"encoding\": {\n+            \"size\": {\n+              \"field\": \"score\",\n+              \"scale\": {\n+                \"range\": [\n+                  8,\n+                  14\n+                ]\n+              },\n+              \"legend\": null\n+            },\n            \"text\": {\n              \"field\": \"score\",\n              \"format\": \".2f\",\n@@ -51,7 +67,10 @@\n            },\n            \"x\": {\n              \"field\": \"comparator\",\n-              \"type\": \"ordinal\"\n+              \"type\": \"ordinal\",\n+              \"axis\": {\n+                \"labelFontSize\": 12\n+              }\n            },\n            \"y\": {\n              \"field\": \"strings_to_compare\",\n@@ -72,29 +91,33 @@\n            \"color\": {\n              \"field\": \"score\",\n              \"scale\": {\n-                \"domain\": [\n-                  0,\n-                  5\n-                ],\n-                \"range\": [\n-                  \"green\",\n-                  \"red\"\n-                ]\n+                \"scheme\": \"yelloworangered\",\n+                \"reverse\": true\n              },\n-              \"type\": \"quantitative\"\n+              \"type\": \"quantitative\",\n+              \"legend\": null\n            },\n            \"x\": {\n              \"field\": \"comparator\",\n-              \"type\": \"ordinal\"\n+              \"type\": \"ordinal\",\n+              \"title\": null,\n+              \"axis\": {\n+                \"labelFontSize\": 12\n+              }\n            },\n            \"y\": {\n              \"field\": \"strings_to_compare\",\n-              \"type\": \"ordinal\"\n+              \"type\": \"ordinal\",\n+              \"axis\": null\n            }\n          },\n-          \"height\": 300,\n-          \"title\": \"Heatmap of Distance Scores\",\n-          \"width\": 200\n+          \"title\": \"Distance\",\n+          \"width\": {\n+            \"step\": 40\n+          },\n+          \"height\": {\n+            \"step\": 30\n+          }\n        },\n        {\n          \"mark\": {\n@@ -102,6 +125,17 @@\n            \"baseline\": \"middle\"\n          },\n          \"encoding\": {\n+            \"size\": {\n+              \"field\": \"score\",\n+              \"scale\": {\n+                \"range\": [\n+                  8,\n+                  14\n+                ],\n+                \"reverse\": true\n+              },\n+              \"legend\": null\n+            },\n            \"text\": {\n              \"field\": \"score\",\n              \"type\": \"quantitative\"\n@@ -124,7 +158,9 @@\n  ],\n  \"resolve\": {\n    \"scale\": {\n-      \"color\": \"independent\"\n+      \"color\": \"independent\",\n+      \"y\": \"shared\",\n+      \"size\": \"independent\"\n    }\n  },\n  \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.17.0.json\",\n
"},{"location":"dev_guides/settings_validation/extending_settings_validator.html","title":"Extending the Settings Validator","text":""},{"location":"dev_guides/settings_validation/extending_settings_validator.html#enhancing-the-settings-validator","title":"Enhancing the Settings Validator","text":""},{"location":"dev_guides/settings_validation/extending_settings_validator.html#overview-of-current-validation-checks","title":"Overview of Current Validation Checks","text":"

Below is a summary of the key validation checks currently implemented by our settings validator. For detailed information, please refer to the source code:

  • Blocking Rules and Comparison Levels Validation: Ensures that the user\u2019s blocking rules and comparison levels are correctly imported from the designated library, and that they contain the necessary details for effective use within the Splink.
  • Column Existence Verification: Verifies the presence of columns specified in the user\u2019s settings across all input dataframes, preventing errors due to missing data fields.
  • Miscellaneous Checks: Conducts a range of additional checks aimed at providing clear and informative error messages, facilitating smoother user experiences when deviations from typical Splink usage are detected.
"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#extending-validation-logic","title":"Extending Validation Logic","text":"

If you are introducing new validation checks that deviate from the existing ones, please incorporate them as functions within a new script located in the splink/settings_validation directory. This ensures that all validation logic is centrally managed and easily maintainable.

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#error-handling-and-logging","title":"Error handling and logging","text":"

Error handling and logging in the settings validator takes the following forms:

  • Raising INFO level logs - These are raised when the settings validator detects an issue with the user's settings dictionary. These logs are intended to provide the user with information on how to rectify the issue, but should not halt the program.
  • Raising single exceptions - Raise a built-in Python or Splink exception in response to finding an error.
  • Concurrently raising multiple exceptions - In some instances, it makes sense to raise multiple errors simultaneously, so as not to disrupt the program. This is achieved using the ErrorLogger class.

The first two use standard Python logging and exception handling. The third is a custom class, covered in more detail below.

You should look to use whichever makes the most sense given your requirements.

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#raising-multiple-exceptions-concurrently","title":"Raising multiple exceptions concurrently","text":"

Raising multiple exceptions simultaneously provides users with faster and more manageable feedback, avoiding the tedious back-and-forth that typically occurs when errors are reported and addressed one at a time.

To enable the logging of multiple errors in a single check, the ErrorLogger class can be utilised. This is designed to operate similarly to a list, allowing the storing of errors using the append method.

Once all errors have been logged, you can raise them with the raise_and_log_all_errors method. This will raise an exception of your choice and report all stored errors to the user.

ErrorLogger in practice
from splink.exceptions import ErrorLogger\n\n# Create an error logger instance\ne = ErrorLogger()\n\n# Log your errors\ne.append(SyntaxError(\"The syntax is wrong\"))\ne.append(NameError(\"Invalid name entered\"))\n\n# Raise your errors\ne.raise_and_log_all_errors()\n

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#expanding-miscellaneous-checks","title":"Expanding miscellaneous checks","text":"

Miscellaneous checks should be added as standalone functions within an appropriate check inside splink/settings_validation. These functions can then be integrated into the linker's startup process for validation.

An example of a miscellaneous check is the validate_dialect function. This assesses whether the settings dialect aligns with the linker's dialect.

This is then injected into the _validate_settings method within our linker, as seen here.

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#additional-comparison-and-blocking-rule-checks","title":"Additional comparison and blocking rule checks","text":"

Comparison and Blocking Rule checks can be found within the valid_types.py script.

These checks currently interface with the ErrorLogger class which is used to store and raise multiple errors simultaneously (see above).

If you wish to expand the current set of tests, it is advised that you incorporate any new checks into either log_comparison_errors or _validate_settings (mentioned above).

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#checking-for-the-existence-of-user-specified-columns","title":"Checking for the existence of user specified columns","text":"

Column and SQL validation is performed within log_invalid_columns.py.

The aim of this script is to check that the columns specified by the user exist within the input dataframe(s). If any invalid columns are found, the script will log this with the user.

Should you need to include extra checks to assess the validity of columns supplied by a user, your primary focus should be on the column_lookups.py script.

There are two main classes within this script that can be used or extended to perform additional column checks:

InvalidCols

InvalidCols is a NamedTuple, used to construct the bulk of our log strings. This accepts a list of columns and the type of error, producing a complete log string when requested.

For simplicity, there are three partial implementations to cover the most common cases: - MissingColumnsLogGenerator - missing column identified. - InvalidTableNamesLogGenerator - table name entered by the user is missing or invalid. - InvalidColumnSuffixesLogGenerator - _l and _r suffixes are missing or invalid.

In practice, this can be used as follows:

# Store our invalid columns\nmy_invalid_cols = MissingColumnsLogGenerator([\"first_col\", \"second_col\"])\n# Construct the corresponding log string\nmy_invalid_cols.construct_log_string()\n
InvalidColumnsLogger

InvalidColumnsLogger takes in a series of cleansed columns from your settings object (see SettingsColumnCleaner) and runs a series of validation checks to assess whether the column(s) are present within the underlying dataframes.

Any invalid columns are stored in an InvalidCols instance (see above), which is then used to construct a log string.

Logs are output to the user at the INFO level.

To extend the column checks, you simply need to add an additional validation method to the InvalidColumnsLogger class. Checks must be added as a new method and then called within construct_output_logs.

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#single-column-multi-column-and-sql-checks","title":"Single column, multi-column and SQL checks","text":""},{"location":"dev_guides/settings_validation/extending_settings_validator.html#single-and-multi-column","title":"Single and multi-column","text":"

Single and multi-column checks are relatively straightforward. Assuming you have a clean set of columns, you can leverage the check_for_missing_settings_column function.

This expects the following arguments: * settings_id: the name of the settings ID. This is only used for logging and does not necessarily need to match the true ID. * settings_column_to_check: the column(s) you wish to validate. * valid_input_dataframe_columns: the cleaned columns from your all input dataframes.

"},{"location":"dev_guides/settings_validation/extending_settings_validator.html#checking-columns-in-sql-statements","title":"Checking columns in SQL statements","text":"

Checking SQL statements is a little more complex, given the need to parse SQL in order to extract your column names.

To do this, you can leverage the check_for_missing_or_invalid_columns_in_sql_strings function.

This expects the following arguments: * sql_dialect: The SQL dialect used by the linker. * sql_strings: A list of SQL strings. * valid_input_dataframe_columns: The list of columns identified in your input dataframe(s). * additional_validation_checks: Functions used to check for other issues with the parsed SQL string, namely, table name and column suffix validation.

NB: for nested SQL statements, you'll need to add an additional loop. See check_comparison_for_missing_or_invalid_sql_strings for more details.

"},{"location":"dev_guides/settings_validation/settings_validation_overview.html","title":"Settings Validation Overview","text":""},{"location":"dev_guides/settings_validation/settings_validation_overview.html#settings-validation","title":"Settings Validation","text":"

A common problem within Splink comes from users providing invalid settings dictionaries. To prevent this, we've built a settings validator to scan through a given settings dictionary and provide user-friendly feedback on what needs to be fixed.

At a high level, this includes:

  1. Assessing the structure of the settings dictionary. See the Settings Schema Validation section.
  2. The contents of the settings dictionary. See the Settings Validator section.
"},{"location":"dev_guides/settings_validation/settings_validation_overview.html#settings-schema-validation","title":"Settings Schema Validation","text":"

Our custom settings schema can be found within settings_jsonschema.json.

This is a json file, outlining the required data type, key and value(s) to be specified by the user while constructing their settings. Where values deviate from this specified schema, an error will be thrown.

Schema validation is currently performed inside the settings.py script.

You can modify the schema by manually editing the json schema.

Modifications can be used to (amongst other uses):

  • Set or remove default values for schema keys.
  • Set the required data type for a given key.
  • Expand or refine previous titles and descriptions to help with clarity.

Any updates you wish to make to the schema should be discussed with the wider team, to ensure it won't break backwards compatibility and makes sense as a design decision.

Detailed information on the arguments that can be supplied to the json schema can be found within the json schema documentation.

"},{"location":"dev_guides/settings_validation/settings_validation_overview.html#settings-validator","title":"Settings Validator","text":"

As long as an input is of the correct data type, it will pass our initial schema checks. This can then mean that user inputs that would generate invalid SQL can slip through and are then often caught by the database engine, commonly resulting in uninformative errors. This can result in uninformative and confusing errors that the user is unsure of how to resolve.

The settings validation code (found within the settings validation directory of Splink) is another layer of validation, executing a series of checks to determine whether values in the user's settings dictionary will generate invalid SQL.

Frequently encountered problems include:

  • Invalid column names. For example, specifying a unique_id_column_name that doesn't exist in the underlying dataframe(s). Such names satisfy the schema requirements as long as they are strings.
  • Using the settings dictionary's default values
  • Importing comparisons and blocking rules for the wrong dialect.
  • Using an inappropriate custom data types - (comparison level vs. comparison within our comparisons).
  • Using Splink for an invalid form of linkage - See the following discussion.

All code relating to settings validation can be found within one of the following scripts:

  • valid_types.py - This script includes various miscellaneous checks for comparison levels, blocking rules, and linker objects. These checks are primarily performed within settings.py.
  • settings_column_cleaner.py - Includes a set of functions for cleaning and extracting data, designed to sanitise user inputs in the settings dictionary and retrieve necessary SQL or column identifiers.
  • log_invalid_columns.py - Pulls the information extracted in settings_column_cleaner.py and generates any log strings outlining invalid columns or SQL identified within the settings dictionary. Any generated error logs are reported to the user when initialising a linker object at the INFO level.
  • settings_validation_log_strings.py - a home for any error messages or logs generated by the settings validator.

For information on expanding the range of checks available to the validator, see Extending the Settings Validator.

"},{"location":"includes/tags.html","title":"Tags","text":""},{"location":"includes/tags.html#tags","title":"Tags","text":"

Following is a list of relevant tags:

[TAGS]

"},{"location":"includes/generated_files/dataset_labels_table.html","title":"Dataset labels table","text":"dataset name description rows unique entities link to source fake_1000_labels Clerical labels for fake_1000 3,176 NA source"},{"location":"includes/generated_files/datasets_table.html","title":"Datasets table","text":"dataset name description rows unique entities link to source fake_1000 Fake 1000 from splink demos. Records are 250 simulated people, with different numbers of duplicates, labelled. 1,000 250 source historical_50k The data is based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors. 50,000 5,156 source febrl3 The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL3 data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record. 5,000 2,000 source febrl4a The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4a contains 5000 original records. 5,000 5,000 source febrl4b The Freely Extensible Biomedical Record Linkage (FEBRL) datasets consist of comparison patterns from an epidemiological cancer study in Germany.FEBRL4b contains 5000 duplicate records, one for each record in FEBRL4a. 5,000 5,000 source transactions_origin This data has been generated to resemble bank transactions leaving an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart arriving in 'transactions_destination'. Memo is sometimes truncated or missing. 45,326 45,326 source transactions_destination This data has been generated to resemble bank transactions arriving in an account. There are no duplicates within the dataset and each transaction is designed to have a counterpart sent from 'transactions_origin'. There may be a delay between the source and destination account, and the amount may vary due to hidden fees and foreign exchange rates. Memo is sometimes truncated or missing. 45,326 45,326 source"},{"location":"topic_guides/topic_guides_index.html","title":"Introduction","text":""},{"location":"topic_guides/topic_guides_index.html#user-guide","title":"User Guide","text":"

This section contains in-depth guides on a variety of topics and concepts within Splink, as well as data linking more generally. These are intended to provide an extra layer of detail on top of the Splink tutorial and examples.

The user guide is broken up into the following categories:

  1. Record Linkage Theory - for an introduction to data linkage from a theoretical perspective, and to help build some intuition around the parameters being estimated in Splink models.
  2. Linkage Models in Splink - for an introduction to the building blocks of a Splink model. Including the supported SQL Backends and how to define a model with a Splink Settings dictionary.
  3. Data Preparation - for guidance on preparing your data for linkage. Including guidance on feature engineering to help improve Splink models.
  4. Blocking - for an introduction to Blocking Rules and their purpose within record linkage. Including how blocking rules are used in different contexts within Splink.
  5. Comparing Records - for guidance on defining Comparisons withing a Splink model. Including how comparing records are structured within Comparisons, how to utilise string comparators for fuzzy matching and how deal with skewed data with Term Frequency Adjustments.
  6. Model Training - for guidance on the methods for training a Splink model, and how to choose them for specific use cases. (Coming soon)
  7. Clustering - for guidance on how records are clustered together. (Coming Soon)
  8. Evaluation - for guidance on how to evaluate Splink models, links and clusters (including Clerical Labelling).
  9. Performance - for guidance on how to make Splink models run more efficiently.
"},{"location":"topic_guides/blocking/blocking_rules.html","title":"What are Blocking Rules?","text":"","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#what-are-blocking-rules","title":"What are Blocking Rules?","text":"

The primary driver the run time of Splink is the number of record pairs that the Splink model has to process. This is controlled by the blocking rules.

This guide explains what blocking rules are, and how they can be used.

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#introduction","title":"Introduction","text":"

One of the main challenges to overcome in record linkage is the scale of the problem.

The number of pairs of records to compare grows using the formula \\(\\frac{n\\left(n-1\\right)}2\\), i.e. with (approximately) the square of the number of records, as shown in the following chart:

For example, a dataset of 1 million input records would generate around 500 billion pairwise record comparisons.

So, when datasets get bigger the computation could get infeasibly large. We use blocking to reduce the scale of the computation to something more tractible.

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#blocking","title":"Blocking","text":"

Blocking is a technique for reducing the number of record pairs that are considered by a model.

Considering a dataset of 1 million records, comparing each record against all of the other records in the dataset generates ~500 billion pairwise comparisons. However, we know the vast majority of these record comparisons won't be matches, so processing the full ~500 billion comparisons would be largely pointless (as well as costly and time-consuming).

Instead, we can define a subset of potential comparisons using Blocking Rules. These are rules that define \"blocks\" of comparisons that should be considered. For example, the blocking rule:

\"block_on(\"first_name\", \"surname\")

will generate only those pairwise record comparisons where first name and surname match. That is, is equivalent to joining input records the SQL condition l.first_name = r.first_name and l.surname = r.surname

Within a Splink model, you can specify multiple Blocking Rules to ensure all potential matches are considered. These are provided as a list. Splink will then produce all record comparisons that satisfy at least one of your blocking rules.

Further Reading

For more information on blocking, please refer to this article

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#blocking-in-splink","title":"Blocking in Splink","text":"

There are two areas in Splink where blocking is used:

  • The first is to generate pairwise comparisons when finding links (running predict()). This is the sense in which 'blocking' is usually understood in the context of record linkage. These blocking rules are provided in the model settings using blocking_rules_to_generate_predictions.

  • The second is a less familiar application of blocking: using it for model training. This is a more advanced topic, and is covered in the model training topic guide.

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#choosing-blocking_rules_to_generate_predictions","title":"Choosing blocking_rules_to_generate_predictions","text":"

The blocking rules specified in your settings at blocking_rules_to_generate_predictions are the single most important determinant of how quickly your linkage runs. This is because the number of comparisons generated is usually many multiple times higher than the number of input records.

How can we choose a good set of blocking rules? It's usually better to use a longer list of strict blocking rules, than a short list of loose blocking rules. Let's see why:

The aim of our blocking rules are to:

  • Capture as many true matches as possible
  • Reduce the total number of comparisons being generated

There is a tension between these aims, because by choosing loose blocking rules which generate more comparisons, you have a greater chance of capturing all true matches.

A single rule is unlikely to be able to achieve both aims.

For example, consider:

SettingsCreator(\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\")\n    ]\n)\n
This will generate comparisons for all true matches where names match. But it would miss a true match where there was a typo in the name.

This is why blocking_rules_to_generate_predictions is a list.

Suppose we also block on postcode:

SettingsCreator(\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"postcode\")\n    ]\n)\n

Now it doesn't matter if there's a typo in the name so long as postcode matches (and vice versa).

We could take this further and block on, say, date_of_birth as well.

By specifying a variety of blocking_rules_to_generate_predictions, even if each rule on its own is relatively tight, it becomes implausible that a truly matching record would not be captured by at least one of the rules.

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#tightening-blocking-rules-for-linking-larger-datasets","title":"Tightening blocking rules for linking larger datasets","text":"

As the size of your input data grows, tighter blocking rules may be needed. Blocking on, say first_name and surname may be insufficiently tight to reduce the number of comparisons down to a computationally tractable number.

In this situation, it's often best to use an even larger list of tighter blocking rules.

An example could be something like:

SettingsCreator(\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"surname\", \"substr(postcode,1,3)\"),\n        block_on(\"surname\", \"dob\"),\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"dob\", \"postcode\")\n        block_on(\"first_name\", \"postcode\")\n        block_on(\"surname\", \"postcode\")\n    ]\n)\n
","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#analysing-blocking_rules_to_generate_predictions","title":"Analysing blocking_rules_to_generate_predictions","text":"

It's generally a good idea to analyse the number of comparisons generated by your blocking rules before trying to use them to make predictions, to make sure you don't accidentally generate trillions of pairs. You can use the following function to do this:

from splink.blocking_analysis import count_comparisons_from_blocking_rule\n\nbr = block_on(\"substr(first_name, 1,1)\", \"surname\")\n\ncount_comparisons_from_blocking_rule(\n        table_or_tables=df,\n        blocking_rule=br,\n        link_type=\"dedupe_only\",\n        db_api=db_api,\n    )\n
","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/blocking_rules.html#more-compelex-blocking-rules","title":"More compelex blocking rules","text":"

It is possible to use more complex blocking rules that use non-equijoin conditions. For example, you could use a blocking rule that uses a fuzzy matching function:

l.first_name and r.first_name and levenshtein(l.surname, r.surname) < 3\n

However, this will not be executed very efficiently, for reasons described in this page.

","tags":["Blocking","Performance"]},{"location":"topic_guides/blocking/model_training.html","title":"Model Training Blocking Rules","text":""},{"location":"topic_guides/blocking/model_training.html#blocking-for-model-training","title":"Blocking for Model Training","text":"

Model Training Blocking Rules choose which record pairs from a dataset get considered when training a Splink model. These are used during Expectation Maximisation (EM), where we estimate the m probability (in most cases).

The aim of Model Training Blocking Rules is to reduce the number of record pairs considered when training a Splink model in order to reduce the computational resource required. Each Training Blocking Rule define a training \"block\" of records which have a combination of matches and non-matches that are considered by Splink's Expectation Maximisation algorithm.

The Expectation Maximisation algorithm seems to work best when the pairwise record comparisons are a mix of anywhere between around 0.1% and 99.9% true matches. It works less efficiently if there is a huge imbalance between the two (e.g. a billion non matches and only a hundred matches).

Note

Unlike blocking rules for prediction, it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.

"},{"location":"topic_guides/blocking/model_training.html#using-training-rules-in-splink","title":"Using Training Rules in Splink","text":"

Blocking Rules for Model Training are used as a parameter in the estimate_parameters_using_expectation_maximisation function. After a linker object has been instantiated, you can estimate m probability with training sessions such as:

from splink.duckdb.blocking_rule_library import block_on\n\nblocking_rule_for_training = block_on(\"first_name\")\nlinker.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n)\n

Here, we have defined a \"block\" of records where first_name are the same. As names are not unique, we can be pretty sure that there will be a combination of matches and non-matches in this \"block\" which is what is required for the EM algorithm.

Matching only on first_name will likely generate a large \"block\" of pairwise comparisons which will take longer to run. In this case it may be worthwhile applying a stricter blocking rule to reduce runtime. For example, a match on first_name and surname:

from splink.duckdb.blocking_rule_library import block_on\nblocking_rule = block_on([\"first_name\", \"surname\"])\nlinker.estimate_parameters_using_expectation_maximisation(\n    blocking_rule_for_training\n    )\n

which will still have a combination of matches and non-matches, but fewer record pairs to consider.

"},{"location":"topic_guides/blocking/model_training.html#choosing-training-rules","title":"Choosing Training Rules","text":"

The idea behind Training Rules is to consider \"blocks\" of record pairs with a mixture of matches and non-matches. In practice, most blocking rules have a mixture of matches and non-matches so the primary consideration should be to reduce the runtime of model training by choosing Training Rules that reduce the number of record pairs in the training set.

There are some tools within Splink to help choosing these rules. For example, the count_num_comparisons_from_blocking_rule gives the number of records pairs generated by a blocking rule:

from splink.duckdb.blocking_rule_library import block_on\nblocking_rule = block_on([\"first_name\", \"surname\"])\nlinker.count_num_comparisons_from_blocking_rule(blocking_rule)\n

1056

It is recommended that you run this function to check how many comparisons are generated before training a model so that you do not needlessly run a training session on billions of comparisons.

Note

Unlike blocking rules for prediction, Training Rules are treated separately for each EM training session therefore the total number of comparisons for Model Training is simply the sum of count_num_comparisons_from_blocking_rule across all Blocking Rules (as opposed to the result of cumulative_comparisons_from_blocking_rules_records).

"},{"location":"topic_guides/blocking/performance.html","title":"Computational Performance","text":""},{"location":"topic_guides/blocking/performance.html#blocking-rule-performance","title":"Blocking Rule Performance","text":"

When considering computational performance of blocking rules, there are two main drivers to address:

  • How may pairwise comparisons are generated
  • How quickly each pairwise comparison takes to run

Below we run through an example of how to address each of these drivers.

"},{"location":"topic_guides/blocking/performance.html#strict-vs-lenient-blocking-rules","title":"Strict vs lenient Blocking Rules","text":"

One way to reduce the number of comparisons being considered within a model is to apply strict blocking rules. However, this can have a significant impact on the how well the Splink model works.

In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which means you can iterate through model versions more quickly.

Example - Incrementally loosening Prediction Blocking Rules

When choosing Prediction Blocking Rules, consider how blocking_rules_to_generate_predictions may be made incrementally less strict. We may start with the following rule:

l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob.

This is a very strict rule, and will only create comparisons where full name and date of birth match. This has the advantage of creating few record comparisons, but the disadvantage that the rule will miss true matches where there are typos or nulls in any of these three fields.

This blocking rule could be loosened to:

substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname and l.year_of_birth = r.year_of_birth

Now it allows for typos or aliases in the first name, so long as the first letter is the same, and errors in month or day of birth.

Depending on the side of your input data, the rule could be further loosened to

substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname

or even

l.surname = r.surname

The user could use the linker.count_num_comparisons_from_blocking_rule() function to select which rule is appropriate for their data.

"},{"location":"topic_guides/blocking/performance.html#efficient-blocking-rules","title":"Efficient Blocking Rules","text":"

While the number of pairwise comparisons is important for reducing the computation, it is also helpful to consider the efficiency of the Blocking Rules. There are a number of ways to define subsets of records (i.e. \"blocks\"), but they are not all computationally efficient.

From a performance perspective, here we consider two classes of blocking rule:

  • Equi-join conditions
  • Filter conditions
"},{"location":"topic_guides/blocking/performance.html#equi-join-conditions","title":"Equi-join Conditions","text":"

Equi-joins are simply equality conditions between records, e.g.

l.first_name = r.first_name

Equality-based blocking rules can be executed efficiently by SQL engines in the sense that the engine is able to create only the record pairs that satisfy the blocking rule. The engine does not have to create all possible record pairs and then filter out the pairs that do not satisfy the blocking rule. This is in contrast to filter conditions (see below), where the engine has to create a larger set of comparisons and then filter it down.

Due to this efficiency advantage, equality-based blocking rules should be considered the default method for defining blocking rules. For example, the above example can be written as:

from splink import block_on\nblock_on(\"first_name\")\n
"},{"location":"topic_guides/blocking/performance.html#filter-conditions","title":"Filter Conditions","text":"

Filter conditions refer to any Blocking Rule that isn't a simple equality between columns. E.g.

levenshtein(l.surname, r.surname) < 3

Blocking rules which use similarity or distance functions, such as the example above, are inefficient as the levenshtein function needs to be evaluated for all possible record comparisons before filtering out the pairs that do not satisfy the filter condition.

"},{"location":"topic_guides/blocking/performance.html#combining-blocking-rules-efficiently","title":"Combining Blocking Rules Efficiently","text":"

Just as how Blocking Rules can impact on performance, so can how they are combined. The most efficient Blocking Rules combinations are \"AND\" statements. E.g.

block_on(\"first_name\", \"surname\")

which is equivalent to

l.first_name = r.first_name AND l.surname = r.surname

\"OR\" statements are extremely inefficient and should almost never be used. E.g.

l.first_name = r.first_name OR l.surname = r.surname

In most SQL engines, an OR condition within a blocking rule will result in all possible record comparisons being generated. That is, the whole blocking rule becomes a filter condition rather than an equi-join condition, so these should be avoided. For further information, see here.

Instead of the OR condition being included in the blocking rule, instead, provide two blocking rules to Splink. This will achieve the desired outcome of generating all comparisons where either the first name or surname match.

SettingsCreator(\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\")\n    ]\n)\n
Spark-specific Further Reading

Given the ability to parallelise operations in Spark, there are some additional configuration options which can improve performance of blocking. Please refer to the Spark Performance Topic Guides for more information.

Note: In Spark Equi-joins are implemented using hash partitioning, which facilitates splitting the workload across multiple machines.

"},{"location":"topic_guides/comparisons/choosing_comparators.html","title":"Choosing string comparators","text":""},{"location":"topic_guides/comparisons/choosing_comparators.html#choosing-string-comparators","title":"Choosing String Comparators","text":"

When building a Splink model, one of the most important aspects is defining the Comparisons and Comparison Levels that the model will train on. Each Comparison Level within a Comparison should contain a different amount of evidence that two records are a match, to which the model can assign a match weight. When considering different amounts of evidence for the model, it is helpful to explore fuzzy matching as a way of distinguishing strings that are similar, but not the same, as one another.

This guide is intended to show how Splink's string comparators perform in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds). For descriptions and examples of each string comparators available in Splink, see the dedicated topic guide.

"},{"location":"topic_guides/comparisons/choosing_comparators.html#what-options-are-available-when-comparing-strings","title":"What options are available when comparing strings?","text":"

There are three main classes of string comparator that are considered within Splink:

  1. String Similarity Scores
  2. String Distance Scores
  3. Phonetic Matching

where

String Similarity Scores are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. Jaro-Winkler Similarity.

String Distance Scores are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. Levenshtein Distance.

Phonetic Matching is whether two strings are phonetically similar. The two strings are passed through a phonetic transformation algorithm and then the resulting phonetic codes are matched. E.g. Double Metaphone.

"},{"location":"topic_guides/comparisons/choosing_comparators.html#comparing-string-similarity-and-distance-scores","title":"Comparing String Similarity and Distance Scores","text":"

Splink contains a comparison_helpers module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.

For comparing two strings the comparator_score function returns the scores for all of the available comparators. E.g. consider a simple inversion \"Richard\" vs \"iRchard\":

from splink.exploratory import similarity_analysis as sa\n\nsa.comparator_score(\"Richard\", \"iRchard\")\n
string1 string2 levenshtein_distance damerau_levenshtein_distance jaro_similarity jaro_winkler_similarity jaccard_similarity 0 Richard iRchard 2 1 0.95 0.95 1.0

Now consider a collection of common variations of the name \"Richard\" - which comparators will consider these variations as sufficiently similar to \"Richard\"?

import pandas as pd\n\ndata = [\n    {\"string1\": \"Richard\", \"string2\": \"Richard\", \"error_type\": \"None\"},\n    {\"string1\": \"Richard\", \"string2\": \"ichard\", \"error_type\": \"Deletion\"},\n    {\"string1\": \"Richard\", \"string2\": \"Richar\", \"error_type\": \"Deletion\"},\n    {\"string1\": \"Richard\", \"string2\": \"iRchard\", \"error_type\": \"Transposition\"},\n    {\"string1\": \"Richard\", \"string2\": \"Richadr\", \"error_type\": \"Transposition\"},\n    {\"string1\": \"Richard\", \"string2\": \"Rich\", \"error_type\": \"Shortening\"},\n    {\"string1\": \"Richard\", \"string2\": \"Rick\", \"error_type\": \"Nickname/Alias\"},\n    {\"string1\": \"Richard\", \"string2\": \"Ricky\", \"error_type\": \"Nickname/Alias\"},\n    {\"string1\": \"Richard\", \"string2\": \"Dick\", \"error_type\": \"Nickname/Alias\"},\n    {\"string1\": \"Richard\", \"string2\": \"Rico\", \"error_type\": \"Nickname/Alias\"},\n    {\"string1\": \"Richard\", \"string2\": \"Rachael\", \"error_type\": \"Different Name\"},\n    {\"string1\": \"Richard\", \"string2\": \"Stephen\", \"error_type\": \"Different Name\"},\n]\n\ndf = pd.DataFrame(data)\ndf\n
string1 string2 error_type 0 Richard Richard None 1 Richard ichard Deletion 2 Richard Richar Deletion 3 Richard iRchard Transposition 4 Richard Richadr Transposition 5 Richard Rich Shortening 6 Richard Rick Nickname/Alias 7 Richard Ricky Nickname/Alias 8 Richard Dick Nickname/Alias 9 Richard Rico Nickname/Alias 10 Richard Rachael Different Name 11 Richard Stephen Different Name

The comparator_score_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

sa.comparator_score_chart(data, \"string1\", \"string2\")\n

Here we can see that all of the metrics are fairly sensitive to transcriptions errors (\"Richadr\", \"Richar\", \"iRchard\"). However, considering nicknames/aliases (\"Rick\", \"Ricky\", \"Rico\"), simple metrics such as Jaccard, Levenshtein and Damerau-Levenshtein tend to be less useful. The same can be said for name shortenings (\"Rich\"), but to a lesser extent than more complex nicknames. However, even more subtle metrics like Jaro and Jaro-Winkler still struggle to identify less obvious nicknames/aliases such as \"Dick\".

If you would prefer the underlying dataframe instead of the chart, there is the comparator_score_df function.

sa.comparator_score_df(data, \"string1\", \"string2\")\n
string1 string2 levenshtein_distance damerau_levenshtein_distance jaro_similarity jaro_winkler_similarity jaccard_similarity 0 Richard Richard 0 0 1.00 1.00 1.00 1 Richard ichard 1 1 0.95 0.95 0.86 2 Richard Richar 1 1 0.95 0.97 0.86 3 Richard iRchard 2 1 0.95 0.95 1.00 4 Richard Richadr 2 1 0.95 0.97 1.00 5 Richard Rich 3 3 0.86 0.91 0.57 6 Richard Rick 4 4 0.73 0.81 0.38 7 Richard Ricky 4 4 0.68 0.68 0.33 8 Richard Dick 5 5 0.60 0.60 0.22 9 Richard Rico 4 4 0.73 0.81 0.38 10 Richard Rachael 3 3 0.71 0.74 0.44 11 Richard Stephen 7 7 0.43 0.43 0.08"},{"location":"topic_guides/comparisons/choosing_comparators.html#choosing-thresholds","title":"Choosing thresholds","text":"

We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:

sa.comparator_score_threshold_chart(\n    data, \"string1\", \"string2\", distance_threshold=2, similarity_threshold=0.8\n)\n

To class our variations on \"Richard\" in the same Comparison Level, a good choice of metric could be Jaro-Winkler with a threshold of 0.8. Lowering the threshold any more could increase the chances for false positives.

For example, consider a single Jaro-Winkler Comparison Level threshold of 0.7 would lead to \"Rachael\" being considered as providing the same amount evidence for a record matching as \"iRchard\".

An alternative way around this is to construct a Comparison with multiple levels, each corresponding to a different threshold of Jaro-Winkler similarity. For example, below we construct a Comparison using the Comparison Library function JaroWinklerAtThresholds with multiple levels for different match thresholds.:

import splink.comparison_library as cl\n\nfirst_name_comparison = cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.8, 0.7])\n

If we print this comparison as a dictionary we can see the underlying SQL.

first_name_comparison.get_comparison(\"duckdb\").as_dict()\n
{'output_column_name': 'first_name',\n 'comparison_levels': [{'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n   'label_for_charts': 'first_name is NULL',\n   'is_null_level': True},\n  {'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n   'label_for_charts': 'Exact match on first_name'},\n  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9',\n   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},\n  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8',\n   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.8'},\n  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7',\n   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},\n  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n 'comparison_description': 'JaroWinklerAtThresholds'}\n

Where:

  • Exact Match level will catch perfect matches (\"Richard\").
  • The 0.9 threshold will catch Shortenings and Typos (\"ichard\", \"Richar\", \"iRchard\", \"Richadr\", \"Rich\").
  • The 0.8 threshold will catch simple Nicknames/Aliases (\"Rick\", \"Rico\").
  • The 0.7 threshold will catch more complex Nicknames/Aliases (\"Ricky\"), but will also include less relevant names (e.g. \"Rachael\"). However, this should not be a concern as the model should give less predictive power (i.e. Match Weight) to this level of evidence.
  • All other comparisons will end up in the \"Else\" level
"},{"location":"topic_guides/comparisons/choosing_comparators.html#phonetic-matching","title":"Phonetic Matching","text":"

There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.

To see the phonetic transformations for a single string, there is the phonetic_transform function:

sa.phonetic_transform(\"Richard\")\n
{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}\n
sa.phonetic_transform(\"Steven\")\n
{'soundex': 'S30105', 'metaphone': 'STFN', 'dmetaphone': ('STFN', '')}\n

Now consider a collection of common variations of the name \"Stephen\". Which phonetic transforms will consider these as sufficiently similar to \"Stephen\"?

data = [\n    {\"string1\": \"Stephen\", \"string2\": \"Stephen\", \"error_type\": \"None\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Steven\", \"error_type\": \"Spelling Variation\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stephan\", \"error_type\": \"Spelling Variation/Similar Name\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Steve\", \"error_type\": \"Nickname/Alias\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stehpen\", \"error_type\": \"Transposition\"},\n    {\"string1\": \"Stephen\", \"string2\": \"tSephen\", \"error_type\": \"Transposition\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stephne\", \"error_type\": \"Transposition\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stphen\", \"error_type\": \"Deletion\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stepheb\", \"error_type\": \"Replacement\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Stephanie\", \"error_type\": \"Different Name\"},\n    {\"string1\": \"Stephen\", \"string2\": \"Richard\", \"error_type\": \"Different Name\"},\n]\n\n\ndf = pd.DataFrame(data)\ndf\n
string1 string2 error_type 0 Stephen Stephen None 1 Stephen Steven Spelling Variation 2 Stephen Stephan Spelling Variation/Similar Name 3 Stephen Steve Nickname/Alias 4 Stephen Stehpen Transposition 5 Stephen tSephen Transposition 6 Stephen Stephne Transposition 7 Stephen Stphen Deletion 8 Stephen Stepheb Replacement 9 Stephen Stephanie Different Name 10 Stephen Richard Different Name

The phonetic_match_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

sa.phonetic_match_chart(data, \"string1\", \"string2\")\n

Here we can see that all of the algorithms recognise simple phonetically similar names (\"Stephen\", \"Steven\"). However, there is some variation when it comes to transposition errors (\"Stehpen\", \"Stephne\") with soundex and metaphone-esque giving different results. There is also different behaviour considering different names (\"Stephanie\").

Given there is no clear winner that captures all of the similar names, it is recommended that phonetic matches are used as a single Comparison Level within in a Comparison which also includes string comparators in the other levels. To see an example of this, see the Combining String scores and Phonetic matching section of this topic guide.

If you would prefer the underlying dataframe instead of the chart, there is the phonetic_transform_df function.

sa.phonetic_transform_df(data, \"string1\", \"string2\")\n
string1 string2 soundex metaphone dmetaphone 0 Stephen Stephen [S30105, S30105] [STFN, STFN] [(STFN, ), (STFN, )] 1 Stephen Steven [S30105, S30105] [STFN, STFN] [(STFN, ), (STFN, )] 2 Stephen Stephan [S30105, S30105] [STFN, STFN] [(STFN, ), (STFN, )] 3 Stephen Steve [S30105, S3010] [STFN, STF] [(STFN, ), (STF, )] 4 Stephen Stehpen [S30105, S30105] [STFN, STPN] [(STFN, ), (STPN, )] 5 Stephen tSephen [S30105, t50105] [STFN, TSFN] [(STFN, ), (TSFN, )] 6 Stephen Stephne [S30105, S301050] [STFN, STFN] [(STFN, ), (STFN, )] 7 Stephen Stphen [S30105, S3105] [STFN, STFN] [(STFN, ), (STFN, )] 8 Stephen Stepheb [S30105, S30101] [STFN, STFP] [(STFN, ), (STFP, )] 9 Stephen Stephanie [S30105, S301050] [STFN, STFN] [(STFN, ), (STFN, )] 10 Stephen Richard [S30105, R02063] [STFN, RXRT] [(STFN, ), (RXRT, RKRT)]"},{"location":"topic_guides/comparisons/choosing_comparators.html#combining-string-scores-and-phonetic-matching","title":"Combining String scores and Phonetic matching","text":"

Once you have considered all of the string comparators and phonetic transforms for a given column, you may decide that you would like to have multiple comparison levels including a combination of options.

For this you can construct a custom comparison to catch all of the edge cases you want. For example, if you decide that the comparison for first_name in the model should consider:

  1. A Dmetaphone level for phonetic similarity
  2. A Levenshtein level with distance of 2 for typos
  3. A Jaro-Winkler level with similarity 0.8 for fuzzy matching
import splink.comparison_library as cl\nimport splink.comparison_level_library as cll\nfirst_name_comparison = cl.CustomComparison(\n    output_column_name=\"first_name\",\n    comparison_levels=[\n        cll.NullLevel(\"first_name\"),\n        cll.ExactMatchLevel(\"first_name\"),\n        cll.JaroWinklerLevel(\"first_name\", 0.9),\n        cll.LevenshteinLevel(\"first_name\", 0.8),\n        cll.ArrayIntersectLevel(\"first_name_dm\", 1),\n        cll.ElseLevel()\n    ]\n)\n\nprint(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'CustomComparison' of \"first_name\" and \"first_name_dm\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'Jaro-Winkler distance of first_name >= 0.9' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9\n    - 'Levenshtein distance of first_name <= 0.8' with SQL rule: levenshtein(\"first_name_l\", \"first_name_r\") <= 0.8\n    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"first_name_dm_l\", \"first_name_dm_r\")) >= 1\n    - 'All other comparisons' with SQL rule: ELSE\n

where first_name_dm refers to a column in the dataset which has been created during the feature engineering step to give the Dmetaphone transform of first_name.

"},{"location":"topic_guides/comparisons/comparators.html","title":"String comparators","text":"","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#string-comparators","title":"String Comparators","text":"

There are a number of string comparator functions available in Splink that allow fuzzy matching for strings within Comparisons and Comparison Levels. For each of these fuzzy matching functions, below you will find explanations of how they work, worked examples and recommendations for the types of data they are useful for.

For guidance on how to choose the most suitable string comparator, and associated threshold, see the dedicated topic guide.

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#levenshtein-distance","title":"Levenshtein Distance","text":"

At a glance

Useful for: Data entry errors e.g. character miskeys. Splink comparison functions: levenshtein_level() and levenshtein_at_thresholds() Returns: An integer (lower is more similar).

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#description","title":"Description","text":"

Levenshtein distance, also known as edit distance, is a measure of the difference between two strings. It represents the minimum number of insertions, deletions, or substitutions of characters required to transform one string into the other.

Or, as a formula,

\\[\\textsf{Levenshtein}(s_1, s_2) = \\min \\lbrace \\begin{array}{l} \\text{insertion , } \\text{deletion , } \\text{substitution} \\end{array} \\rbrace \\]","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#examples","title":"Examples","text":"\"KITTEN\" vs \"SITTING\"

The minimum number of operations to convert \"KITTEN\" into \"SITTING\" are:

  • Substitute \"K\" in \"KITTEN\" with \"S\" to get \"SITTEN.\"
  • Substitute \"E\" in \"SITTEN\" with \"I\" to get \"SITTIN.\"
  • Insert \"G\" after \"N\" in \"SITTIN\" to get \"SITTING.\"

Therefore,

\\[\\textsf{Levenshtein}(\\texttt{KITTEN}, \\texttt{SITTING}) = 3\\] \"CAKE\" vs \"ACKE\"

The minimum number of operations to convert \"CAKE\" into \"ACKE\" are:

  • Substitute \"C\" in \"CAKE\" with \"A\" to get \"AAKE.\"
  • substitute the second \"A\" in \"AAKE\" with \"C\" to get \"ACKE.\"

Therefore,

\\[\\textsf{Levenshtein}(\\texttt{CAKE}, \\texttt{ACKE}) = 2\\]","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#sample-code","title":"Sample code","text":"

You can test out the Levenshtein distance as follows:

import duckdb\nduckdb.sql(\"SELECT levenshtein('CAKE', 'ACKE')\").df().iloc[0,0]\n

2

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#damerau-levenshtein-distance","title":"Damerau-Levenshtein Distance","text":"

At a glance

Useful for: Data entry errors e.g. character transpositions and miskeys Splink comparison functions: damerau_levenshtein_level() and damerau_levenshtein_at_thresholds() Returns: An integer (lower is more similar).

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#description_1","title":"Description","text":"

Damerau-Levenshtein distance is a variation of Levenshtein distance that also includes transposition operations, which are the interchange of adjacent characters. This distance measures the minimum number of operations required to transform one string into another by allowing insertions, deletions, substitutions, and transpositions of characters.

Or, as a formula,

\\[\\textsf{DamerauLevenshtein}(s_1, s_2) = \\min \\lbrace \\begin{array}{l} \\text{insertion , } \\text{deletion , } \\text{substitution , } \\text{transposition} \\end{array} \\rbrace \\]","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#examples_1","title":"Examples","text":"\"KITTEN\" vs \"SITTING\"

The minimum number of operations to convert \"KITTEN\" into \"SITTING\" are:

  • Substitute \"K\" in \"KITTEN\" with \"S\" to get \"SITTEN\".
  • Substitute \"E\" in \"SITTEN\" with \"I\" to get \"SITTIN\".
  • Insert \"G\" after \"T\" in \"SITTIN\" to get \"SITTING\".

Therefore,

\\[\\textsf{DamerauLevenshtein}(\\texttt{KITTEN}, \\texttt{SITTING}) = 3\\] \"CAKE\" vs \"ACKE\"

The minimum number of operations to convert \"CAKE\" into \"ACKE\" are:

  • Transpose \"C\" and \"A\" in \"CAKE\" with \"A\" to get \"ACKE.\"

Therefore,

\\[\\textsf{DamerauLevenshtein}(\\texttt{CAKE}, \\texttt{ACKE}) = 1\\]","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#sample-code_1","title":"Sample code","text":"

You can test out the Damerau-Levenshtein distance as follows:

import duckdb\nduckdb.sql(\"SELECT damerau_levenshtein('CAKE', 'ACKE')\").df().iloc[0,0]\n

1

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#jaro-similarity","title":"Jaro Similarity","text":"

At a glance

Useful for: Strings where all characters are considered equally important, regardless of order e.g. ID numbers Splink comparison functions: jaro_level() and jaro_at_thresholds() Returns: A score between 0 and 1 (higher is more similar)

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#description_2","title":"Description","text":"

Jaro similarity is a measure of similarity between two strings. It takes into account the number and order of matching characters, as well as the number of transpositions needed to make the strings identical.

Jaro similarity considers:

  • The number of matching characters (characters in the same position in both strings).
  • The number of transpositions (pairs of characters that are not in the same position in both strings).

Or, as a formula:

\\[\\textsf{Jaro}(s_1, s_2) = \\frac{1}{3} \\left[ \\frac{m}{|s_1|} + \\frac{m}{|s_2|} + \\frac{m-t}{m} \\right]\\]

where:

  • \\(s_1\\) and \\(s_2\\) are the two strings being compared
  • \\(m\\) is the number of common characters (which are considered matching only if they are the same and not farther than \\(\\left\\lfloor \\frac{\\min(|s_1|,|s_2|)}{2} \\right\\rfloor - 1\\) characters apart)
  • \\(t\\) is the number of transpositions (which is calculated as the number of matching characters that are not in the right order divided by two).
","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#examples_2","title":"Examples","text":"\"MARTHA\" vs \"MARHTA\":
  • There are four matching characters: \"M\", \"A\", \"R\", and \"T\".
  • There is one transposition: the fifth character in \"MARTHA\" (\"H\") is not in the same position as the fifth character in \"MARHTA\" (\"T\").
  • We calculate the Jaro similarity using the formula:
\\[\\textsf{Jaro}(\\texttt{MARTHA}, \\texttt{MARHTA}) = \\frac{1}{3} \\left[ \\frac{4}{6} + \\frac{4}{6} + \\frac{4-1}{4} \\right] = 0.944\\] \"MARTHA\" vs \"AMRTHA\":
  • There are four matching characters: \"M\", \"A\", \"R\", and \"T\".
  • There is one transposition: the first character in \"MARTHA\" (\"M\") is not in the same position as the first character in \"AMRTHA\" (\"T\").
  • We calculate the Jaro similarity using the formula:
\\[\\textsf{Jaro}(\\texttt{MARTHA}, \\texttt{AMRTHA}) = \\frac{1}{3} \\left[ \\frac{4}{6} + \\frac{4}{6} + \\frac{4-1}{4} \\right] = 0.944\\]

Noting that transpositions of strings gives the same Jaro similarity regardless of order.

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#sample-code_2","title":"Sample code","text":"

You can test out the Jaro similarity as follows:

import duckdb\nduckdb.sql(\"SELECT jaro_similarity('MARTHA', 'MARHTA')\").df().iloc[0,0]\n

0.944

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#jaro-winkler-similarity","title":"Jaro-Winkler Similarity","text":"

At a glance

Useful for: Strings where importance is weighted towards the first 4 characters e.g. Names Splink comparison functions: jaro_winkler_level() and jaro_winkler_at_thresholds() Returns: A score between 0 and 1 (higher is more similar).

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#description_3","title":"Description","text":"

Jaro-Winkler similarity is a variation of Jaro similarity that gives extra weight to matching prefixes of the strings. It is particularly useful for names

The Jaro-Winkler similarity is calculated as follows:

\\[\\textsf{JaroWinkler}(s_1, s_2) = \\textsf{Jaro}(s_1, s_2) + p \\cdot l \\cdot (1 - \\textsf{Jaro}(s_1, s_2))\\]

where: - \\(\\textsf{Jaro}(s_1, s_2)\\) is the Jaro similarity between the two strings - \\(l\\) is the length of the common prefix between the two strings, up to a maximum of four characters - \\(p\\) is a prefix scale factor, commonly set to 0.1.

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#examples_3","title":"Examples","text":"\"MARTHA\" vs \"MARHTA\"

The common prefix between the two strings is \"MAR\", which has a length of 3. We calculate the Jaro-Winkler similarity using the formula:

\\[\\textsf{Jaro-Winkler}(\\texttt{MARTHA}, \\texttt{MARHTA}) = 0.944 + 0.1 \\cdot 3 \\cdot (1 - 0.944) = 0.9612\\]

The Jaro-Winkler similarity is slightly higher than the Jaro similarity, due to the matching prefix.

\"MARTHA\" vs \"AMRTHA\":

There is no common prefix, so the Jaro-Winkler similarity formula gives:

\\[\\textsf{Jaro-Winkler}(\\texttt{MARTHA}, \\texttt{MARHTA}) = 0.944 + 0.1 \\cdot 0 \\cdot (1 - 0.944) = 0.944\\]

Which is the same as the Jaro score.

Note that the Jaro-Winkler similarity should be used with caution, as it may not always provide better results than the standard Jaro similarity, especially when dealing with short strings or strings that have no common prefix.

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#sample-code_3","title":"Sample code","text":"

You can test out the Jaro similarity as follows:

import duckdb\nduckdb.sql(\"SELECT jaro_winkler_similarity('MARTHA', 'MARHTA')\").df().iloc[0,0]\n

0.9612

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#jaccard-similarity","title":"Jaccard Similarity","text":"

At a glance

Useful for: Splink comparison functions: jaccard_level() and [jaccard_at_thresholds()](../../comparison_library.md#splink.comparison_library.JaccardAtThresholdsBase) Returns: A score between 0 and 1 (higher is more similar).

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#description_4","title":"Description","text":"

Jaccard similarity is a measure of similarity between two sets of items, based on the size of their intersection (elements in common) and union (total elements across both sets). For strings, it considers the overlap of characters within each string. Mathematically, it can be represented as:

\\[\\textsf{Jaccard}=\\frac{|A \\cap B|}{|A \\cup B|}\\]

where A and B are two strings, and |A| and |B| represent their cardinalities (i.e., the number of elements in each set).

In practice, Jaccard is more useful with strings that can be split up into multiple words as opposed to characters within a single word or string. E.g. tokens within addresses:

Address 1: {\"flat\", \"2\", \"123\", \"high\", \"street\", \"london\", \"sw1\", \"1ab\"}

Address 2: {\"2\", \"high\", \"street\", \"london\", \"sw1a\", \"1ab\"},

where:

  • there are 9 unique tokens across the addresses: \"flat\", \"2\", \"123\", \"high\", \"street\", \"london\", \"sw1\", \"sw1a\", \"1ab\"
  • there are 5 tokens found in both addresses: \"2\", \"high\", \"street\", \"london\", \"1ab\"

We calculate the Jaccard similarity using the formula:

\\[\\textsf{Jaccard}(\\textrm{Address1}, \\textrm{Address2})=\\frac{5}{9}=0.5556\\]

However, this functionality is not currently implemented within Splink

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#examples_4","title":"Examples","text":"\"DUCK\" vs \"LUCK\"
  • There are five unique characters across the strings: \"D\", \"U\", \"C\", \"K\", \"L\"
  • Three are found in both strings: \"U\", \"C\", \"K\"

We calculate the Jaccard similarity using the formula:

\\[\\textsf{Jaccard}(\\texttt{DUCK}, \\texttt{LUCK})=\\frac{3}{5}=0.6\\] \"MARTHA\" vs \"MARHTA\"
  • There are five unique characters across the strings: \"M\", \"A\", \"R\", \"T\", \"H\"
  • Five are found in both strings: \"M\", \"A\", \"R\", \"T\", \"H\"

We calculate the Jaccard similarity using the formula:

\\[\\textsf{Jaccard}(\\texttt{MARTHA}, \\texttt{MARHTA})=\\frac{5}{5}=1\\]","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparators.html#sample-code_4","title":"Sample code","text":"

You can test out the Jaccard similarity between two strings with the function below:

def jaccard_similarity(str1, str2):\n        set1 = set(str1)\n        set2 = set(str2)\n        return len(set1 & set2) / len(set1 | set2)\n\njaccard_similarity(\"DUCK\", \"LUCK\")\n

0.6

","tags":["API","comparisons","Levenshtein","Damerau-Levenshtein","Jaro","Jaro-Winkler","Jaccard"]},{"location":"topic_guides/comparisons/comparisons_and_comparison_levels.html","title":"Comparisons and comparison levels","text":""},{"location":"topic_guides/comparisons/comparisons_and_comparison_levels.html#comparison-and-comparisonlevels","title":"Comparison and ComparisonLevels","text":""},{"location":"topic_guides/comparisons/comparisons_and_comparison_levels.html#comparing-information","title":"Comparing information","text":"

To find matching records, Splink creates pairwise record comparisons from the input records, and scores these comparisons.

Suppose for instance your data contains first_name and surname and dob:

id first_name surname dob 1 john smith 1991-04-11 2 jon smith 1991-04-17 3 john smyth 1991-04-11

To compare these records, at the blocking stage, Splink will set these records against each other in a table of pairwise record comparisons:

id_l id_r first_name_l first_name_r surname_l surname_r dob_l dob_r 1 2 john jon smith smith 1991-04-11 1991-04-17 1 3 john john smith smyth 1991-04-11 1991-04-11 2 3 jon john smith smyth 1991-04-17 1991-04-11

When defining comparisons, we are defining rules that operate on each row of this latter table of pairwise comparisons

"},{"location":"topic_guides/comparisons/comparisons_and_comparison_levels.html#defining-similarity","title":"Defining similarity","text":"

How how should we assess similarity between the records?

In Splink, we will use different measures of similarity for different columns in the data, and then combine these measures to get an overall similarity score. But the most appropriate definition of similarity will differ between columns.

For example, two surnames that differ by a single character would usually be considered to be similar. But a one character difference in a 'gender' field encoded as M or F is not similar at all!

To allow for this, Splink uses the concepts of Comparisons and ComparisonLevels. Each Comparison usually measures the similarity of a single column in the data, and each Comparison is made up of one or more ComparisonLevels.

Within each Comparison are n discrete ComparisonLevels. Each ComparisonLevel defines a discrete gradation (category) of similarity within a Comparison. There can be as many ComparisonLevels as you want. For example:

Data Linking Model\n\u251c\u2500-- Comparison: Gender\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n\u251c\u2500-- Comparison: First name\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match on surname\n\u2502    \u251c\u2500-- ComparisonLevel: surnames have JaroWinklerSimilarity > 0.95\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n

The categories are discrete rather than continuous for performance reasons - so for instance, a ComparisonLevel may be defined as jaro winkler similarity between > 0.95, as opposed to using the Jaro-Winkler score as a continuous measure directly.

It is up to the user to decide how best to define similarity for the different columns (fields) in their data, and this is a key part of modelling a record linkage problem.

A much more detailed of how this works can be found in this series of interactive tutorials - refer in particular to computing the Fellegi Sunter model.

"},{"location":"topic_guides/comparisons/comparisons_and_comparison_levels.html#an-example","title":"An example:","text":"

The concepts of Comparisons and ComparisonLevels are best explained using an example.

Consider the following simple data linkage model with only two columns (in a real example there would usually be more):

Data Linking Model\n\u251c\u2500-- Comparison: Date of birth\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match\n\u2502    \u251c\u2500-- ComparisonLevel: One character difference\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n\u251c\u2500-- Comparison: First name\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match on first_name\n\u2502    \u251c\u2500-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.95\n\u2502    \u251c\u2500-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.8\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n

In this model we have two Comparisons: one for date of birth and one for first name:

For data of birth, we have chosen three discrete ComparisonLevels to measure similarity. Either the dates of birth are an exact match, they differ by one character, or they are different in some other way.

For first name, we have chosen four discrete ComparisonLevels to measure similarity. Either the first names are an exact match, they have a JaroWinkler similarity of greater than 0.95, they have a JaroWinkler similarity of greater than 0.8, or they are different in some other way.

Note that these definitions are mutually exclusive, because they're implemented by Splink like an if statement. For example, for first name, the Comparison is equivalent to the following pseudocode:

if first_name_l_ == first_name_r:\n    return \"Assign to category: Exact match\"\nelif JaroWinklerSimilarity(first_name_l_, first_name_r) > 0.95:\n    return \"Assign to category: JaroWinklerSimilarity > 0.95\"\nelif JaroWinklerSimilarity(first_name_l_, first_name_r) > 0.8:\n    return \"Assign to category: JaroWinklerSimilarity > 0.8\"\nelse:\n    return \"Assign to category: All other\"\n

In the next section, we will see how to define these Comparisons and ComparisonLevels in Splink.

"},{"location":"topic_guides/comparisons/customising_comparisons.html","title":"Defining and customising comparisons","text":""},{"location":"topic_guides/comparisons/customising_comparisons.html#defining-and-customising-how-record-comparisons-are-made","title":"Defining and customising how record comparisons are made","text":"

A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.

By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.

"},{"location":"topic_guides/comparisons/customising_comparisons.html#comparisons-and-comparisonlevels","title":"Comparisons and ComparisonLevels","text":"

Recall that a Splink model contains a collection of Comparisons and ComparisonLevels organised in a hierarchy.

Each ComparisonLevel defines the different gradations of similarity that make up a Comparison.

An example is as follows:

Data Linking Model\n\u251c\u2500-- Comparison: Date of birth\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match\n\u2502    \u251c\u2500-- ComparisonLevel: Up to one character difference\n\u2502    \u251c\u2500-- ComparisonLevel: Up to three character difference\n\u2502    \u251c\u2500-- ComparisonLevel: All other\n\u251c\u2500-- Comparison: Name\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match on first name and surname\n\u2502    \u251c\u2500-- ComparisonLevel: Exact match on first name\n\u2502    \u251c\u2500-- etc.\n
"},{"location":"topic_guides/comparisons/customising_comparisons.html#three-ways-of-specifying-comparisons","title":"Three ways of specifying Comparisons","text":"

In Splink, there are three ways of specifying Comparisons:

  • Using 'out-of-the-box' Comparisons (Most simple/succinct)
  • Composing pre-defined ComparisonLevels
  • Writing a full dictionary spec of a Comparison by hand (most verbose/flexible)
"},{"location":"topic_guides/comparisons/customising_comparisons.html#method-1-using-the-comparisonlibrary","title":"Method 1: Using the ComparisonLibrary","text":"

The ComparisonLibrary contains pre-baked similarity functions that cover many common use cases.

These functions generate an entire Comparison, composed of several ComparisonLevels.

You can find a listing of all available Comparisons at the page for its API documentation here

The following provides an example of using the ExactMatch Comparison, and producing the description (with associated SQL) for the duckdb backend:

import splink.comparison_library as cl\n\nfirst_name_comparison = cl.ExactMatch(\"first_name\")\nprint(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'ExactMatch' of \"first_name\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'All other comparisons' with SQL rule: ELSE\n

Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying .json specification of a model:

first_name_comparison.get_comparison(\"duckdb\").as_dict()\n
{'output_column_name': 'first_name',\n 'comparison_levels': [{'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n   'label_for_charts': 'first_name is NULL',\n   'is_null_level': True},\n  {'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n   'label_for_charts': 'Exact match on first_name'},\n  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n 'comparison_description': 'ExactMatch'}\n

We can now generate a second, more complex comparison using one of our data-specific comparisons, the PostcodeComparison:

pc_comparison = cl.PostcodeComparison(\"postcode\")\nprint(pc_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'PostcodeComparison' of \"postcode\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on full postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')\n    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}', 0), '')\n    - 'All other comparisons' with SQL rule: ELSE\n

For a deep dive on out of the box comparisons, see the dedicated topic guide.

Comparisons can be further configured using the .configure() method - full API docs here.

"},{"location":"topic_guides/comparisons/customising_comparisons.html#method-2-comparisonlevels","title":"Method 2: ComparisonLevels","text":"

ComparisonLevels provide a lower-level API that allows you to compose your own comparisons.

For example, the user may wish to specify a comparison that has levels for a match on soundex and jaro_winkler of the first_name field.

The below example assumes the user has derived a column soundex_first_name which contains the soundex of the first name.

from splink.comparison_library import CustomComparison\nimport splink.comparison_level_library as cll\n\ncustom_name_comparison = CustomComparison(\n    output_column_name=\"first_name\",\n    comparison_levels=[\n        cll.NullLevel(\"first_name\"),\n        cll.ExactMatchLevel(\"first_name\").configure(tf_adjustment_column=\"first_name\"),\n        cll.ExactMatchLevel(\"soundex_first_name\").configure(\n            tf_adjustment_column=\"soundex_first_name\"\n        ),\n        cll.ElseLevel(),\n    ],\n)\n\nprint(custom_name_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'CustomComparison' of \"first_name\" and \"soundex_first_name\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'Exact match on soundex_first_name' with SQL rule: \"soundex_first_name_l\" = \"soundex_first_name_r\"\n    - 'All other comparisons' with SQL rule: ELSE\n

This can now be specified in the settings dictionary as follows:

from splink import SettingsCreator, block_on\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        custom_name_comparison,\n        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n    ],\n)\n

To inspect the custom comparison as a dictionary, you can call custom_name_comparison.get_comparison(\"duckdb\").as_dict()

Note that ComparisonLevels can be further configured using the .configure() method - full API documentation here

"},{"location":"topic_guides/comparisons/customising_comparisons.html#method-3-providing-the-spec-as-a-dictionary","title":"Method 3: Providing the spec as a dictionary","text":"

Behind the scenes in Splink, all Comparisons are eventually turned into a dictionary which conforms to the formal jsonschema specification of the settings dictionary and here.

The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.

For maximum control over your settings, you can specify your comparisons as a dictionary.

comparison_first_name = {\n    \"output_column_name\": \"first_name\",\n    \"comparison_levels\": [\n        {\n            \"sql_condition\": \"first_name_l IS NULL OR first_name_r IS NULL\",\n            \"label_for_charts\": \"Null\",\n            \"is_null_level\": True,\n        },\n        {\n            \"sql_condition\": \"first_name_l = first_name_r\",\n            \"label_for_charts\": \"Exact match\",\n            \"tf_adjustment_column\": \"first_name\",\n            \"tf_adjustment_weight\": 1.0,\n            \"tf_minimum_u_value\": 0.001,\n        },\n        {\n            \"sql_condition\": \"dmeta_first_name_l = dmeta_first_name_r\",\n            \"label_for_charts\": \"Exact match\",\n            \"tf_adjustment_column\": \"dmeta_first_name\",\n            \"tf_adjustment_weight\": 1.0,\n        },\n        {\n            \"sql_condition\": \"jaro_winkler_sim(first_name_l, first_name_r) > 0.8\",\n            \"label_for_charts\": \"Exact match\",\n            \"tf_adjustment_column\": \"first_name\",\n            \"tf_adjustment_weight\": 0.5,\n            \"tf_minimum_u_value\": 0.001,\n        },\n        {\"sql_condition\": \"ELSE\", \"label_for_charts\": \"All other comparisons\"},\n    ],\n}\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        comparison_first_name,\n        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n    ],\n)\n
"},{"location":"topic_guides/comparisons/customising_comparisons.html#examples","title":"Examples","text":"

Below are some examples of how you can define the same comparison, but through different methods.

"},{"location":"topic_guides/comparisons/customising_comparisons.html#exact-match-comparison-with-term-frequency-adjustments","title":"Exact match Comparison with Term-Frequency Adjustments","text":"Comparison LibraryComparison Level LibrarySettings Dictionary
import splink.comparison_library as cl\n\nfirst_name_comparison = cl.ExactMatch(\"first_name\").configure(\n    term_frequency_adjustments=True\n)\n
import splink.duckdb.comparison_level_library as cll\n\nfirst_name_comparison = cl.CustomComparison(\n    output_column_name=\"first_name\",\n    comparison_description=\"Exact match vs. anything else\",\n    comparison_levels=[\n        cll.NullLevel(\"first_name\"),\n        cll.ExactMatchLevel(\"first_name\").configure(tf_adjustment_column=\"first_name\"),\n        cll.ElseLevel(),\n    ],\n)\n
first_name_comparison = {\n    'output_column_name': 'first_name',\n    'comparison_levels': [\n        {\n            'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n            'label_for_charts': 'Null',\n            'is_null_level': True\n        },\n        {\n            'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n            'label_for_charts': 'Exact match',\n            'tf_adjustment_column': 'first_name',\n            'tf_adjustment_weight': 1.0\n        },\n        {\n            'sql_condition': 'ELSE', \n            'label_for_charts': 'All other comparisons'\n        }],\n    'comparison_description': 'Exact match vs. anything else'\n}\n

Each of which gives

{\n    'output_column_name': 'first_name',\n    'comparison_levels': [\n        {\n            'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n            'label_for_charts': 'Null',\n            'is_null_level': True\n        },\n        {\n            'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n            'label_for_charts': 'Exact match',\n            'tf_adjustment_column': 'first_name',\n            'tf_adjustment_weight': 1.0\n        },\n        {\n            'sql_condition': 'ELSE', \n            'label_for_charts': 'All other comparisons'\n        }],\n    'comparison_description': 'Exact match vs. anything else'\n}\n
in your settings dictionary."},{"location":"topic_guides/comparisons/customising_comparisons.html#levenshtein-comparison","title":"Levenshtein Comparison","text":"Comparison LibraryComparison Level LibrarySettings Dictionary
import splink.comparison_library as cl\n\nemail_comparison = cl.LevenshteinAtThresholds(\"email\", [2, 4])\n
import splink.comparison_library as cl\nimport splink.comparison_level_library as cll\n\nemail_comparison = cl.CustomComparison(\n    output_column_name=\"email\",\n    comparison_description=\"Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else\",\n    comparison_levels=[\n        cll.NullLevel(\"email\"),\n        cll.LevenshteinLevel(\"email\", distance_threshold=2),\n        cll.LevenshteinLevel(\"email\", distance_threshold=4),\n        cll.ElseLevel(),\n    ],\n)\n
email_comparison = {\n    'output_column_name': 'email',\n    'comparison_levels': [{'sql_condition': '\"email_l\" IS NULL OR \"email_r\" IS NULL',\n    'label_for_charts': 'Null',\n    'is_null_level': True},\n    {\n        'sql_condition': '\"email_l\" = \"email_r\"',\n        'label_for_charts': 'Exact match'\n    },\n    {\n        'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 2',\n        'label_for_charts': 'Levenshtein <= 2'\n    },\n    {\n        'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 4',\n        'label_for_charts': 'Levenshtein <= 4'\n    },\n    {\n        'sql_condition': 'ELSE', \n        'label_for_charts': 'All other comparisons'\n    }],\n    'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'}\n

Each of which gives

{\n    'output_column_name': 'email',\n    'comparison_levels': [\n        {\n            'sql_condition': '\"email_l\" IS NULL OR \"email_r\" IS NULL',\n            'label_for_charts': 'Null',\n            'is_null_level': True},\n        {\n            'sql_condition': '\"email_l\" = \"email_r\"',\n            'label_for_charts': 'Exact match'\n        },\n        {\n            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 2',\n            'label_for_charts': 'Levenshtein <= 2'\n        },\n        {\n            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 4',\n            'label_for_charts': 'Levenshtein <= 4'\n        },\n        {\n            'sql_condition': 'ELSE', \n            'label_for_charts': 'All other comparisons'\n        }],\n    'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'\n}\n

in your settings dictionary.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html","title":"Out-of-the-box comparisons","text":""},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#out-of-the-box-comparisons-for-specific-data-types","title":"Out-of-the-box Comparisons for specific data types","text":"

Splink has pre-defined Comparisons available for variety of data types.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#dateofbirthcomparison","title":"DateOfBirthComparison","text":"

You can find full API docs for DateOfBirthComparison here

import splink.comparison_library as cl\n\ndate_of_birth_comparison = cl.DateOfBirthComparison(\n    \"date_of_birth\",\n    input_is_string=True,\n)\n

You can view the structure of the comparison as follows:

print(date_of_birth_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'DateOfBirthComparison' of \"date_of_birth\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'transformed date_of_birth is NULL' with SQL rule: try_strptime(\"date_of_birth_l\", '%Y-%m-%d') IS NULL OR try_strptime(\"date_of_birth_r\", '%Y-%m-%d') IS NULL\n    - 'Exact match on date of birth' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n    - 'DamerauLevenshtein distance <= 1' with SQL rule: damerau_levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n    - 'Abs date difference <= 1 month' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 2629800.0\n    - 'Abs date difference <= 1 year' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 31557600.0\n    - 'Abs date difference <= 10 year' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 315576000.0\n    - 'All other comparisons' with SQL rule: ELSE\n

To see this as a specifications dictionary you can use:

date_of_birth_comparison.get_comparison(\"duckdb\").as_dict()\n

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#name-comparison","title":"Name Comparison","text":"

A Name comparison is intended for use on an individual name column (e.g. forename, surname)

You can find full API docs for NameComparison here

import splink.comparison_library as cl\n\nfirst_name_comparison = cl.NameComparison(\"first_name\")\n
print(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'NameComparison' of \"first_name\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92\n    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7\n    - 'All other comparisons' with SQL rule: ELSE\n

The NameComparison also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

For example:

surname_comparison = cl.NameComparison(\n    \"surname\",\n    jaro_winkler_thresholds=[0.95, 0.9],\n    dmeta_col_name=\"surname_dmeta\",\n)\nprint(surname_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'NameComparison' of \"surname\" and \"surname_dmeta\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'surname is NULL' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n    - 'Exact match on surname' with SQL rule: \"surname_l\" = \"surname_r\"\n    - 'Jaro-Winkler distance of surname >= 0.95' with SQL rule: jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.95\n    - 'Jaro-Winkler distance of surname >= 0.9' with SQL rule: jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9\n    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"surname_dmeta_l\", \"surname_dmeta_r\")) >= 1\n    - 'All other comparisons' with SQL rule: ELSE\n

Where surname_dm refers to a column which has used the DoubleMetaphone algorithm on surname to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide.

To see this as a specifications dictionary you can call

surname_comparison.get_comparison(\"duckdb\").as_dict()\n

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#forename-and-surname-comparison","title":"Forename and Surname Comparison","text":"

It can be helpful to construct a single comparison for for comparing the forename and surname because:

  1. The Fellegi-Sunter model assumes that columns are independent. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

    As a result term-frequencies of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. For more information on term-frequencies, see the dedicated topic guide. Combining forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.

  2. It is common for some records to have swapped forename and surname by mistake. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.

The ForenameSurnameComparison has been designed to accomodate this.

You can find full API docs for ForenameSurnameComparison here

import splink.comparison_library as cl\n\nfull_name_comparison = cl.ForenameSurnameComparison(\"forename\", \"surname\")\n
print(full_name_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'ForenameSurnameComparison' of \"forename\" and \"surname\".\nSimilarity is assessed using the following ComparisonLevels:\n    - '(forename is NULL) AND (surname is NULL)' with SQL rule: (\"forename_l\" IS NULL OR \"forename_r\" IS NULL) AND (\"surname_l\" IS NULL OR \"surname_r\" IS NULL)\n    - '(Exact match on forename) AND (Exact match on surname)' with SQL rule: (\"forename_l\" = \"forename_r\") AND (\"surname_l\" = \"surname_r\")\n    - 'Match on reversed cols: forename and surname' with SQL rule: \"forename_l\" = \"surname_r\" AND \"forename_r\" = \"surname_l\"\n    - '(Jaro-Winkler distance of forename >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)' with SQL rule: (jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.92) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.92)\n    - '(Jaro-Winkler distance of forename >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)' with SQL rule: (jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.88) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.88)\n    - 'Exact match on surname' with SQL rule: \"surname_l\" = \"surname_r\"\n    - 'Exact match on forename' with SQL rule: \"forename_l\" = \"forename_r\"\n    - 'All other comparisons' with SQL rule: ELSE\n

As noted in the feature engineering guide, to take advantage of term frequency adjustments on full name, you need to derive a full name column prior to importing data into Splin. You then provide the column name using the forename_surname_concat_col_name argument:

full_name_comparison = cl.ForenameSurnameComparison(\"forename\", \"surname\", forename_surname_concat_col_name=\"first_and_last_name\")\n

To see this as a specifications dictionary you can call

full_name_comparison.get_comparison(\"duckdb\").as_dict()\n

Which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#postcode-comparisons","title":"Postcode Comparisons","text":"

See Feature Engineering for more details.

import splink.comparison_library as cl\n\npc_comparison = cl.PostcodeComparison(\"postcode\")\n
print(pc_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'PostcodeComparison' of \"postcode\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on full postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')\n    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}', 0), '')\n    - 'All other comparisons' with SQL rule: ELSE\n

If you have derive lat long columns, you can model geographical distances.

pc_comparison = cl.PostcodeComparison(\"postcode\", lat_col=\"lat\", long_col=\"long\", km_thresholds=[1,10,50])\nprint(pc_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'PostcodeComparison' of \"postcode\", \"long\" and \"lat\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n    - 'Distance in km <= 1' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 1\n    - 'Distance in km <= 10' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 10\n    - 'Distance in km <= 50' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 50\n    - 'All other comparisons' with SQL rule: ELSE\n

To see this as a specifications dictionary you can call

pc_comparison.get_comparison(\"duckdb\").as_dict()\n

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

"},{"location":"topic_guides/comparisons/out_of_the_box_comparisons.html#email-comparison","title":"Email Comparison","text":"

You can find full API docs for EmailComparison here

import splink.comparison_library as cl\n\nemail_comparison = cl.EmailComparison(\"email\")\n
print(email_comparison.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'EmailComparison' of \"email\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'email is NULL' with SQL rule: \"email_l\" IS NULL OR \"email_r\" IS NULL\n    - 'Exact match on email' with SQL rule: \"email_l\" = \"email_r\"\n    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')\n    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')) >= 0.88\n    - 'All other comparisons' with SQL rule: ELSE\n

To see this as a specifications dictionary you can call

email_comparison.as_dict()\n

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

"},{"location":"topic_guides/comparisons/phonetic.html","title":"Phonetic algorithms","text":"","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/phonetic.html#phonetic-transformation-algorithms","title":"Phonetic transformation algorithms","text":"

Phonetic transformation algorithms can be used to identify words that sound similar, even if they are spelled differently (e.g. \"Stephen\" vs \"Steven\"). These algorithms to give another type of fuzzy match and are often generated in the Feature Engineering step of record linkage.

Once generated, phonetic matches can be used within comparisons & comparison levels and blocking rules.

import splink.comparison_library as cl\n\nfirst_name_comparison = cl.NameComparison(\n                        \"first_name\",\n                        dmeta_col_name= \"first_name_dm\")\nprint(first_name_comparison.human_readable_description)\n
Comparison 'NameComparison' of \"first_name\" and \"first_name_dm\".\nSimilarity is assessed using the following ComparisonLevels:\n\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92\n    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"first_name_dm_l\", \"first_name_dm_r\")) >= 1\n    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7\n    - 'All other comparisons' with SQL rule: ELSE\n
","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/phonetic.html#algorithms","title":"Algorithms","text":"

Below are some examples of well known phonetic transformation algorithms.

","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/phonetic.html#soundex","title":"Soundex","text":"

Soundex is a phonetic algorithm that assigns a code to words based on their sound. The Soundex algorithm works by converting a word into a four-character code, where the first character is the first letter of the word, and the next three characters are numerical codes representing the word's remaining consonants. Vowels and some consonants, such as H, W, and Y, are ignored.

Algorithm Steps

The Soundex algorithm works by following these steps:

  1. Retain the first letter of the word and remove all other vowels and the letters \"H\", \"W\", and \"Y\".

  2. Replace each remaining consonant (excluding the first letter) with a numerical code as follows:

    1. B, F, P, and V are replaced with \"1\"
    2. C, G, J, K, Q, S, X, and Z are replaced with \"2\"
    3. D and T are replaced with \"3\"
    4. L is replaced with \"4\"
    5. M and N are replaced with \"5\"
    6. R is replaced with \"6\"
  3. Combine the first letter and the numerical codes to form a four-character code. If there are fewer than four characters, pad the code with zeros.

Example

You can test out the Soundex transformation between two strings through the phonetics package.

import phonetics\nprint(phonetics.soundex(\"Smith\"), phonetics.soundex(\"Smyth\"))\n

S5030 S5030

","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/phonetic.html#metaphone","title":"Metaphone","text":"

Metaphone is an improved version of the Soundex algorithm that was developed to handle a wider range of words and languages. The Metaphone algorithm assigns a code to a word based on its phonetic pronunciation, but it takes into account the sound of the entire word, rather than just its first letter and consonants. The Metaphone algorithm works by applying a set of rules to the word's pronunciation, such as converting the \"TH\" sound to a \"T\" sound, or removing silent letters. The resulting code is a variable-length string of letters that represents the word's pronunciation.

Algorithm Steps

The Metaphone algorithm works by following these steps:

  1. Convert the word to uppercase and remove all non-alphabetic characters.

  2. Apply a set of pronunciation rules to the word, such as:

    1. Convert the letters \"C\" and \"K\" to \"K\"
    2. Convert the letters \"PH\" to \"F\"
    3. Convert the letters \"W\" and \"H\" to nothing if they are not at the beginning of the word
  3. Apply a set of replacement rules to the resulting word, such as:

    1. Replace the letter \"G\" with \"J\" if it is followed by an \"E\", \"I\", or \"Y\"
    2. Replace the letter \"C\" with \"S\" if it is followed by an \"E\", \"I\", or \"Y\"
    3. Replace the letter \"X\" with \"KS\"
  4. If the resulting word ends with \"S\", remove it.

  5. If the resulting word ends with \"ED\", \"ING\", or \"ES\", remove it.

  6. If the resulting word starts with \"KN\", \"GN\", \"PN\", \"AE\", \"WR\", or \"WH\", remove the first letter.

  7. If the resulting word starts with a vowel, retain the first letter.

  8. Retain the first four characters of the resulting word, or pad it with zeros if it has fewer than four characters.

Example

You can test out the Metaphone transformation between two strings through the phonetics package.

import phonetics\nprint(phonetics.metaphone(\"Smith\"), phonetics.metaphone(\"Smyth\"))\n

SM0 SM0

","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/phonetic.html#double-metaphone","title":"Double Metaphone","text":"

Double Metaphone is an extension of the Metaphone algorithm that generates two codes for each word, one for the primary pronunciation and one for an alternate pronunciation. The Double Metaphone algorithm is designed to handle a wide range of languages and dialects, and it is more accurate than the original Metaphone algorithm.

The Double Metaphone algorithm works by applying a set of rules to the word's pronunciation, similar to the Metaphone algorithm, but it generates two codes for each word. The primary code is the most likely pronunciation of the word, while the alternate code represents a less common pronunciation.

Algorithm Steps Standard Double MetaphoneAlternative Double Metaphone

The Double Metaphone algorithm works by following these steps:

  1. Convert the word to uppercase and remove all non-alphabetic characters.

  2. Apply a set of pronunciation rules to the word, such as:

    1. Convert the letters \"C\" and \"K\" to \"K\"
    2. Convert the letters \"PH\" to \"F\"
    3. Convert the letters \"W\" and \"H\" to nothing if they are not at the beginning of the word
  3. Apply a set of replacement rules to the resulting word, such as:

    1. Replace the letter \"G\" with \"J\" if it is followed by an \"E\", \"I\", or \"Y\"
    2. Replace the letter \"C\" with \"S\" if it is followed by an \"E\", \"I\", or \"Y\"
    3. Replace the letter \"X\" with \"KS\"
  4. If the resulting word ends with \"S\", remove it.

  5. If the resulting word ends with \"ED\", \"ING\", or \"ES\", remove it.

  6. If the resulting word starts with \"KN\", \"GN\", \"PN\", \"AE\", \"WR\", or \"WH\", remove the first letter.

  7. If the resulting word starts with \"X\", \"Z\", \"GN\", or \"KN\", retain the first two characters.

  8. Apply a second set of rules to the resulting word to generate an alternative code.

  9. Return the primary and alternative codes as a tuple.

The Alternative Double Metaphone algorithm takes into account different contexts in the word and is generated by following these steps:

  1. Apply a set of prefix rules, such as:

    1. Convert the letter \"G\" at the beginning of the word to \"K\" if it is followed by \"N\", \"NED\", or \"NER\"
    2. Convert the letter \"A\" at the beginning of the word to \"E\" if it is followed by \"SCH\"
  2. Apply a set of suffix rules, such as:

    1. Convert the letters \"E\" and \"I\" at the end of the word to \"Y\"
    2. Convert the letters \"S\" and \"Z\" at the end of the word to \"X\"
    3. Remove the letter \"D\" at the end of the word if it is preceded by \"N\"
  3. Apply a set of replacement rules, such as:

    1. Replace the letter \"C\" with \"X\" if it is followed by \"IA\" or \"H\"
    2. Replace the letter \"T\" with \"X\" if it is followed by \"IA\" or \"CH\"
  4. Retain the first four characters of the resulting word, or pad it with zeros if it has fewer than four characters.

  5. If the resulting word starts with \"X\", \"Z\", \"GN\", or \"KN\", retain the first two characters.

  6. Return the alternative code.

Example

You can test out the Metaphone transformation between two strings through the phonetics package.

import phonetics\nprint(phonetics.dmetaphone(\"Smith\"), phonetics.dmetaphone(\"Smyth\"))\n

('SM0', 'XMT') ('SM0', 'XMT')

","tags":["API","Phonetic Transformations","Comparisons","Blocking","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/comparisons/regular_expressions.html","title":"Regular expressions","text":""},{"location":"topic_guides/comparisons/regular_expressions.html#extracting-partial-strings","title":"Extracting partial strings","text":"

It can sometimes be useful to make comparisons based on substrings or parts of column values. For example, one approach to comparing postcodes is to consider their constituent components, e.g. area, district, etc (see Featuring Engineering for more details).

We can use functions such as substrings and regular expressions to enable users to compare strings without needing to engineer new features from source data.

Splink supports this functionality via the use of the ComparisonExpression.

"},{"location":"topic_guides/comparisons/regular_expressions.html#examples","title":"Examples","text":""},{"location":"topic_guides/comparisons/regular_expressions.html#1-exact-match-on-postcode-area","title":"1. Exact match on postcode area","text":"

Suppose you wish to make comparisons on a postcode column in your data, however only care about finding links between people who share the same area code (given by the first 1 to 2 letters of the postcode). The regular expression to pick out the first two characters is ^[A-Z]{1,2}:

import splink.comparison_level_library as cll\nfrom splink import ColumnExpression\n\npc_ce = ColumnExpression(\"postcode\").regex_extract(\"^[A-Z]{1,2}\")\nprint(cll.ExactMatchLevel(pc_ce).get_comparison_level(\"duckdb\").sql_condition)\n
NULLIF(regexp_extract(\"postcode_l\", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Z]{1,2}', 0), '')\n

We may therefore configure a comparison as follows:

from splink.comparison_library import CustomComparison\n\ncc = CustomComparison(\n    output_column_name=\"postcode\",\n    comparison_levels=[\n        cll.NullLevel(\"postcode\"),\n        cll.ExactMatchLevel(pc_ce),\n        cll.ElseLevel()\n    ]\n\n)\nprint(cc.get_comparison(\"duckdb\").human_readable_description)\n
Comparison 'CustomComparison' of \"postcode\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Z]{1,2}', 0), '')\n    - 'All other comparisons' with SQL rule: ELSE\n
person_id_l person_id_r postcode_l postcode_r comparison_level 7 1 SE1P 0NY SE1P 0NY exact match 5 1 SE2 4UZ SE1P 0NY exact match 9 2 SW14 7PQ SW3 9JG exact match 4 8 N7 8RL EC2R 8AH else level 6 3 SE2 4UZ null level"},{"location":"topic_guides/comparisons/regular_expressions.html#2-exact-match-on-initial","title":"2. Exact match on initial","text":"

In this example we use the .substr function to create a comparison level based on the first letter of a column value.

Note that the substr function is 1-indexed, so the first character is given by substr(1, 1): The first two characters would be given by substr(1, 2).

import splink.comparison_level_library as cll\nfrom splink import ColumnExpression\n\ninitial = ColumnExpression(\"first_name\").substr(1,1)\nprint(cll.ExactMatchLevel(initial).get_comparison_level(\"duckdb\").sql_condition)\n
SUBSTRING(\"first_name_l\", 1, 1) = SUBSTRING(\"first_name_r\", 1, 1)\n
"},{"location":"topic_guides/comparisons/regular_expressions.html#additional-info","title":"Additional info","text":"

Regular expressions containing \u201c\\\u201d (the python escape character) are tricky to make work with the Spark linker due to escaping so consider using alternative syntax, for example replacing \u201c\\d\u201d with \u201c[0-9]\u201d.

Different regex patterns can achieve the same result but with more or less efficiency. You might want to consider optimising your regular expressions to improve performance (see here, for example).

"},{"location":"topic_guides/comparisons/term-frequency.html","title":"Term frequency adjustments","text":"","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#term-frequency-adjustments","title":"Term-Frequency Adjustments","text":"","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#problem-statement","title":"Problem Statement","text":"

A shortcoming of the basic Fellegi-Sunter model is that it doesn\u2019t account for skew in the distributions of linking variables. A stark example is a binary variable such as gender in the prison population, where male offenders outnumber female offenders by 10:1.

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#how-does-this-affect-our-m-and-u-probabilities","title":"How does this affect our m and u probabilities?","text":"
  • m probability is unaffected - given two records are a match, the gender field should also match with roughly the same probability for males and females

  • Given two records are not a match, however, it is far more likely that both records will be male than that they will both be female - u probability is too low for the more common value (male) and too high otherwise.

In this example, one solution might be to create an extra comparison level for matches on gender:

  • l.gender = r.gender AND l.gender = 'Male'

  • l.gender = r.gender AND l.gender = 'Female'

However, this complexity forces us to estimate two m probabilities when one would do, and it becomes impractical if we extend to higher-cardinality variables like surname, requiring thousands of additional comparison levels.

This problem used to be addressed with an ex-post (after the fact) solution - once the linking is done, we have a look at the average match probability for each value in a column to determine which values tend to be stronger indicators of a match. If the average match probability for records pairs that share a surname is 0.2 but the average for the specific surname Smith is 0.1 then we know that the match weight for name should be adjusted downwards for Smiths.

The shortcoming of this option is that in practice, the model training is conducted on the assumption that all name matches are equally informative, and all of the underlying probabilities are evaluated accordingly. Ideally, we want to be able to account for term frequencies within the Fellegi-Sunter framework as trained by the EM algorithm.

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#toy-example","title":"Toy Example","text":"

Below is an illustration of 2 datasets (10 records each) with skewed distributions of first name. A link_and_dedupe Splink model concatenates these two tables and deduplicates those 20 records.

In principle, u probabilities for a small dataset like this can be estimated directly - out of 190 possible pairwise comparisons, 77 of them have the same first name. Based on the assumption that matches are rare (i.e. the vast majority of these comparisons are non-matches), we use this as a direct estimate of u. Random sampling makes the same assumption, but by using a manageable-sized sample of a much larger dataset where it would be prohibitively costly to perform all possible comparisons (a Cartesian join).

Once we have concatenated our input tables, it is useful to calculate the term frequencies (TF) of each value. Rather than keep a separate TF table, we can add a TF column to the concatenated table - this is what df_concat_with_tf refers to within Splink.

Building on the example above, we can define the m and u probabilities for a specific first name value, and work out an expression for the resulting match weight.

Just as we can add independent match weights for name, DOB and other comparisons (as shown in the Splink waterfall charts), we can also add an independent TF adjustment term for each comparison. This is useful because:

  • The TF adjustment doesn't depend on m, and therefore does not have to be estimated by the EM algorithm - it is known already

  • The EM algorithm benefits from the TF adjustment (rather than previous post hoc implementations)

  • It is trivially easy to \u201cturn off\u201d TF adjustments in our final match weights if we wish

  • We can easily disentangle and visualise the aggregate significance of a particular column, separately from the deviations within it (see charts below)

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#visualising-tf-adjustments","title":"Visualising TF Adjustments","text":"

For an individual comparison of two records, we can see the impact of TF adjustments in the waterfall charts:

This example shows two records having a match weight of +15.69 due to a match on first name, surname and DOB. Due to relatively uncommon values for all 3 of these, they each have an additional term frequency adjustment contributing around +5 to the final match weight

We can also see these match weights and TF adjustments summarised using a chart like the below to highlight common and uncommon names. We do this already using the Splink linker's profile_columns method, but once we know the u probabilities for our comparison columns, we can show these outliers in terms of their impact on match weight:

In this example of names from FEBRL data used in the demo notebooks, we see that a match on first name has a match weight of +6. For an uncommon name like Portia this is increased, whereas a common name like Jack has a reduced match weight. This chart can be generated using `linker.tf_adjustment_chart(\"name\")`","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#applying-tf-adjustments-in-splink","title":"Applying TF adjustments in Splink","text":"

Depending on how you compose your Splink settings, TF adjustments can be applied to a specific comparison level in different ways:

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#comparisonlibrary-and-comparisontemplatelibrary-functions","title":"ComparisonLibrary and ComparisonTemplateLibrary functions","text":"
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\n\nsex_comparison = cl.ExactMatch(\"sex\").configure(term_frequency_adjustments=True)\n\nname_comparison = cl.JaroWinklerAtThresholds(\n    \"name\",\n    score_threshold_or_thresholds=[0.9, 0.8],\n).configure(term_frequency_adjustments=True)\n\nemail_comparison = ctl.EmailComparison(\"email\").configure(\n    term_frequency_adjustments=True,\n)\n
","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#comparison-level-library-functions","title":"Comparison level library functions","text":"
import splink.comparison_level_library as cll\n\nname_comparison = cl.CustomComparison(\n    output_column_name=\"name\",\n    comparison_description=\"Full name\",\n    comparison_levels=[\n        cll.NullLevel(\"full_name\"),\n        cll.ExactMatchLevel(\"full_name\").configure(tf_adjustment_column=\"full_name\"),\n        cll.ColumnsReversedLevel(\"first_name\", \"surname\").configure(\n            tf_adjustment_column=\"surname\"\n        ),\n        cll.else_level(),\n    ],\n)\n
","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#providing-a-detailed-spec-as-a-dictionary","title":"Providing a detailed spec as a dictionary","text":"
comparison_first_name = {\n    \"output_column_name\": \"first_name\",\n    \"comparison_description\": \"First name jaro dmeta\",\n    \"comparison_levels\": [\n        {\n            \"sql_condition\": \"first_name_l IS NULL OR first_name_r IS NULL\",\n            \"label_for_charts\": \"Null\",\n            \"is_null_level\": True,\n        },\n        {\n            \"sql_condition\": \"first_name_l = first_name_r\",\n            \"label_for_charts\": \"Exact match\",\n            \"tf_adjustment_column\": \"first_name\",\n            \"tf_adjustment_weight\": 1.0,\n            \"tf_minimum_u_value\": 0.001,\n        },\n        {\n            \"sql_condition\": \"jaro_winkler_sim(first_name_l, first_name_r) > 0.8\",\n            \"label_for_charts\": \"Exact match\",\n            \"tf_adjustment_column\": \"first_name\",\n            \"tf_adjustment_weight\": 0.5,\n            \"tf_minimum_u_value\": 0.001,\n        },\n        {\"sql_condition\": \"ELSE\", \"label_for_charts\": \"All other comparisons\"},\n    ],\n}\n
","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#more-advanced-applications","title":"More advanced applications","text":"

The code examples above show how we can use term frequencies for different columns for different comparison levels, and demonstrated a few other features of the TF adjustment implementation in Splink:

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#multiple-columns","title":"Multiple columns","text":"

Each comparison level can be adjusted on the basis of a specified column. In the case of exact match levels, this is trivial but it allows some partial matches to be reframed as exact matches on a different derived column. One example could be ethnicity, often provided in codes as a letter (W/M/B/A/O - the ethnic group) and a number. Without TF adjustments, an ethnicity comparison might have 3 levels - exact match, match on ethnic group (LEFT(ethnicity,1)), no match. By creating a derived column ethnic_group = LEFT(ethnicity,1) we can apply TF adjustments to both levels.

ethnicity_comparison = cl.CustomComparison(\n    output_column_name=\"ethnicity\",\n    comparison_description=\"Self-defined ethnicity\",\n    comparison_levels=[\n        cll.NullLevel(\"ethnicity\"),\n        cll.ExactMatchLevel(\"ethnicity\").configure(tf_adjustment_column=\"ethnicity\"),\n        cll.ExactMatchLevel(\"ethnic_group\").configure(tf_adjustment_column=\"ethnic_group\"),\n        cll.else_level(),\n    ],\n)\n

A more critical example would be a full name comparison that uses separate first name and surname columns. Previous implementations would apply TF adjustments to each name component independently, so \u201cJohn Smith\u201d would be adjusted down for the common name \u201cJohn\u201d and then again for the common name \u201cSmith\u201d. However, the frequencies of names are not generally independent (e.g. \u201cMohammed Khan\u201d is a relatively common full name despite neither name occurring frequently). A simple full name comparison could therefore be structured as follows:

name_comparison = cl.CustomComparison(\n    output_column_name=\"name\",\n    comparison_description=\"Full name\",\n    comparison_levels=[\n        cll.NullLevel(\"full_name\"),\n        cll.ExactMatchLevel(\"full_name\").configure(tf_adjustment_column=\"full_name\"),\n        cll.ExactMatchLevel(\"first_name\").configure(tf_adjustment_column=\"first_name\"),\n        cll.ExactMatchLevel(\"surname\").configure(tf_adjustment_column=\"surname\"),\n        cll.else_level(),\n    ],\n)\n
","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#fuzzy-matches","title":"Fuzzy matches","text":"

All of the above discussion of TF adjustments has assumed an exact match on the column in question, but this need not be the case. Where we have a \u201cfuzzy\u201d match between string values, it is generally assumed that there has been some small corruption in the text, for a number of possible reasons. A trivial example could be \"Smith\" vs \"Smith \" which we know to be equivalent if not an exact string match.

In the case of a fuzzy match, we may decide it is desirable to apply TF adjustments for the same reasons as an exact match, but given there are now two distinct sides to the comparison, there are also two different TF adjustments. Building on our assumption that one side is the \u201ccorrect\u201d or standard value and the other contains some mistake, Splink will simply use the greater of the two term frequencies. There should be more \"Smith\"s than \"Smith \"s, so the former provides the best estimate of the true prevalence of the name Smith in the data.

In cases where this assumption might not hold and both values are valid and distinct (e.g. \"Alex\" v \"Alexa\"), this behaviour is still desirable. Taking the most common of the two ensures that we err on the side of lowering the match score for a more common name than increasing the score by assuming the less common name.

TF adjustments will not be applied to any comparison level without explicitly being turned on, but to allow for some middle ground when applying them to fuzzy match column, there is a tf_adjustment_weight setting that can down-weight the TF adjustment. A weight of zero is equivalent to turning TF adjustments off, while a weight of 0.5 means the match weights are halved, mitigating their impact:

{\n  \"sql_condition\": \"jaro_winkler_sim(first_name_l, first_name_r) > 0.8\",\n  \"label_for_charts\": \"Exact match\",\n  \"tf_adjustment_column\": \"first_name\",\n  \"tf_adjustment_weight\": 0.5\n}\n
","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/comparisons/term-frequency.html#low-frequency-outliers","title":"Low-frequency outliers","text":"

Another example of where you may wish to limit the impact of TF adjustments is for exceedingly rare values. As defined above, the TF-adjusted match weight, K is inversely proportional to the term frequency, allowing K to become very large in some cases.

Let\u2019s say we have a handful of records with the misspelt first name \u201cSiohban\u201d (rather than \u201cSiobhan\u201d). Fuzzy matches between the two spellings will rightly be adjusted on the basis of the frequency of the correct spelling, but there will be a small number of cases where the misspellings match one another. Given we suspect these values are more likely to be misspellings of more common names, rather than a distinct and very rare name, we can mitigate this effect by imposing a minimum value on the term frequency used (equivalent to the u value). This can be added to your full settings dictionary as in the example above using \"tf_minimum_u_value\": 0.001. This means that for values with a frequency of <1 in 1000, it will be set to 0.001.

","tags":["Term Frequency","Comparisons"]},{"location":"topic_guides/data_preparation/feature_engineering.html","title":"Feature Engineering","text":"","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#feature-engineering-for-data-linkage","title":"Feature Engineering for Data Linkage","text":"

During record linkage, the features in a given dataset are used to provide evidence as to whether two records are a match. Like any predictive model, the quality of a Splink model is dictated by the features provided.

Below are some examples of features that be created from common columns, and how to create more detailed comparisons with them in a Splink model.

","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#postcodes","title":"Postcodes","text":"

In this example, we derive latitude and longitude coordinates from a postcode column to create a more nuanced comparison. By doing so, we account for similarity not just in the string of the postcode, but in the geographical location it represents. This could be useful if we believe, for instance, that people move house, but generally stay within the same geographical area.

We start with a comparison that uses the postcode's components, For example, UK postcodes can be broken down into the following substrings:

See image source for more details.

The pre-built postcode comparison generates a comparison with levels for an exact match on full postcode, sector, district and area in turn.

Code examples to use the comparison template:

import splink.comparison_library as cl\n\npc_comparison = ctl.PostcodeComparison(\"postcode\").get_comparison(\"duckdb\")\nprint(pc_comparison.human_readable_description)\n
Output
Comparison 'PostcodeComparison' of \"postcode\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on full postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')\n    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}', 0), '')\n    - 'All other comparisons' with SQL rule: ELSE\n

Note that this is not able to compute geographical distance by default, because it cannot assume that lat-long coordinates are available.

We now proceed to derive lat and long columns so that we can take advantage of geographcial distance. We will use the ONS Postcode Directory to look up the lat-long coordinates for each postcode.

Read in a dataset with postcodes:

import duckdb\n\nfrom splink import splink_datasets\n\ndf = splink_datasets.historical_50k\n\ndf_with_pc = \"\"\"\nWITH postcode_lookup AS (\n    SELECT\n        pcd AS postcode,\n        lat,\n        long\n    FROM\n        read_csv_auto('./path/to/ONSPD_FEB_2023_UK.csv')\n)\nSELECT\n    df.*,\n    postcode_lookup.lat,\n    postcode_lookup.long\nFROM\n    df\nLEFT JOIN\n    postcode_lookup\nON\n    upper(df.postcode_fake) = postcode_lookup.postcode\n\"\"\"\n\ndf_with_postcode = duckdb.sql(df_with_pc)\n

Now that coordinates have been added, a more detailed postcode comparison can be produced using the postcode_comparison:

pc_comparison = cl.PostcodeComparison(\n    \"postcode\", lat_col=\"lat\", long_col=\"long\", km_thresholds=[1, 10]\n).get_comparison(\"duckdb\")\nprint(pc_comparison.human_readable_description)\n
Output
Comparison 'PostcodeComparison' of \"postcode\", \"lat\" and \"long\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n    - 'Exact match on postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n    - 'Distance less than 1km' with SQL rule:\n        cast(\n            acos(\n\n        case\n            when (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    ) > 1 then 1\n            when (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    ) < -1 then -1\n            else (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    )\n        end\n\n            ) * 6371\n            as float\n        )\n    <= 1\n    - 'Distance less than 10km' with SQL rule:\n        cast(\n            acos(\n\n        case\n            when (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    ) > 1 then 1\n            when (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    ) < -1 then -1\n            else (\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\n            * cos( radians(\"long_r\" - \"long_l\") )\n    )\n        end\n\n            ) * 6371\n            as float\n        )\n    <= 10\n    - 'All other comparisons' with SQL rule: ELSE\n

or by using cll.distance_in_km_level() in conjunction with other comparison levels:

import splink.comparison_level_library as cll\nimport splink.comparison_library as cl\n\ncustom_postcode_comparison = cl.CustomComparison(\n    output_column_name=\"postcode\",\n    comparison_description=\"Postcode\",\n    comparison_levels=[\n        cll.NullLevel(\"postcode\"),\n        cll.ExactMatchLevel(\"postcode\"),\n        cll.DistanceInKMLevel(\"lat\", \"long\", 1),\n        cll.DistanceInKMLevel(\"lat\", \"long\", 10),\n        cll.DistanceInKMLevel(\"lat\", \"long\", 50),\n        cll.ElseLevel(),\n    ],\n)\n
","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#phonetic-transformations","title":"Phonetic transformations","text":"

Phonetic transformation algorithms can be used to identify words that sound similar, even if they are spelled differently. These are particularly useful for names and can be used as an additional comparison level within name comparisons.

For a more detailed explanation on phonetic transformation algorithms, see the topic guide.

","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#example","title":"Example","text":"

There are a number of python packages which support phonetic transformations that can be applied to a pandas dataframe, which can then be loaded into the Linker. For example, creating a Double Metaphone column with the phonetics python library:

import pandas as pd\nimport phonetics\n\nfrom splink import splink_datasets\ndf = splink_datasets.fake_1000\n\n# Define a function to apply the dmetaphone phonetic algorithm to each name in the column\ndef dmetaphone_name(name):\n    if name is None:\n        pass\n    else:\n        return phonetics.dmetaphone(name)\n\n# Apply the function to the \"first_name\" and surname columns using the apply method\ndf['first_name_dm'] = df['first_name'].apply(dmetaphone_name)\ndf['surname_dm'] = df['surname'].apply(dmetaphone_name)\n\ndf.head()\n
Output unique_id first_name surname dob city email group first_name_dm surname_dm 0 0 Julia 2015-10-29 London hannah88@powers.com 0 ('JL', 'AL') 1 1 Julia Taylor 2015-07-31 London hannah88@powers.com 0 ('JL', 'AL') ('TLR', '') 2 2 Julia Taylor 2016-01-27 London hannah88@powers.com 0 ('JL', 'AL') ('TLR', '') 3 3 Julia Taylor 2015-10-29 hannah88opowersc@m 0 ('JL', 'AL') ('TLR', '') 4 4 oNah Watson 2008-03-23 Bolton matthew78@ballard-mcdonald.net 1 ('AN', '') ('ATSN', 'FTSN')

Note: Soundex and Metaphone are also supported in phonetics

Now that the dmetaphone columns have been added, they can be used within comparisons. For example, using the NameComparison function from the comparison library.

import splink.duckdb.comparison_template_library as ctl\n\ncomparison = cl.NameComparison(\"first_name\", dmeta_col_name=\"first_name_dm\").get_comparison(\"duckdb\")\ncomparison.human_readable_description\n
Output
Comparison 'NameComparison' of \"first_name\" and \"first_name_dm\".\nSimilarity is assessed using the following ComparisonLevels:\n    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92\n    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"first_name_dm_l\", \"first_name_dm_r\")) >= 1\n    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7\n    - 'All other comparisons' with SQL rule: ELSE\n
","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#full-name","title":"Full name","text":"

If Splink has access to a combined full name column, it can use the term frequency of the full name, as opposed to treating forename and surname as independent.

This can be important because correlations in names are common. For example, in the UK, \u201cMohammed Khan\u201d is a more common full name than the individual frequencies of \"Mohammed\" or \"Khan\" would suggest.

The following example shows how to do this.

For more on term frequency, see the dedicated topic guide.

","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/data_preparation/feature_engineering.html#example_1","title":"Example","text":"

Derive a full name column:

import pandas as pd\n\nfrom splink import splink_datasets\n\ndf = splink_datasets.fake_1000\n\ndf['full_name'] = df['first_name'] + ' ' + df['surname']\n\ndf.head()\n

Now that the full_name column has been added, it can be used within comparisons. For example, using the ForenameSurnameComparison function from the comparison library.

comparison = cl.ForenameSurnameComparison(\n    \"first_name\", \"surname\", forename_surname_concat_col_name=\"full_name\"\n)\ncomparison.get_comparison(\"duckdb\").as_dict()\n
Output
{'output_column_name': 'first_name_surname',\n'comparison_levels': [{'sql_condition': '(\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL) AND (\"surname_l\" IS NULL OR \"surname_r\" IS NULL)',\n'label_for_charts': '(first_name is NULL) AND (surname is NULL)',\n'is_null_level': True},\n{'sql_condition': '\"full_name_l\" = \"full_name_r\"',\n'label_for_charts': 'Exact match on full_name',\n'tf_adjustment_column': 'full_name',\n'tf_adjustment_weight': 1.0},\n{'sql_condition': '\"first_name_l\" = \"surname_r\" AND \"first_name_r\" = \"surname_l\"',\n'label_for_charts': 'Match on reversed cols: first_name and surname'},\n{'sql_condition': '(jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.92)',\n'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)'},\n{'sql_condition': '(jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.88)',\n'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)'},\n{'sql_condition': '\"surname_l\" = \"surname_r\"',\n'label_for_charts': 'Exact match on surname',\n'tf_adjustment_column': 'surname',\n'tf_adjustment_weight': 1.0},\n{'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n'label_for_charts': 'Exact match on first_name',\n'tf_adjustment_column': 'first_name',\n'tf_adjustment_weight': 1.0},\n{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n'comparison_description': 'ForenameSurnameComparison'}\n

Note that the first level is now :

{'sql_condition': '\"full_name_l\" = \"full_name_r\"',\n'label_for_charts': 'Exact match on full_name',\n'tf_adjustment_column': 'full_name',\n'tf_adjustment_weight': 1.0},\n

whereas without specifying forename_surname_concat_col_name we would have had:

{'sql_condition': '(\"first_name_l\" = \"first_name_r\") AND (\"surname_l\" = \"surname_r\")',\n'label_for_charts': '(Exact match on first_name) AND (Exact match on surname)'},\n
","tags":["API","Feature Engineering","Comparisons","Postcode","Phonetic Transformations","Soundex","Metaphone","Double Metaphone"]},{"location":"topic_guides/evaluation/edge_metrics.html","title":"Edge Metrics","text":"

This guide is intended to be a reference guide for Edge Metrics used throughout Splink. It will build up from basic principles into more complex metrics.

Note

All of these metrics are dependent on having a \"ground truth\" to compare against. This is generally provided by Clerical Labelling (i.e. labels created by a human). For more on how to generate this ground truth (and the impact that can have on Edge Metrics), check out the Clerical Labelling Topic Guide.

"},{"location":"topic_guides/evaluation/edge_metrics.html#the-basics","title":"The Basics","text":"

Any Edge (Link) within a Splink model will fall into one of four categories:

"},{"location":"topic_guides/evaluation/edge_metrics.html#true-positive","title":"True Positive","text":"

Also known as: True Link

A True Positive is a case where a Splink model correctly predicts a match between two records.

"},{"location":"topic_guides/evaluation/edge_metrics.html#true-negative","title":"True Negative","text":"

Also known as: True Non-link

A True Negative is a case where a Splink model correctly predicts a non-match between two records.

"},{"location":"topic_guides/evaluation/edge_metrics.html#false-positive","title":"False Positive","text":"

Also known as: False Link, Type I Error

A False Positive is a case where a Splink model incorrectly predicts a match between two records, when they are actually a non-match.

"},{"location":"topic_guides/evaluation/edge_metrics.html#false-negative","title":"False Negative","text":"

Also known as: False Non-link, Missed Link, Type II Error

A False Negative is a case where a Splink model incorrectly predicts a non-match between two records, when they are actually a match.

"},{"location":"topic_guides/evaluation/edge_metrics.html#confusion-matrix","title":"Confusion Matrix","text":"

These can be summarised in a Confusion Matrix

In a perfect model there would be no False Positives or False Negatives (i.e. FP = 0 and FN = 0).

"},{"location":"topic_guides/evaluation/edge_metrics.html#metrics-for-linkage","title":"Metrics for Linkage","text":"

The confusion matrix shows counts of each link type, but we are generally more interested in proportions. I.e. what percentage of the time does the model get the answer right?

"},{"location":"topic_guides/evaluation/edge_metrics.html#accuracy","title":"Accuracy","text":"

The simplest metric is

\\[\\textsf{Accuracy} = \\frac{\\textsf{True Positives}+\\textsf{True Negatives}}{\\textsf{All Predictions}}\\]

This measures the proportion of correct classifications (of any kind). This may be useful for balanced data but high accuracy can be achieved by simply assuming the majority class for highly imbalanced data (e.g. assuming non-matches).

Accuracy in Splink
  • Accuracy can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.

"},{"location":"topic_guides/evaluation/edge_metrics.html#true-positive-rate-recall","title":"True Positive Rate (Recall)","text":"

Also known as: Sensitivity

The True Positive Rate (Recall) is the proportion of matches that are correctly predicted by Splink.

\\[\\textsf{Recall} = \\frac{\\textsf{True Positives}}{\\textsf{All Positives}} = \\frac{\\textsf{True Positives}}{\\textsf{True Positives} + \\textsf{False Negatives}}\\] Recall in Splink
  • Recall can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
"},{"location":"topic_guides/evaluation/edge_metrics.html#true-negative-rate-specificity","title":"True Negative Rate (Specificity)","text":"

Also known as: Selectivity

The True Negative Rate (Specificity) is the proportion of non-matches that are correctly predicted by Splink.

\\[\\textsf{Specificity} = \\frac{\\textsf{True Negatives}}{\\textsf{All Negatives}} = \\frac{\\textsf{True Negatives}}{\\textsf{True Negatives} + \\textsf{False Positives}}\\] Specificity in Splink
  • Specificity can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
"},{"location":"topic_guides/evaluation/edge_metrics.html#positive-predictive-value-precision","title":"Positive Predictive Value (Precision)","text":"

The Positive Predictive Value (Precision), is the proportion of predicted matches which are true matches.

\\[\\textsf{Precision} = \\frac{\\textsf{True Positives}}{\\textsf{All Predicted Positives}} = \\frac{\\textsf{True Positives}}{\\textsf{True Positives} + \\textsf{False Positives}}\\] Precision in Splink
  • Precision can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
"},{"location":"topic_guides/evaluation/edge_metrics.html#negative-predictive-value","title":"Negative Predictive Value","text":"

The Negative Predictive Value is the proportion of predicted non-matches which are true non-matches.

\\[\\textsf{Negative Predictive Value} = \\frac{\\textsf{True Negatives}}{\\textsf{All Predicted Negatives}} = \\frac{\\textsf{True Negatives}}{\\textsf{True Negatives} + \\textsf{False Negatives}}\\] Negative Predictive Value in Splink
  • Negative predictive value can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.

Warning

Each of these metrics looks at just one row or column of the confusion matrix. A model cannot be meaningfully summarised by just one of these performance measures.

\u201cPredicts cancer with 100% Precision\u201d - is true of a \u201cmodel\u201d that correctly identifies one known cancer patient, but misdiagnoses everyone else as cancer-free.

\u201cAI judge\u2019s verdicts have Recall of 100%\u201d - is true for a power-mad AI judge that declares everyone guilty, regardless of any evidence to the contrary.

"},{"location":"topic_guides/evaluation/edge_metrics.html#composite-metrics-for-linkage","title":"Composite Metrics for Linkage","text":"

This section contains composite metrics i.e. combinations of metrics that can been derived from the confusion matrix (Precision, Recall, Specificity and Negative Predictive Value).

Any comparison of two records has a number of possible outcomes (True Positives, False Positives etc.), each of which has a different impact on your specific use case. It is very rare that a single metric defines the desired behaviour of a model. Therefore, evaluating performance with a composite metric (or a combination of metrics) is advised.

"},{"location":"topic_guides/evaluation/edge_metrics.html#f-score","title":"F Score","text":"

The F-Score is a weighted harmonic mean of Precision (Positive Predictive Value) and Recall (True Positive Rate). For a general weight \\(\\beta\\):

\\[F_{\\beta} = \\frac{(1 + \\beta^2) \\cdot \\textsf{Precision} \\cdot \\textsf{Recall}}{\\beta^2 \\cdot \\textsf{Precision} + \\textsf{Recall}}\\]

where Recall is \\(\\beta\\) times more important than Precision.

For example, when Precision and Recall are equally weighted (\\(\\beta = 1\\)), we get:

\\[F_{1} = 2\\left[\\frac{1}{\\textsf{Precision}}+\\frac{1}{\\textsf{Recall}}\\right]^{-1} = \\frac{2 \\cdot \\textsf{Precision} \\cdot \\textsf{Recall}}{\\textsf{Precision} + \\textsf{Recall}}\\]

Other popular versions of the F score are \\(F_{2}\\) (Recall twice as important as Precision) and \\(F_{0.5}\\) (Precision twice as important as Recall)

F-Score in Splink
  • The F score can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.

Warning

F-score does not account for class imbalance in the data, and is asymmetric (i.e. it considers the prediction of matching records, but ignores how well the model correctly predicts non-matching records).

"},{"location":"topic_guides/evaluation/edge_metrics.html#p4-score","title":"P4 Score","text":"

The \\(P_{4}\\) Score is the harmonic mean of the 4 metrics that can be directly derived from the confusion matrix:

\\[ 4\\left[\\frac{1}{\\textsf{Recall}}+\\frac{1}{\\textsf{Specificity}}+\\frac{1}{\\textsf{Precision}}+\\frac{1}{\\textsf{Negative Predictive Value}}\\right]^{-1} \\]

This addresses one of the issues with the F-Score as it considers how well the model predicts non-matching records as well as matching records.

Note: all metrics are given equal weighting.

\\(P_{4}\\) in Splink
  • \\(P_{4}\\) can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
"},{"location":"topic_guides/evaluation/edge_metrics.html#matthews-correlation-coefficient","title":"Matthews Correlation Coefficient","text":"

The Matthews Correlation Coefficient (\\(\\phi\\)) is a measure of how correlation between predictions and actual observations.

\\[ \\phi = \\sqrt{\\textsf{Recall} \\cdot \\textsf{Specificity} \\cdot \\textsf{Precision} \\cdot \\textsf{Negative Predictive Value}} - \\sqrt{(1 - \\textsf{Recall})(1 - \\textsf{Specificity})(1 - \\textsf{Precision})(1 - \\textsf{Negative Predictive Value})} \\] Matthews Correlation Coefficient (\\(\\phi\\)) in Splink
  • \\(\\phi\\) be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.

Note

Unlike the other metrics in this guide, \\(\\phi\\) is a correlation coefficient, so can range from -1 to 1 (as opposed to a range of 0 to 1).

In reality, linkage models should never be negatively correlated with actual observations, so \\(\\phi\\) can be used in the same way as other metrics.

"},{"location":"topic_guides/evaluation/edge_overview.html","title":"Overview","text":""},{"location":"topic_guides/evaluation/edge_overview.html#edge-evaluation","title":"Edge Evaluation","text":"

Once you have a trained model, you use it to generate edges (links) between entities (nodes). These edges will have a Match Weight and corresponding Probability.

There are several strategies for checking whether the links created in your pipeline perform as you want/expect.

"},{"location":"topic_guides/evaluation/edge_overview.html#consider-the-edge-metrics","title":"Consider the Edge Metrics","text":"

Edge Metrics measure how links perform at an overall level.

First, consider how you would like your model to perform. What is important for your use case? Do you want to ensure that you capture all possible matches (i.e. high recall)? Or do you want to minimise the number of incorrectly predicted matches (i.e. high precision)? Perhaps a combination of both?

For a summary of all the edge metrics available in Splink, check out the Edge Metrics guide.

Note

To produce Edge Metrics you will require a \"ground truth\" to compare your linkage results against (which can be achieved by Clerical Labelling).

"},{"location":"topic_guides/evaluation/edge_overview.html#spot-checking-pairs-of-records","title":"Spot Checking pairs of records","text":"

Spot Checking real examples of record pairs is helpful for confidence in linkage results. It is an effective way to build intuition for how the model works in practice and allows you to interrogate edge cases.

Results of individual record pairs can be examined with the Waterfall Chart.

Choosing which pairs of records to spot check can be done by either:

  • Looking at all combinations of comparison levels and choosing which to examine in the Comparison Viewer Dashboard.
  • Identifying and examining records which have been incorrectly predicted by your Splink model.

As you are checking real examples, you will often come across cases that have not been accounted for by your model which you believe signify a match (e.g. a fuzzy match for names). We recommend using this feedback loop to help iterate and improve the definition of your model.

"},{"location":"topic_guides/evaluation/edge_overview.html#choosing-a-threshold","title":"Choosing a Threshold","text":"

Threshold selection is a key decision point within a linkage pipeline. One of the major benefits of probabilistic linkage versus a deterministic (i.e. rules-based) approach is the ability to choose the amount of evidence required for two records to be considered a match (i.e. a threshold).

When you have decided on the metrics that are important for your use case, you can use the Threshold Selection Tool to get a first estimate for what your threshold should be.

Note

The Threshold Selection Tool requires labelled data to act as a \"ground truth\" to compare your linkage results against.

Once you have an initial threshold, you can use Comparison Viewer Dashboard to look at records on either side of your threshold to check whether the threshold makes intuitive sense.

From here, we recommend an iterative process of tweaking your threshold based on your spot checking then looking at the impact that this has on your overall edge metrics. Another tools that can be useful is spot checking where your model has gone wrong using prediction_errors_from_labels_table as demoed in the accuracy analysis demo.

"},{"location":"topic_guides/evaluation/edge_overview.html#in-summary","title":"In Summary","text":"

Evaluating the edges (links) of a linkage model depends on your use case. Defining what \"good\" looks like is a key step, which then allows you to choose a relevant metric (or metrics) for measuring success.

Your desired metric should help give an initial estimation for a linkage threshold, then you can use spot checking to help settle on a final threshold.

In general, the links between pairs of records are not the final output of linkage pipeline. Most use-cases use these links to group records together into clusters. In this instance, evaluating the links themselves is not sufficient, you have to evaluate the resulting clusters as well.

"},{"location":"topic_guides/evaluation/labelling.html","title":"Clerical Labelling","text":""},{"location":"topic_guides/evaluation/labelling.html#clerical-labelling","title":"Clerical Labelling","text":"

This page is under construction - check back soon!

"},{"location":"topic_guides/evaluation/model.html","title":"Model","text":""},{"location":"topic_guides/evaluation/model.html#model-evaluation","title":"Model Evaluation","text":"

The parameters in a trained Splink model determine the match probability (Splink score) assigned to pairwise record comparisons. Before scoring any pairs of records there are a number of ways to check whether your model will perform as you expect.

"},{"location":"topic_guides/evaluation/model.html#look-at-the-model-parameters","title":"Look at the model parameters","text":"

The final model is summarised in the match weights chart with each bar in the chart signifying the match weight (i.e. the amount of evidence for or against a match) for each comparison level in your model.

If, after some investigation, you still can't make sense of some of the match weights, take a look at the corresponding \\(m\\) and \\(u\\) values generated to see if they themselves make sense. These can be viewed in the m u parameters chart.

Remember that \\(\\textsf{Match Weight} = \\log_2 \\frac{m}{u}\\)

"},{"location":"topic_guides/evaluation/model.html#look-at-the-model-training","title":"Look at the model training","text":"

The behaviour of a model during training can offer some insight into its utility. The more stable a model is in the training process, the more reliable the outputs are.

Stability of model training can be seen in the Expectation Maximisation stage (for \\(m\\) training):

  • Stability across EM training sessions can be seen through the parameter estimates chart

  • Stability within each session is indicated by the speed of convergence of the algorithm. This is shown in the terminal output during training. In general, the fewer iterations required to converge the better. You can also access convergence charts on the EM training session object

    training_session = linker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\ntraining_session.match_weights_interactive_history_chart()\n
"},{"location":"topic_guides/evaluation/model.html#in-summary","title":"In summary","text":"

Evaluating a trained model is not an exact science - there are no metrics which can definitively say whether a model is good or bad at this stage. In most cases, applying human logic and heuristics is the best you can do to establish whether the model is sensible. Given the variety of potential use cases of Splink, there is no perfect, universal model, just models that can be tuned to produce useful outputs for a given application.

The tools within Splink are intended to help identify areas where your model may not be performing as expected. In future versions releases we hope to automatically flag where there are areas of a model that require further investigation to make this process easier for the user.

"},{"location":"topic_guides/evaluation/overview.html","title":"Overview","text":""},{"location":"topic_guides/evaluation/overview.html#evaluation-overview","title":"Evaluation Overview","text":"

Evaluation is a non-trivial, but crucial, task in data linkage. Linkage pipelines are complex and require many design decisions, each of which has an impact on the end result.

This set of topic guides is intended to provide some structure and guidance on how to evaluate a Splink model alongside its resulting links and clusters.

"},{"location":"topic_guides/evaluation/overview.html#how-do-we-evaluate-different-stages-of-the-pipeline","title":"How do we evaluate different stages of the pipeline?","text":"

Evaluation in a data linking pipeline can be broken into 3 broad categories:

"},{"location":"topic_guides/evaluation/overview.html#model-evaluation","title":"Model Evaluation","text":"

After you have trained your model, you can start evaluating the parameters and overall design of the model. To see how, check out the Model Evaluation guide.

"},{"location":"topic_guides/evaluation/overview.html#edge-link-evaluation","title":"Edge (Link) Evaluation","text":"

Once you have trained a model, you will use it to predict the probability of links (edges) between entities (nodes). To see how to evaluate these links, check out the Edge Evaluation guide.

"},{"location":"topic_guides/evaluation/overview.html#cluster-evaluation","title":"Cluster Evaluation","text":"

Once you have chosen a linkage threshold, the edges are used to generate clusters of records. To see how to evaluate these clusters, check out the Cluster Evaluation guide.

Note

In reality, the development of a linkage pipeline involves iterating through multiple versions of models, links and clusters. For example, for each model version you will generally want to understand the downstream impact on the links and clusters generated. As such, you will likely revisit each stage of evaluation a number of times before settling on a final output.

The aim of these guides, and the tools provided in Splink, is to ensure that you are able to extract enough information from each iteration to better understand how your pipeline is working and identify areas for improvement.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html","title":"Graph metrics","text":""},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#graph-metrics","title":"Graph metrics","text":"

Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is cluster size, which is the number of nodes within a cluster.

For data linking with Splink, it is useful to sort graph metrics into three categories:

  • Node metrics
  • Edge metrics
  • Cluster metrics

Each of these are defined below together with examples and explanations of how they can be applied to linked data to evaluate cluster quality. The examples cover all metrics currently available in Splink.

Note

It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. A more comprehensive picture can be built by considering various metrics in conjunction with one another.

It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#node-metrics","title":"Node metrics","text":"

Node metrics quantify the properties of the nodes which live within clusters.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#node-degree","title":"Node Degree","text":""},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#definition","title":"Definition","text":"

Node degree is the number of edges connected to a node.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#example","title":"Example","text":"

In the cluster below A has a node degree of 1, whereas D has a node degree of 3.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#application-in-data-linkage","title":"Application in Data Linkage","text":"

High node degree is generally considered good as it means there are many edges in support of records in a cluster being linked. Nodes with low node degree could indicate links being missed (false negatives) or be the result of a small number of false links (false positives).

However, erroneous links (false positives) could also be the reason for high node degree, so it can be useful to validate the edges of highly connected nodes.

It is important to consider cluster size when looking at node degree. By definition, larger clusters contain more nodes to form links between, allowing nodes within them to attain higher degrees compared to those in smaller clusters. Consequently, low node degree within larger clusters can carry greater significance.

Bear in mind, that the degree of a single node in a cluster isn't necessarily representative of the overall connectedness of a cluster. This is where cluster centralisation can help.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#edge-metrics","title":"Edge metrics","text":"

Edge metrics quantify the properties of the edges within a cluster.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#is-bridge","title":"'is bridge'","text":""},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#definition_1","title":"Definition","text":"

An edge is classified as a 'bridge' if its removal splits a cluster into two smaller clusters.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#example_1","title":"Example","text":"

For example, the removal of the link labelled \"Bridge\" below would break this cluster of 9 nodes into two clusters of 5 and 4 nodes, respectively.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#application-in-data-linkage_1","title":"Application in Data Linkage","text":"

Bridges can be signalers of false positives in linked data, especially when joining two highly connected sub-clusters. Examining bridges can shed light on issues with the linking process leading to the formation of false positive links.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#cluster-metrics","title":"Cluster metrics","text":"

Cluster metrics refer to the characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#cluster-size","title":"Cluster Size","text":""},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#definition_2","title":"Definition","text":"

Cluster size refers to the number of nodes within a cluster.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#example_2","title":"Example","text":"

The cluster below is of size 5.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#application-in-data-linkage_2","title":"Application in Data Linkage","text":"

When thinking about cluster size, it is often useful to consider the biggest clusters produced and ask yourself if the sizes seem reasonable for the dataset being linked. For example when linking people, does it make sense that an individual is appearing hundreds of times in the linked data resulting in a cluster of over 100 nodes? If the answer is no, then false positives links are probably being formed.

If you don't have an intuition of what seems reasonable, then it is worth inspecting a sample of the largest clusters in Splink's Cluster Studio Dashboard to validate (or invalidate) links. From there you can develop an understanding of what maximum cluster size to expect for your linkage. Bear in mind that a large and highly dense cluster is usually less suspicious than a large low-density cluster.

There also might be a lower bound on cluster size. For example, when linking two datasets in which you know people appear at least once in each, the minimum expected size of cluster will be 2. Clusters smaller than the minimum size indicate links have been missed.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#cluster-density","title":"Cluster Density","text":""},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#definition_3","title":"Definition","text":"

The density of a cluster is given by the number of edges it contains divided by the maximum possible number of edges. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#example_3","title":"Example","text":"

The left cluster below has links between all nodes (giving a density of 1), whereas the right cluster has the minimum number of edges (4) to link 5 nodes together (giving a density of 0.4).

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#application-in-data-linkage_3","title":"Application in Data Linkage","text":"

When evaluating clusters, a high density (closer to 1) is generally considered good as it means there are many edges in support of the records in a cluster being linked.

A low density could indicate links being missed. This could happen, for example, if blocking rules are too tight or the clustering threshold is too high.

A sample of low density clusters can be inspected in Splink's Cluster Studio Dashboard via the option sampling_method = \"lowest_density_clusters_by_size\", which performs stratified sampling across different cluster sizes. When inspecting a cluster, ask yourself the question: why aren't more links being formed between record nodes?

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#cluster-centralisation","title":"Cluster Centralisation","text":"

Work in Progress

We are still working out where Cluster Centralisation can be best used in the context of record linkage. At this stage, we do not have clear recommendations or guidance on the best places to use it - so if you have any expertise in this area we would love to hear from you!

We will update this guidance as and when we have clearer strategies in this space.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#definition_4","title":"Definition","text":"

Cluster centralisation is defined as the deviation from maximum node degree normalised with respect to the maximum possible value. In other words, cluster centralisation tells us about the concentration of edges in a cluster. Centralisation ranges from 0 to 1.

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#example_4","title":"Example","text":"

Coming Soon

"},{"location":"topic_guides/evaluation/clusters/graph_metrics.html#application-in-data-linkage_4","title":"Application in Data Linkage","text":"

A high cluster centralisation (closer to 1) indicates that a few nodes are home to significantly more connections compared to the rest of the nodes in a cluster. This can help identify clusters containing nodes with a lower number of connections (low node degree) relative to what is possible for that cluster.

Low centralisation suggests that edges are more evenly distributed amongst nodes in a cluster. This can be good if all nodes within a clusters enjoy many connections. However, low centralisation could also indicate that most nodes are not as highly connected as they could be. To check for this, look at low centralisation in conjunction with low density.

A guide on how to compute graph metrics mentioned above with Splink is given in the next chapter.

Please note, this topic guide is a work in progress and we welcome any feedback.

"},{"location":"topic_guides/evaluation/clusters/how_to_compute_metrics.html","title":"How to compute graph metrics","text":""},{"location":"topic_guides/evaluation/clusters/how_to_compute_metrics.html#how-to-compute-graph-metrics-with-splink","title":"How to compute graph metrics with Splink","text":""},{"location":"topic_guides/evaluation/clusters/how_to_compute_metrics.html#introduction-to-the-compute_graph_metrics-method","title":"Introduction to the compute_graph_metrics() method","text":"

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics() method.

The method is called on the linker like so:

linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)\n

Parameters:

Name Type Description Default df_predict SplinkDataFrame

The results of linker.inference.predict()

required df_clustered SplinkDataFrame

The outputs of linker.clustering.cluster_pairwise_predictions_at_threshold()

required threshold_match_probability float

Filter the pairwise match predictions to include only pairwise comparisons with a match_probability at or above this threshold. If not provided, the value will be taken from metadata on df_clustered. If no such metadata is available, this value must be provided.

None

Warning

threshold_match_probability should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold(). If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:

graph_metrics = linker.clustering.compute_graph_metrics(\n    pairwise_predictions, clusters\n)\n\ndf_edges = graph_metrics.edges.as_pandas_dataframe()\ndf_nodes = graph_metrics.nodes.as_pandas_dataframe()\ndf_clusters = graph_metrics.clusters.as_pandas_dataframe()\n

The metrics computed by compute_graph_metrics() include all those mentioned in the Graph metrics chapter, namely:

  • Node degree
  • 'Is bridge'
  • Cluster size
  • Cluster density
  • Cluster centrality

All of these metrics are calculated by default. If you are unable to install the igraph package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

"},{"location":"topic_guides/evaluation/clusters/how_to_compute_metrics.html#full-code-example","title":"Full code example","text":"

This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.

import splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndf = splink_datasets.historical_50k\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.ExactMatch(\n            \"first_name\",\n        ).configure(term_frequency_adjustments=True),\n        cl.JaroWinklerAtThresholds(\"surname\", score_threshold_or_thresholds=[0.9, 0.8]),\n        cl.LevenshteinAtThresholds(\n            \"postcode_fake\", distance_threshold_or_thresholds=[1, 2]\n        ),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"postcode_fake\", \"first_name\"),\n        block_on(\"first_name\", \"surname\"),\n        block_on(\"dob\", \"substr(postcode_fake,1,2)\"),\n        block_on(\"postcode_fake\", \"substr(dob,1,3)\"),\n        block_on(\"postcode_fake\", \"substr(dob,4,5)\"),\n    ],\n    retain_intermediate_calculation_columns=True,\n)\n\ndb_api = DuckDBAPI()\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"dob\", \"substr(postcode_fake, 1,3)\")\n)\n\npairwise_predictions = linker.inference.predict()\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    pairwise_predictions, 0.95\n)\n\ngraph_metrics = linker.clustering.compute_graph_metrics(pairwise_predictions, clusters)\n\ndf_clusters = graph_metrics.clusters.as_pandas_dataframe()\n
df_clusters\n
cluster_id n_nodes n_edges density cluster_centralisation 0 Q5076213-1 10 31.0 0.688889 0.250000 1 Q760788-1 9 30.0 0.833333 0.214286 2 Q88466525-10 3 3.0 1.000000 0.000000 3 Q88466525-1 10 37.0 0.822222 0.222222 4 Q1386511-1 13 47.0 0.602564 0.272727 ... ... ... ... ... ... 21346 Q1562561-16 1 0.0 NaN NaN 21347 Q15999964-5 1 0.0 NaN NaN 21348 Q5363139-12 1 0.0 NaN NaN 21349 Q4722328-5 1 0.0 NaN NaN 21350 Q7528564-13 1 0.0 NaN NaN

21351 rows \u00d7 5 columns

"},{"location":"topic_guides/evaluation/clusters/overview.html","title":"Overview","text":""},{"location":"topic_guides/evaluation/clusters/overview.html#cluster-evaluation","title":"Cluster Evaluation","text":"

Graphs provide a natural way to think about linked data (see the \"Linked data as graphs\" guide for a refresher). Visualising linked data as a graph and employing graph metrics are powerful ways to evaluate linkage quality.

Graph metrics help to give a big-picture view of the clusters generated by a Splink model. Through metric distributions and statistics, we can gauge the quality of clusters and monitor how adjustments to models impact results.

Graph metrics can also help us home in on problematic clusters, such as those containing inaccurate links (false positives). Spot-checking can be performed with Splink\u2019s Cluster Studio Dashboard which enables users to visualise individual clusters and interrogate the links between their member records.

"},{"location":"topic_guides/evaluation/clusters/overview.html#evaluating-cluster-quality","title":"Evaluating cluster quality","text":""},{"location":"topic_guides/evaluation/clusters/overview.html#what-is-a-high-quality-cluster","title":"What is a high quality cluster?","text":"

When it comes to data linking, the highest quality clusters will be those containing all possible true matches (there will be no missed links a.k.a. false negatives) and no false matches (no false positives). In other words, clusters only containing precisely those nodes corresponding to records about the same entity.

Generating clusters which all adhere to this ideal is rare in practice. For example,

  • Blocking rules, necessary to make computations tractable, can prevent record comparisons between some true matches ever being made
  • Data limitations can place an upper bound on the level of quality achievable

Despite this, graph metrics can help us get closer to a satisfactory level of quality as well as monitor it going forward.

"},{"location":"topic_guides/evaluation/clusters/overview.html#what-does-cluster-quality-look-like-for-you","title":"What does cluster quality look like for you?","text":"

The extent of cluster evaluation efforts and what is considered 'good enough' will vary greatly with linkage use-case. You might already have labelled data or quality assured outputs from another model which define a clear benchmark for cluster quality.

Domain knowledge can also set expectations of what is deemed reasonable or good. For example, you might already know that a large cluster (containing say 100 nodes) is suspicious for your deduplicated dataset.

However, you may currently have little or no knowledge about the data or no a clear idea of what good quality clusters look like for your linkage.

Whatever the starting point, this topic guide is designed to help users develop a better understanding of their clusters and help focus quality assurance efforts to get the best out of their linkage models.

"},{"location":"topic_guides/evaluation/clusters/overview.html#what-this-topic-guide-contains","title":"What this topic guide contains","text":"
  • An introduction to the graph metrics currently available in Splink and how to apply them to linked data
  • Instructions on how to compute graph metrics with Splink

Please note, this topic guide is a work in progress and we welcome any feedback.

"},{"location":"topic_guides/performance/drivers_of_performance.html","title":"Run times, performance and linking large data","text":"

This topic guide covers the fundamental drivers of the run time of Splink jobs.

","tags":["Performance"]},{"location":"topic_guides/performance/drivers_of_performance.html#blocking","title":"Blocking","text":"

The primary driver of run time is the number of record pairs that the Splink model has to process. In Splink, the number of pairs to consider is reduced using Blocking Rules which are covered in depth in their own set of topic guides.

","tags":["Performance"]},{"location":"topic_guides/performance/drivers_of_performance.html#complexity-of-comparisons","title":"Complexity of comparisons","text":"

More complex comparisons reduces performance. Complexity is added to comparisons in a number of ways, including:

  • Increasing the number of comparison levels
  • Using more computationally expensive comparison functions
  • Adding Term Frequency Adjustments

Performant Term Frequency Adjustments

Model training with Term Frequency adjustments can be made more performant by setting estimate_without_term_frequencies parameter to True in estimate_parameters_using_expectation_maximisation.

","tags":["Performance"]},{"location":"topic_guides/performance/drivers_of_performance.html#retaining-columns-through-the-linkage-process","title":"Retaining columns through the linkage process","text":"

The size your dataset has an impact on the performance of Splink. This is also applicable to the tables that Splink creates and uses under the hood. Some Splink functionality requires additional calculated columns to be stored. For example:

  • The comparison_viewer_dashboard requires retain_matching_columns and retain_intermediate_calculation_columns to be set to True in the settings dictionary, but this makes some processes less performant.
","tags":["Performance"]},{"location":"topic_guides/performance/drivers_of_performance.html#filtering-out-pairwise-in-the-predict-step","title":"Filtering out pairwise in the predict() step","text":"

Reducing the number of pairwise comparisons that need to be returned will make Splink perform faster. One way of doing this is to filter comparisons with a match score below a given threshold (using a threshold_match_probability or threshold_match_weight) when you call predict().

","tags":["Performance"]},{"location":"topic_guides/performance/drivers_of_performance.html#spark-performance","title":"Spark Performance","text":"

As Spark is designed to distribute processing across multiple machines so there are additional configuration options available to make jobs run more quickly. For more information, check out the Spark Performance Topic Guide.

Balancing computational performance and model accuracy

There is usually a trade off between performance and accuracy in Splink models. I.e. some model design decisions that improve computational performance can also have a negative impact the accuracy of the model.

Be sure to check how the suggestions in this topic guide impact the accuracy of your model to ensure the best results.

","tags":["Performance"]},{"location":"topic_guides/performance/optimising_duckdb.html","title":"Optimising DuckDB performance","text":"","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#optimising-duckdb-jobs","title":"Optimising DuckDB jobs","text":"

This topic guide describes how to configure DuckDB to optimise performance

It is assumed readers have already read the more general guide to linking big data, and have chosen appropriate blocking rules.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#summary","title":"Summary:","text":"
  • From splink==3.9.11 onwards, DuckDB generally parallelises jobs well, so you should see 100% usage of all CPU cores for the main Splink operations (parameter estimation and prediction)
  • In some cases predict() needs salting on blocking_rules_to_generate_predictions to achieve 100% CPU use. You're most likely to need this in the following scenarios:
    • Very high core count machines
    • Splink models that contain a small number of blocking_rules_to_generate_predictions
    • Splink models that have a relatively small number of input rows (less than around 500k)
  • If you are facing memory issues with DuckDB, you have the option of using an on-disk database.
  • Reducing the amount of parallelism by removing salting can also sometimes reduce memory usage

You can find a blog post with formal benchmarks of DuckDB performance on a variety of machine types here.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#configuration","title":"Configuration","text":"","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#ensuring-100-cpu-usage-across-all-cores-on-predict","title":"Ensuring 100% CPU usage across all cores on predict()","text":"

The aim is for overall parallelism of the predict() step to closely align to the number of thread/vCPU cores you have: - If parallelism is too low, you won't use all your threads - If parallelism is too high, runtime will be longer.

The number of CPU cores used is given by the following formula:

\\(\\text{base parallelism} = \\frac{\\text{number of input rows}}{122,880}\\)

\\(\\text{blocking rule parallelism}\\)

\\(= \\text{count of blocking rules} \\times\\) \\(\\text{number of salting partitions per blocking rule}\\)

\\(\\text{overall parallelism} = \\text{base parallelism} \\times \\text{blocking rule parallelism}\\)

If overall parallelism is less than the total number of threads, then you won't achieve 100% CPU usage.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#example","title":"Example","text":"

Consider a deduplication job with 1,000,000 input rows, on a machine with 32 cores (64 threads)

In our Splink suppose we set:

settings =  {\n    ...\n    \"blocking_rules_to_generate_predictions\" ; [\n        block_on([\"first_name\"], salting_partitions=2),\n        block_on([\"dob\"], salting_partitions=2),\n        block_on([\"surname\"], salting_partitions=2),\n    ]\n    ...\n}\n

Then we have:

  • Base parallelism of 9.
  • 3 blocking rules
  • 2 salting partitions per blocking rule

We therefore have paralleism of \\(9 \\times 3 \\times 2 = 54\\), which is less than the 64 threads, and therefore we won't quite achieve full parallelism.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#generalisation","title":"Generalisation","text":"

The above formula for overall parallelism assumes all blocking rules have the same number of salting partitions, which is not necessarily the case. In the more general case of variable numbers of salting partitions, the formula becomes

\\[ \\text{overall parallelism} = \\text{base parallelism} \\times \\text{total number of salted blocking partitions across all blocking rules} \\]

So for example, with two blocking rules, if the first has 2 salting partitions, and the second has 10 salting partitions, when we would multiply base parallelism by 12.

This may be useful in the case one of the blocking rules produces more comparisons than another: the 'bigger' blocking rule can be salted more.

For further information about how parallelism works in DuckDB, including links to relevant DuckDB documentation and discussions, see here.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#running-out-of-memory","title":"Running out of memory","text":"

If your job is running out of memory, the first thing to consider is tightening your blocking rules, or running the workload on a larger machine.

If these are not possible, the following config options may help reduce memory usage:

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#using-an-on-disk-database","title":"Using an on-disk database","text":"

DuckDB can spill to disk using several settings:

Use the special :temporary: connection built into Splink that creates a temporary on disk database

linker = Linker(\n    df, settings, DuckDBAPI(connection=\":temporary:\")\n)\n

Use an on-disk database:

con = duckdb.connect(database='my-db.duckdb')\nlinker = Linker(\n    df, settings, DuckDBAPI(connection=con)\n)\n

Use an in-memory database, but ensure it can spill to disk:

con = duckdb.connect(\":memory:\")\n\ncon.execute(\"SET temp_directory='/path/to/temp';\")\nlinker = Linker(\n    df, settings, DuckDBAPI(connection=con)\n)\n

See also this section of the DuckDB docs

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_duckdb.html#reducing-salting","title":"Reducing salting","text":"

Empirically we have noticed that there is a tension between parallelism and total memory usage. If you're running out of memory, you could consider reducing parallelism.

","tags":["Performance","DuckDB","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html","title":"Optimising Spark performance","text":"","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#optimising-spark-jobs","title":"Optimising Spark jobs","text":"

This topic guide describes how to configure Spark to optimise performance - especially large linkage jobs which are slow or are not completing using default settings.

It is assumed readers have already read the more general guide to linking big data, and blocking rules are proportionate to the size of the Spark cluster. As a very rough guide, on a small cluster of (say) 8 machines, we recommend starting with blocking rules that generate around 100 million comparisons. Once this is working, loosening the blocking rules to around 1 billion comparisons or more is often achievable.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#summary","title":"Summary:","text":"
  • Ensure blocking rules are not generating too many comparisons.
  • We recommend setting the break_lineage_method to \"parquet\", which is the default
  • num_partitions_on_repartition should be set so that each file in the output of predict() is roughly 100MB.
  • Try setting spark.default.parallelism to around 5x the number of CPUs in your cluster

For a cluster with 10 CPUs, that outputs about 8GB of data in parquet format, the following setup may be appropriate:

spark.conf.set(\"spark.default.parallelism\", \"50\")\nspark.conf.set(\"spark.sql.shuffle.partitions\", \"50\")\n\nlinker = Linker(\n    person_standardised_nodes,\n    settings,\n    db_api=spark_api,\n    break_lineage_method=\"parquet\",\n    num_partitions_on_repartition=80,\n)\n
","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#breaking-lineage","title":"Breaking lineage","text":"

Splink uses an iterative algorithm for model training, and more generally, lineage is long and complex. We have found that big jobs fail to complete without further optimisation. This is a well-known problem:

Quote

\"This long lineage bottleneck is widely known by sophisticated Spark application programmers. A common practice for dealing with long lineage is to have the application program strategically checkpoint RDDs at code locations that truncate much of the lineage for checkpointed data and resume computation immediately from the checkpoint.\"

Splink will automatically break lineage in sensible places. We have found in practice that, when running Spark jobs backed by AWS S3, the fastest method of breaking lineage is persisting outputs to .parquet file.

You can do this using the break_lineage_method parameter as follows:

linker = Linker(\n    person_standardised_nodes,\n    settings,\n    db_api=db_api,\n    break_lineage_method=\"parquet\"\n)\n

Other options are checkpoint and persist. For different Spark setups, particularly if you have fast local storage, you may find these options perform better.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#spark-parallelism","title":"Spark Parallelism","text":"

We suggest setting default parallelism to roughly 5x the number of CPUs in your cluster. This is a very rough rule of thumb, and if you're encountering performance problems you may wish to experiment with different values.

One way to set default parallelism is as follows:

from pyspark.context import SparkContext, SparkConf\nfrom pyspark.sql import SparkSession\n\nconf = SparkConf()\n\nconf.set(\"spark.default.parallelism\", \"50\")\nconf.set(\"spark.sql.shuffle.partitions\", \"50\")\n\nsc = SparkContext.getOrCreate(conf=conf)\nspark = SparkSession(sc)\n

In general, increasing parallelism will make Spark 'chunk' your job into a larger amount of smaller tasks. This may solve memory issues. But note there is a tradeoff here: if you increase parallelism too high, Spark may take too much time scheduling large numbers of tasks, and may even run out of memory performing this work. See here. Also note that when blocking, jobs cannot be split into a large number of tasks than the cardinality of the blocking rule. For example, if you block on month of birth, this will be split into 12 tasks, irrespective of the parallelism setting. See here. You can use salting (below) to partially address this limitation.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#repartition-after-blocking","title":"Repartition after blocking","text":"

For some jobs, setting repartition_after_blocking=True when you initialise the SparkAPI may improve performance.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#salting","title":"Salting","text":"

For very large jobs, you may find that salting your blocking keys results in faster run times.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/optimising_spark.html#general-spark-config","title":"General Spark config","text":"

Splink generates large numbers of record comparisons from relatively small input datasets. This is an unusual type of workload, and so default Spark parameters are not always appropriate. Some of the issues encountered are similar to performance issues encountered with Cartesian joins - so some of the tips in relevant articles may help.

","tags":["Performance","Spark","Salting","Parallelism"]},{"location":"topic_guides/performance/salting.html","title":"Salting blocking rules","text":"","tags":["Performance","Salting","Spark"]},{"location":"topic_guides/performance/salting.html#salting-blocking-rules","title":"Salting blocking rules","text":"

For very large linkages using Apache Spark, Splink supports salting blocking rules.

Under certain conditions, this can help Spark better parallelise workflows, leading to shorter run times, and avoiding out of memory errors. It is most likely to help where you have blocking rules that create very large numbers of comparisons (100m records+) and where there is skew in how record comparisons are made (e.g. blocking on full name creates more comparisons amongst 'John Smith's than many other names).

Further information about the motivation for salting can be found here.

Note that salting is only available for the Spark backend

","tags":["Performance","Salting","Spark"]},{"location":"topic_guides/performance/salting.html#how-to-use-salting","title":"How to use salting","text":"

To enable salting using the Linker with Spark, you provide some of your blocking rules as a dictionary rather than a string.

This enables you to choose the number of salts for each blocking rule.

Blocking rules provided as plain strings default to no salting (salting_partitions = 1)

The following code snippet illustrates:

import logging\n\nfrom pyspark.context import SparkConf, SparkContext\nfrom pyspark.sql import SparkSession\n\nimport splink.comparison_library as cl\nfrom splink import Linker, SparkAPI, splink_datasets\n\nconf = SparkConf()\nconf.set(\"spark.driver.memory\", \"12g\")\nconf.set(\"spark.sql.shuffle.partitions\", \"8\")\nconf.set(\"spark.default.parallelism\", \"8\")\n\nsc = SparkContext.getOrCreate(conf=conf)\nspark = SparkSession(sc)\nspark.sparkContext.setCheckpointDir(\"./tmp_checkpoints\")\n\nsettings = {\n    \"probability_two_random_records_match\": 0.01,\n    \"link_type\": \"dedupe_only\",\n    \"blocking_rules_to_generate_predictions\": [\n        \"l.dob = r.dob\",\n        {\"blocking_rule\": \"l.first_name = r.first_name\", \"salting_partitions\": 4},\n    ],\n    \"comparisons\": [\n        cl.LevenshteinAtThresholds(\"first_name\", 2),\n        cl.ExactMatch(\"surname\"),\n        cl.ExactMatch(\"dob\"),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.ExactMatch(\"email\"),\n    ],\n    \"retain_matching_columns\": True,\n    \"retain_intermediate_calculation_columns\": True,\n    \"additional_columns_to_retain\": [\"cluster\"],\n    \"max_iterations\": 1,\n    \"em_convergence\": 0.01,\n}\n\n\ndf = splink_datasets.fake_1000\n\nspark_api = SparkAPI(spark_session=spark)\nlinker = Linker(df, settings, db_api=spark_api)\nlogging.getLogger(\"splink\").setLevel(5)\n\nlinker.inference.deterministic_link()\n

And we can see that salting has been applied by looking at the SQL generated in the log:

SELECT\n  l.unique_id AS unique_id_l,\n  r.unique_id AS unique_id_r,\n  l.first_name AS first_name_l,\n  r.first_name AS first_name_r,\n  l.surname AS surname_l,\n  r.surname AS surname_r,\n  l.dob AS dob_l,\n  r.dob AS dob_r,\n  l.city AS city_l,\n  r.city AS city_r,\n  l.tf_city AS tf_city_l,\n  r.tf_city AS tf_city_r,\n  l.email AS email_l,\n  r.email AS email_r,\n  l.`group` AS `group_l`,\n  r.`group` AS `group_r`,\n  '0' AS match_key\nFROM __splink__df_concat_with_tf AS l\nINNER JOIN __splink__df_concat_with_tf AS r\n  ON l.dob = r.dob\nWHERE\n  l.unique_id < r.unique_id\nUNION ALL\nSELECT\n  l.unique_id AS unique_id_l,\n  r.unique_id AS unique_id_r,\n  l.first_name AS first_name_l,\n  r.first_name AS first_name_r,\n  l.surname AS surname_l,\n  r.surname AS surname_r,\n  l.dob AS dob_l,\n  r.dob AS dob_r,\n  l.city AS city_l,\n  r.city AS city_r,\n  l.tf_city AS tf_city_l,\n  r.tf_city AS tf_city_r,\n  l.email AS email_l,\n  r.email AS email_r,\n  l.`group` AS `group_l`,\n  r.`group` AS `group_r`,\n  '1' AS match_key\nFROM __splink__df_concat_with_tf AS l\nINNER JOIN __splink__df_concat_with_tf AS r\n  ON l.first_name = r.first_name\n  AND CEIL(l.__splink_salt * 4) = 1\n  AND NOT (\n    COALESCE((\n        l.dob = r.dob\n    ), FALSE)\n  )\nWHERE\n  l.unique_id < r.unique_id\nUNION ALL\nSELECT\n  l.unique_id AS unique_id_l,\n  r.unique_id AS unique_id_r,\n  l.first_name AS first_name_l,\n  r.first_name AS first_name_r,\n  l.surname AS surname_l,\n  r.surname AS surname_r,\n  l.dob AS dob_l,\n  r.dob AS dob_r,\n  l.city AS city_l,\n  r.city AS city_r,\n  l.tf_city AS tf_city_l,\n  r.tf_city AS tf_city_r,\n  l.email AS email_l,\n  r.email AS email_r,\n  l.`group` AS `group_l`,\n  r.`group` AS `group_r`,\n  '1' AS match_key\nFROM __splink__df_concat_with_tf AS l\nINNER JOIN __splink__df_concat_with_tf AS r\n  ON l.first_name = r.first_name\n  AND CEIL(l.__splink_salt * 4) = 2\n  AND NOT (\n    COALESCE((\n        l.dob = r.dob\n    ), FALSE)\n  )\nWHERE\n  l.unique_id < r.unique_id\nUNION ALL\nSELECT\n  l.unique_id AS unique_id_l,\n  r.unique_id AS unique_id_r,\n  l.first_name AS first_name_l,\n  r.first_name AS first_name_r,\n  l.surname AS surname_l,\n  r.surname AS surname_r,\n  l.dob AS dob_l,\n  r.dob AS dob_r,\n  l.city AS city_l,\n  r.city AS city_r,\n  l.tf_city AS tf_city_l,\n  r.tf_city AS tf_city_r,\n  l.email AS email_l,\n  r.email AS email_r,\n  l.`group` AS `group_l`,\n  r.`group` AS `group_r`,\n  '1' AS match_key\nFROM __splink__df_concat_with_tf AS l\nINNER JOIN __splink__df_concat_with_tf AS r\n  ON l.first_name = r.first_name\n  AND CEIL(l.__splink_salt * 4) = 3\n  AND NOT (\n    COALESCE((\n        l.dob = r.dob\n    ), FALSE)\n  )\nWHERE\n  l.unique_id < r.unique_id\nUNION ALL\nSELECT\n  l.unique_id AS unique_id_l,\n  r.unique_id AS unique_id_r,\n  l.first_name AS first_name_l,\n  r.first_name AS first_name_r,\n  l.surname AS surname_l,\n  r.surname AS surname_r,\n  l.dob AS dob_l,\n  r.dob AS dob_r,\n  l.city AS city_l,\n  r.city AS city_r,\n  l.tf_city AS tf_city_l,\n  r.tf_city AS tf_city_r,\n  l.email AS email_l,\n  r.email AS email_r,\n  l.`group` AS `group_l`,\n  r.`group` AS `group_r`,\n  '1' AS match_key\nFROM __splink__df_concat_with_tf AS l\nINNER JOIN __splink__df_concat_with_tf AS r\n  ON l.first_name = r.first_name\n  AND CEIL(l.__splink_salt * 4) = 4\n  AND NOT (\n    COALESCE((\n        l.dob = r.dob\n    ), FALSE)\n  )\nWHERE\n  l.unique_id < r.unique_id\n
","tags":["Performance","Salting","Spark"]},{"location":"topic_guides/splink_fundamentals/link_type.html","title":"Link type - linking vs deduping","text":"","tags":["Dedupe","Link","Link and Dedupe"]},{"location":"topic_guides/splink_fundamentals/link_type.html#link-type-linking-deduping-or-both","title":"Link type: Linking, Deduping or Both","text":"

Splink allows data to be linked, deduplicated or both.

Linking refers to finding links between datasets, whereas deduplication finding links within datasets.

Data linking is therefore only meaningful when more than one dataset is provided.

This guide shows how to specify the settings dictionary and initialise the linker for the three link types.

","tags":["Dedupe","Link","Link and Dedupe"]},{"location":"topic_guides/splink_fundamentals/link_type.html#deduplication","title":"Deduplication","text":"

The dedupe_only link type expects the user to provide a single input table, and is specified as follows

from splink import SettingsCreator\n\nsettings = SettingsCreator(\n    link_type= \"dedupe_only\",\n)\n\nlinker = Linker(df, settings, db_api=dbapi, )\n
","tags":["Dedupe","Link","Link and Dedupe"]},{"location":"topic_guides/splink_fundamentals/link_type.html#link-only","title":"Link only","text":"

The link_only link type expects the user to provide a list of input tables, and is specified as follows:

from splink import SettingsCreator\n\nsettings = SettingsCreator(\n    link_type= \"link_only\",\n)\n\nlinker = Linker(\n    [df_1, df_2, df_n],\n    settings,\n    db_api=dbapi,\n    input_table_aliases=[\"name1\", \"name2\", \"name3\"],\n)\n

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.

","tags":["Dedupe","Link","Link and Dedupe"]},{"location":"topic_guides/splink_fundamentals/link_type.html#link-and-dedupe","title":"Link and dedupe","text":"

The link_and_dedupe link type expects the user to provide a list of input tables, and is specified as follows:

from splink import SettingsCreator\n\nsettings = SettingsCreator(\n    link_type= \"link_and_dedupe\",\n)\n\nlinker = Linker(\n    [df_1, df_2, df_n],\n    settings,\n    db_api=dbapi,\n    input_table_aliases=[\"name1\", \"name2\", \"name3\"],\n)\n

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.

","tags":["Dedupe","Link","Link and Dedupe"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html","title":"Retrieving and querying Splink results","text":"","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html#retrieving-and-querying-splink-results","title":"Retrieving and Querying Splink Results","text":"

When Splink returns results, it does so in the format of a SplinkDataFrame. This is needed to allow Splink to provide results in a uniform format across the different database backends

For example, when you run df_predict = linker.predict(), the result df_predict is a SplinkDataFrame.

A SplinkDataFrame is an abstraction of a table in the underlying backend database, and provides several convenience methods for interacting with the underlying table. For detailed information check the full API.

","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html#converting-to-other-types","title":"Converting to other types","text":"

You can convert a SplinkDataFrame into a Pandas dataframe using splink_df.as_pandas_dataframe().

To view the first few records use a limit statement: splink_df.as_pandas_dataframe(limit=10).

For large linkages, it is not recommended to convert the whole SplinkDataFrame to pandas because Splink results can be very large, so converting them into pandas can be slow and result in out of memory errors. Usually it will be better to use SQL to query the tables directly.

","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html#querying-tables","title":"Querying tables","text":"

You can find out the name of the table in the underlying database using splink_df.physical_name. This enables you to run SQL queries directly against the results. You can execute queries using linker.misc.query_sql - this is the recommended approach as it's typically faster and more memory efficient than using pandas dataframes.

The following is an example of this approach, in which we use SQL to find the best match to each input record in a link_type=\"link_only\" job (i.e remove duplicate matches):

# linker is a Linker with link_type set to \"link_only\"\ndf_predict = linker.predict(threshold_match_probability=0.75)\n\nsql = f\"\"\"\nwith ranked as\n(\nselect *,\nrow_number() OVER (\n    PARTITION BY unique_id_l order by match_weight desc\n    ) as row_number\nfrom {df_predict.physical_name}\n)\n\nselect *\nfrom ranked\nwhere row_number = 1\n\"\"\"\n\ndf_query_result = linker.misc.query_sql(sql)  # pandas dataframe\n

Note that linker.misc.query_sql will return a pandas dataframe by default, but you can instead return a SplinkDataFrame as follows:

df_query_result = linker.misc.query_sql(sql, output_type='splink_df')\n
","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html#saving-results","title":"Saving results","text":"

If you have a SplinkDataFrame, you may wish to store the results in some file outside of your database. As tables may be large, there are a couple of convenience methods for doing this directly without needing to load the table into memory. Currently Splink supports saving frames to either csv or parquet format. Of these we generally recommend the latter, as it is typed, compressed, column-oriented, and easily supports nested data.

To save results, simply use the methods to_csv() or to_parquet() - for example:

df_predict = linker.inference.predict()\ndf_predict.to_parquet(\"splink_predictions.parquet\", overwrite=True)\n# or alternatively:\ndf_predict.to_csv(\"splink_predictions.csv\", overwrite=True)\n
","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/querying_splink_results.html#creating-a-splinkdataframe","title":"Creating a SplinkDataFrame","text":"

You can create a SplinkDataFrame for any table in your database. You will need to already have a linker to manage interactions with the database:

import pandas as pd\nimport duckdb\n\nfrom splink import Linker, SettingsCreator, DuckDBAPI\nfrom splink.datasets import splink_datasets\n\ncon = duckdb.connect()\ndf_numbers = pd.DataFrame({\"id\": [1, 2, 3], \"number\": [\"one\", \"two\", \"three\"]})\ncon.sql(\"CREATE TABLE number_table AS SELECT * FROM df_numbers\")\n\ndb_api = DuckDBAPI(connection=con)\ndf = splink_datasets.fake_1000\n\nlinker = Linker(df, settings=SettingsCreator(link_type=\"dedupe_only\"), db_api=db_api)\nsplink_df = linker.table_management.register_table(\"number_table\", \"a_templated_name\")\nsplink_df.as_pandas_dataframe()\n
```","tags":["SQL","Data Frames","SplinkDataFrame"]},{"location":"topic_guides/splink_fundamentals/settings.html","title":"Defining Splink models","text":"","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#defining-a-splink-model","title":"Defining a Splink Model","text":"","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#what-makes-a-splink-model","title":"What makes a Splink Model?","text":"

When building any linkage model in Splink, there are 3 key things which need to be defined:

  1. What type of linkage you want (defined by the link type)
  2. What pairs of records to consider (defined by blocking rules)
  3. What features to consider, and how they should be compared (defined by comparisons)
","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#defining-a-splink-model-with-a-settings-dictionary","title":"Defining a Splink model with a settings dictionary","text":"

All aspects of a Splink model are defined via the SettingsCreator object.

For example, consider a simple model:

import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        ctl.NameComparison(\"first_name\"),\n        ctl.NameComparison(\"surname\"),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"month\", \"year\"],\n            datetime_thresholds=[\n                1,\n                1,\n            ],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n)\n

Where:

1. Type of linkage

The \"link_type\" is defined as a deduplication for a single dataset.

    link_type=\"dedupe_only\",\n

2. Pairs of records to consider

The \"blocking_rules_to_generate_predictions\" define a subset of pairs of records for the model to be considered when making predictions. In this case, where there is a match on:

  • first_name
  • OR (surname AND dob).
    blocking_rules_to_generate_predictions=[\n            block_on(\"first_name\"),\n            block_on(\"surname\", \"dob\"),\n        ],\n

For more information on how blocking is used in Splink, see the dedicated topic guide.

3. Features to consider, and how they should be compared

The \"comparisons\" define the features to be compared between records: \"first_name\", \"surname\", \"dob\", \"city\" and \"email\".

    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.NameComparison(\"surname\"),\n        cl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"month\", \"year\"],\n            datetime_thresholds=[\n                1,\n                1,\n            ],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n

Using functions from the comparison library to define how these features should be compared.

For more information on how comparisons are defined, see the dedicated topic guide.

With our finalised settings object, we can train a Splink model using the following code:

Example model using the settings dictionary
import splink.comparison_library as cl\nimport splink.comparison_template_library as ctl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ],\n    comparisons=[\n        ctl.NameComparison(\"first_name\"),\n        ctl.NameComparison(\"surname\"),\n        ctl.DateComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"month\", \"year\"],\n            datetime_thresholds=[\n                1,\n                1,\n            ],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        ctl.EmailComparison(\"email\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nblocking_rule_for_training = block_on(\"first_name\", \"surname\")\nlinker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)\n\nblocking_rule_for_training = block_on(\"dob\")\nlinker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)\n\npairwise_predictions = linker.inference.predict()\n\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)\nclusters.as_pandas_dataframe(limit=5)\n
","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#advanced-usage-of-the-settings-dictionary","title":"Advanced usage of the settings dictionary","text":"

The section above refers to the three key aspects of the Splink settings dictionary. There are a variety of other lesser used settings, which can be found as the arguments to the SettingsCreator

","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#saving-a-trained-model","title":"Saving a trained model","text":"

Once you have have a trained Splink model, it is often helpful to save out the model. The save_model_to_json function allows the user to save out the specifications of their trained model.

linker.misc.save_model_to_json(\"model.json\")\n

which, using the example settings and model training from above, gives the following output:

Model JSON

When the splink model is saved to disk using linker.misc.save_model_to_json(\"model.json\") these settings become:

{\n    \"link_type\": \"dedupe_only\",\n    \"probability_two_random_records_match\": 0.0008208208208208208,\n    \"retain_matching_columns\": true,\n    \"retain_intermediate_calculation_columns\": false,\n    \"additional_columns_to_retain\": [],\n    \"sql_dialect\": \"duckdb\",\n    \"linker_uid\": \"29phy7op\",\n    \"em_convergence\": 0.0001,\n    \"max_iterations\": 25,\n    \"bayes_factor_column_prefix\": \"bf_\",\n    \"term_frequency_adjustment_column_prefix\": \"tf_\",\n    \"comparison_vector_value_column_prefix\": \"gamma_\",\n    \"unique_id_column_name\": \"unique_id\",\n    \"source_dataset_column_name\": \"source_dataset\",\n    \"blocking_rules_to_generate_predictions\": [\n        {\n            \"blocking_rule\": \"l.\\\"first_name\\\" = r.\\\"first_name\\\"\",\n            \"sql_dialect\": \"duckdb\"\n        },\n        {\n            \"blocking_rule\": \"l.\\\"surname\\\" = r.\\\"surname\\\"\",\n            \"sql_dialect\": \"duckdb\"\n        }\n    ],\n    \"comparisons\": [\n        {\n            \"output_column_name\": \"first_name\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"\\\"first_name_l\\\" IS NULL OR \\\"first_name_r\\\" IS NULL\",\n                    \"label_for_charts\": \"first_name is NULL\",\n                    \"is_null_level\": true\n                },\n                {\n                    \"sql_condition\": \"\\\"first_name_l\\\" = \\\"first_name_r\\\"\",\n                    \"label_for_charts\": \"Exact match on first_name\",\n                    \"m_probability\": 0.48854806009621365,\n                    \"u_probability\": 0.0056770619302010565\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(\\\"first_name_l\\\", \\\"first_name_r\\\") >= 0.9\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of first_name >= 0.9\",\n                    \"m_probability\": 0.1903763096120358,\n                    \"u_probability\": 0.003424501164330396\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(\\\"first_name_l\\\", \\\"first_name_r\\\") >= 0.8\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of first_name >= 0.8\",\n                    \"m_probability\": 0.08609678978546921,\n                    \"u_probability\": 0.006620702251038765\n                },\n                {\n                    \"sql_condition\": \"ELSE\",\n                    \"label_for_charts\": \"All other comparisons\",\n                    \"m_probability\": 0.23497884050628137,\n                    \"u_probability\": 0.9842777346544298\n                }\n            ],\n            \"comparison_description\": \"jaro_winkler at thresholds 0.9, 0.8 vs. anything else\"\n        },\n        {\n            \"output_column_name\": \"surname\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"\\\"surname_l\\\" IS NULL OR \\\"surname_r\\\" IS NULL\",\n                    \"label_for_charts\": \"surname is NULL\",\n                    \"is_null_level\": true\n                },\n                {\n                    \"sql_condition\": \"\\\"surname_l\\\" = \\\"surname_r\\\"\",\n                    \"label_for_charts\": \"Exact match on surname\",\n                    \"m_probability\": 0.43210610613512185,\n                    \"u_probability\": 0.004322481469643699\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(\\\"surname_l\\\", \\\"surname_r\\\") >= 0.9\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of surname >= 0.9\",\n                    \"m_probability\": 0.2514700606335103,\n                    \"u_probability\": 0.002907020988387136\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(\\\"surname_l\\\", \\\"surname_r\\\") >= 0.8\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of surname >= 0.8\",\n                    \"m_probability\": 0.0757748206402343,\n                    \"u_probability\": 0.0033636211436311888\n                },\n                {\n                    \"sql_condition\": \"ELSE\",\n                    \"label_for_charts\": \"All other comparisons\",\n                    \"m_probability\": 0.2406490125911336,\n                    \"u_probability\": 0.989406876398338\n                }\n            ],\n            \"comparison_description\": \"jaro_winkler at thresholds 0.9, 0.8 vs. anything else\"\n        },\n        {\n            \"output_column_name\": \"dob\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"\\\"dob_l\\\" IS NULL OR \\\"dob_r\\\" IS NULL\",\n                    \"label_for_charts\": \"dob is NULL\",\n                    \"is_null_level\": true\n                },\n                {\n                    \"sql_condition\": \"\\\"dob_l\\\" = \\\"dob_r\\\"\",\n                    \"label_for_charts\": \"Exact match on dob\",\n                    \"m_probability\": 0.39025358731716286,\n                    \"u_probability\": 0.0016036280808555408\n                },\n                {\n                    \"sql_condition\": \"damerau_levenshtein(\\\"dob_l\\\", \\\"dob_r\\\") <= 1\",\n                    \"label_for_charts\": \"Damerau-Levenshtein distance of dob <= 1\",\n                    \"m_probability\": 0.1489444378965258,\n                    \"u_probability\": 0.0016546990388445707\n                },\n                {\n                    \"sql_condition\": \"ABS(EPOCH(try_strptime(\\\"dob_l\\\", '%Y-%m-%d')) - EPOCH(try_strptime(\\\"dob_r\\\", '%Y-%m-%d'))) <= 2629800.0\",\n                    \"label_for_charts\": \"Abs difference of 'transformed dob <= 1 month'\",\n                    \"m_probability\": 0.08866691175438302,\n                    \"u_probability\": 0.002594404665842722\n                },\n                {\n                    \"sql_condition\": \"ABS(EPOCH(try_strptime(\\\"dob_l\\\", '%Y-%m-%d')) - EPOCH(try_strptime(\\\"dob_r\\\", '%Y-%m-%d'))) <= 31557600.0\",\n                    \"label_for_charts\": \"Abs difference of 'transformed dob <= 1 year'\",\n                    \"m_probability\": 0.10518866178811104,\n                    \"u_probability\": 0.030622146410222362\n                },\n                {\n                    \"sql_condition\": \"ELSE\",\n                    \"label_for_charts\": \"All other comparisons\",\n                    \"m_probability\": 0.26694640124381713,\n                    \"u_probability\": 0.9635251218042348\n                }\n            ],\n            \"comparison_description\": \"Exact match vs. Damerau-Levenshtein distance <= 1 vs. month difference <= 1 vs. year difference <= 1 vs. anything else\"\n        },\n        {\n            \"output_column_name\": \"city\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"\\\"city_l\\\" IS NULL OR \\\"city_r\\\" IS NULL\",\n                    \"label_for_charts\": \"city is NULL\",\n                    \"is_null_level\": true\n                },\n                {\n                    \"sql_condition\": \"\\\"city_l\\\" = \\\"city_r\\\"\",\n                    \"label_for_charts\": \"Exact match on city\",\n                    \"m_probability\": 0.561103053663773,\n                    \"u_probability\": 0.052019405886043986,\n                    \"tf_adjustment_column\": \"city\",\n                    \"tf_adjustment_weight\": 1.0\n                },\n                {\n                    \"sql_condition\": \"ELSE\",\n                    \"label_for_charts\": \"All other comparisons\",\n                    \"m_probability\": 0.438896946336227,\n                    \"u_probability\": 0.947980594113956\n                }\n            ],\n            \"comparison_description\": \"Exact match 'city' vs. anything else\"\n        },\n        {\n            \"output_column_name\": \"email\",\n            \"comparison_levels\": [\n                {\n                    \"sql_condition\": \"\\\"email_l\\\" IS NULL OR \\\"email_r\\\" IS NULL\",\n                    \"label_for_charts\": \"email is NULL\",\n                    \"is_null_level\": true\n                },\n                {\n                    \"sql_condition\": \"\\\"email_l\\\" = \\\"email_r\\\"\",\n                    \"label_for_charts\": \"Exact match on email\",\n                    \"m_probability\": 0.5521904988218763,\n                    \"u_probability\": 0.0023577568563241916\n                },\n                {\n                    \"sql_condition\": \"NULLIF(regexp_extract(\\\"email_l\\\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\\\"email_r\\\", '^[^@]+', 0), '')\",\n                    \"label_for_charts\": \"Exact match on transformed email\",\n                    \"m_probability\": 0.22046667643566936,\n                    \"u_probability\": 0.0010970118706508391\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(\\\"email_l\\\", \\\"email_r\\\") >= 0.88\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of email >= 0.88\",\n                    \"m_probability\": 0.21374764835824084,\n                    \"u_probability\": 0.0007367990176013098\n                },\n                {\n                    \"sql_condition\": \"jaro_winkler_similarity(NULLIF(regexp_extract(\\\"email_l\\\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\\\"email_r\\\", '^[^@]+', 0), '')) >= 0.88\",\n                    \"label_for_charts\": \"Jaro-Winkler distance of transformed email >= 0.88\",\n                    \"u_probability\": 0.00027834629553827263\n                },\n                {\n                    \"sql_condition\": \"ELSE\",\n                    \"label_for_charts\": \"All other comparisons\",\n                    \"m_probability\": 0.013595176384213488,\n                    \"u_probability\": 0.9955300859598853\n                }\n            ],\n            \"comparison_description\": \"jaro_winkler on username at threshold 0.88 vs. anything else\"\n        }\n    ]\n}\n

This is simply the settings dictionary with additional entries for \"m_probability\" and \"u_probability\" in each of the \"comparison_levels\", which have estimated during model training.

For example in the first name exact match level:

{\n    \"sql_condition\": \"\\\"first_name_l\\\" = \\\"first_name_r\\\"\",\n    \"label_for_charts\": \"Exact match on first_name\",\n    \"m_probability\": 0.48854806009621365,\n    \"u_probability\": 0.0056770619302010565\n},\n

where the m_probability and u_probability values here are then used to generate the match weight for an exact match on \"first_name\" between two records (i.e. the amount of evidence provided by records having the same first name) in model predictions.

","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/settings.html#loading-a-pre-trained-model","title":"Loading a pre-trained model","text":"

When using a pre-trained model, you can read in the model from a json and recreate the linker object to make new pairwise predictions. For example:

linker = Linker(\n    new_df,\n    settings=\"./path/to/model.json\",\n    db_api=db_api\n)\n
","tags":["settings","Dedupe","Link","Link and Dedupe","Comparisons","Blocking Rules"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html","title":"Backends overview","text":"","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#splinks-sql-backends-spark-duckdb-etc","title":"Splink's SQL backends: Spark, DuckDB, etc","text":"

Splink is a Python library. However, it implements all data linking computations by generating SQL, and submitting the SQL statements to a backend of the user's choosing for execution.

The Splink code you write is almost identical between backends, so it's straightforward to migrate between backends. Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#choosing-a-backend","title":"Choosing a backend","text":"","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#considerations-when-choosing-a-sql-backend-for-splink","title":"Considerations when choosing a SQL backend for Splink","text":"

When choosing which backend to use when getting started with Splink, there are a number of factors to consider:

  • the size of the dataset(s)
  • the amount of boilerplate code/configuration required
  • access to specific (sometimes proprietary) platforms
  • the backend-specific features offered by Splink
  • the level of support and active development offered by Splink

Below is a short summary of each of the backends available in Splink.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#duckdb","title":"DuckDB","text":"

DuckDB is recommended for most users for all but the largest linkages.

It is the fastest backend, and is capable of linking large datasets, especially if you have access to high-spec machines.

As a rough guide it can:

  • Link up to around 5 million records on a modern laptop (4 core/16GB RAM)
  • Link tens of millions of records on high spec cloud computers very fast.

For further details, see the results of formal benchmarking here.

DuckDB is also recommended because for many users its simplest to set up.

It can be run on any device with python installed and it is installed automatically with Splink via pip install splink. DuckDB has complete coverage for the functions in the Splink comparison libraries. Alongside the Spark linker, it receives most attention from the development team.

See the DuckDB deduplication example notebook to get a better idea of how Splink works with DuckDB.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#spark","title":"Spark","text":"

Spark is recommended for: - Very large linkages, especially where DuckDB is performing poorly or running out of memory, or - Or have easier access to a Spark cluster than a single high-spec instance to run DuckDB

It is not our default recommendation for most users because: - It involves more configuration than users, such as registering UDFs and setting up a Spark cluster - It is slower than DuckDB for many

The Spark linker has complete coverage for the functions in the Splink comparison libraries.

If working with Databricks note that the Splink development team does not have access to a Databricks environment so we can struggle help DataBricks-specific issues.

See the Spark deduplication example notebook for an example of how Splink works with Spark.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#athena","title":"Athena","text":"

Athena is a big data SQL backend provided on AWS which is great for large datasets (10+ million records). It requires access to a live AWS account and as a persistent database, requires some additional management of the tables created by Splink. Athena has reasonable, but not complete, coverage for fuzzy matching functions, see [Presto]https://prestodb.io/docs/current/functions/string.html). At this time, the Athena backend is being used sparingly by the Splink development team so receives minimal levels of support.

In addition, from a development perspective, the necessity for an AWS connection makes testing Athena code more difficult, so there may be occasional bugs that would normally be caught by our testing framework.

See the Athena deduplication example notebook to get a better idea of how Splink works with Athena.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#sqlite","title":"SQLite","text":"

SQLite is similar to DuckDB in that it is, generally, more suited to smaller datasets. SQLite is simple to setup and can be run directly in a Jupyter notebook, but is not as performant as DuckDB. SQLite has reasonable, but not complete, coverage for the functions in the Splink comparison libraries, with gaps in array and date comparisons. String fuzzy matching, while not native to SQLite is available via python UDFs which has some performance implications. SQLite is not actively been used by the Splink team so receives minimal levels of support.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#postgresql","title":"PostgreSQL","text":"

PostgreSQL is a relatively new linker, so we have not fully tested performance or what size of datasets can processed with Splink. The Postgres backend requires a Postgres database, so it is recommend to use this backend only if you are working with a pre-existing Postgres database. Postgres has reasonable, but not complete, coverage for the functions in the Splink comparison libraries, with gaps in string fuzzy matching functionality due to the lack of some string functions in Postgres. At this time, the Postgres backend is not being actively used by the Splink development team so receives minimal levels of support.

More details on using Postgres as a Splink backend can be found on the Postgres page.

","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#using-your-chosen-backend","title":"Using your chosen backend","text":"

Choose the relevant DBAPI:

Once you have initialised the linker object, there is no difference in the subsequent code between backends.

DuckDB Spark Athena SQLite PostgreSQL
from splink import Linker, DuckDBAPI\n\nlinker = Linker(your_args. DuckDBAPI)\n
from splink import Linker, SparkAPI\n\nlinker = Linker(your_args. SparkAPI)\n
from splink import Linker, AthenaAPI\n\nlinker = Linker(your_args. AthenaAPI)\n
from splink import Linker, SQLiteAPI\n\nlinker = Linker(your_args. SQLiteAPI)\n
from splink import Linker, PostgresAPI\n\nlinker = Linker(your_args. PostgresAPI)\n
","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#additional-information-for-specific-backends","title":"Additional Information for specific backends","text":"","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/backends.html#sqlite_1","title":"SQLite","text":"

SQLite does not have native support for fuzzy string-matching functions. However, the following are available for Splink users as python user-defined functions (UDFs) which are automatically registered when calling SQLiteAPI()

  • levenshtein
  • damerau_levenshtein
  • jaro
  • jaro_winkler

However, there are a couple of points to note:

  • These functions are implemented using the RapidFuzz package, which must be installed if you wish to make use of them, via e.g. pip install rapidfuzz. If you do not wish to do so you can disable the use of these functions when creating your linker:
    SQLiteAPI(register_udfs=False)\n
  • As these functions are implemented in python they will be considerably slower than any native-SQL comparisons. If you find that your model-training or predictions are taking a large time to run, you may wish to consider instead switching to DuckDB (or some other backend).
","tags":["Spark","DuckDB","Athena","SQLite","Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html","title":"PostgreSQL","text":"","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#using-postgresql-as-a-splink-backend","title":"Using PostgreSQL as a Splink backend","text":"

Splink is compatible with using PostgreSQL (or simply as Postgres) as a SQL backend - for other options have a look at the overview of Splink backends.

","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#setup","title":"Setup","text":"

Splink makes use of SQLAlchemy for connecting to Postgres, and the default database adapter is psycopg2, but you should be able to use any other if you prefer. The PostgresLinker requires a valid engine upon creation to manage interactions with the database:

from sqlalchemy import create_engine\n\nfrom splink.postgres.linker import PostgresLinker\nimport splink.postgres.comparison_library as cl\n\n# create a sqlalchemy engine to manage connecting to the database\nengine = create_engine(\"postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DB_NAME\")\n\nsettings = SettingsCreator(\n    link_type= \"dedupe_only\",\n)\n

You can pass data to the linker in one of two ways:

  • use the name of a pre-existing table in your database

    dbapi = PostgresAPI(engine=engine)\nlinker = Linker(\n    \"my_data_table,\n    settings_dict,\n    db_api=db_api,\n)\n
  • or pass a pandas DataFrame directly, in which case the linker will create a corresponding table for you automatically in the database

    import pandas as pd\n\n# create pandas frame from csv\ndf = pd.read_csv(\"./my_data_table.csv\")\n\ndbapi = PostgresAPI(engine=engine)\nlinker = Linker(\n    df,\n    settings_dict,\n    db_api=db_api,\n)\n
","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#permissions","title":"Permissions","text":"

When you connect to Postgres, you must do so with a role that has sufficient privileges for Splink to operate correctly. These are:

  • CREATE ON DATABASE, to allow Splink to create a schema for working, and install the fuzzystrmatch extension
  • USAGE ON LANGUAGE SQL and USAGE ON TYPE float8 - these are required for creating the UDFs that Splink employs for calculations
","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#things-to-know","title":"Things to know","text":"","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#schemas","title":"Schemas","text":"

When you create a PostgresLinker, Splink will create a new schema within the database you specify - by default this schema is called splink, but you can choose another name by passing the appropriate argument when creating the linker:

dbapi = PostgresAPI(engine=engine, schema=\"another_splink_schema\")\n
This schema is where all of Splink's work will be carried out, and where any tables created by Splink will live.

By default when looking for tables, Splink will check the schema it created, and the public schema; if you have tables in other schemas that you would like to be discoverable by Splink, you can use the parameter other_schemas_to_search:

dbapi = PostgresAPI(engine=engine, other_schemas_to_search=[\"my_data_schema_1\", \"my_data_schema_2\"])\n
","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#user-defined-functions-udfs","title":"User-Defined Functions (UDFs)","text":"

Splink makes use of Postgres' user-defined functions in order to operate, which are defined in the schema created by Splink when you create the linker. These functions are all defined using SQL, and are:

  • log2 - required for core Splink functionality
  • datediff - for the datediff comparison level
  • ave_months_between - for the datediff comparison level
  • array_intersect - for the array intersect comparison level

Information

The information below is only relevant if you are planning on making changes to Splink. If you are only intending to use Splink with Postgres, you do not need to read any further.

","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#testing-splink-with-postgres","title":"Testing Splink with Postgres","text":"

To run only the Splink tests that run against Postgres, you can run simply:

pytest -m postgres_only tests/\n
For more information see the documentation page for testing in Splink.

The tests will are run using a temporary database and user that are created at the start of the test session, and destroyed at the end.

","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#postgres-via-docker","title":"Postgres via docker","text":"

If you are trying to run tests with Splink on Postgres, or simply develop using Postgres, you may prefer to not actually install Postgres on you system, but to run it instead using Docker. In this case you can simply run the setup script (a thin wrapper around docker-compose):

./scripts/postgres_docker/setup.sh\n
Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port.

When you are finished you can remove these resources:

./scripts/postgres_docker/teardown.sh\n
","tags":["Postgres","Backends"]},{"location":"topic_guides/splink_fundamentals/backends/postgres.html#running-with-a-pre-existing-database","title":"Running with a pre-existing database","text":"

If you have a pre-existing Postgres server you wish to use to run the tests against, you will need to specify environment variables for the credentials where they differ from default (in parentheses):

  • SPLINKTEST_PG_USER (splinkognito)
  • SPLINKTEST_PG_PASSWORD (splink123!)
  • SPLINKTEST_PG_HOST (localhost)
  • SPLINKTEST_PG_PORT (5432)
  • SPLINKTEST_PG_DB (splink_db) - tests will not actually run against this, but it is from a connection to this that the temporary test database + user will be created

While care has been taken to ensure that tests are run using minimal permissions, and are cleaned up after, it is probably wise to run tests connected to a non-important database, in case anything goes wrong. In addition to the above privileges, in order to run the tests you will need:

  • CREATE DATABASE to create a temporary testing database
  • CREATEROLE to create a temporary user role with limited privileges, which will be actually used for all the SQL execution in the tests
","tags":["Postgres","Backends"]},{"location":"topic_guides/theory/fellegi_sunter.html","title":"The Fellegi-Sunter Model","text":""},{"location":"topic_guides/theory/fellegi_sunter.html#the-fellegi-sunter-model","title":"The Fellegi-Sunter model","text":"

This topic guide gives a high-level introduction to the Fellegi Sunter model, the statistical model that underlies Splink's methodology.

For a more detailed interactive guide that aligns to Splink's methodology see Robin Linacre's interactive introduction to probabilistic linkage.

"},{"location":"topic_guides/theory/fellegi_sunter.html#parameters-of-the-fellegi-sunter-model","title":"Parameters of the Fellegi-Sunter model","text":"

The Fellegi-Sunter model has three main parameters that need to be considered to generate a match probability between two records:

  • \\(\\lambda\\) - probability that any two records match
  • \\(m\\) - probability of a given observation given the records are a match
  • \\(u\\) - probability of a given observation given the records are not a match
"},{"location":"topic_guides/theory/fellegi_sunter.html#probability","title":"\u03bb probability","text":"

The lambda (\\(\\lambda\\)) parameter is the prior probability that any two records match. I.e. assuming no other knowledge of the data, how likely is a match? Or, as a formula:

\\[ \\lambda = Pr(\\textsf{Records match}) \\]

This is the same for all records comparisons, but is highly dependent on:

  • The total number of records
  • The number of duplicate records (more duplicates increases \\(\\lambda\\))
  • The overlap between datasets
    • Two datasets covering the same cohort (high overlap, high \\(\\lambda\\))
    • Two entirely independent datasets (low overlap, low \\(\\lambda\\))
"},{"location":"topic_guides/theory/fellegi_sunter.html#m-probability","title":"m probability","text":"

The \\(m\\) probability is the probability of a given observation given the records are a match. Or, as a formula:

\\[ m = Pr(\\textsf{Observation | Records match}) \\]

For example, consider the the \\(m\\) probability of a match on Date of Birth (DOB). For two records that are a match, what is the probability that:

  • DOB is the same:
  • Almost 100%, say 98% \\(\\Longrightarrow m \\approx 0.98\\)
  • DOB is different:
  • Maybe a 2% chance of a data error? \\(\\Longrightarrow m \\approx 0.02\\)

The \\(m\\) probability is largely a measure of data quality - if DOB is poorly collected, it may only match exactly for 50% of true matches.

"},{"location":"topic_guides/theory/fellegi_sunter.html#u-probability","title":"u probability","text":"

The \\(u\\) probability is the probability of a given observation given the records are not a match. Or, as a formula:

\\[ u = Pr(\\textsf{Observation | Records do not match}) \\]

For example, consider the the \\(u\\) probability of a match on Surname. For two records that are not a match, what is the probability that:

  • Surname is the same:
  • Depending on the surname, <1%? \\(\\Longrightarrow u \\approx 0.005\\)
  • Surname is different:
  • Almost 100% \\(\\Longrightarrow u \\approx 0.995\\)

The \\(u\\) probability is a measure of coincidence. As there are so many possible surnames, the chance of sharing the same surname with a randomly-selected person is small.

"},{"location":"topic_guides/theory/fellegi_sunter.html#interpreting-m-and-u","title":"Interpreting m and u","text":"

In the case of a perfect unique identifier:

  • A person is only assigned one such value - \\(m = 1\\) (match) or \\(m=0\\) (non-match)
  • A value is only ever assigned to one person - \\(u = 0\\) (match) or \\(u = 1\\) (non-match)

Where \\(m\\) and \\(u\\) deviate from these ideals can usually be intuitively explained:

"},{"location":"topic_guides/theory/fellegi_sunter.html#m-probability_1","title":"m probability","text":"

A measure of data quality/reliability.

How often might a person's information change legitimately or through data error?

  • Names: typos, aliases, nicknames, middle names, married names etc.
  • DOB: typos, estimates (e.g. 1st Jan YYYY where date not known)
  • Address: formatting issues, moving house, multiple addresses, temporary addresses
"},{"location":"topic_guides/theory/fellegi_sunter.html#u-probability_1","title":"u probability","text":"

A measure of coincidence/cardinality1.

How many different people might share a given identifier?

  • DOB (high cardinality) \u2013 for a flat age distribution spanning ~30 years, there are ~10,000 DOBs (0.01% chance of a match)
  • Sex (low cardinality) \u2013 only 2 potential values (~50% chance of a match)
"},{"location":"topic_guides/theory/fellegi_sunter.html#match-weights","title":"Match Weights","text":"

One of the key measures of evidence of a match between records is the match weight.

"},{"location":"topic_guides/theory/fellegi_sunter.html#deriving-match-weights-from-m-and-u","title":"Deriving Match Weights from m and u","text":"

The match weight is a measure of the relative size of \\(m\\) and \\(u\\):

\\[ \\begin{equation} \\begin{aligned} M &= \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\log_2 K \\\\[10pt] &= \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\log_2 m - \\log_2 u \\end{aligned} \\end{equation} \\]

where \\(\\lambda\\) is the probability that two random records match and \\(K=m/u\\) is the Bayes factor.

A key assumption of the Fellegi Sunter model is that observations from different column/comparisons are independent of one another. This means that the Bayes factor for two records is the products of the Bayes factor for each column/comparison:

\\[ K_\\textsf{features} = K_\\textsf{forename} \\cdot K_\\textsf{surname} \\cdot K_\\textsf{dob} \\cdot K_\\textsf{city} \\cdot K_\\textsf{email} \\]

This, in turn, means that match weights are additive:

\\[ M_\\textsf{obs} = M_\\textsf{prior} + M_\\textsf{features} \\]

where \\(M_\\textsf{prior} = \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right)\\) and \\(M_\\textsf{features} = M_\\textsf{forename} + M_\\textsf{surname} + M_\\textsf{dob} + M_\\textsf{city} + M_\\textsf{email}\\).

So, considering these properties, the total match weight for two observed records can be rewritten as:

\\[ \\begin{equation} \\begin{aligned} M_\\textsf{obs} &= \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\sum_{i}^\\textsf{features}\\log_2(\\frac{m_i}{u_i}) \\\\[10pt] &= \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\log_2\\left(\\prod_i^\\textsf{features}\\frac{m_i}{u_i}\\right) \\end{aligned} \\end{equation} \\]"},{"location":"topic_guides/theory/fellegi_sunter.html#interpreting-match-weights","title":"Interpreting Match Weights","text":"

The match weight is the central metric showing the amount of evidence of a match is provided by each of the features in a model. The is most easily shown through Splink's Waterfall Chart:

  • 1\ufe0f\u20e3 are the two records being compared
  • 2\ufe0f\u20e3 is the match weight of the prior, \\(M_\\textsf{prior} = \\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right)\\). This is the match weight if no additional knowledge of features is taken into account, and can be thought of as similar to the y-intercept in a simple regression.

  • 3\ufe0f\u20e3 are the match weights of each feature, \\(M_\\textsf{forename}\\), \\(M_\\textsf{surname}\\), \\(M_\\textsf{dob}\\), \\(M_\\textsf{city}\\) and \\(M_\\textsf{email}\\) respectively.

  • 4\ufe0f\u20e3 is the total match weight for two observed records, combining 2\ufe0f\u20e3 and 3\ufe0f\u20e3:

    \\[ \\begin{equation} \\begin{aligned} M_\\textsf{obs} &= M_\\textsf{prior} + M_\\textsf{forename} + M_\\textsf{surname} + M_\\textsf{dob} + M_\\textsf{city} + M_\\textsf{email} \\\\[10pt] &= -6.67 + 4.74 + 6.49 - 1.97 - 1.12 + 8.00 \\\\[10pt] &= 9.48 \\end{aligned} \\end{equation} \\]
  • 5\ufe0f\u20e3 is an axis representing the \\(\\textsf{match weight} = \\log_2(\\textsf{Bayes factor})\\))

  • 6\ufe0f\u20e3 is an axis representing the equivalent match probability (noting the non-linear scale). For more on the relationship between match weight and probability, see the sections below

"},{"location":"topic_guides/theory/fellegi_sunter.html#match-probability","title":"Match Probability","text":"

Match probability is a more intuitive measure of similarity than match weight, and is, generally, used when choosing a similarity threshold for record matching.

"},{"location":"topic_guides/theory/fellegi_sunter.html#deriving-match-probability-from-match-weight","title":"Deriving Match Probability from Match Weight","text":"

Probability of two records being a match can be derived from the total match weight:

\\[ Pr(\\textsf{Match | Observation}) = \\frac{2^{M_\\textsf{obs}}}{1+2^{M_\\textsf{obs}}} \\] Example

Consider the example in the Interpreting Match Weights section. The total match weight, \\(M_\\textsf{obs} = 9.48\\). Therefore,

\\[ Pr(\\textsf{Match | Observation}) = \\frac{2^{9.48}}{1+2^{9.48}} \\approx 0.999 \\]"},{"location":"topic_guides/theory/fellegi_sunter.html#understanding-the-relationship-between-match-probability-and-match-weight","title":"Understanding the relationship between Match Probability and Match Weight","text":"

It can be helpful to build up some intuition for how match weight translates into match probability.

Plotting match probability versus match weight gives the following chart:

Some observations from this chart:

  • \\(\\textsf{Match weight} = 0 \\Longrightarrow \\textsf{Match probability} = 0.5\\)
  • \\(\\textsf{Match weight} = 2 \\Longrightarrow \\textsf{Match probability} = 0.8\\)
  • \\(\\textsf{Match weight} = 3 \\Longrightarrow \\textsf{Match probability} = 0.9\\)
  • \\(\\textsf{Match weight} = 4 \\Longrightarrow \\textsf{Match probability} = 0.95\\)
  • \\(\\textsf{Match weight} = 7 \\Longrightarrow \\textsf{Match probability} = 0.99\\)

So, the impact of any additional match weight on match probability gets smaller as the total match weight increases. This makes intuitive sense as, when comparing two records, after you already have a lot of evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.

Similarly, if you already have a lot of negative evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.

"},{"location":"topic_guides/theory/fellegi_sunter.html#deriving-match-probability-from-m-and-u","title":"Deriving Match Probability from m and u","text":"

Given the definitions for match probability and match weight above, we can rewrite the probability in terms of \\(m\\) and \\(u\\).

\\[ \\begin{equation} \\begin{aligned} Pr(\\textsf{Match | Observation}) &= \\frac{2^{\\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\log_2\\left(\\prod_{i}^\\textsf{features}\\frac{m_{i}}{u_{i}}\\right)}}{1+2^{\\log_2\\left(\\frac{\\lambda}{1-\\lambda}\\right) + \\log_2\\left(\\prod_{i}^\\textsf{features}\\frac{m_{i}}{u_{i}}\\right)}} \\\\[20pt] &= \\frac{\\left(\\frac{\\lambda}{1-\\lambda}\\right)\\prod_{i}^\\textsf{features}\\frac{m_{i}}{u_{i}}}{1+\\left(\\frac{\\lambda}{1-\\lambda}\\right)\\prod_{i}^\\textsf{features}\\frac{m_{i}}{u_{i}}} \\\\[20pt] &= 1 - \\left[1+\\left(\\frac{\\lambda}{1-\\lambda}\\right)\\prod_{i}^\\textsf{features}\\frac{m_{i}}{u_{i}}\\right]^{-1} \\end{aligned} \\end{equation} \\]"},{"location":"topic_guides/theory/fellegi_sunter.html#further-reading","title":"Further Reading","text":"

This academic paper provides a detailed mathematical description of the model used by R fastLink package. The mathematical uesd by Splink is very similar.

  1. Cardinality is the the number of items in a set. In record linkage, cardinality refers to the number of possible values a feature could have. This is important in record linkage, as the number of possible options for e.g. date of birth has a significant impact on the amount of evidence that a match on date of birth provides for two records being a match.\u00a0\u21a9

"},{"location":"topic_guides/theory/linked_data_as_graphs.html","title":"Linked Data as Graphs","text":""},{"location":"topic_guides/theory/linked_data_as_graphs.html#linked-data-as-graphs","title":"Linked data as graphs","text":"

When you link data, the results can be thought of as a graph, where each record (node) in your data is connected to other records by links (edges). This guide discusses relevant graph theory.

A graph is a collection of points (referred to in graph theory as nodes or vertices) connected by lines (referred to as edges).

Then a group of interconnected nodes is referred to as a cluster.

Graphs provide a natural way to represent linked data, where the nodes of a graph represent records being linked and the edges represent the links between them. So, if we have 5 records (A-E) in our dataset(s), with links between them, this can be represented as a graph like so:

When linking people together, a cluster represents the all of the records in our dataset(s) that refer to the same person. We can give this cluster a new identifier (F) as a way of referring to this single person.

Note

For clusters produced with Splink, every edge comes with an associated Splink score (the probability of two records being a match). The clustering threshold (match_probability_threshold) supplied by the user determines which records are included in a cluster, as any links (edges) between records with a match probability below this threshold are excluded.

Clusters, specifically cluster IDs, are the ultimate output of a Splink pipeline.

"},{"location":"topic_guides/theory/linked_data_as_graphs.html#probabilistic-data-linkage-and-graphs","title":"Probabilistic data linkage and graphs","text":"

When performing probabilistic linkage, each pair of records has a score indicating how similar they are. For example, consider a collection of records with pairwise similarity scores:

Having a score associated with each pair of records is the key benefit of probabilistic linkage, as we have a measure of similarity of the records (rather than a binary link/no-link). However, we need to choose a threshold at or above which links are considered valid in order to generate our final linked data (clusters).

Let's consider a few different thresholds for the records above to see how the resulting clusters change. Setting a threshold of 0.95 keeps all links, so the records are all joined up into a single cluster.

Whereas if we increase the threshold to 0.99, one link is discarded. This breaks the records into two clusters.

Increasing the threshold further (to 0.999) breaks an additional two links, resulting in a total of three clusters.

This demonstrates that choice of threshold can have a significant impact on the final linked data produced (i.e. clusters). For more specific guidance on selecting linkage thresholds, check out the Evaluation Topic Guides.

"},{"location":"topic_guides/theory/probabilistic_vs_deterministic.html","title":"Probabilistic vs Deterministic linkage","text":""},{"location":"topic_guides/theory/probabilistic_vs_deterministic.html#types-of-record-linkage","title":"Types of Record Linkage","text":"

There are two main types of record linkage - Deterministic and Probabilistic.

"},{"location":"topic_guides/theory/probabilistic_vs_deterministic.html#deterministic-linkage","title":"Deterministic Linkage","text":"

Deterministic Linkage is a rules-based approach for joining records together.

For example, consider a single table with duplicates:

AID Name DOB Postcode A00001 Bob Smith 1990-05-09 AB12 3CD A00002 Robert Smith 1990-05-09 AB12 3CD A00003 Robert \"Bobby\u201d Smith 1990-05-09 -

and some deterministic rules:

IF Name matches AND DOB matches (Rule 1)\nTHEN records are a match\n\nELSE\n\nIF Forename matches AND DOB matches AND Postcode match (Rule 2)\nTHEN records are a match\n\nELSE\n\nrecords do not match\n

Applying these rules to the table above leads to no matches:

A0001-A0002 No match (different forename) A0001-A0003 No match (different forename) A0002-A0003 No match (missing postcode)

So, even a relatively simple dataset, with duplicates that are obvious to a human, will require more complex rules.

In general, Deterministic linkage is:

\u2705 Computationally cheap \u2705 Capable of achieving high precision (few False Positives) \u274c Lacking in subtlety \u274c Prone to Low recall (False Negatives) Deterministic Linkage in Splink

While Splink is primarily a tool for Probabilistic linkage, Deterministic linkage is also supported (utilising blocking rules). See the example notebooks to see how this is Deterministic linkage is implemented in Splink.

"},{"location":"topic_guides/theory/probabilistic_vs_deterministic.html#probabilistic-linkage","title":"Probabilistic Linkage","text":"

Probabilistic Linkage is a evidence-based approach for joining records together.

Linkage is probabilistic in the sense that it relies on the balance of evidence. In a large dataset, observing that two records match on the full name 'Robert Smith' provides some evidence that these two records may refer to the same person, but this evidence is inconclusive. However, the cumulative evidence from across multiple features within the dataset (e.g. date of birth, home address, email address) can provide conclusive evidence of a match. The evidence for a match is commonly represented as a probability.

For example, putting the first 2 records of the table above through a probabilistic model gives a an overall probability that the records are a match:

In addition, the breakdown of this probability by the evidence provided by each feature can be shown through a waterfall chart:

Given these probabilities, unlike (binary) Deterministic linkage, the user can choose an evidence threshold for what they consider a match before creating a new unique identifier.

This is important, as it allows the linkage to be customised to best support the specific use case. For example, if it is important to:

  • minimise False Positive matches (i.e. where False Negatives are less of a concern), a higher threshold for a match can be chosen.
  • maximise True Positive matches (i.e. where False Positives are less of a concern), a lower threshold can be chosen.

Further Reading

For a more in-depth introduction to Probabilistic Data Linkage, including an interactive version of the waterfall chart above, see Robin Linacre's Blog.

Probabilistic Linkage in Splink

Splink is primarily a tool for Probabilistic linkage, and implements the Fellegi-Sunter model - the most common probabilistic record linkage model. See the Splink Tutorial for a step by step guide for Probabilistic linkage in Splink.

A Topic Guide on the Fellegi-Sunter model is can be found here!

"},{"location":"topic_guides/theory/record_linkage.html","title":"Why do we need record linkage?","text":""},{"location":"topic_guides/theory/record_linkage.html#why-do-we-need-record-linkage","title":"Why do we need record linkage?","text":""},{"location":"topic_guides/theory/record_linkage.html#in-a-perfect-world","title":"In a perfect world","text":"

In a perfect world, everyone (and everything) would have a single, unique identifier. If this were the case, linking any datasets would be a simple inner join.

Example

Consider 2 tables of people A and B with no duplicates and each person has a unique id UID. To join these tables in SQL we would write:

SELECT *\nFROM A\nINNER JOIN B\nON A.UID = B.UID\n
"},{"location":"topic_guides/theory/record_linkage.html#in-reality","title":"In reality","text":"

Real datasets often lack truly unique identifiers (both within and across datasets).

The overall aim of record linkage is to generate a unique identifier to be used like UID to our \"perfect world\" scenario.

Record linkage the process of using the information within records to assess whether records refer to the same entity. For example, if records refer to people, factors such as names, date of birth, location etc can be used to link records together.

Record linkage can be done within datasets (deduplication) or between datasets (linkage), or both.

"},{"location":"blog/category/bias.html","title":"Bias","text":""},{"location":"blog/category/feature-updates.html","title":"Feature Updates","text":""},{"location":"blog/category/ethics.html","title":"Ethics","text":""}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 0000000000..cfdbb5154b --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,628 @@ + + + + https://moj-analytical-services.github.io/splink/index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/getting_started.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/api_docs_index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/blocking.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/blocking_analysis.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/clustering.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/column_expression.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/comparison_level_library.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/comparison_library.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/datasets.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/em_training_session.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/evaluation.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/exploratory.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/inference.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/misc.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/splink_dataframe.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/table_management.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/training.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/api_docs/visualisations.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2023/07/27/splink-updates---july-2023.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2023/12/06/splink-updates---december-2023.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2024/01/23/ethics-in-data-linking.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2024/04/02/splink-3-updates-and-splink-4-development-announcement---april-2024.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2024/07/24/splink-400-released.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/2024/08/19/bias-in-data-linking.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/accuracy_analysis_from_labels_table.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/cluster_studio_dashboard.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/comparison_viewer_dashboard.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/completeness_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/cumulative_comparisons_to_be_scored_from_blocking_rules_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/m_u_parameters_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/match_weights_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/parameter_estimate_comparisons_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/profile_columns.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/template.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/tf_adjustment_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/threshold_selection_tool_from_labels_table.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/unlinkables_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/charts/waterfall_chart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/examples_index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/athena/deduplicate_50k_synthetic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/accuracy_analysis_from_labels_column.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/cookbook.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deduplicate_50k_synthetic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deterministic_dedupe.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/febrl3.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/febrl4.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/link_only.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/pairwise_labels.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/quick_and_dirty_persons.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/real_time_record_linkage.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/duckdb/transactions.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/examples/sqlite/deduplicate_50k_synthetic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/01_Prerequisites.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/02_Exploratory_analysis.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/03_Blocking.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/05_Predicting_results.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/06_Visualising_predictions.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/demos/tutorials/07_Evaluation.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/CONTRIBUTING.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/caching.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/debug_modes.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/dependency_compatibility_policy.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/spark_pipelining_and_caching.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/transpilation.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/udfs.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/blog_posts.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/building_env_locally.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/contributing_to_docs.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/development_quickstart.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/lint_and_format.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/managing_dependencies_with_poetry.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/releases.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/changing_splink/testing.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/charts/building_charts.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/charts/understanding_and_editing_charts.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/settings_validation/extending_settings_validator.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/dev_guides/settings_validation/settings_validation_overview.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/includes/tags.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/includes/generated_files/dataset_labels_table.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/includes/generated_files/datasets_table.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/topic_guides_index.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/blocking/blocking_rules.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/choosing_comparators.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/comparators.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/comparisons_and_comparison_levels.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/out_of_the_box_comparisons.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/phonetic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/regular_expressions.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/comparisons/term-frequency.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/data_preparation/feature_engineering.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/edge_metrics.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/edge_overview.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/labelling.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/model.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/overview.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/clusters/graph_metrics.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/clusters/how_to_compute_metrics.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/evaluation/clusters/overview.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/performance/drivers_of_performance.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/performance/optimising_duckdb.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/performance/optimising_spark.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/performance/salting.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/link_type.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/querying_splink_results.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/settings.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/backends/backends.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/splink_fundamentals/backends/postgres.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/theory/fellegi_sunter.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/theory/linked_data_as_graphs.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/theory/probabilistic_vs_deterministic.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/topic_guides/theory/record_linkage.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/category/bias.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/category/feature-updates.html + 2024-09-15 + daily + + + https://moj-analytical-services.github.io/splink/blog/category/ethics.html + 2024-09-15 + daily + + \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz new file mode 100644 index 0000000000..4c75095262 Binary files /dev/null and b/sitemap.xml.gz differ diff --git a/topic_guides/blocking/blocking_rules.html b/topic_guides/blocking/blocking_rules.html new file mode 100644 index 0000000000..fcc9a33b19 --- /dev/null +++ b/topic_guides/blocking/blocking_rules.html @@ -0,0 +1,5507 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + What are Blocking Rules? - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

What are Blocking Rules?

+

The primary driver the run time of Splink is the number of record pairs that the Splink model has to process. This is controlled by the blocking rules.

+

This guide explains what blocking rules are, and how they can be used.

+

Introduction

+

One of the main challenges to overcome in record linkage is the scale of the problem.

+

The number of pairs of records to compare grows using the formula \(\frac{n\left(n-1\right)}2\), i.e. with (approximately) the square of the number of records, as shown in the following chart:

+

+

For example, a dataset of 1 million input records would generate around 500 billion pairwise record comparisons.

+

So, when datasets get bigger the computation could get infeasibly large. We use blocking to reduce the scale of the computation to something more tractible.

+

Blocking

+

Blocking is a technique for reducing the number of record pairs that are considered by a model.

+

Considering a dataset of 1 million records, comparing each record against all of the other records in the dataset generates ~500 billion pairwise comparisons. However, we know the vast majority of these record comparisons won't be matches, so processing the full ~500 billion comparisons would be largely pointless (as well as costly and time-consuming).

+

Instead, we can define a subset of potential comparisons using Blocking Rules. These are rules that define "blocks" of comparisons that should be considered. For example, the blocking rule:

+

"block_on("first_name", "surname")

+

will generate only those pairwise record comparisons where first name and surname match. That is, is equivalent to joining input records the SQL condition l.first_name = r.first_name and l.surname = r.surname

+

Within a Splink model, you can specify multiple Blocking Rules to ensure all potential matches are considered. These are provided as a list. Splink will then produce all record comparisons that satisfy at least one of your blocking rules.

+
+Further Reading +

For more information on blocking, please refer to this article

+
+ +

There are two areas in Splink where blocking is used:

+
    +
  • +

    The first is to generate pairwise comparisons when finding links (running predict()). This is the sense in which 'blocking' is usually understood in the context of record linkage. These blocking rules are provided in the model settings using blocking_rules_to_generate_predictions.

    +
  • +
  • +

    The second is a less familiar application of blocking: using it for model training. This is a more advanced topic, and is covered in the model training topic guide.

    +
  • +
+

Choosing blocking_rules_to_generate_predictions

+

The blocking rules specified in your settings at blocking_rules_to_generate_predictions are the single most important determinant of how quickly your linkage runs. This is because the number of comparisons generated is usually many multiple times higher than the number of input records.

+

How can we choose a good set of blocking rules? It's usually better to use a longer list of strict blocking rules, than a short list of loose blocking rules. Let's see why:

+

The aim of our blocking rules are to:

+
    +
  • Capture as many true matches as possible
  • +
  • Reduce the total number of comparisons being generated
  • +
+

There is a tension between these aims, because by choosing loose blocking rules which generate more comparisons, you have a greater chance of capturing all true matches.

+

A single rule is unlikely to be able to achieve both aims.

+

For example, consider:

+

SettingsCreator(
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname")
+    ]
+)
+
+This will generate comparisons for all true matches where names match. But it would miss a true match where there was a typo in the name. +

This is why blocking_rules_to_generate_predictions is a list.

+

Suppose we also block on postcode:

+
SettingsCreator(
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname"),
+        block_on("postcode")
+    ]
+)
+
+

Now it doesn't matter if there's a typo in the name so long as postcode matches (and vice versa).

+

We could take this further and block on, say, date_of_birth as well.

+

By specifying a variety of blocking_rules_to_generate_predictions, even if each rule on its own is relatively tight, it becomes implausible that a truly matching record would not be captured by at least one of the rules.

+

Tightening blocking rules for linking larger datasets

+

As the size of your input data grows, tighter blocking rules may be needed. Blocking on, say first_name and surname may be insufficiently tight to reduce the number of comparisons down to a computationally tractable number.

+

In this situation, it's often best to use an even larger list of tighter blocking rules.

+

An example could be something like: +

SettingsCreator(
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name", "surname", "substr(postcode,1,3)"),
+        block_on("surname", "dob"),
+        block_on("first_name", "dob"),
+        block_on("dob", "postcode")
+        block_on("first_name", "postcode")
+        block_on("surname", "postcode")
+    ]
+)
+
+

Analysing blocking_rules_to_generate_predictions

+

It's generally a good idea to analyse the number of comparisons generated by your blocking rules before trying to use them to make predictions, to make sure you don't accidentally generate trillions of pairs. You can use the following function to do this:

+
from splink.blocking_analysis import count_comparisons_from_blocking_rule
+
+br = block_on("substr(first_name, 1,1)", "surname")
+
+count_comparisons_from_blocking_rule(
+        table_or_tables=df,
+        blocking_rule=br,
+        link_type="dedupe_only",
+        db_api=db_api,
+    )
+
+

More compelex blocking rules

+

It is possible to use more complex blocking rules that use non-equijoin conditions. For example, you could use a blocking rule that uses a fuzzy matching function:

+
l.first_name and r.first_name and levenshtein(l.surname, r.surname) < 3
+
+

However, this will not be executed very efficiently, for reasons described in this page.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/blocking/model_training.html b/topic_guides/blocking/model_training.html new file mode 100644 index 0000000000..f8d8c60c9a --- /dev/null +++ b/topic_guides/blocking/model_training.html @@ -0,0 +1,5333 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Model Training Blocking Rules - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Blocking for Model Training

+

Model Training Blocking Rules choose which record pairs from a dataset get considered when training a Splink model. These are used during Expectation Maximisation (EM), where we estimate the m probability (in most cases).

+

The aim of Model Training Blocking Rules is to reduce the number of record pairs considered when training a Splink model in order to reduce the computational resource required. Each Training Blocking Rule define a training "block" of records which have a combination of matches and non-matches that are considered by Splink's Expectation Maximisation algorithm.

+

The Expectation Maximisation algorithm seems to work best when the pairwise record comparisons are a mix of anywhere between around 0.1% and 99.9% true matches. It works less efficiently if there is a huge imbalance between the two (e.g. a billion non matches and only a hundred matches).

+
+

Note

+

Unlike blocking rules for prediction, it does not matter if Training Rules excludes some true matches - it just needs to generate examples of matches and non-matches.

+
+ +

Blocking Rules for Model Training are used as a parameter in the estimate_parameters_using_expectation_maximisation function. After a linker object has been instantiated, you can estimate m probability with training sessions such as:

+
from splink.duckdb.blocking_rule_library import block_on
+
+blocking_rule_for_training = block_on("first_name")
+linker.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+)
+
+

Here, we have defined a "block" of records where first_name are the same. As names are not unique, we can be pretty sure that there will be a combination of matches and non-matches in this "block" which is what is required for the EM algorithm.

+

Matching only on first_name will likely generate a large "block" of pairwise comparisons which will take longer to run. In this case it may be worthwhile applying a stricter blocking rule to reduce runtime. For example, a match on first_name and surname:

+
from splink.duckdb.blocking_rule_library import block_on
+blocking_rule = block_on(["first_name", "surname"])
+linker.estimate_parameters_using_expectation_maximisation(
+    blocking_rule_for_training
+    )
+
+

which will still have a combination of matches and non-matches, but fewer record pairs to consider.

+

Choosing Training Rules

+

The idea behind Training Rules is to consider "blocks" of record pairs with a mixture of matches and non-matches. In practice, most blocking rules have a mixture of matches and non-matches so the primary consideration should be to reduce the runtime of model training by choosing Training Rules that reduce the number of record pairs in the training set.

+

There are some tools within Splink to help choosing these rules. For example, the count_num_comparisons_from_blocking_rule gives the number of records pairs generated by a blocking rule:

+
from splink.duckdb.blocking_rule_library import block_on
+blocking_rule = block_on(["first_name", "surname"])
+linker.count_num_comparisons_from_blocking_rule(blocking_rule)
+
+
+

1056

+
+

It is recommended that you run this function to check how many comparisons are generated before training a model so that you do not needlessly run a training session on billions of comparisons.

+
+

Note

+

Unlike blocking rules for prediction, Training Rules are treated separately for each EM training session therefore the total number of comparisons for Model Training is simply the sum of count_num_comparisons_from_blocking_rule across all Blocking Rules (as opposed to the result of cumulative_comparisons_from_blocking_rules_records).

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/blocking/performance.html b/topic_guides/blocking/performance.html new file mode 100644 index 0000000000..c144932016 --- /dev/null +++ b/topic_guides/blocking/performance.html @@ -0,0 +1,5422 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Computational Performance - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+ +
+ + + +
+
+ + + + + + + + + + + + +

Blocking Rule Performance

+

When considering computational performance of blocking rules, there are two main drivers to address:

+
    +
  • How may pairwise comparisons are generated
  • +
  • How quickly each pairwise comparison takes to run
  • +
+

Below we run through an example of how to address each of these drivers.

+

Strict vs lenient Blocking Rules

+

One way to reduce the number of comparisons being considered within a model is to apply strict blocking rules. However, this can have a significant impact on the how well the Splink model works.

+

In reality, we recommend getting a model up and running with strict Blocking Rules and incrementally loosening them to see the impact on the runtime and quality of the results. By starting with strict blocking rules, the linking process will run faster which means you can iterate through model versions more quickly.

+
+Example - Incrementally loosening Prediction Blocking Rules +

When choosing Prediction Blocking Rules, consider how blocking_rules_to_generate_predictions may be made incrementally less strict. We may start with the following rule:

+

l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob.

+

This is a very strict rule, and will only create comparisons where full name and date of birth match. This has the advantage of creating few record comparisons, but the disadvantage that the rule will miss true matches where there are typos or nulls in any of these three fields.

+

This blocking rule could be loosened to:

+

substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname and l.year_of_birth = r.year_of_birth

+

Now it allows for typos or aliases in the first name, so long as the first letter is the same, and errors in month or day of birth.

+

Depending on the side of your input data, the rule could be further loosened to

+

substr(l.first_name,1,1) = substr(r.first_name,1,1) and l.surname = r.surname

+

or even

+

l.surname = r.surname

+

The user could use the linker.count_num_comparisons_from_blocking_rule() function to select which rule is appropriate for their data.

+
+

Efficient Blocking Rules

+

While the number of pairwise comparisons is important for reducing the computation, it is also helpful to consider the efficiency of the Blocking Rules. There are a number of ways to define subsets of records (i.e. "blocks"), but they are not all computationally efficient.

+

From a performance perspective, here we consider two classes of blocking rule:

+
    +
  • Equi-join conditions
  • +
  • Filter conditions
  • +
+

Equi-join Conditions

+

Equi-joins are simply equality conditions between records, e.g.

+

l.first_name = r.first_name

+

Equality-based blocking rules can be executed efficiently by SQL engines in the sense that the engine is able to create only the record pairs that satisfy the blocking rule. The engine does not have to create all possible record pairs and then filter out the pairs that do not satisfy the blocking rule. This is in contrast to filter conditions (see below), where the engine has to create a larger set of comparisons and then filter it down.

+

Due to this efficiency advantage, equality-based blocking rules should be considered the default method for defining blocking rules. For example, the above example can be written as:

+
from splink import block_on
+block_on("first_name")
+
+

Filter Conditions

+

Filter conditions refer to any Blocking Rule that isn't a simple equality between columns. E.g.

+

levenshtein(l.surname, r.surname) < 3

+

Blocking rules which use similarity or distance functions, such as the example above, are inefficient as the levenshtein function needs to be evaluated for all possible record comparisons before filtering out the pairs that do not satisfy the filter condition.

+

Combining Blocking Rules Efficiently

+

Just as how Blocking Rules can impact on performance, so can how they are combined. The most efficient Blocking Rules combinations are "AND" statements. E.g.

+

block_on("first_name", "surname")

+

which is equivalent to

+

l.first_name = r.first_name AND l.surname = r.surname

+

"OR" statements are extremely inefficient and should almost never be used. E.g.

+

l.first_name = r.first_name OR l.surname = r.surname

+

In most SQL engines, an OR condition within a blocking rule will result in all possible record comparisons being generated. That is, the whole blocking rule becomes a filter condition rather than an equi-join condition, so these should be avoided. For further information, see here.

+

Instead of the OR condition being included in the blocking rule, instead, provide two blocking rules to Splink. This will achieve the desired outcome of generating all comparisons where either the first name or surname match.

+
SettingsCreator(
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname")
+    ]
+)
+
+
+Spark-specific Further Reading +

Given the ability to parallelise operations in Spark, there are some additional configuration options which can improve performance of blocking. Please refer to the Spark Performance Topic Guides for more information.

+

Note: In Spark Equi-joins are implemented using hash partitioning, which facilitates splitting the workload across multiple machines.

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/choosing_comparators.html b/topic_guides/comparisons/choosing_comparators.html new file mode 100644 index 0000000000..f6c7383f47 --- /dev/null +++ b/topic_guides/comparisons/choosing_comparators.html @@ -0,0 +1,6225 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Choosing string comparators - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Choosing String Comparators

+

When building a Splink model, one of the most important aspects is defining the Comparisons and Comparison Levels that the model will train on. Each Comparison Level within a Comparison should contain a different amount of evidence that two records are a match, to which the model can assign a match weight. When considering different amounts of evidence for the model, it is helpful to explore fuzzy matching as a way of distinguishing strings that are similar, but not the same, as one another.

+

This guide is intended to show how Splink's string comparators perform in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds). +For descriptions and examples of each string comparators available in Splink, see the dedicated topic guide.

+

What options are available when comparing strings?

+

There are three main classes of string comparator that are considered within Splink:

+
    +
  1. String Similarity Scores
  2. +
  3. String Distance Scores
  4. +
  5. Phonetic Matching
  6. +
+

where

+

String Similarity Scores are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. Jaro-Winkler Similarity.

+

String Distance Scores are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. Levenshtein Distance.

+

Phonetic Matching is whether two strings are phonetically similar. The two strings are passed through a phonetic transformation algorithm and then the resulting phonetic codes are matched. E.g. Double Metaphone.

+

Comparing String Similarity and Distance Scores

+

Splink contains a comparison_helpers module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.

+

For comparing two strings the comparator_score function returns the scores for all of the available comparators. E.g. consider a simple inversion "Richard" vs "iRchard":

+
from splink.exploratory import similarity_analysis as sa
+
+sa.comparator_score("Richard", "iRchard")
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
string1string2levenshtein_distancedamerau_levenshtein_distancejaro_similarityjaro_winkler_similarityjaccard_similarity
0RichardiRchard210.950.951.0
+
+ +

Now consider a collection of common variations of the name "Richard" - which comparators will consider these variations as sufficiently similar to "Richard"?

+
import pandas as pd
+
+data = [
+    {"string1": "Richard", "string2": "Richard", "error_type": "None"},
+    {"string1": "Richard", "string2": "ichard", "error_type": "Deletion"},
+    {"string1": "Richard", "string2": "Richar", "error_type": "Deletion"},
+    {"string1": "Richard", "string2": "iRchard", "error_type": "Transposition"},
+    {"string1": "Richard", "string2": "Richadr", "error_type": "Transposition"},
+    {"string1": "Richard", "string2": "Rich", "error_type": "Shortening"},
+    {"string1": "Richard", "string2": "Rick", "error_type": "Nickname/Alias"},
+    {"string1": "Richard", "string2": "Ricky", "error_type": "Nickname/Alias"},
+    {"string1": "Richard", "string2": "Dick", "error_type": "Nickname/Alias"},
+    {"string1": "Richard", "string2": "Rico", "error_type": "Nickname/Alias"},
+    {"string1": "Richard", "string2": "Rachael", "error_type": "Different Name"},
+    {"string1": "Richard", "string2": "Stephen", "error_type": "Different Name"},
+]
+
+df = pd.DataFrame(data)
+df
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
string1string2error_type
0RichardRichardNone
1RichardichardDeletion
2RichardRicharDeletion
3RichardiRchardTransposition
4RichardRichadrTransposition
5RichardRichShortening
6RichardRickNickname/Alias
7RichardRickyNickname/Alias
8RichardDickNickname/Alias
9RichardRicoNickname/Alias
10RichardRachaelDifferent Name
11RichardStephenDifferent Name
+
+ +

The comparator_score_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

+
sa.comparator_score_chart(data, "string1", "string2")
+
+ +
+ + +

Here we can see that all of the metrics are fairly sensitive to transcriptions errors ("Richadr", "Richar", "iRchard"). However, considering nicknames/aliases ("Rick", "Ricky", "Rico"), simple metrics such as Jaccard, Levenshtein and Damerau-Levenshtein tend to be less useful. The same can be said for name shortenings ("Rich"), but to a lesser extent than more complex nicknames. However, even more subtle metrics like Jaro and Jaro-Winkler still struggle to identify less obvious nicknames/aliases such as "Dick".

+

If you would prefer the underlying dataframe instead of the chart, there is the comparator_score_df function.

+
sa.comparator_score_df(data, "string1", "string2")
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
string1string2levenshtein_distancedamerau_levenshtein_distancejaro_similarityjaro_winkler_similarityjaccard_similarity
0RichardRichard001.001.001.00
1Richardichard110.950.950.86
2RichardRichar110.950.970.86
3RichardiRchard210.950.951.00
4RichardRichadr210.950.971.00
5RichardRich330.860.910.57
6RichardRick440.730.810.38
7RichardRicky440.680.680.33
8RichardDick550.600.600.22
9RichardRico440.730.810.38
10RichardRachael330.710.740.44
11RichardStephen770.430.430.08
+
+ +

Choosing thresholds

+

We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:

+
sa.comparator_score_threshold_chart(
+    data, "string1", "string2", distance_threshold=2, similarity_threshold=0.8
+)
+
+ +
+ + +

To class our variations on "Richard" in the same Comparison Level, a good choice of metric could be Jaro-Winkler with a threshold of 0.8. Lowering the threshold any more could increase the chances for false positives.

+

For example, consider a single Jaro-Winkler Comparison Level threshold of 0.7 would lead to "Rachael" being considered as providing the same amount evidence for a record matching as "iRchard".

+

An alternative way around this is to construct a Comparison with multiple levels, each corresponding to a different threshold of Jaro-Winkler similarity. For example, below we construct a Comparison using the Comparison Library function JaroWinklerAtThresholds with multiple levels for different match thresholds.:

+
import splink.comparison_library as cl
+
+first_name_comparison = cl.JaroWinklerAtThresholds("first_name", [0.9, 0.8, 0.7])
+
+

If we print this comparison as a dictionary we can see the underlying SQL.

+
first_name_comparison.get_comparison("duckdb").as_dict()
+
+
{'output_column_name': 'first_name',
+ 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
+   'label_for_charts': 'first_name is NULL',
+   'is_null_level': True},
+  {'sql_condition': '"first_name_l" = "first_name_r"',
+   'label_for_charts': 'Exact match on first_name'},
+  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
+   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
+  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.8',
+   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.8'},
+  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
+   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
+  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
+ 'comparison_description': 'JaroWinklerAtThresholds'}
+
+

Where:

+
    +
  • Exact Match level will catch perfect matches ("Richard").
  • +
  • The 0.9 threshold will catch Shortenings and Typos ("ichard", "Richar", "iRchard", "Richadr", "Rich").
  • +
  • The 0.8 threshold will catch simple Nicknames/Aliases ("Rick", "Rico").
  • +
  • The 0.7 threshold will catch more complex Nicknames/Aliases ("Ricky"), but will also include less relevant names (e.g. "Rachael"). However, this should not be a concern as the model should give less predictive power (i.e. Match Weight) to this level of evidence.
  • +
  • All other comparisons will end up in the "Else" level
  • +
+

Phonetic Matching

+

There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.

+

To see the phonetic transformations for a single string, there is the phonetic_transform function:

+
sa.phonetic_transform("Richard")
+
+
{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}
+
+
sa.phonetic_transform("Steven")
+
+
{'soundex': 'S30105', 'metaphone': 'STFN', 'dmetaphone': ('STFN', '')}
+
+

Now consider a collection of common variations of the name "Stephen". Which phonetic transforms will consider these as sufficiently similar to "Stephen"?

+
data = [
+    {"string1": "Stephen", "string2": "Stephen", "error_type": "None"},
+    {"string1": "Stephen", "string2": "Steven", "error_type": "Spelling Variation"},
+    {"string1": "Stephen", "string2": "Stephan", "error_type": "Spelling Variation/Similar Name"},
+    {"string1": "Stephen", "string2": "Steve", "error_type": "Nickname/Alias"},
+    {"string1": "Stephen", "string2": "Stehpen", "error_type": "Transposition"},
+    {"string1": "Stephen", "string2": "tSephen", "error_type": "Transposition"},
+    {"string1": "Stephen", "string2": "Stephne", "error_type": "Transposition"},
+    {"string1": "Stephen", "string2": "Stphen", "error_type": "Deletion"},
+    {"string1": "Stephen", "string2": "Stepheb", "error_type": "Replacement"},
+    {"string1": "Stephen", "string2": "Stephanie", "error_type": "Different Name"},
+    {"string1": "Stephen", "string2": "Richard", "error_type": "Different Name"},
+]
+
+
+df = pd.DataFrame(data)
+df
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
string1string2error_type
0StephenStephenNone
1StephenStevenSpelling Variation
2StephenStephanSpelling Variation/Similar Name
3StephenSteveNickname/Alias
4StephenStehpenTransposition
5StephentSephenTransposition
6StephenStephneTransposition
7StephenStphenDeletion
8StephenStephebReplacement
9StephenStephanieDifferent Name
10StephenRichardDifferent Name
+
+ +

The phonetic_match_chart function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics.

+
sa.phonetic_match_chart(data, "string1", "string2")
+
+ +
+ + +

Here we can see that all of the algorithms recognise simple phonetically similar names ("Stephen", "Steven"). However, there is some variation when it comes to transposition errors ("Stehpen", "Stephne") with soundex and metaphone-esque giving different results. There is also different behaviour considering different names ("Stephanie").

+

Given there is no clear winner that captures all of the similar names, it is recommended that phonetic matches are used as a single Comparison Level within in a Comparison which also includes string comparators in the other levels. To see an example of this, see the Combining String scores and Phonetic matching section of this topic guide.

+

If you would prefer the underlying dataframe instead of the chart, there is the phonetic_transform_df function.

+
sa.phonetic_transform_df(data, "string1", "string2")
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
string1string2soundexmetaphonedmetaphone
0StephenStephen[S30105, S30105][STFN, STFN][(STFN, ), (STFN, )]
1StephenSteven[S30105, S30105][STFN, STFN][(STFN, ), (STFN, )]
2StephenStephan[S30105, S30105][STFN, STFN][(STFN, ), (STFN, )]
3StephenSteve[S30105, S3010][STFN, STF][(STFN, ), (STF, )]
4StephenStehpen[S30105, S30105][STFN, STPN][(STFN, ), (STPN, )]
5StephentSephen[S30105, t50105][STFN, TSFN][(STFN, ), (TSFN, )]
6StephenStephne[S30105, S301050][STFN, STFN][(STFN, ), (STFN, )]
7StephenStphen[S30105, S3105][STFN, STFN][(STFN, ), (STFN, )]
8StephenStepheb[S30105, S30101][STFN, STFP][(STFN, ), (STFP, )]
9StephenStephanie[S30105, S301050][STFN, STFN][(STFN, ), (STFN, )]
10StephenRichard[S30105, R02063][STFN, RXRT][(STFN, ), (RXRT, RKRT)]
+
+ +

Combining String scores and Phonetic matching

+

Once you have considered all of the string comparators and phonetic transforms for a given column, you may decide that you would like to have multiple comparison levels including a combination of options.

+

For this you can construct a custom comparison to catch all of the edge cases you want. For example, if you decide that the comparison for first_name in the model should consider:

+
    +
  1. A Dmetaphone level for phonetic similarity
  2. +
  3. A Levenshtein level with distance of 2 for typos
  4. +
  5. A Jaro-Winkler level with similarity 0.8 for fuzzy matching
  6. +
+
import splink.comparison_library as cl
+import splink.comparison_level_library as cll
+first_name_comparison = cl.CustomComparison(
+    output_column_name="first_name",
+    comparison_levels=[
+        cll.NullLevel("first_name"),
+        cll.ExactMatchLevel("first_name"),
+        cll.JaroWinklerLevel("first_name", 0.9),
+        cll.LevenshteinLevel("first_name", 0.8),
+        cll.ArrayIntersectLevel("first_name_dm", 1),
+        cll.ElseLevel()
+    ]
+)
+
+print(first_name_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'CustomComparison' of "first_name" and "first_name_dm".
+Similarity is assessed using the following ComparisonLevels:
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'Jaro-Winkler distance of first_name >= 0.9' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9
+    - 'Levenshtein distance of first_name <= 0.8' with SQL rule: levenshtein("first_name_l", "first_name_r") <= 0.8
+    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("first_name_dm_l", "first_name_dm_r")) >= 1
+    - 'All other comparisons' with SQL rule: ELSE
+
+

where first_name_dm refers to a column in the dataset which has been created during the feature engineering step to give the Dmetaphone transform of first_name.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/comparators.html b/topic_guides/comparisons/comparators.html new file mode 100644 index 0000000000..3d41799805 --- /dev/null +++ b/topic_guides/comparisons/comparators.html @@ -0,0 +1,5643 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + String comparators - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+ +
+
+ + + +
+
+ + + + + + + + + + + + + + + +

String Comparators

+

There are a number of string comparator functions available in Splink that allow fuzzy matching for strings within Comparisons and Comparison Levels. For each of these fuzzy matching functions, below you will find explanations of how they work, worked examples and recommendations for the types of data they are useful for.

+

For guidance on how to choose the most suitable string comparator, and associated threshold, see the dedicated topic guide.

+
+ +

Levenshtein Distance

+
+

At a glance

+

Useful for: Data entry errors e.g. character miskeys. +Splink comparison functions: levenshtein_level() and levenshtein_at_thresholds() +Returns: An integer (lower is more similar).

+
+
Description
+

Levenshtein distance, also known as edit distance, is a measure of the difference between two strings. It represents the minimum number of insertions, deletions, or substitutions of characters required to transform one string into the other.

+

Or, as a formula,

+
\[\textsf{Levenshtein}(s_1, s_2) = \min \lbrace \begin{array}{l} +\text{insertion , } +\text{deletion , } +\text{substitution} +\end{array} \rbrace \]
+
Examples
+
+"KITTEN" vs "SITTING" +

The minimum number of operations to convert "KITTEN" into "SITTING" are:

+
    +
  • Substitute "K" in "KITTEN" with "S" to get "SITTEN."
  • +
  • Substitute "E" in "SITTEN" with "I" to get "SITTIN."
  • +
  • Insert "G" after "N" in "SITTIN" to get "SITTING."
  • +
+

Therefore,

+
\[\textsf{Levenshtein}(\texttt{KITTEN}, \texttt{SITTING}) = 3\]
+
+
+"CAKE" vs "ACKE" +

The minimum number of operations to convert "CAKE" into "ACKE" are:

+
    +
  • Substitute "C" in "CAKE" with "A" to get "AAKE."
  • +
  • substitute the second "A" in "AAKE" with "C" to get "ACKE."
  • +
+

Therefore,

+
\[\textsf{Levenshtein}(\texttt{CAKE}, \texttt{ACKE}) = 2\]
+
+
Sample code
+

You can test out the Levenshtein distance as follows:

+
import duckdb
+duckdb.sql("SELECT levenshtein('CAKE', 'ACKE')").df().iloc[0,0]
+
+
+

2

+
+
+ +

Damerau-Levenshtein Distance

+
+

At a glance

+

Useful for: Data entry errors e.g. character transpositions and miskeys +Splink comparison functions: damerau_levenshtein_level() and damerau_levenshtein_at_thresholds() +Returns: An integer (lower is more similar).

+
+
Description
+

Damerau-Levenshtein distance is a variation of Levenshtein distance that also includes transposition operations, which are the interchange of adjacent characters. This distance measures the minimum number of operations required to transform one string into another by allowing insertions, deletions, substitutions, and transpositions of characters.

+

Or, as a formula,

+
\[\textsf{DamerauLevenshtein}(s_1, s_2) = \min \lbrace \begin{array}{l} +\text{insertion , } +\text{deletion , } +\text{substitution , } +\text{transposition} +\end{array} \rbrace \]
+
Examples
+
+"KITTEN" vs "SITTING" +

The minimum number of operations to convert "KITTEN" into "SITTING" are:

+
    +
  • Substitute "K" in "KITTEN" with "S" to get "SITTEN".
  • +
  • Substitute "E" in "SITTEN" with "I" to get "SITTIN".
  • +
  • Insert "G" after "T" in "SITTIN" to get "SITTING".
  • +
+

Therefore,

+
\[\textsf{DamerauLevenshtein}(\texttt{KITTEN}, \texttt{SITTING}) = 3\]
+
+
+"CAKE" vs "ACKE" +

The minimum number of operations to convert "CAKE" into "ACKE" are:

+
    +
  • Transpose "C" and "A" in "CAKE" with "A" to get "ACKE."
  • +
+

Therefore,

+
\[\textsf{DamerauLevenshtein}(\texttt{CAKE}, \texttt{ACKE}) = 1\]
+
+
Sample code
+

You can test out the Damerau-Levenshtein distance as follows:

+
import duckdb
+duckdb.sql("SELECT damerau_levenshtein('CAKE', 'ACKE')").df().iloc[0,0]
+
+
+

1

+
+
+ +

Jaro Similarity

+
+

At a glance

+

Useful for: Strings where all characters are considered equally important, regardless of order e.g. ID numbers +Splink comparison functions: jaro_level() and jaro_at_thresholds() +Returns: A score between 0 and 1 (higher is more similar)

+
+
Description
+

Jaro similarity is a measure of similarity between two strings. It takes into account the number and order of matching characters, as well as the number of transpositions needed to make the strings identical.

+

Jaro similarity considers:

+
    +
  • The number of matching characters (characters in the same position in both strings).
  • +
  • The number of transpositions (pairs of characters that are not in the same position in both strings).
  • +
+

Or, as a formula:

+
\[\textsf{Jaro}(s_1, s_2) = \frac{1}{3} \left[ \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m} \right]\]
+

where:

+
    +
  • \(s_1\) and \(s_2\) are the two strings being compared
  • +
  • \(m\) is the number of common characters (which are considered matching only if they are the same and not farther than \(\left\lfloor \frac{\min(|s_1|,|s_2|)}{2} \right\rfloor - 1\) characters apart)
  • +
  • \(t\) is the number of transpositions (which is calculated as the number of matching characters that are not in the right order divided by two).
  • +
+
Examples
+
+"MARTHA" vs "MARHTA": +
    +
  • There are four matching characters: "M", "A", "R", and "T".
  • +
  • There is one transposition: the fifth character in "MARTHA" ("H") is not in the same position as the fifth character in "MARHTA" ("T").
  • +
  • We calculate the Jaro similarity using the formula:
  • +
+
\[\textsf{Jaro}(\texttt{MARTHA}, \texttt{MARHTA}) = \frac{1}{3} \left[ \frac{4}{6} + \frac{4}{6} + \frac{4-1}{4} \right] = 0.944\]
+
+
+"MARTHA" vs "AMRTHA": +
    +
  • There are four matching characters: "M", "A", "R", and "T".
  • +
  • There is one transposition: the first character in "MARTHA" ("M") is not in the same position as the first character in "AMRTHA" ("T").
  • +
  • We calculate the Jaro similarity using the formula:
  • +
+
\[\textsf{Jaro}(\texttt{MARTHA}, \texttt{AMRTHA}) = \frac{1}{3} \left[ \frac{4}{6} + \frac{4}{6} + \frac{4-1}{4} \right] = 0.944\]
+

Noting that transpositions of strings gives the same Jaro similarity regardless of order.

+
+
Sample code
+

You can test out the Jaro similarity as follows:

+
import duckdb
+duckdb.sql("SELECT jaro_similarity('MARTHA', 'MARHTA')").df().iloc[0,0]
+
+
+

0.944

+
+
+ +

Jaro-Winkler Similarity

+
+

At a glance

+

Useful for: Strings where importance is weighted towards the first 4 characters e.g. Names +Splink comparison functions: jaro_winkler_level() and jaro_winkler_at_thresholds() +Returns: A score between 0 and 1 (higher is more similar).

+
+
Description
+

Jaro-Winkler similarity is a variation of Jaro similarity that gives extra weight to matching prefixes of the strings. It is particularly useful for names

+

The Jaro-Winkler similarity is calculated as follows:

+
\[\textsf{JaroWinkler}(s_1, s_2) = \textsf{Jaro}(s_1, s_2) + p \cdot l \cdot (1 - \textsf{Jaro}(s_1, s_2))\]
+

where: +- \(\textsf{Jaro}(s_1, s_2)\) is the Jaro similarity between the two strings +- \(l\) is the length of the common prefix between the two strings, up to a maximum of four characters +- \(p\) is a prefix scale factor, commonly set to 0.1.

+
Examples
+
+"MARTHA" vs "MARHTA" +

The common prefix between the two strings is "MAR", which has a length of 3. +We calculate the Jaro-Winkler similarity using the formula:

+
\[\textsf{Jaro-Winkler}(\texttt{MARTHA}, \texttt{MARHTA}) = 0.944 + 0.1 \cdot 3 \cdot (1 - 0.944) = 0.9612\]
+

The Jaro-Winkler similarity is slightly higher than the Jaro similarity, due to the matching prefix.

+
+
+"MARTHA" vs "AMRTHA": +

There is no common prefix, so the Jaro-Winkler similarity formula gives:

+
\[\textsf{Jaro-Winkler}(\texttt{MARTHA}, \texttt{MARHTA}) = 0.944 + 0.1 \cdot 0 \cdot (1 - 0.944) = 0.944\]
+

Which is the same as the Jaro score.

+

Note that the Jaro-Winkler similarity should be used with caution, as it may not always provide better results than the standard Jaro similarity, especially when dealing with short strings or strings that have no common prefix.

+
+
Sample code
+

You can test out the Jaro similarity as follows:

+
import duckdb
+duckdb.sql("SELECT jaro_winkler_similarity('MARTHA', 'MARHTA')").df().iloc[0,0]
+
+
+

0.9612

+
+
+ +

Jaccard Similarity

+
+

At a glance

+

Useful for: +Splink comparison functions: jaccard_level() and [jaccard_at_thresholds()](../../comparison_library.md#splink.comparison_library.JaccardAtThresholdsBase) +Returns: A score between 0 and 1 (higher is more similar).

+
+
Description
+

Jaccard similarity is a measure of similarity between two sets of items, based on the size of their intersection (elements in common) and union (total elements across both sets). For strings, it considers the overlap of characters within each string. Mathematically, it can be represented as:

+
\[\textsf{Jaccard}=\frac{|A \cap B|}{|A \cup B|}\]
+

where A and B are two strings, and |A| and |B| represent their cardinalities (i.e., the number of elements in each set).

+

In practice, Jaccard is more useful with strings that can be split up into multiple words as opposed to characters within a single word or string. E.g. tokens within addresses:

+

Address 1: {"flat", "2", "123", "high", "street", "london", "sw1", "1ab"}

+

Address 2: {"2", "high", "street", "london", "sw1a", "1ab"},

+

where:

+
    +
  • there are 9 unique tokens across the addresses: "flat", "2", "123", "high", "street", "london", "sw1", "sw1a", "1ab"
  • +
  • there are 5 tokens found in both addresses: "2", "high", "street", "london", "1ab"
  • +
+

We calculate the Jaccard similarity using the formula:

+
\[\textsf{Jaccard}(\textrm{Address1}, \textrm{Address2})=\frac{5}{9}=0.5556\]
+

However, this functionality is not currently implemented within Splink

+
Examples
+
+"DUCK" vs "LUCK" +
    +
  • There are five unique characters across the strings: "D", "U", "C", "K", "L"
  • +
  • Three are found in both strings: "U", "C", "K"
  • +
+

We calculate the Jaccard similarity using the formula:

+
\[\textsf{Jaccard}(\texttt{DUCK}, \texttt{LUCK})=\frac{3}{5}=0.6\]
+
+
+"MARTHA" vs "MARHTA" +
    +
  • There are five unique characters across the strings: "M", "A", "R", "T", "H"
  • +
  • Five are found in both strings: "M", "A", "R", "T", "H"
  • +
+

We calculate the Jaccard similarity using the formula:

+
\[\textsf{Jaccard}(\texttt{MARTHA}, \texttt{MARHTA})=\frac{5}{5}=1\]
+
+
Sample code
+

You can test out the Jaccard similarity between two strings with the function below:

+
def jaccard_similarity(str1, str2):
+        set1 = set(str1)
+        set2 = set(str2)
+        return len(set1 & set2) / len(set1 | set2)
+
+jaccard_similarity("DUCK", "LUCK")
+
+
+

0.6

+
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/comparisons_and_comparison_levels.html b/topic_guides/comparisons/comparisons_and_comparison_levels.html new file mode 100644 index 0000000000..744f09acdb --- /dev/null +++ b/topic_guides/comparisons/comparisons_and_comparison_levels.html @@ -0,0 +1,5438 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Comparisons and comparison levels - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Comparison and ComparisonLevels

+

Comparing information

+

To find matching records, Splink creates pairwise record comparisons from the input records, and scores these comparisons.

+

Suppose for instance your data contains first_name and surname and dob:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idfirst_namesurnamedob
1johnsmith1991-04-11
2jonsmith1991-04-17
3johnsmyth1991-04-11
+

To compare these records, at the blocking stage, Splink will set these records against each other in a table of pairwise record comparisons:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
id_lid_rfirst_name_lfirst_name_rsurname_lsurname_rdob_ldob_r
12johnjonsmithsmith1991-04-111991-04-17
13johnjohnsmithsmyth1991-04-111991-04-11
23jonjohnsmithsmyth1991-04-171991-04-11
+

When defining comparisons, we are defining rules that operate on each row of this latter table of pairwise comparisons

+

Defining similarity

+

How how should we assess similarity between the records?

+

In Splink, we will use different measures of similarity for different columns in the data, and then combine these measures to get an overall similarity score. But the most appropriate definition of similarity will differ between columns.

+

For example, two surnames that differ by a single character would usually be considered to be similar. But a one character difference in a 'gender' field encoded as M or F is not similar at all!

+

To allow for this, Splink uses the concepts of Comparisons and ComparisonLevels. Each Comparison usually measures the similarity of a single column in the data, and each Comparison is made up of one or more ComparisonLevels.

+

Within each Comparison are n discrete ComparisonLevels. Each ComparisonLevel defines a discrete gradation (category) of similarity within a Comparison. There can be as many ComparisonLevels as you want. For example:

+
Data Linking Model
+├─-- Comparison: Gender
+│    ├─-- ComparisonLevel: Exact match
+│    ├─-- ComparisonLevel: All other
+├─-- Comparison: First name
+│    ├─-- ComparisonLevel: Exact match on surname
+│    ├─-- ComparisonLevel: surnames have JaroWinklerSimilarity > 0.95
+│    ├─-- ComparisonLevel: All other
+
+

The categories are discrete rather than continuous for performance reasons - so for instance, a ComparisonLevel may be defined as jaro winkler similarity between > 0.95, as opposed to using the Jaro-Winkler score as a continuous measure directly.

+

It is up to the user to decide how best to define similarity for the different columns (fields) in their data, and this is a key part of modelling a record linkage problem.

+

A much more detailed of how this works can be found in this series of interactive tutorials - refer in particular to computing the Fellegi Sunter model.

+

An example:

+

The concepts of Comparisons and ComparisonLevels are best explained using an example.

+

Consider the following simple data linkage model with only two columns (in a real example there would usually be more):

+
Data Linking Model
+├─-- Comparison: Date of birth
+│    ├─-- ComparisonLevel: Exact match
+│    ├─-- ComparisonLevel: One character difference
+│    ├─-- ComparisonLevel: All other
+├─-- Comparison: First name
+│    ├─-- ComparisonLevel: Exact match on first_name
+│    ├─-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.95
+│    ├─-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.8
+│    ├─-- ComparisonLevel: All other
+
+

In this model we have two Comparisons: one for date of birth and one for first name:

+

For data of birth, we have chosen three discrete ComparisonLevels to measure similarity. Either the dates of birth are an exact match, they differ by one character, or they are different in some other way.

+

For first name, we have chosen four discrete ComparisonLevels to measure similarity. Either the first names are an exact match, they have a JaroWinkler similarity of greater than 0.95, they have a JaroWinkler similarity of greater than 0.8, or they are different in some other way.

+

Note that these definitions are mutually exclusive, because they're implemented by Splink like an if statement. For example, for first name, the Comparison is equivalent to the following pseudocode:

+
if first_name_l_ == first_name_r:
+    return "Assign to category: Exact match"
+elif JaroWinklerSimilarity(first_name_l_, first_name_r) > 0.95:
+    return "Assign to category: JaroWinklerSimilarity > 0.95"
+elif JaroWinklerSimilarity(first_name_l_, first_name_r) > 0.8:
+    return "Assign to category: JaroWinklerSimilarity > 0.8"
+else:
+    return "Assign to category: All other"
+
+

In the next section, we will see how to define these Comparisons and ComparisonLevels in Splink.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/customising_comparisons.html b/topic_guides/comparisons/customising_comparisons.html new file mode 100644 index 0000000000..4cae712528 --- /dev/null +++ b/topic_guides/comparisons/customising_comparisons.html @@ -0,0 +1,5746 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Defining and customising comparisons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Defining and customising how record comparisons are made

+

A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types. For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.

+

By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.

+

Comparisons and ComparisonLevels

+

Recall that a Splink model contains a collection of Comparisons and ComparisonLevels organised in a hierarchy.

+

Each ComparisonLevel defines the different gradations of similarity that make up a Comparison.

+

An example is as follows:

+
Data Linking Model
+├─-- Comparison: Date of birth
+│    ├─-- ComparisonLevel: Exact match
+│    ├─-- ComparisonLevel: Up to one character difference
+│    ├─-- ComparisonLevel: Up to three character difference
+│    ├─-- ComparisonLevel: All other
+├─-- Comparison: Name
+│    ├─-- ComparisonLevel: Exact match on first name and surname
+│    ├─-- ComparisonLevel: Exact match on first name
+│    ├─-- etc.
+
+

Three ways of specifying Comparisons

+

In Splink, there are three ways of specifying Comparisons:

+
    +
  • Using 'out-of-the-box' Comparisons (Most simple/succinct)
  • +
  • Composing pre-defined ComparisonLevels
  • +
  • Writing a full dictionary spec of a Comparison by hand (most verbose/flexible)
  • +
+
+ +

Method 1: Using the ComparisonLibrary

+

The ComparisonLibrary contains pre-baked similarity functions that cover many common use cases.

+

These functions generate an entire Comparison, composed of several ComparisonLevels.

+

You can find a listing of all available Comparisons at the page for its API documentation here

+

The following provides an example of using the ExactMatch Comparison, and producing the description (with associated SQL) for the duckdb backend:

+
import splink.comparison_library as cl
+
+first_name_comparison = cl.ExactMatch("first_name")
+print(first_name_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'ExactMatch' of "first_name".
+Similarity is assessed using the following ComparisonLevels:
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'All other comparisons' with SQL rule: ELSE
+
+

Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying .json specification of a model:

+
first_name_comparison.get_comparison("duckdb").as_dict()
+
+
{'output_column_name': 'first_name',
+ 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
+   'label_for_charts': 'first_name is NULL',
+   'is_null_level': True},
+  {'sql_condition': '"first_name_l" = "first_name_r"',
+   'label_for_charts': 'Exact match on first_name'},
+  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
+ 'comparison_description': 'ExactMatch'}
+
+

We can now generate a second, more complex comparison using one of our data-specific comparisons, the PostcodeComparison:

+
pc_comparison = cl.PostcodeComparison("postcode")
+print(pc_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'PostcodeComparison' of "postcode".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
+    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
+    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
+    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
+    - 'All other comparisons' with SQL rule: ELSE
+
+

For a deep dive on out of the box comparisons, see the dedicated topic guide.

+

Comparisons can be further configured using the .configure() method - full API docs here.

+
+ +

Method 2: ComparisonLevels

+

ComparisonLevels provide a lower-level API that allows you to compose your own comparisons.

+

For example, the user may wish to specify a comparison that has levels for a match on soundex and jaro_winkler of the first_name field.

+

The below example assumes the user has derived a column soundex_first_name which contains the soundex of the first name.

+
from splink.comparison_library import CustomComparison
+import splink.comparison_level_library as cll
+
+custom_name_comparison = CustomComparison(
+    output_column_name="first_name",
+    comparison_levels=[
+        cll.NullLevel("first_name"),
+        cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
+        cll.ExactMatchLevel("soundex_first_name").configure(
+            tf_adjustment_column="soundex_first_name"
+        ),
+        cll.ElseLevel(),
+    ],
+)
+
+print(custom_name_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'CustomComparison' of "first_name" and "soundex_first_name".
+Similarity is assessed using the following ComparisonLevels:
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'Exact match on soundex_first_name' with SQL rule: "soundex_first_name_l" = "soundex_first_name_r"
+    - 'All other comparisons' with SQL rule: ELSE
+
+

This can now be specified in the settings dictionary as follows:

+
from splink import SettingsCreator, block_on
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        custom_name_comparison,
+        cl.LevenshteinAtThresholds("dob", [1, 2]),
+    ],
+)
+
+

To inspect the custom comparison as a dictionary, you can call custom_name_comparison.get_comparison("duckdb").as_dict()

+

Note that ComparisonLevels can be further configured using the .configure() method - full API documentation here

+
+ +

Method 3: Providing the spec as a dictionary

+

Behind the scenes in Splink, all Comparisons are eventually turned into a dictionary which conforms to the formal jsonschema specification of the settings dictionary and here.

+

The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.

+

For maximum control over your settings, you can specify your comparisons as a dictionary.

+
comparison_first_name = {
+    "output_column_name": "first_name",
+    "comparison_levels": [
+        {
+            "sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
+            "label_for_charts": "Null",
+            "is_null_level": True,
+        },
+        {
+            "sql_condition": "first_name_l = first_name_r",
+            "label_for_charts": "Exact match",
+            "tf_adjustment_column": "first_name",
+            "tf_adjustment_weight": 1.0,
+            "tf_minimum_u_value": 0.001,
+        },
+        {
+            "sql_condition": "dmeta_first_name_l = dmeta_first_name_r",
+            "label_for_charts": "Exact match",
+            "tf_adjustment_column": "dmeta_first_name",
+            "tf_adjustment_weight": 1.0,
+        },
+        {
+            "sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
+            "label_for_charts": "Exact match",
+            "tf_adjustment_column": "first_name",
+            "tf_adjustment_weight": 0.5,
+            "tf_minimum_u_value": 0.001,
+        },
+        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
+    ],
+}
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        comparison_first_name,
+        cl.LevenshteinAtThresholds("dob", [1, 2]),
+    ],
+)
+
+

Examples

+

Below are some examples of how you can define the same comparison, but through different methods.

+

Exact match Comparison with Term-Frequency Adjustments

+
+
+
+
import splink.comparison_library as cl
+
+first_name_comparison = cl.ExactMatch("first_name").configure(
+    term_frequency_adjustments=True
+)
+
+
+
+
import splink.duckdb.comparison_level_library as cll
+
+first_name_comparison = cl.CustomComparison(
+    output_column_name="first_name",
+    comparison_description="Exact match vs. anything else",
+    comparison_levels=[
+        cll.NullLevel("first_name"),
+        cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
+        cll.ElseLevel(),
+    ],
+)
+
+
+
+
first_name_comparison = {
+    'output_column_name': 'first_name',
+    'comparison_levels': [
+        {
+            'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
+            'label_for_charts': 'Null',
+            'is_null_level': True
+        },
+        {
+            'sql_condition': '"first_name_l" = "first_name_r"',
+            'label_for_charts': 'Exact match',
+            'tf_adjustment_column': 'first_name',
+            'tf_adjustment_weight': 1.0
+        },
+        {
+            'sql_condition': 'ELSE', 
+            'label_for_charts': 'All other comparisons'
+        }],
+    'comparison_description': 'Exact match vs. anything else'
+}
+
+
+
+
+

Each of which gives

+

{
+    'output_column_name': 'first_name',
+    'comparison_levels': [
+        {
+            'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
+            'label_for_charts': 'Null',
+            'is_null_level': True
+        },
+        {
+            'sql_condition': '"first_name_l" = "first_name_r"',
+            'label_for_charts': 'Exact match',
+            'tf_adjustment_column': 'first_name',
+            'tf_adjustment_weight': 1.0
+        },
+        {
+            'sql_condition': 'ELSE', 
+            'label_for_charts': 'All other comparisons'
+        }],
+    'comparison_description': 'Exact match vs. anything else'
+}
+
+in your settings dictionary. +

Levenshtein Comparison

+
+
+
+
import splink.comparison_library as cl
+
+email_comparison = cl.LevenshteinAtThresholds("email", [2, 4])
+
+
+
+
import splink.comparison_library as cl
+import splink.comparison_level_library as cll
+
+email_comparison = cl.CustomComparison(
+    output_column_name="email",
+    comparison_description="Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else",
+    comparison_levels=[
+        cll.NullLevel("email"),
+        cll.LevenshteinLevel("email", distance_threshold=2),
+        cll.LevenshteinLevel("email", distance_threshold=4),
+        cll.ElseLevel(),
+    ],
+)
+
+
+
+
email_comparison = {
+    'output_column_name': 'email',
+    'comparison_levels': [{'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
+    'label_for_charts': 'Null',
+    'is_null_level': True},
+    {
+        'sql_condition': '"email_l" = "email_r"',
+        'label_for_charts': 'Exact match'
+    },
+    {
+        'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
+        'label_for_charts': 'Levenshtein <= 2'
+    },
+    {
+        'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
+        'label_for_charts': 'Levenshtein <= 4'
+    },
+    {
+        'sql_condition': 'ELSE', 
+        'label_for_charts': 'All other comparisons'
+    }],
+    'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'}
+
+
+
+
+

Each of which gives

+
{
+    'output_column_name': 'email',
+    'comparison_levels': [
+        {
+            'sql_condition': '"email_l" IS NULL OR "email_r" IS NULL',
+            'label_for_charts': 'Null',
+            'is_null_level': True},
+        {
+            'sql_condition': '"email_l" = "email_r"',
+            'label_for_charts': 'Exact match'
+        },
+        {
+            'sql_condition': 'levenshtein("email_l", "email_r") <= 2',
+            'label_for_charts': 'Levenshtein <= 2'
+        },
+        {
+            'sql_condition': 'levenshtein("email_l", "email_r") <= 4',
+            'label_for_charts': 'Levenshtein <= 4'
+        },
+        {
+            'sql_condition': 'ELSE', 
+            'label_for_charts': 'All other comparisons'
+        }],
+    'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'
+}
+
+

in your settings dictionary.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/out_of_the_box_comparisons.html b/topic_guides/comparisons/out_of_the_box_comparisons.html new file mode 100644 index 0000000000..a031ce5654 --- /dev/null +++ b/topic_guides/comparisons/out_of_the_box_comparisons.html @@ -0,0 +1,5518 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Out-of-the-box comparisons - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Out-of-the-box Comparisons for specific data types

+

Splink has pre-defined Comparisons available for variety of data types.

+
+ +

DateOfBirthComparison

+

You can find full API docs for DateOfBirthComparison here

+
import splink.comparison_library as cl
+
+date_of_birth_comparison = cl.DateOfBirthComparison(
+    "date_of_birth",
+    input_is_string=True,
+)
+
+

You can view the structure of the comparison as follows:

+
print(date_of_birth_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'DateOfBirthComparison' of "date_of_birth".
+Similarity is assessed using the following ComparisonLevels:
+    - 'transformed date_of_birth is NULL' with SQL rule: try_strptime("date_of_birth_l", '%Y-%m-%d') IS NULL OR try_strptime("date_of_birth_r", '%Y-%m-%d') IS NULL
+    - 'Exact match on date of birth' with SQL rule: "date_of_birth_l" = "date_of_birth_r"
+    - 'DamerauLevenshtein distance <= 1' with SQL rule: damerau_levenshtein("date_of_birth_l", "date_of_birth_r") <= 1
+    - 'Abs date difference <= 1 month' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 2629800.0
+    - 'Abs date difference <= 1 year' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 31557600.0
+    - 'Abs date difference <= 10 year' with SQL rule: ABS(EPOCH(try_strptime("date_of_birth_l", '%Y-%m-%d')) - EPOCH(try_strptime("date_of_birth_r", '%Y-%m-%d'))) <= 315576000.0
+    - 'All other comparisons' with SQL rule: ELSE
+
+

To see this as a specifications dictionary you can use:

+
date_of_birth_comparison.get_comparison("duckdb").as_dict()
+
+

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

+
+ +

Name Comparison

+

A Name comparison is intended for use on an individual name column (e.g. forename, surname)

+

You can find full API docs for NameComparison here

+
import splink.comparison_library as cl
+
+first_name_comparison = cl.NameComparison("first_name")
+
+
print(first_name_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'NameComparison' of "first_name".
+Similarity is assessed using the following ComparisonLevels:
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92
+    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
+    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7
+    - 'All other comparisons' with SQL rule: ELSE
+
+

The NameComparison also allows flexibility to change the parameters and/or fuzzy matching comparison levels.

+

For example:

+
surname_comparison = cl.NameComparison(
+    "surname",
+    jaro_winkler_thresholds=[0.95, 0.9],
+    dmeta_col_name="surname_dmeta",
+)
+print(surname_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'NameComparison' of "surname" and "surname_dmeta".
+Similarity is assessed using the following ComparisonLevels:
+    - 'surname is NULL' with SQL rule: "surname_l" IS NULL OR "surname_r" IS NULL
+    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
+    - 'Jaro-Winkler distance of surname >= 0.95' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.95
+    - 'Jaro-Winkler distance of surname >= 0.9' with SQL rule: jaro_winkler_similarity("surname_l", "surname_r") >= 0.9
+    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("surname_dmeta_l", "surname_dmeta_r")) >= 1
+    - 'All other comparisons' with SQL rule: ELSE
+
+

Where surname_dm refers to a column which has used the DoubleMetaphone algorithm on surname to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the topic guide.

+

To see this as a specifications dictionary you can call

+
surname_comparison.get_comparison("duckdb").as_dict()
+
+

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

+
+ +

Forename and Surname Comparison

+

It can be helpful to construct a single comparison for for comparing the forename and surname because:

+
    +
  1. +

    The Fellegi-Sunter model assumes that columns are independent. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.

    +

    As a result term-frequencies of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are. For more information on term-frequencies, see the dedicated topic guide. Combining forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.

    +
  2. +
  3. +

    It is common for some records to have swapped forename and surname by mistake. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.

    +
  4. +
+

The ForenameSurnameComparison has been designed to accomodate this.

+

You can find full API docs for ForenameSurnameComparison here

+
import splink.comparison_library as cl
+
+full_name_comparison = cl.ForenameSurnameComparison("forename", "surname")
+
+
print(full_name_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'ForenameSurnameComparison' of "forename" and "surname".
+Similarity is assessed using the following ComparisonLevels:
+    - '(forename is NULL) AND (surname is NULL)' with SQL rule: ("forename_l" IS NULL OR "forename_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)
+    - '(Exact match on forename) AND (Exact match on surname)' with SQL rule: ("forename_l" = "forename_r") AND ("surname_l" = "surname_r")
+    - 'Match on reversed cols: forename and surname' with SQL rule: "forename_l" = "surname_r" AND "forename_r" = "surname_l"
+    - '(Jaro-Winkler distance of forename >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)' with SQL rule: (jaro_winkler_similarity("forename_l", "forename_r") >= 0.92) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.92)
+    - '(Jaro-Winkler distance of forename >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)' with SQL rule: (jaro_winkler_similarity("forename_l", "forename_r") >= 0.88) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.88)
+    - 'Exact match on surname' with SQL rule: "surname_l" = "surname_r"
+    - 'Exact match on forename' with SQL rule: "forename_l" = "forename_r"
+    - 'All other comparisons' with SQL rule: ELSE
+
+

As noted in the feature engineering guide, to take advantage of term frequency adjustments on full name, you need to derive a full name column prior to importing data into Splin. You then provide the column name using the forename_surname_concat_col_name argument:

+
full_name_comparison = cl.ForenameSurnameComparison("forename", "surname", forename_surname_concat_col_name="first_and_last_name")
+
+

To see this as a specifications dictionary you can call

+
full_name_comparison.get_comparison("duckdb").as_dict()
+
+

Which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

+
+ +

Postcode Comparisons

+

See Feature Engineering for more details.

+
import splink.comparison_library as cl
+
+pc_comparison = cl.PostcodeComparison("postcode")
+
+
print(pc_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'PostcodeComparison' of "postcode".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
+    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
+    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
+    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
+    - 'All other comparisons' with SQL rule: ELSE
+
+

If you have derive lat long columns, you can model geographical distances.

+
pc_comparison = cl.PostcodeComparison("postcode", lat_col="lat", long_col="long", km_thresholds=[1,10,50])
+print(pc_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'PostcodeComparison' of "postcode", "long" and "lat".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on postcode' with SQL rule: "postcode_l" = "postcode_r"
+    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
+    - 'Distance in km <= 1' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 1
+    - 'Distance in km <= 10' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 10
+    - 'Distance in km <= 50' with SQL rule:  cast( acos( case when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) > 1 then 1 when ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) < -1 then -1 else ( sin( radians("lat_l") ) * sin( radians("lat_r") ) + cos( radians("lat_l") ) * cos( radians("lat_r") ) * cos( radians("long_r" - "long_l") ) ) end ) * 6371 as float ) <= 50
+    - 'All other comparisons' with SQL rule: ELSE
+
+

To see this as a specifications dictionary you can call

+
pc_comparison.get_comparison("duckdb").as_dict()
+
+

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

+
+ +

Email Comparison

+

You can find full API docs for EmailComparison here

+
import splink.comparison_library as cl
+
+email_comparison = cl.EmailComparison("email")
+
+
print(email_comparison.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'EmailComparison' of "email".
+Similarity is assessed using the following ComparisonLevels:
+    - 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
+    - 'Exact match on email' with SQL rule: "email_l" = "email_r"
+    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
+    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
+    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
+    - 'All other comparisons' with SQL rule: ELSE
+
+

To see this as a specifications dictionary you can call

+
email_comparison.as_dict()
+
+

which can be used as the basis for a more custom comparison, as shown in the Defining and Customising Comparisons topic guide , if desired.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/phonetic.html b/topic_guides/comparisons/phonetic.html new file mode 100644 index 0000000000..ac0daf2ff5 --- /dev/null +++ b/topic_guides/comparisons/phonetic.html @@ -0,0 +1,5608 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Phonetic algorithms - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Phonetic transformation algorithms

+

Phonetic transformation algorithms can be used to identify words that sound similar, even if they are spelled differently (e.g. "Stephen" vs "Steven"). These algorithms to give another type of fuzzy match and are often generated in the Feature Engineering step of record linkage.

+

Once generated, phonetic matches can be used within comparisons & comparison levels and blocking rules.

+
import splink.comparison_library as cl
+
+first_name_comparison = cl.NameComparison(
+                        "first_name",
+                        dmeta_col_name= "first_name_dm")
+print(first_name_comparison.human_readable_description)
+
+
Comparison 'NameComparison' of "first_name" and "first_name_dm".
+Similarity is assessed using the following ComparisonLevels:
+
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92
+    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
+    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("first_name_dm_l", "first_name_dm_r")) >= 1
+    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7
+    - 'All other comparisons' with SQL rule: ELSE
+
+
+ +

Algorithms

+

Below are some examples of well known phonetic transformation algorithms.

+

Soundex

+

Soundex is a phonetic algorithm that assigns a code to words based on their sound. The Soundex algorithm works by converting a word into a four-character code, where the first character is the first letter of the word, and the next three characters are numerical codes representing the word's remaining consonants. Vowels and some consonants, such as H, W, and Y, are ignored.

+
+Algorithm Steps +

The Soundex algorithm works by following these steps:

+
    +
  1. +

    Retain the first letter of the word and remove all other vowels and the letters "H", "W", and "Y".

    +
  2. +
  3. +

    Replace each remaining consonant (excluding the first letter) with a numerical code as follows:

    +
      +
    1. B, F, P, and V are replaced with "1"
    2. +
    3. C, G, J, K, Q, S, X, and Z are replaced with "2"
    4. +
    5. D and T are replaced with "3"
    6. +
    7. L is replaced with "4"
    8. +
    9. M and N are replaced with "5"
    10. +
    11. R is replaced with "6"
    12. +
    +
  4. +
  5. +

    Combine the first letter and the numerical codes to form a four-character code. If there are fewer than four characters, pad the code with zeros.

    +
  6. +
+
+
+Example +

You can test out the Soundex transformation between two strings through the phonetics package.

+
import phonetics
+print(phonetics.soundex("Smith"), phonetics.soundex("Smyth"))
+
+
+

S5030 S5030

+
+
+
+ +

Metaphone

+

Metaphone is an improved version of the Soundex algorithm that was developed to handle a wider range of words and languages. The Metaphone algorithm assigns a code to a word based on its phonetic pronunciation, but it takes into account the sound of the entire word, rather than just its first letter and consonants. +The Metaphone algorithm works by applying a set of rules to the word's pronunciation, such as converting the "TH" sound to a "T" sound, or removing silent letters. The resulting code is a variable-length string of letters that represents the word's pronunciation.

+
+Algorithm Steps +

The Metaphone algorithm works by following these steps:

+
    +
  1. +

    Convert the word to uppercase and remove all non-alphabetic characters.

    +
  2. +
  3. +

    Apply a set of pronunciation rules to the word, such as:

    +
      +
    1. Convert the letters "C" and "K" to "K"
    2. +
    3. Convert the letters "PH" to "F"
    4. +
    5. Convert the letters "W" and "H" to nothing if they are not at the beginning of the word
    6. +
    +
  4. +
  5. +

    Apply a set of replacement rules to the resulting word, such as:

    +
      +
    1. Replace the letter "G" with "J" if it is followed by an "E", "I", or "Y"
    2. +
    3. Replace the letter "C" with "S" if it is followed by an "E", "I", or "Y"
    4. +
    5. Replace the letter "X" with "KS"
    6. +
    +
  6. +
  7. +

    If the resulting word ends with "S", remove it.

    +
  8. +
  9. +

    If the resulting word ends with "ED", "ING", or "ES", remove it.

    +
  10. +
  11. +

    If the resulting word starts with "KN", "GN", "PN", "AE", "WR", or "WH", remove the first letter.

    +
  12. +
  13. +

    If the resulting word starts with a vowel, retain the first letter.

    +
  14. +
  15. +

    Retain the first four characters of the resulting word, or pad it with zeros if it has fewer than four characters.

    +
  16. +
+
+
+Example +

You can test out the Metaphone transformation between two strings through the phonetics package.

+
import phonetics
+print(phonetics.metaphone("Smith"), phonetics.metaphone("Smyth"))
+
+
+

SM0 SM0

+
+
+
+ +

Double Metaphone

+

Double Metaphone is an extension of the Metaphone algorithm that generates two codes for each word, one for the primary pronunciation and one for an alternate pronunciation. The Double Metaphone algorithm is designed to handle a wide range of languages and dialects, and it is more accurate than the original Metaphone algorithm.

+

The Double Metaphone algorithm works by applying a set of rules to the word's pronunciation, similar to the Metaphone algorithm, but it generates two codes for each word. The primary code is the most likely pronunciation of the word, while the alternate code represents a less common pronunciation.

+
+Algorithm Steps +
+
+
+

The Double Metaphone algorithm works by following these steps:

+
    +
  1. +

    Convert the word to uppercase and remove all non-alphabetic characters.

    +
  2. +
  3. +

    Apply a set of pronunciation rules to the word, such as:

    +
      +
    1. Convert the letters "C" and "K" to "K"
    2. +
    3. Convert the letters "PH" to "F"
    4. +
    5. Convert the letters "W" and "H" to nothing if they are not at the beginning of the word
    6. +
    +
  4. +
  5. +

    Apply a set of replacement rules to the resulting word, such as:

    +
      +
    1. Replace the letter "G" with "J" if it is followed by an "E", "I", or "Y"
    2. +
    3. Replace the letter "C" with "S" if it is followed by an "E", "I", or "Y"
    4. +
    5. Replace the letter "X" with "KS"
    6. +
    +
  6. +
  7. +

    If the resulting word ends with "S", remove it.

    +
  8. +
  9. +

    If the resulting word ends with "ED", "ING", or "ES", remove it.

    +
  10. +
  11. +

    If the resulting word starts with "KN", "GN", "PN", "AE", "WR", or "WH", remove the first letter.

    +
  12. +
  13. +

    If the resulting word starts with "X", "Z", "GN", or "KN", retain the first two characters.

    +
  14. +
  15. +

    Apply a second set of rules to the resulting word to generate an alternative code.

    +
  16. +
  17. +

    Return the primary and alternative codes as a tuple.

    +
  18. +
+
+
+

The Alternative Double Metaphone algorithm takes into account different contexts in the word and is generated by following these steps:

+
    +
  1. +

    Apply a set of prefix rules, such as:

    +
      +
    1. Convert the letter "G" at the beginning of the word to "K" if it is followed by "N", "NED", or "NER"
    2. +
    3. Convert the letter "A" at the beginning of the word to "E" if it is followed by "SCH"
    4. +
    +
  2. +
  3. +

    Apply a set of suffix rules, such as:

    +
      +
    1. Convert the letters "E" and "I" at the end of the word to "Y"
    2. +
    3. Convert the letters "S" and "Z" at the end of the word to "X"
    4. +
    5. Remove the letter "D" at the end of the word if it is preceded by "N"
    6. +
    +
  4. +
  5. +

    Apply a set of replacement rules, such as:

    +
      +
    1. Replace the letter "C" with "X" if it is followed by "IA" or "H"
    2. +
    3. Replace the letter "T" with "X" if it is followed by "IA" or "CH"
    4. +
    +
  6. +
  7. +

    Retain the first four characters of the resulting word, or pad it with zeros if it has fewer than four characters.

    +
  8. +
  9. +

    If the resulting word starts with "X", "Z", "GN", or "KN", retain the first two characters.

    +
  10. +
  11. +

    Return the alternative code.

    +
  12. +
+
+
+
+
+
+Example +

You can test out the Metaphone transformation between two strings through the phonetics package.

+
import phonetics
+print(phonetics.dmetaphone("Smith"), phonetics.dmetaphone("Smyth"))
+
+
+

('SM0', 'XMT') ('SM0', 'XMT')

+
+
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/regular_expressions.html b/topic_guides/comparisons/regular_expressions.html new file mode 100644 index 0000000000..d0614c4938 --- /dev/null +++ b/topic_guides/comparisons/regular_expressions.html @@ -0,0 +1,5439 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Regular expressions - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Extracting partial strings

+

It can sometimes be useful to make comparisons based on substrings or parts of column values. For example, one approach to comparing postcodes is to consider their constituent components, e.g. area, district, etc (see Featuring Engineering for more details).

+

We can use functions such as substrings and regular expressions to enable users to compare strings without needing to engineer new features from source data.

+

Splink supports this functionality via the use of the ComparisonExpression.

+

Examples

+

1. Exact match on postcode area

+

Suppose you wish to make comparisons on a postcode column in your data, however only care about finding links between people who share the same area code (given by the first 1 to 2 letters of the postcode). The regular expression to pick out the first two characters is ^[A-Z]{1,2}:

+
import splink.comparison_level_library as cll
+from splink import ColumnExpression
+
+pc_ce = ColumnExpression("postcode").regex_extract("^[A-Z]{1,2}")
+print(cll.ExactMatchLevel(pc_ce).get_comparison_level("duckdb").sql_condition)
+
+
NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')
+
+

We may therefore configure a comparison as follows:

+
from splink.comparison_library import CustomComparison
+
+cc = CustomComparison(
+    output_column_name="postcode",
+    comparison_levels=[
+        cll.NullLevel("postcode"),
+        cll.ExactMatchLevel(pc_ce),
+        cll.ElseLevel()
+    ]
+
+)
+print(cc.get_comparison("duckdb").human_readable_description)
+
+
Comparison 'CustomComparison' of "postcode".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Z]{1,2}', 0), '')
+    - 'All other comparisons' with SQL rule: ELSE
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
person_id_lperson_id_rpostcode_lpostcode_rcomparison_level
71SE1P 0NYSE1P 0NYexact match
51SE2 4UZSE1P 0NYexact match
92SW14 7PQSW3 9JGexact match
48N7 8RLEC2R 8AHelse level
63SE2 4UZnull level
+

2. Exact match on initial

+

In this example we use the .substr function to create a comparison level based on the first letter of a column value.

+

Note that the substr function is 1-indexed, so the first character is given by substr(1, 1): The first two characters would be given by substr(1, 2).

+
import splink.comparison_level_library as cll
+from splink import ColumnExpression
+
+initial = ColumnExpression("first_name").substr(1,1)
+print(cll.ExactMatchLevel(initial).get_comparison_level("duckdb").sql_condition)
+
+
SUBSTRING("first_name_l", 1, 1) = SUBSTRING("first_name_r", 1, 1)
+
+

Additional info

+

Regular expressions containing “\” (the python escape character) are tricky to make work with the Spark linker due to escaping so consider using alternative syntax, for example replacing “\d” with “[0-9]”.

+

Different regex patterns can achieve the same result but with more or less efficiency. You might want to consider optimising your regular expressions to improve performance (see here, for example).

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/comparisons/term-frequency.html b/topic_guides/comparisons/term-frequency.html new file mode 100644 index 0000000000..14c1a6bca6 --- /dev/null +++ b/topic_guides/comparisons/term-frequency.html @@ -0,0 +1,5665 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Term frequency adjustments - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Term-Frequency Adjustments

+

Problem Statement

+

A shortcoming of the basic Fellegi-Sunter model is that it doesn’t account for skew in the distributions of linking variables. A stark example is a binary variable such as gender in the prison population, where male offenders outnumber female offenders by 10:1.

+

+

How does this affect our m and u probabilities?

+
    +
  • +

    m probability is unaffected - given two records are a match, the gender field should also match with roughly the same probability for males and females

    +
  • +
  • +

    Given two records are not a match, however, it is far more likely that both records will be male than that they will both be female - u probability is too low for the more common value (male) and too high otherwise.

    +
  • +
+

In this example, one solution might be to create an extra comparison level for matches on gender:

+
    +
  • +

    l.gender = r.gender AND l.gender = 'Male'

    +
  • +
  • +

    l.gender = r.gender AND l.gender = 'Female'

    +
  • +
+

However, this complexity forces us to estimate two m probabilities when one would do, and it becomes impractical if we extend to higher-cardinality variables like surname, requiring thousands of additional comparison levels.

+

+

This problem used to be addressed with an ex-post (after the fact) solution - once the linking is done, we have a look at the average match probability for each value in a column to determine which values tend to be stronger indicators of a match. If the average match probability for records pairs that share a surname is 0.2 but the average for the specific surname Smith is 0.1 then we know that the match weight for name should be adjusted downwards for Smiths.

+

The shortcoming of this option is that in practice, the model training is conducted on the assumption that all name matches are equally informative, and all of the underlying probabilities are evaluated accordingly. Ideally, we want to be able to account for term frequencies within the Fellegi-Sunter framework as trained by the EM algorithm.

+

Toy Example

+

Below is an illustration of 2 datasets (10 records each) with skewed distributions of first name. A link_and_dedupe Splink model concatenates these two tables and deduplicates those 20 records.

+

+

In principle, u probabilities for a small dataset like this can be estimated directly - out of 190 possible pairwise comparisons, 77 of them have the same first name. Based on the assumption that matches are rare (i.e. the vast majority of these comparisons are non-matches), we use this as a direct estimate of u. Random sampling makes the same assumption, but by using a manageable-sized sample of a much larger dataset where it would be prohibitively costly to perform all possible comparisons (a Cartesian join).

+

Once we have concatenated our input tables, it is useful to calculate the term frequencies (TF) of each value. Rather than keep a separate TF table, we can add a TF column to the concatenated table - this is what df_concat_with_tf refers to within Splink.

+

Building on the example above, we can define the m and u probabilities for a specific first name value, and work out an expression for the resulting match weight.

+

+

Just as we can add independent match weights for name, DOB and other comparisons (as shown in the Splink waterfall charts), we can also add an independent TF adjustment term for each comparison. This is useful because:

+
    +
  • +

    The TF adjustment doesn't depend on m, and therefore does not have to be estimated by the EM algorithm - it is known already

    +
  • +
  • +

    The EM algorithm benefits from the TF adjustment (rather than previous post hoc implementations)

    +
  • +
  • +

    It is trivially easy to “turn off” TF adjustments in our final match weights if we wish

    +
  • +
  • +

    We can easily disentangle and visualise the aggregate significance of a particular column, separately from the deviations within it (see charts below)

    +
  • +
+

+

Visualising TF Adjustments

+

For an individual comparison of two records, we can see the impact of TF adjustments in the waterfall charts:

+
+

+

+
This example shows two records having a match weight of +15.69 due to a match on first name, surname and DOB. Due to relatively uncommon values for all 3 of these, they each have an additional term frequency adjustment contributing around +5 to the final match weight
+
+

We can also see these match weights and TF adjustments summarised using a chart like the below to highlight common and uncommon names. We do this already using the Splink linker's profile_columns method, but once we know the u probabilities for our comparison columns, we can show these outliers in terms of their impact on match weight:

+
+

+

+
In this example of names from FEBRL data used in the demo notebooks, we see that a match on first name has a match weight of +6. For an uncommon name like Portia this is increased, whereas a common name like Jack has a reduced match weight. This chart can be generated using `linker.tf_adjustment_chart("name")`
+
+ +

Depending on how you compose your Splink settings, TF adjustments can be applied to a specific comparison level in different ways:

+

ComparisonLibrary and ComparisonTemplateLibrary functions

+
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+
+sex_comparison = cl.ExactMatch("sex").configure(term_frequency_adjustments=True)
+
+name_comparison = cl.JaroWinklerAtThresholds(
+    "name",
+    score_threshold_or_thresholds=[0.9, 0.8],
+).configure(term_frequency_adjustments=True)
+
+email_comparison = ctl.EmailComparison("email").configure(
+    term_frequency_adjustments=True,
+)
+
+

Comparison level library functions

+
import splink.comparison_level_library as cll
+
+name_comparison = cl.CustomComparison(
+    output_column_name="name",
+    comparison_description="Full name",
+    comparison_levels=[
+        cll.NullLevel("full_name"),
+        cll.ExactMatchLevel("full_name").configure(tf_adjustment_column="full_name"),
+        cll.ColumnsReversedLevel("first_name", "surname").configure(
+            tf_adjustment_column="surname"
+        ),
+        cll.else_level(),
+    ],
+)
+
+

Providing a detailed spec as a dictionary

+
comparison_first_name = {
+    "output_column_name": "first_name",
+    "comparison_description": "First name jaro dmeta",
+    "comparison_levels": [
+        {
+            "sql_condition": "first_name_l IS NULL OR first_name_r IS NULL",
+            "label_for_charts": "Null",
+            "is_null_level": True,
+        },
+        {
+            "sql_condition": "first_name_l = first_name_r",
+            "label_for_charts": "Exact match",
+            "tf_adjustment_column": "first_name",
+            "tf_adjustment_weight": 1.0,
+            "tf_minimum_u_value": 0.001,
+        },
+        {
+            "sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
+            "label_for_charts": "Exact match",
+            "tf_adjustment_column": "first_name",
+            "tf_adjustment_weight": 0.5,
+            "tf_minimum_u_value": 0.001,
+        },
+        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
+    ],
+}
+
+

More advanced applications

+

The code examples above show how we can use term frequencies for different columns for different comparison levels, and demonstrated a few other features of the TF adjustment implementation in Splink:

+

Multiple columns

+

Each comparison level can be adjusted on the basis of a specified column. In the case of exact match levels, this is trivial but it allows some partial matches to be reframed as exact matches on a different derived column. +One example could be ethnicity, often provided in codes as a letter (W/M/B/A/O - the ethnic group) and a number. Without TF adjustments, an ethnicity comparison might have 3 levels - exact match, match on ethnic group (LEFT(ethnicity,1)), no match. By creating a derived column ethnic_group = LEFT(ethnicity,1) we can apply TF adjustments to both levels.

+
ethnicity_comparison = cl.CustomComparison(
+    output_column_name="ethnicity",
+    comparison_description="Self-defined ethnicity",
+    comparison_levels=[
+        cll.NullLevel("ethnicity"),
+        cll.ExactMatchLevel("ethnicity").configure(tf_adjustment_column="ethnicity"),
+        cll.ExactMatchLevel("ethnic_group").configure(tf_adjustment_column="ethnic_group"),
+        cll.else_level(),
+    ],
+)
+
+

A more critical example would be a full name comparison that uses separate first name and surname columns. Previous implementations would apply TF adjustments to each name component independently, so “John Smith” would be adjusted down for the common name “John” and then again for the common name “Smith”. However, the frequencies of names are not generally independent (e.g. “Mohammed Khan” is a relatively common full name despite neither name occurring frequently). A simple full name comparison could therefore be structured as follows:

+
name_comparison = cl.CustomComparison(
+    output_column_name="name",
+    comparison_description="Full name",
+    comparison_levels=[
+        cll.NullLevel("full_name"),
+        cll.ExactMatchLevel("full_name").configure(tf_adjustment_column="full_name"),
+        cll.ExactMatchLevel("first_name").configure(tf_adjustment_column="first_name"),
+        cll.ExactMatchLevel("surname").configure(tf_adjustment_column="surname"),
+        cll.else_level(),
+    ],
+)
+
+

Fuzzy matches

+

All of the above discussion of TF adjustments has assumed an exact match on the column in question, but this need not be the case. Where we have a “fuzzy” match between string values, it is generally assumed that there has been some small corruption in the text, for a number of possible reasons. A trivial example could be "Smith" vs "Smith " which we know to be equivalent if not an exact string match.

+

In the case of a fuzzy match, we may decide it is desirable to apply TF adjustments for the same reasons as an exact match, but given there are now two distinct sides to the comparison, there are also two different TF adjustments. Building on our assumption that one side is the “correct” or standard value and the other contains some mistake, Splink will simply use the greater of the two term frequencies. There should be more "Smith"s than "Smith "s, so the former provides the best estimate of the true prevalence of the name Smith in the data.

+

In cases where this assumption might not hold and both values are valid and distinct (e.g. "Alex" v "Alexa"), this behaviour is still desirable. Taking the most common of the two ensures that we err on the side of lowering the match score for a more common name than increasing the score by assuming the less common name.

+

TF adjustments will not be applied to any comparison level without explicitly being turned on, but to allow for some middle ground when applying them to fuzzy match column, there is a tf_adjustment_weight setting that can down-weight the TF adjustment. A weight of zero is equivalent to turning TF adjustments off, while a weight of 0.5 means the match weights are halved, mitigating their impact:

+
{
+  "sql_condition": "jaro_winkler_sim(first_name_l, first_name_r) > 0.8",
+  "label_for_charts": "Exact match",
+  "tf_adjustment_column": "first_name",
+  "tf_adjustment_weight": 0.5
+}
+
+

Low-frequency outliers

+

Another example of where you may wish to limit the impact of TF adjustments is for exceedingly rare values. As defined above, the TF-adjusted match weight, K is inversely proportional to the term frequency, allowing K to become very large in some cases.

+

Let’s say we have a handful of records with the misspelt first name “Siohban” (rather than “Siobhan”). Fuzzy matches between the two spellings will rightly be adjusted on the basis of the frequency of the correct spelling, but there will be a small number of cases where the misspellings match one another. Given we suspect these values are more likely to be misspellings of more common names, rather than a distinct and very rare name, we can mitigate this effect by imposing a minimum value on the term frequency used (equivalent to the u value). This can be added to your full settings dictionary as in the example above using "tf_minimum_u_value": 0.001. This means that for values with a frequency of <1 in 1000, it will be set to 0.001.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/data_preparation/feature_engineering.html b/topic_guides/data_preparation/feature_engineering.html new file mode 100644 index 0000000000..29b4363995 --- /dev/null +++ b/topic_guides/data_preparation/feature_engineering.html @@ -0,0 +1,5762 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Feature Engineering - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Feature Engineering for Data Linkage

+

During record linkage, the features in a given dataset are used to provide evidence as to whether two records are a match. Like any predictive model, the quality of a Splink model is dictated by the features provided.

+

Below are some examples of features that be created from common columns, and how to create more detailed comparisons with them in a Splink model.

+
+ +

Postcodes

+

In this example, we derive latitude and longitude coordinates from a postcode column to create a more nuanced comparison. By doing so, we account for similarity not just in the string of the postcode, but in the geographical location it represents. This could be useful if we believe, for instance, that people move house, but generally stay within the same geographical area.

+

We start with a comparison that uses the postcode's components, For example, UK postcodes can be broken down into the following substrings:

+

UK postcode components from https://ideal-postcodes.co.uk/guides/uk-postcode-format +See image source for more details.

+

The pre-built postcode comparison generates a comparison with levels for an exact match on full postcode, sector, district and area in turn.

+

Code examples to use the comparison template:

+
import splink.comparison_library as cl
+
+pc_comparison = ctl.PostcodeComparison("postcode").get_comparison("duckdb")
+print(pc_comparison.human_readable_description)
+
+
+Output +
Comparison 'PostcodeComparison' of "postcode".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on full postcode' with SQL rule: "postcode_l" = "postcode_r"
+    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
+    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
+    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}', 0), '')
+    - 'All other comparisons' with SQL rule: ELSE
+
+
+

Note that this is not able to compute geographical distance by default, because it cannot assume that lat-long coordinates are available.

+

We now proceed to derive lat and long columns so that we can take advantage of geographcial distance. We will use the ONS Postcode Directory to look up the lat-long coordinates for each postcode.

+

Read in a dataset with postcodes:

+
import duckdb
+
+from splink import splink_datasets
+
+df = splink_datasets.historical_50k
+
+df_with_pc = """
+WITH postcode_lookup AS (
+    SELECT
+        pcd AS postcode,
+        lat,
+        long
+    FROM
+        read_csv_auto('./path/to/ONSPD_FEB_2023_UK.csv')
+)
+SELECT
+    df.*,
+    postcode_lookup.lat,
+    postcode_lookup.long
+FROM
+    df
+LEFT JOIN
+    postcode_lookup
+ON
+    upper(df.postcode_fake) = postcode_lookup.postcode
+"""
+
+df_with_postcode = duckdb.sql(df_with_pc)
+
+

Now that coordinates have been added, a more detailed postcode comparison can be produced using the postcode_comparison:

+
pc_comparison = cl.PostcodeComparison(
+    "postcode", lat_col="lat", long_col="long", km_thresholds=[1, 10]
+).get_comparison("duckdb")
+print(pc_comparison.human_readable_description)
+
+
+Output +
Comparison 'PostcodeComparison' of "postcode", "lat" and "long".
+Similarity is assessed using the following ComparisonLevels:
+    - 'postcode is NULL' with SQL rule: "postcode_l" IS NULL OR "postcode_r" IS NULL
+    - 'Exact match on postcode' with SQL rule: "postcode_l" = "postcode_r"
+    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract("postcode_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract("postcode_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
+    - 'Distance less than 1km' with SQL rule:
+        cast(
+            acos(
+
+        case
+            when (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    ) > 1 then 1
+            when (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    ) < -1 then -1
+            else (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    )
+        end
+
+            ) * 6371
+            as float
+        )
+    <= 1
+    - 'Distance less than 10km' with SQL rule:
+        cast(
+            acos(
+
+        case
+            when (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    ) > 1 then 1
+            when (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    ) < -1 then -1
+            else (
+        sin( radians("lat_l") ) * sin( radians("lat_r") ) +
+        cos( radians("lat_l") ) * cos( radians("lat_r") )
+            * cos( radians("long_r" - "long_l") )
+    )
+        end
+
+            ) * 6371
+            as float
+        )
+    <= 10
+    - 'All other comparisons' with SQL rule: ELSE
+
+
+

or by using cll.distance_in_km_level() in conjunction with other comparison levels:

+
import splink.comparison_level_library as cll
+import splink.comparison_library as cl
+
+custom_postcode_comparison = cl.CustomComparison(
+    output_column_name="postcode",
+    comparison_description="Postcode",
+    comparison_levels=[
+        cll.NullLevel("postcode"),
+        cll.ExactMatchLevel("postcode"),
+        cll.DistanceInKMLevel("lat", "long", 1),
+        cll.DistanceInKMLevel("lat", "long", 10),
+        cll.DistanceInKMLevel("lat", "long", 50),
+        cll.ElseLevel(),
+    ],
+)
+
+
+ +

Phonetic transformations

+

Phonetic transformation algorithms can be used to identify words that sound similar, even if they are spelled differently. These are particularly useful for names and can be used as an additional comparison level within name comparisons.

+

For a more detailed explanation on phonetic transformation algorithms, see the topic guide.

+

Example

+

There are a number of python packages which support phonetic transformations that can be applied to a pandas dataframe, which can then be loaded into the Linker. For example, creating a Double Metaphone column with the phonetics python library:

+
import pandas as pd
+import phonetics
+
+from splink import splink_datasets
+df = splink_datasets.fake_1000
+
+# Define a function to apply the dmetaphone phonetic algorithm to each name in the column
+def dmetaphone_name(name):
+    if name is None:
+        pass
+    else:
+        return phonetics.dmetaphone(name)
+
+# Apply the function to the "first_name" and surname columns using the apply method
+df['first_name_dm'] = df['first_name'].apply(dmetaphone_name)
+df['surname_dm'] = df['surname'].apply(dmetaphone_name)
+
+df.head()
+
+
+Output + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
unique_idfirst_namesurnamedobcityemailgroupfirst_name_dmsurname_dm
00Julia2015-10-29Londonhannah88@powers.com0('JL', 'AL')
11JuliaTaylor2015-07-31Londonhannah88@powers.com0('JL', 'AL')('TLR', '')
22JuliaTaylor2016-01-27Londonhannah88@powers.com0('JL', 'AL')('TLR', '')
33JuliaTaylor2015-10-29hannah88opowersc@m0('JL', 'AL')('TLR', '')
44oNahWatson2008-03-23Boltonmatthew78@ballard-mcdonald.net1('AN', '')('ATSN', 'FTSN')
+
+

Note: Soundex and Metaphone are also supported in phonetics

+

Now that the dmetaphone columns have been added, they can be used within comparisons. For example, using the NameComparison function from the comparison library.

+
import splink.duckdb.comparison_template_library as ctl
+
+comparison = cl.NameComparison("first_name", dmeta_col_name="first_name_dm").get_comparison("duckdb")
+comparison.human_readable_description
+
+
+Output +
Comparison 'NameComparison' of "first_name" and "first_name_dm".
+Similarity is assessed using the following ComparisonLevels:
+    - 'first_name is NULL' with SQL rule: "first_name_l" IS NULL OR "first_name_r" IS NULL
+    - 'Exact match on first_name' with SQL rule: "first_name_l" = "first_name_r"
+    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92
+    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88
+    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect("first_name_dm_l", "first_name_dm_r")) >= 1
+    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7
+    - 'All other comparisons' with SQL rule: ELSE
+
+
+
+ +

Full name

+

If Splink has access to a combined full name column, it can use the term frequency of the full name, as opposed to treating forename and surname as independent.

+

This can be important because correlations in names are common. For example, in the UK, “Mohammed Khan” is a more common full name than the individual frequencies of "Mohammed" or "Khan" would suggest.

+

The following example shows how to do this.

+

For more on term frequency, see the dedicated topic guide.

+

Example

+

Derive a full name column:

+
import pandas as pd
+
+from splink import splink_datasets
+
+df = splink_datasets.fake_1000
+
+df['full_name'] = df['first_name'] + ' ' + df['surname']
+
+df.head()
+
+

Now that the full_name column has been added, it can be used within comparisons. For example, using the ForenameSurnameComparison function from the comparison library.

+
comparison = cl.ForenameSurnameComparison(
+    "first_name", "surname", forename_surname_concat_col_name="full_name"
+)
+comparison.get_comparison("duckdb").as_dict()
+
+
+Output +
{'output_column_name': 'first_name_surname',
+'comparison_levels': [{'sql_condition': '("first_name_l" IS NULL OR "first_name_r" IS NULL) AND ("surname_l" IS NULL OR "surname_r" IS NULL)',
+'label_for_charts': '(first_name is NULL) AND (surname is NULL)',
+'is_null_level': True},
+{'sql_condition': '"full_name_l" = "full_name_r"',
+'label_for_charts': 'Exact match on full_name',
+'tf_adjustment_column': 'full_name',
+'tf_adjustment_weight': 1.0},
+{'sql_condition': '"first_name_l" = "surname_r" AND "first_name_r" = "surname_l"',
+'label_for_charts': 'Match on reversed cols: first_name and surname'},
+{'sql_condition': '(jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.92) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.92)',
+'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)'},
+{'sql_condition': '(jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.88) AND (jaro_winkler_similarity("surname_l", "surname_r") >= 0.88)',
+'label_for_charts': '(Jaro-Winkler distance of first_name >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)'},
+{'sql_condition': '"surname_l" = "surname_r"',
+'label_for_charts': 'Exact match on surname',
+'tf_adjustment_column': 'surname',
+'tf_adjustment_weight': 1.0},
+{'sql_condition': '"first_name_l" = "first_name_r"',
+'label_for_charts': 'Exact match on first_name',
+'tf_adjustment_column': 'first_name',
+'tf_adjustment_weight': 1.0},
+{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
+'comparison_description': 'ForenameSurnameComparison'}
+
+
+

Note that the first level is now :

+
{'sql_condition': '"full_name_l" = "full_name_r"',
+'label_for_charts': 'Exact match on full_name',
+'tf_adjustment_column': 'full_name',
+'tf_adjustment_weight': 1.0},
+
+

whereas without specifying forename_surname_concat_col_name we would have had:

+
{'sql_condition': '("first_name_l" = "first_name_r") AND ("surname_l" = "surname_r")',
+'label_for_charts': '(Exact match on first_name) AND (Exact match on surname)'},
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/clusters/graph_metrics.html b/topic_guides/evaluation/clusters/graph_metrics.html new file mode 100644 index 0000000000..9ffb3eec2e --- /dev/null +++ b/topic_guides/evaluation/clusters/graph_metrics.html @@ -0,0 +1,5518 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Graph metrics - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Graph metrics

+

Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is cluster size, which is the number of nodes within a cluster.

+

For data linking with Splink, it is useful to sort graph metrics into three categories:

+ +

Each of these are defined below together with examples and explanations of how they can be applied to linked data to evaluate cluster quality. The examples cover all metrics currently available in Splink.

+
+

Note

+

It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. A more comprehensive picture can be built by considering various metrics in conjunction with one another.

+

It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.

+
+

🟣 Node metrics

+

Node metrics quantify the properties of the nodes which live within clusters.

+

Node Degree

+
Definition
+

Node degree is the number of edges connected to a node.

+
Example
+

In the cluster below A has a node degree of 1, whereas D has a node degree of 3.

+

Basic Graph - Records

+
Application in Data Linkage
+

High node degree is generally considered good as it means there are many edges in support of records in a cluster being linked. Nodes with low node degree could indicate links being missed (false negatives) or be the result of a small number of false links (false positives).

+

However, erroneous links (false positives) could also be the reason for high node degree, so it can be useful to validate the edges of highly connected nodes.

+

It is important to consider cluster size when looking at node degree. By definition, larger clusters contain more nodes to form links between, allowing nodes within them to attain higher degrees compared to those in smaller clusters. Consequently, low node degree within larger clusters can carry greater significance.

+

Bear in mind, that the degree of a single node in a cluster isn't necessarily representative of the overall connectedness of a cluster. This is where cluster centralisation can help.

+
+ +

🔗 Edge metrics

+

Edge metrics quantify the properties of the edges within a cluster.

+

'is bridge'

+
Definition
+

An edge is classified as a 'bridge' if its removal splits a cluster into two smaller clusters.

+
Example
+

For example, the removal of the link labelled "Bridge" below would break this cluster of 9 nodes into two clusters of 5 and 4 nodes, respectively.

+

+
Application in Data Linkage
+

Bridges can be signalers of false positives in linked data, especially when joining two highly connected sub-clusters. Examining bridges can shed light on issues with the linking process leading to the formation of false positive links.

+
+ +

Cluster metrics

+

Cluster metrics refer to the characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.

+

Cluster Size

+
Definition
+

Cluster size refers to the number of nodes within a cluster.

+
Example
+

The cluster below is of size 5.

+

+
Application in Data Linkage
+

When thinking about cluster size, it is often useful to consider the biggest clusters produced and ask yourself if the sizes seem reasonable for the dataset being linked. For example when linking people, does it make sense that an individual is appearing hundreds of times in the linked data resulting in a cluster of over 100 nodes? If the answer is no, then false positives links are probably being formed.

+

If you don't have an intuition of what seems reasonable, then it is worth inspecting a sample of the largest clusters in Splink's Cluster Studio Dashboard to validate (or invalidate) links. From there you can develop an understanding of what maximum cluster size to expect for your linkage. Bear in mind that a large and highly dense cluster is usually less suspicious than a large low-density cluster.

+

There also might be a lower bound on cluster size. For example, when linking two datasets in which you know people appear at least once in each, the minimum expected size of cluster will be 2. Clusters smaller than the minimum size indicate links have been missed.

+

Cluster Density

+
Definition
+

The density of a cluster is given by the number of edges it contains divided by the maximum possible number of edges. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.

+
Example
+

The left cluster below has links between all nodes (giving a density of 1), whereas the right cluster has the minimum number of edges (4) to link 5 nodes together (giving a density of 0.4).

+

+
Application in Data Linkage
+

When evaluating clusters, a high density (closer to 1) is generally considered good as it means there are many edges in support of the records in a cluster being linked.

+

A low density could indicate links being missed. This could happen, for example, if blocking rules are too tight or the clustering threshold is too high.

+

A sample of low density clusters can be inspected in Splink's Cluster Studio Dashboard via the option sampling_method = "lowest_density_clusters_by_size", which performs stratified sampling across different cluster sizes. When inspecting a cluster, ask yourself the question: why aren't more links being formed between record nodes?

+

Cluster Centralisation

+
+

Work in Progress

+

We are still working out where Cluster Centralisation can be best used in the context of record linkage. At this stage, we do not have clear recommendations or guidance on the best places to use it - so if you have any expertise in this area we would love to hear from you!

+

We will update this guidance as and when we have clearer strategies in this space.

+
+
Definition
+

Cluster centralisation is defined as the deviation from maximum node degree normalised with respect to the maximum possible value. In other words, cluster centralisation tells us about the concentration of edges in a cluster. Centralisation ranges from 0 to 1.

+
Example
+

Coming Soon

+
Application in Data Linkage
+

A high cluster centralisation (closer to 1) indicates that a few nodes are home to significantly more connections compared to the rest of the nodes in a cluster. This can help identify clusters containing nodes with a lower number of connections (low node degree) relative to what is possible for that cluster.

+

Low centralisation suggests that edges are more evenly distributed amongst nodes in a cluster. This can be good if all nodes within a clusters enjoy many connections. However, low centralisation could also indicate that most nodes are not as highly connected as they could be. To check for this, look at low centralisation in conjunction with low density.

+
+ +

A guide on how to compute graph metrics mentioned above with Splink is given in the next chapter.

+

Please note, this topic guide is a work in progress and we welcome any feedback.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/clusters/how_to_compute_metrics.html b/topic_guides/evaluation/clusters/how_to_compute_metrics.html new file mode 100644 index 0000000000..ed0bf8e6e7 --- /dev/null +++ b/topic_guides/evaluation/clusters/how_to_compute_metrics.html @@ -0,0 +1,5584 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + How to compute graph metrics - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

How to compute graph metrics with Splink

+

Introduction to the compute_graph_metrics() method

+

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the compute_graph_metrics() method.

+

The method is called on the linker like so:

+
linker.computer_graph_metrics(df_predict, df_clustered, threshold_match_probability=0.95)
+
+ + +
+ + + + +
+ + + +

Parameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameTypeDescriptionDefault
df_predict + SplinkDataFrame + +
+

The results of linker.inference.predict()

+
+
+ required +
df_clustered + SplinkDataFrame + +
+

The outputs of +linker.clustering.cluster_pairwise_predictions_at_threshold()

+
+
+ required +
threshold_match_probability + float + +
+

Filter the pairwise match +predictions to include only pairwise comparisons with a +match_probability at or above this threshold. If not provided, the value +will be taken from metadata on df_clustered. If no such metadata is +available, this value must be provided.

+
+
+ None +
+ +
+ +
+

Warning

+

threshold_match_probability should be the same as the clustering threshold passed to cluster_pairwise_predictions_at_threshold(). If this information is available to Splink then it will be passed automatically, otherwise the user will have to provide it themselves and take care to ensure that threshold values align.

+
+

The method generates tables containing graph metrics (for nodes, edges and clusters), and returns a data class of Splink dataframes. The individual Splink dataframes containing node, edge and cluster metrics can be accessed as follows:

+
graph_metrics = linker.clustering.compute_graph_metrics(
+    pairwise_predictions, clusters
+)
+
+df_edges = graph_metrics.edges.as_pandas_dataframe()
+df_nodes = graph_metrics.nodes.as_pandas_dataframe()
+df_clusters = graph_metrics.clusters.as_pandas_dataframe()
+
+

The metrics computed by compute_graph_metrics() include all those mentioned in the Graph metrics chapter, namely:

+
    +
  • Node degree
  • +
  • 'Is bridge'
  • +
  • Cluster size
  • +
  • Cluster density
  • +
  • Cluster centrality
  • +
+

All of these metrics are calculated by default. If you are unable to install the igraph package required for 'is bridge', this metric won't be calculated, however all other metrics will still be generated.

+

Full code example

+

This code snippet computes graph metrics for a simple Splink dedupe model. A pandas dataframe of cluster metrics is displayed as the final output.

+
import splink.comparison_library as cl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+df = splink_datasets.historical_50k
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    comparisons=[
+        cl.ExactMatch(
+            "first_name",
+        ).configure(term_frequency_adjustments=True),
+        cl.JaroWinklerAtThresholds("surname", score_threshold_or_thresholds=[0.9, 0.8]),
+        cl.LevenshteinAtThresholds(
+            "postcode_fake", distance_threshold_or_thresholds=[1, 2]
+        ),
+    ],
+    blocking_rules_to_generate_predictions=[
+        block_on("postcode_fake", "first_name"),
+        block_on("first_name", "surname"),
+        block_on("dob", "substr(postcode_fake,1,2)"),
+        block_on("postcode_fake", "substr(dob,1,3)"),
+        block_on("postcode_fake", "substr(dob,4,5)"),
+    ],
+    retain_intermediate_calculation_columns=True,
+)
+
+db_api = DuckDBAPI()
+linker = Linker(df, settings, db_api)
+
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("first_name", "surname")
+)
+
+linker.training.estimate_parameters_using_expectation_maximisation(
+    block_on("dob", "substr(postcode_fake, 1,3)")
+)
+
+pairwise_predictions = linker.inference.predict()
+clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
+    pairwise_predictions, 0.95
+)
+
+graph_metrics = linker.clustering.compute_graph_metrics(pairwise_predictions, clusters)
+
+df_clusters = graph_metrics.clusters.as_pandas_dataframe()
+
+
df_clusters
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
cluster_idn_nodesn_edgesdensitycluster_centralisation
0Q5076213-11031.00.6888890.250000
1Q760788-1930.00.8333330.214286
2Q88466525-1033.01.0000000.000000
3Q88466525-11037.00.8222220.222222
4Q1386511-11347.00.6025640.272727
..................
21346Q1562561-1610.0NaNNaN
21347Q15999964-510.0NaNNaN
21348Q5363139-1210.0NaNNaN
21349Q4722328-510.0NaNNaN
21350Q7528564-1310.0NaNNaN
+

21351 rows × 5 columns

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/clusters/overview.html b/topic_guides/evaluation/clusters/overview.html new file mode 100644 index 0000000000..c1971cac67 --- /dev/null +++ b/topic_guides/evaluation/clusters/overview.html @@ -0,0 +1,5367 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Overview - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Cluster Evaluation

+

Graphs provide a natural way to think about linked data (see the "Linked data as graphs" guide for a refresher). Visualising linked data as a graph and employing graph metrics are powerful ways to evaluate linkage quality.

+

Basic Cluster

+

Graph metrics help to give a big-picture view of the clusters generated by a Splink model. Through metric distributions and statistics, we can gauge the quality of clusters and monitor how adjustments to models impact results.

+

Graph metrics can also help us home in on problematic clusters, such as those containing inaccurate links (false positives). Spot-checking can be performed with Splink’s Cluster Studio Dashboard which enables users to visualise individual clusters and interrogate the links between their member records.

+

Evaluating cluster quality

+

What is a high quality cluster?

+

When it comes to data linking, the highest quality clusters will be those containing all possible true matches (there will be no missed links a.k.a. false negatives) and no false matches (no false positives). In other words, clusters only containing precisely those nodes corresponding to records about the same entity.

+

Generating clusters which all adhere to this ideal is rare in practice. For example,

+
    +
  • Blocking rules, necessary to make computations tractable, can prevent record comparisons between some true matches ever being made
  • +
  • Data limitations can place an upper bound on the level of quality achievable
  • +
+

Despite this, graph metrics can help us get closer to a satisfactory level of quality as well as monitor it going forward.

+

What does cluster quality look like for you?

+

The extent of cluster evaluation efforts and what is considered 'good enough' will vary greatly with linkage use-case. You might already have labelled data or quality assured outputs from another model which define a clear benchmark for cluster quality.

+

Domain knowledge can also set expectations of what is deemed reasonable or good. For example, you might already know that a large cluster (containing say 100 nodes) is suspicious for your deduplicated dataset.

+

However, you may currently have little or no knowledge about the data or no a clear idea of what good quality clusters look like for your linkage.

+

Whatever the starting point, this topic guide is designed to help users develop a better understanding of their clusters and help focus quality assurance efforts to get the best out of their linkage models.

+

What this topic guide contains

+ +

Please note, this topic guide is a work in progress and we welcome any feedback.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/edge_metrics.html b/topic_guides/evaluation/edge_metrics.html new file mode 100644 index 0000000000..1248238f20 --- /dev/null +++ b/topic_guides/evaluation/edge_metrics.html @@ -0,0 +1,5705 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Edge Metrics - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

Edge Metrics

+ +

This guide is intended to be a reference guide for Edge Metrics used throughout Splink. It will build up from basic principles into more complex metrics.

+
+

Note

+

All of these metrics are dependent on having a "ground truth" to compare against. This is generally provided by Clerical Labelling (i.e. labels created by a human). For more on how to generate this ground truth (and the impact that can have on Edge Metrics), check out the Clerical Labelling Topic Guide.

+
+

The Basics

+

Any Edge (Link) within a Splink model will fall into one of four categories:

+

True Positive

+

Also known as: True Link

+

A True Positive is a case where a Splink model correctly predicts a match between two records.

+

True Negative

+

Also known as: True Non-link

+

A True Negative is a case where a Splink model correctly predicts a non-match between two records.

+

False Positive

+

Also known as: False Link, Type I Error

+

A False Positive is a case where a Splink model incorrectly predicts a match between two records, when they are actually a non-match.

+

False Negative

+

Also known as: False Non-link, Missed Link, Type II Error

+

A False Negative is a case where a Splink model incorrectly predicts a non-match between two records, when they are actually a match.

+

Confusion Matrix

+

These can be summarised in a Confusion Matrix

+

+

In a perfect model there would be no False Positives or False Negatives (i.e. FP = 0 and FN = 0).

+

Metrics for Linkage

+

The confusion matrix shows counts of each link type, but we are generally more interested in proportions. I.e. what percentage of the time does the model get the answer right?

+

Accuracy

+

The simplest metric is

+
\[\textsf{Accuracy} = \frac{\textsf{True Positives}+\textsf{True Negatives}}{\textsf{All Predictions}}\]
+

This measures the proportion of correct classifications (of any kind). This may be useful for balanced data but high accuracy can be achieved by simply assuming the majority class for highly imbalanced data (e.g. assuming non-matches).

+
+Accuracy in Splink +
    +
  • Accuracy can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+
+ +

+

True Positive Rate (Recall)

+

Also known as: Sensitivity

+

The True Positive Rate (Recall) is the proportion of matches that are correctly predicted by Splink.

+
\[\textsf{Recall} = \frac{\textsf{True Positives}}{\textsf{All Positives}} = \frac{\textsf{True Positives}}{\textsf{True Positives} + \textsf{False Negatives}}\]
+
+Recall in Splink +
    +
  • Recall can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+

True Negative Rate (Specificity)

+

Also known as: Selectivity

+

The True Negative Rate (Specificity) is the proportion of non-matches that are correctly predicted by Splink.

+
\[\textsf{Specificity} = \frac{\textsf{True Negatives}}{\textsf{All Negatives}} = \frac{\textsf{True Negatives}}{\textsf{True Negatives} + \textsf{False Positives}}\]
+
+Specificity in Splink +
    +
  • Specificity can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+

Positive Predictive Value (Precision)

+

The Positive Predictive Value (Precision), is the proportion of predicted matches which are true matches.

+
\[\textsf{Precision} = \frac{\textsf{True Positives}}{\textsf{All Predicted Positives}} = \frac{\textsf{True Positives}}{\textsf{True Positives} + \textsf{False Positives}}\]
+
+Precision in Splink +
    +
  • Precision can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+

Negative Predictive Value

+

The Negative Predictive Value is the proportion of predicted non-matches which are true non-matches.

+
\[\textsf{Negative Predictive Value} = \frac{\textsf{True Negatives}}{\textsf{All Predicted Negatives}} = \frac{\textsf{True Negatives}}{\textsf{True Negatives} + \textsf{False Negatives}}\]
+
+Negative Predictive Value in Splink +
    +
  • Negative predictive value can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+
+

Warning

+

Each of these metrics looks at just one row or column of the confusion matrix. A model cannot be meaningfully summarised by just one of these performance measures.

+

“Predicts cancer with 100% Precision” - is true of a “model” that correctly identifies one known cancer patient, but misdiagnoses everyone else as cancer-free.

+

“AI judge’s verdicts have Recall of 100%” - is true for a power-mad AI judge that declares everyone guilty, regardless of any evidence to the contrary.

+
+

Composite Metrics for Linkage

+

This section contains composite metrics i.e. combinations of metrics that can been derived from the confusion matrix (Precision, Recall, Specificity and Negative Predictive Value).

+

Any comparison of two records has a number of possible outcomes (True Positives, False Positives etc.), each of which has a different impact on your specific use case. It is very rare that a single metric defines the desired behaviour of a model. Therefore, evaluating performance with a composite metric (or a combination of metrics) is advised.

+

F Score

+

The F-Score is a weighted harmonic mean of Precision (Positive Predictive Value) and Recall (True Positive Rate). For a general weight \(\beta\):

+
\[F_{\beta} = \frac{(1 + \beta^2) \cdot \textsf{Precision} \cdot \textsf{Recall}}{\beta^2 \cdot \textsf{Precision} + \textsf{Recall}}\]
+

where Recall is \(\beta\) times more important than Precision.

+

For example, when Precision and Recall are equally weighted (\(\beta = 1\)), we get:

+
\[F_{1} = 2\left[\frac{1}{\textsf{Precision}}+\frac{1}{\textsf{Recall}}\right]^{-1} = \frac{2 \cdot \textsf{Precision} \cdot \textsf{Recall}}{\textsf{Precision} + \textsf{Recall}}\]
+

Other popular versions of the F score are \(F_{2}\) (Recall twice as important as Precision) and \(F_{0.5}\) (Precision twice as important as Recall)

+
+F-Score in Splink +
    +
  • The F score can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+
+

Warning

+

F-score does not account for class imbalance in the data, and is asymmetric (i.e. it considers the prediction of matching records, but ignores how well the model correctly predicts non-matching records).

+
+

P4 Score

+

The \(P_{4}\) Score is the harmonic mean of the 4 metrics that can be directly derived from the confusion matrix:

+
\[ 4\left[\frac{1}{\textsf{Recall}}+\frac{1}{\textsf{Specificity}}+\frac{1}{\textsf{Precision}}+\frac{1}{\textsf{Negative Predictive Value}}\right]^{-1} \]
+

This addresses one of the issues with the F-Score as it considers how well the model predicts non-matching records as well as matching records.

+

Note: all metrics are given equal weighting.

+
+\(P_{4}\) in Splink +
    +
  • \(P_{4}\) can be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+

Matthews Correlation Coefficient

+

The Matthews Correlation Coefficient (\(\phi\)) is a measure of how correlation between predictions and actual observations.

+
\[ \phi = \sqrt{\textsf{Recall} \cdot \textsf{Specificity} \cdot \textsf{Precision} \cdot \textsf{Negative Predictive Value}} - \sqrt{(1 - \textsf{Recall})(1 - \textsf{Specificity})(1 - \textsf{Precision})(1 - \textsf{Negative Predictive Value})} \]
+
+Matthews Correlation Coefficient (\(\phi\)) in Splink +
    +
  • \(\phi\) be calculated in Splink using the accuracy_analysis_from_labels_column and accuracy_analysis_from_labels_table methods. Checkout the splink.evaluation docs for more.
  • +
+
+
+

Note

+

Unlike the other metrics in this guide, \(\phi\) is a correlation coefficient, so can range from -1 to 1 (as opposed to a range of 0 to 1).

+

In reality, linkage models should never be negatively correlated with actual observations, so \(\phi\) can be used in the same way as other metrics.

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/edge_overview.html b/topic_guides/evaluation/edge_overview.html new file mode 100644 index 0000000000..34584fa32f --- /dev/null +++ b/topic_guides/evaluation/edge_overview.html @@ -0,0 +1,5363 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Overview - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Edge Evaluation

+

Once you have a trained model, you use it to generate edges (links) between entities (nodes). These edges will have a Match Weight and corresponding Probability.

+

There are several strategies for checking whether the links created in your pipeline perform as you want/expect.

+

Consider the Edge Metrics

+

Edge Metrics measure how links perform at an overall level.

+

First, consider how you would like your model to perform. What is important for your use case? Do you want to ensure that you capture all possible matches (i.e. high recall)? Or do you want to minimise the number of incorrectly predicted matches (i.e. high precision)? Perhaps a combination of both?

+

For a summary of all the edge metrics available in Splink, check out the Edge Metrics guide.

+
+

Note

+

To produce Edge Metrics you will require a "ground truth" to compare your linkage results against (which can be achieved by Clerical Labelling).

+
+

Spot Checking pairs of records

+

Spot Checking real examples of record pairs is helpful for confidence in linkage results. It is an effective way to build intuition for how the model works in practice and allows you to interrogate edge cases.

+

Results of individual record pairs can be examined with the Waterfall Chart.

+

Choosing which pairs of records to spot check can be done by either:

+ +

As you are checking real examples, you will often come across cases that have not been accounted for by your model which you believe signify a match (e.g. a fuzzy match for names). We recommend using this feedback loop to help iterate and improve the definition of your model.

+

Choosing a Threshold

+

Threshold selection is a key decision point within a linkage pipeline. One of the major benefits of probabilistic linkage versus a deterministic (i.e. rules-based) approach is the ability to choose the amount of evidence required for two records to be considered a match (i.e. a threshold).

+

When you have decided on the metrics that are important for your use case, you can use the Threshold Selection Tool to get a first estimate for what your threshold should be.

+
+

Note

+

The Threshold Selection Tool requires labelled data to act as a "ground truth" to compare your linkage results against.

+
+

Once you have an initial threshold, you can use Comparison Viewer Dashboard to look at records on either side of your threshold to check whether the threshold makes intuitive sense.

+

From here, we recommend an iterative process of tweaking your threshold based on your spot checking then looking at the impact that this has on your overall edge metrics. Another tools that can be useful is spot checking where your model has gone wrong using prediction_errors_from_labels_table as demoed in the accuracy analysis demo.

+

In Summary

+

Evaluating the edges (links) of a linkage model depends on your use case. Defining what "good" looks like is a key step, which then allows you to choose a relevant metric (or metrics) for measuring success.

+

Your desired metric should help give an initial estimation for a linkage threshold, then you can use spot checking to help settle on a final threshold.

+

In general, the links between pairs of records are not the final output of linkage pipeline. Most use-cases use these links to group records together into clusters. In this instance, evaluating the links themselves is not sufficient, you have to evaluate the resulting clusters as well.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/image/confusion_matrix.drawio.png b/topic_guides/evaluation/image/confusion_matrix.drawio.png new file mode 100644 index 0000000000..b1ac07f018 Binary files /dev/null and b/topic_guides/evaluation/image/confusion_matrix.drawio.png differ diff --git a/topic_guides/evaluation/image/confusion_matrix_extra.drawio.png b/topic_guides/evaluation/image/confusion_matrix_extra.drawio.png new file mode 100644 index 0000000000..62e4d8d4ce Binary files /dev/null and b/topic_guides/evaluation/image/confusion_matrix_extra.drawio.png differ diff --git a/topic_guides/evaluation/labelling.html b/topic_guides/evaluation/labelling.html new file mode 100644 index 0000000000..e3d9db2c76 --- /dev/null +++ b/topic_guides/evaluation/labelling.html @@ -0,0 +1,5222 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Clerical Labelling - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Clerical Labelling

+

This page is under construction - check back soon!

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/model.html b/topic_guides/evaluation/model.html new file mode 100644 index 0000000000..55beb4f3f2 --- /dev/null +++ b/topic_guides/evaluation/model.html @@ -0,0 +1,5337 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Model - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Model Evaluation

+

The parameters in a trained Splink model determine the match probability (Splink score) assigned to pairwise record comparisons. Before scoring any pairs of records there are a number of ways to check whether your model will perform as you expect.

+

Look at the model parameters

+

The final model is summarised in the match weights chart with each bar in the chart signifying the match weight (i.e. the amount of evidence for or against a match) for each comparison level in your model.

+

If, after some investigation, you still can't make sense of some of the match weights, take a look at the corresponding \(m\) and \(u\) values generated to see if they themselves make sense. These can be viewed in the m u parameters chart.

+
+

Remember that \(\textsf{Match Weight} = \log_2 \frac{m}{u}\)

+
+

Look at the model training

+

The behaviour of a model during training can offer some insight into its utility. The more stable a model is in the training process, the more reliable the outputs are.

+

Stability of model training can be seen in the Expectation Maximisation stage (for \(m\) training):

+
    +
  • +

    Stability across EM training sessions can be seen through the parameter estimates chart

    +
  • +
  • +

    Stability within each session is indicated by the speed of convergence of the algorithm. This is shown in the terminal output during training. In general, the fewer iterations required to converge the better. You can also access convergence charts on the EM training session object

    +
    training_session = linker.training.estimate_parameters_using_expectation_maximisation(
    +    block_on("first_name", "surname")
    +)
    +training_session.match_weights_interactive_history_chart()
    +
    +
  • +
+

In summary

+

Evaluating a trained model is not an exact science - there are no metrics which can definitively say whether a model is good or bad at this stage. In most cases, applying human logic and heuristics is the best you can do to establish whether the model is sensible. Given the variety of potential use cases of Splink, there is no perfect, universal model, just models that can be tuned to produce useful outputs for a given application.

+

The tools within Splink are intended to help identify areas where your model may not be performing as expected. In future versions releases we hope to automatically flag where there are areas of a model that require further investigation to make this process easier for the user.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/evaluation/overview.html b/topic_guides/evaluation/overview.html new file mode 100644 index 0000000000..3ecf414337 --- /dev/null +++ b/topic_guides/evaluation/overview.html @@ -0,0 +1,5358 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Overview - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Evaluation Overview

+

Evaluation is a non-trivial, but crucial, task in data linkage. Linkage pipelines are complex and require many design decisions, each of which has an impact on the end result.

+

This set of topic guides is intended to provide some structure and guidance on how to evaluate a Splink model alongside its resulting links and clusters.

+

How do we evaluate different stages of the pipeline?

+

Evaluation in a data linking pipeline can be broken into 3 broad categories:

+

Model Evaluation

+

After you have trained your model, you can start evaluating the parameters and overall design of the model. To see how, check out the Model Evaluation guide.

+ +

Once you have trained a model, you will use it to predict the probability of links (edges) between entities (nodes). To see how to evaluate these links, check out the Edge Evaluation guide.

+

Cluster Evaluation

+

Once you have chosen a linkage threshold, the edges are used to generate clusters of records. To see how to evaluate these clusters, check out the Cluster Evaluation guide.

+
+ +
+

Note

+

In reality, the development of a linkage pipeline involves iterating through multiple versions of models, links and clusters. For example, for each model version you will generally want to understand the downstream impact on the links and clusters generated. As such, you will likely revisit each stage of evaluation a number of times before settling on a final output.

+

The aim of these guides, and the tools provided in Splink, is to ensure that you are able to extract enough information from each iteration to better understand how your pipeline is working and identify areas for improvement.

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/performance/drivers_of_performance.html b/topic_guides/performance/drivers_of_performance.html new file mode 100644 index 0000000000..f9984397bf --- /dev/null +++ b/topic_guides/performance/drivers_of_performance.html @@ -0,0 +1,5384 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Run times, performance and linking large data - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Run times, performance and linking large data

+ +

This topic guide covers the fundamental drivers of the run time of Splink jobs.

+

Blocking

+

The primary driver of run time is the number of record pairs that the Splink model has to process. In Splink, the number of pairs to consider is reduced using Blocking Rules which are covered in depth in their own set of topic guides.

+

Complexity of comparisons

+

More complex comparisons reduces performance. Complexity is added to comparisons in a number of ways, including:

+
    +
  • Increasing the number of comparison levels
  • +
  • Using more computationally expensive comparison functions
  • +
  • Adding Term Frequency Adjustments
  • +
+
+

Performant Term Frequency Adjustments

+

Model training with Term Frequency adjustments can be made more performant by setting estimate_without_term_frequencies parameter to True in estimate_parameters_using_expectation_maximisation.

+
+

Retaining columns through the linkage process

+

The size your dataset has an impact on the performance of Splink. This is also applicable to the tables that Splink creates and uses under the hood. Some Splink functionality requires additional calculated columns to be stored. For example:

+
    +
  • The comparison_viewer_dashboard requires retain_matching_columns and retain_intermediate_calculation_columns to be set to True in the settings dictionary, but this makes some processes less performant.
  • +
+

Filtering out pairwise in the predict() step

+

Reducing the number of pairwise comparisons that need to be returned will make Splink perform faster. One way of doing this is to filter comparisons with a match score below a given threshold (using a threshold_match_probability or threshold_match_weight) when you call predict().

+

Spark Performance

+

As Spark is designed to distribute processing across multiple machines so there are additional configuration options available to make jobs run more quickly. For more information, check out the Spark Performance Topic Guide.

+
+

Balancing computational performance and model accuracy

+

There is usually a trade off between performance and accuracy in Splink models. I.e. some model design decisions that improve computational performance can also have a negative impact the accuracy of the model.

+

Be sure to check how the suggestions in this topic guide impact the accuracy of your model to ensure the best results.

+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/performance/optimising_duckdb.html b/topic_guides/performance/optimising_duckdb.html new file mode 100644 index 0000000000..ab30ba8bf5 --- /dev/null +++ b/topic_guides/performance/optimising_duckdb.html @@ -0,0 +1,5489 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Optimising DuckDB performance - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Optimising DuckDB performance

+ +

Optimising DuckDB jobs

+

This topic guide describes how to configure DuckDB to optimise performance

+

It is assumed readers have already read the more general guide to linking big data, and have chosen appropriate blocking rules.

+

Summary:

+
    +
  • From splink==3.9.11 onwards, DuckDB generally parallelises jobs well, so you should see 100% usage of all CPU cores for the main Splink operations (parameter estimation and prediction)
  • +
  • In some cases predict() needs salting on blocking_rules_to_generate_predictions to achieve 100% CPU use. You're most likely to need this in the following scenarios:
      +
    • Very high core count machines
    • +
    • Splink models that contain a small number of blocking_rules_to_generate_predictions
    • +
    • Splink models that have a relatively small number of input rows (less than around 500k)
    • +
    +
  • +
  • If you are facing memory issues with DuckDB, you have the option of using an on-disk database.
  • +
  • Reducing the amount of parallelism by removing salting can also sometimes reduce memory usage
  • +
+

You can find a blog post with formal benchmarks of DuckDB performance on a variety of machine types here.

+

Configuration

+

Ensuring 100% CPU usage across all cores on predict()

+

The aim is for overall parallelism of the predict() step to closely align to the number of thread/vCPU cores you have: +- If parallelism is too low, you won't use all your threads +- If parallelism is too high, runtime will be longer.

+

The number of CPU cores used is given by the following formula:

+

\(\text{base parallelism} = \frac{\text{number of input rows}}{122,880}\)

+

\(\text{blocking rule parallelism}\)

+

\(= \text{count of blocking rules} \times\) \(\text{number of salting partitions per blocking rule}\)

+

\(\text{overall parallelism} = \text{base parallelism} \times \text{blocking rule parallelism}\)

+

If overall parallelism is less than the total number of threads, then you won't achieve 100% CPU usage.

+

Example

+

Consider a deduplication job with 1,000,000 input rows, on a machine with 32 cores (64 threads)

+

In our Splink suppose we set:

+
settings =  {
+    ...
+    "blocking_rules_to_generate_predictions" ; [
+        block_on(["first_name"], salting_partitions=2),
+        block_on(["dob"], salting_partitions=2),
+        block_on(["surname"], salting_partitions=2),
+    ]
+    ...
+}
+
+

Then we have:

+
    +
  • Base parallelism of 9.
  • +
  • 3 blocking rules
  • +
  • 2 salting partitions per blocking rule
  • +
+

We therefore have paralleism of \(9 \times 3 \times 2 = 54\), which is less than the 64 threads, and therefore we won't quite achieve full parallelism.

+

Generalisation

+

The above formula for overall parallelism assumes all blocking rules have the same number of salting partitions, which is not necessarily the case. In the more general case of variable numbers of salting partitions, the formula becomes

+
\[ +\text{overall parallelism} = +\text{base parallelism} \times \text{total number of salted blocking partitions across all blocking rules} +\]
+

So for example, with two blocking rules, if the first has 2 salting partitions, and the second has 10 salting partitions, when we would multiply base parallelism by 12.

+

This may be useful in the case one of the blocking rules produces more comparisons than another: the 'bigger' blocking rule can be salted more.

+

For further information about how parallelism works in DuckDB, including links to relevant DuckDB documentation and discussions, see here.

+

Running out of memory

+

If your job is running out of memory, the first thing to consider is tightening your blocking rules, or running the workload on a larger machine.

+

If these are not possible, the following config options may help reduce memory usage:

+

Using an on-disk database

+

DuckDB can spill to disk using several settings:

+

Use the special :temporary: connection built into Splink that creates a temporary on disk database

+
linker = Linker(
+    df, settings, DuckDBAPI(connection=":temporary:")
+)
+
+

Use an on-disk database:

+
con = duckdb.connect(database='my-db.duckdb')
+linker = Linker(
+    df, settings, DuckDBAPI(connection=con)
+)
+
+

Use an in-memory database, but ensure it can spill to disk:

+
con = duckdb.connect(":memory:")
+
+con.execute("SET temp_directory='/path/to/temp';")
+linker = Linker(
+    df, settings, DuckDBAPI(connection=con)
+)
+
+

See also this section of the DuckDB docs

+

Reducing salting

+

Empirically we have noticed that there is a tension between parallelism and total memory usage. If you're running out of memory, you could consider reducing parallelism.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/performance/optimising_spark.html b/topic_guides/performance/optimising_spark.html new file mode 100644 index 0000000000..ea6fe30cdb --- /dev/null +++ b/topic_guides/performance/optimising_spark.html @@ -0,0 +1,5471 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Optimising Spark performance - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+ +
+ + + +
+
+ + + + + + + + + + + + + + + +

Optimising Spark performance

+ +

Optimising Spark jobs

+

This topic guide describes how to configure Spark to optimise performance - especially large linkage jobs which are slow or are not completing using default settings.

+

It is assumed readers have already read the more general guide to linking big data, and blocking rules are proportionate to the size of the Spark cluster. As a very rough guide, on a small cluster of (say) 8 machines, we recommend starting with blocking rules that generate around 100 million comparisons. Once this is working, loosening the blocking rules to around 1 billion comparisons or more is often achievable.

+

Summary:

+
    +
  • Ensure blocking rules are not generating too many comparisons.
  • +
  • We recommend setting the break_lineage_method to "parquet", which is the default
  • +
  • num_partitions_on_repartition should be set so that each file in the output of predict() is roughly 100MB.
  • +
  • Try setting spark.default.parallelism to around 5x the number of CPUs in your cluster
  • +
+

For a cluster with 10 CPUs, that outputs about 8GB of data in parquet format, the following setup may be appropriate:

+
spark.conf.set("spark.default.parallelism", "50")
+spark.conf.set("spark.sql.shuffle.partitions", "50")
+
+linker = Linker(
+    person_standardised_nodes,
+    settings,
+    db_api=spark_api,
+    break_lineage_method="parquet",
+    num_partitions_on_repartition=80,
+)
+
+

Breaking lineage

+

Splink uses an iterative algorithm for model training, and more generally, lineage is long and complex. We have found that big jobs fail to complete without further optimisation. This is a well-known problem:

+
+

Quote

+

"This long lineage bottleneck is widely known by sophisticated Spark application programmers. A common practice for dealing with long lineage is to have the application program strategically checkpoint RDDs at code locations that truncate much of the lineage for checkpointed data and resume computation immediately from the checkpoint."

+
+

Splink will automatically break lineage in sensible places. We have found in practice that, when running Spark jobs backed by AWS S3, the fastest method of breaking lineage is persisting outputs to .parquet file.

+

You can do this using the break_lineage_method parameter as follows:

+
linker = Linker(
+    person_standardised_nodes,
+    settings,
+    db_api=db_api,
+    break_lineage_method="parquet"
+)
+
+

Other options are checkpoint and persist. For different Spark setups, particularly if you have fast local storage, you may find these options perform better.

+

Spark Parallelism

+

We suggest setting default parallelism to roughly 5x the number of CPUs in your cluster. This is a very rough rule of thumb, and if you're encountering performance problems you may wish to experiment with different values.

+

One way to set default parallelism is as follows:

+
from pyspark.context import SparkContext, SparkConf
+from pyspark.sql import SparkSession
+
+conf = SparkConf()
+
+conf.set("spark.default.parallelism", "50")
+conf.set("spark.sql.shuffle.partitions", "50")
+
+sc = SparkContext.getOrCreate(conf=conf)
+spark = SparkSession(sc)
+
+

In general, increasing parallelism will make Spark 'chunk' your job into a larger amount of smaller tasks. This may solve memory issues. But note there is a tradeoff here: if you increase parallelism too high, Spark may take too much time scheduling large numbers of tasks, and may even run out of memory performing this work. See here. Also note that when blocking, jobs cannot be split into a large number of tasks than the cardinality of the blocking rule. For example, if you block on month of birth, this will be split into 12 tasks, irrespective of the parallelism setting. See here. You can use salting (below) to partially address this limitation.

+

Repartition after blocking

+

For some jobs, setting repartition_after_blocking=True when you initialise the SparkAPI may improve performance.

+

Salting

+

For very large jobs, you may find that salting your blocking keys results in faster run times.

+

General Spark config

+

Splink generates large numbers of record comparisons from relatively small input datasets. This is an unusual type of workload, and so default Spark parameters are not always appropriate. Some of the issues encountered are similar to performance issues encountered with Cartesian joins - so some of the tips in relevant articles may help.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/performance/salting.html b/topic_guides/performance/salting.html new file mode 100644 index 0000000000..1427f02dcd --- /dev/null +++ b/topic_guides/performance/salting.html @@ -0,0 +1,5504 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Salting blocking rules - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Salting blocking rules

+

For very large linkages using Apache Spark, Splink supports salting blocking rules.

+

Under certain conditions, this can help Spark better parallelise workflows, leading to shorter run times, and avoiding out of memory errors. It is most likely to help where you have blocking rules that create very large numbers of comparisons (100m records+) and where there is skew in how record comparisons are made (e.g. blocking on full name creates more comparisons amongst 'John Smith's than many other names).

+

Further information about the motivation for salting can be found here.

+

Note that salting is only available for the Spark backend

+

How to use salting

+

To enable salting using the Linker with Spark, you provide some of your blocking rules as a dictionary rather than a string.

+

This enables you to choose the number of salts for each blocking rule.

+

Blocking rules provided as plain strings default to no salting (salting_partitions = 1)

+

The following code snippet illustrates:

+
import logging
+
+from pyspark.context import SparkConf, SparkContext
+from pyspark.sql import SparkSession
+
+import splink.comparison_library as cl
+from splink import Linker, SparkAPI, splink_datasets
+
+conf = SparkConf()
+conf.set("spark.driver.memory", "12g")
+conf.set("spark.sql.shuffle.partitions", "8")
+conf.set("spark.default.parallelism", "8")
+
+sc = SparkContext.getOrCreate(conf=conf)
+spark = SparkSession(sc)
+spark.sparkContext.setCheckpointDir("./tmp_checkpoints")
+
+settings = {
+    "probability_two_random_records_match": 0.01,
+    "link_type": "dedupe_only",
+    "blocking_rules_to_generate_predictions": [
+        "l.dob = r.dob",
+        {"blocking_rule": "l.first_name = r.first_name", "salting_partitions": 4},
+    ],
+    "comparisons": [
+        cl.LevenshteinAtThresholds("first_name", 2),
+        cl.ExactMatch("surname"),
+        cl.ExactMatch("dob"),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.ExactMatch("email"),
+    ],
+    "retain_matching_columns": True,
+    "retain_intermediate_calculation_columns": True,
+    "additional_columns_to_retain": ["cluster"],
+    "max_iterations": 1,
+    "em_convergence": 0.01,
+}
+
+
+df = splink_datasets.fake_1000
+
+spark_api = SparkAPI(spark_session=spark)
+linker = Linker(df, settings, db_api=spark_api)
+logging.getLogger("splink").setLevel(5)
+
+linker.inference.deterministic_link()
+
+

And we can see that salting has been applied by looking at the SQL generated in the log:

+
SELECT
+  l.unique_id AS unique_id_l,
+  r.unique_id AS unique_id_r,
+  l.first_name AS first_name_l,
+  r.first_name AS first_name_r,
+  l.surname AS surname_l,
+  r.surname AS surname_r,
+  l.dob AS dob_l,
+  r.dob AS dob_r,
+  l.city AS city_l,
+  r.city AS city_r,
+  l.tf_city AS tf_city_l,
+  r.tf_city AS tf_city_r,
+  l.email AS email_l,
+  r.email AS email_r,
+  l.`group` AS `group_l`,
+  r.`group` AS `group_r`,
+  '0' AS match_key
+FROM __splink__df_concat_with_tf AS l
+INNER JOIN __splink__df_concat_with_tf AS r
+  ON l.dob = r.dob
+WHERE
+  l.unique_id < r.unique_id
+UNION ALL
+SELECT
+  l.unique_id AS unique_id_l,
+  r.unique_id AS unique_id_r,
+  l.first_name AS first_name_l,
+  r.first_name AS first_name_r,
+  l.surname AS surname_l,
+  r.surname AS surname_r,
+  l.dob AS dob_l,
+  r.dob AS dob_r,
+  l.city AS city_l,
+  r.city AS city_r,
+  l.tf_city AS tf_city_l,
+  r.tf_city AS tf_city_r,
+  l.email AS email_l,
+  r.email AS email_r,
+  l.`group` AS `group_l`,
+  r.`group` AS `group_r`,
+  '1' AS match_key
+FROM __splink__df_concat_with_tf AS l
+INNER JOIN __splink__df_concat_with_tf AS r
+  ON l.first_name = r.first_name
+  AND CEIL(l.__splink_salt * 4) = 1
+  AND NOT (
+    COALESCE((
+        l.dob = r.dob
+    ), FALSE)
+  )
+WHERE
+  l.unique_id < r.unique_id
+UNION ALL
+SELECT
+  l.unique_id AS unique_id_l,
+  r.unique_id AS unique_id_r,
+  l.first_name AS first_name_l,
+  r.first_name AS first_name_r,
+  l.surname AS surname_l,
+  r.surname AS surname_r,
+  l.dob AS dob_l,
+  r.dob AS dob_r,
+  l.city AS city_l,
+  r.city AS city_r,
+  l.tf_city AS tf_city_l,
+  r.tf_city AS tf_city_r,
+  l.email AS email_l,
+  r.email AS email_r,
+  l.`group` AS `group_l`,
+  r.`group` AS `group_r`,
+  '1' AS match_key
+FROM __splink__df_concat_with_tf AS l
+INNER JOIN __splink__df_concat_with_tf AS r
+  ON l.first_name = r.first_name
+  AND CEIL(l.__splink_salt * 4) = 2
+  AND NOT (
+    COALESCE((
+        l.dob = r.dob
+    ), FALSE)
+  )
+WHERE
+  l.unique_id < r.unique_id
+UNION ALL
+SELECT
+  l.unique_id AS unique_id_l,
+  r.unique_id AS unique_id_r,
+  l.first_name AS first_name_l,
+  r.first_name AS first_name_r,
+  l.surname AS surname_l,
+  r.surname AS surname_r,
+  l.dob AS dob_l,
+  r.dob AS dob_r,
+  l.city AS city_l,
+  r.city AS city_r,
+  l.tf_city AS tf_city_l,
+  r.tf_city AS tf_city_r,
+  l.email AS email_l,
+  r.email AS email_r,
+  l.`group` AS `group_l`,
+  r.`group` AS `group_r`,
+  '1' AS match_key
+FROM __splink__df_concat_with_tf AS l
+INNER JOIN __splink__df_concat_with_tf AS r
+  ON l.first_name = r.first_name
+  AND CEIL(l.__splink_salt * 4) = 3
+  AND NOT (
+    COALESCE((
+        l.dob = r.dob
+    ), FALSE)
+  )
+WHERE
+  l.unique_id < r.unique_id
+UNION ALL
+SELECT
+  l.unique_id AS unique_id_l,
+  r.unique_id AS unique_id_r,
+  l.first_name AS first_name_l,
+  r.first_name AS first_name_r,
+  l.surname AS surname_l,
+  r.surname AS surname_r,
+  l.dob AS dob_l,
+  r.dob AS dob_r,
+  l.city AS city_l,
+  r.city AS city_r,
+  l.tf_city AS tf_city_l,
+  r.tf_city AS tf_city_r,
+  l.email AS email_l,
+  r.email AS email_r,
+  l.`group` AS `group_l`,
+  r.`group` AS `group_r`,
+  '1' AS match_key
+FROM __splink__df_concat_with_tf AS l
+INNER JOIN __splink__df_concat_with_tf AS r
+  ON l.first_name = r.first_name
+  AND CEIL(l.__splink_salt * 4) = 4
+  AND NOT (
+    COALESCE((
+        l.dob = r.dob
+    ), FALSE)
+  )
+WHERE
+  l.unique_id < r.unique_id
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/splink_fundamentals/backends/backends.html b/topic_guides/splink_fundamentals/backends/backends.html new file mode 100644 index 0000000000..4a8121a1ab --- /dev/null +++ b/topic_guides/splink_fundamentals/backends/backends.html @@ -0,0 +1,5605 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Backends overview - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Splink's SQL backends: Spark, DuckDB, etc

+

Splink is a Python library. However, it implements all data linking computations by generating SQL, and submitting the SQL statements to a backend of the user's choosing for execution.

+

The Splink code you write is almost identical between backends, so it's straightforward to migrate between backends. Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.

+

Choosing a backend

+ +

When choosing which backend to use when getting started with Splink, there are a number of factors to consider:

+
    +
  • the size of the dataset(s)
  • +
  • the amount of boilerplate code/configuration required
  • +
  • access to specific (sometimes proprietary) platforms
  • +
  • the backend-specific features offered by Splink
  • +
  • the level of support and active development offered by Splink
  • +
+

Below is a short summary of each of the backends available in Splink.

+

DuckDB

+

DuckDB is recommended for most users for all but the largest linkages.

+

It is the fastest backend, and is capable of linking large datasets, especially if you have access to high-spec machines.

+

As a rough guide it can:

+
    +
  • Link up to around 5 million records on a modern laptop (4 core/16GB RAM)
  • +
  • Link tens of millions of records on high spec cloud computers very fast.
  • +
+

For further details, see the results of formal benchmarking here.

+

DuckDB is also recommended because for many users its simplest to set up.

+

It can be run on any device with python installed and it is installed automatically with Splink via pip install splink. DuckDB has complete coverage for the functions in the Splink comparison libraries. Alongside the Spark linker, it receives most attention from the development team.

+

See the DuckDB deduplication example notebook to get a better idea of how Splink works with DuckDB.

+

Spark

+

Spark is recommended for: +- Very large linkages, especially where DuckDB is performing poorly or running out of memory, or +- Or have easier access to a Spark cluster than a single high-spec instance to run DuckDB

+

It is not our default recommendation for most users because: +- It involves more configuration than users, such as registering UDFs and setting up a Spark cluster +- It is slower than DuckDB for many

+

The Spark linker has complete coverage for the functions in the Splink comparison libraries.

+

If working with Databricks note that the Splink development team does not have access to a Databricks environment so we can struggle help DataBricks-specific issues.

+

See the Spark deduplication example notebook for an example of how Splink works with Spark.

+

Athena

+

Athena is a big data SQL backend provided on AWS which is great for large datasets (10+ million records). It requires access to a live AWS account and as a persistent database, requires some additional management of the tables created by Splink. Athena has reasonable, but not complete, coverage for fuzzy matching functions, see [Presto]https://prestodb.io/docs/current/functions/string.html). At this time, the Athena backend is being used sparingly by the Splink development team so receives minimal levels of support.

+

In addition, from a development perspective, the necessity for an AWS connection makes testing Athena code more difficult, so there may be occasional bugs that would normally be caught by our testing framework.

+

See the Athena deduplication example notebook to get a better idea of how Splink works with Athena.

+

SQLite

+

SQLite is similar to DuckDB in that it is, generally, more suited to smaller datasets. SQLite is simple to setup and can be run directly in a Jupyter notebook, but is not as performant as DuckDB. SQLite has reasonable, but not complete, coverage for the functions in the Splink comparison libraries, with gaps in array and date comparisons. String fuzzy matching, while not native to SQLite is available via python UDFs which has some performance implications. SQLite is not actively been used by the Splink team so receives minimal levels of support.

+

PostgreSQL

+

PostgreSQL is a relatively new linker, so we have not fully tested performance or what size of datasets can processed with Splink. The Postgres backend requires a Postgres database, so it is recommend to use this backend only if you are working with a pre-existing Postgres database. Postgres has reasonable, but not complete, coverage for the functions in the Splink comparison libraries, with gaps in string fuzzy matching functionality due to the lack of some string functions in Postgres. At this time, the Postgres backend is not being actively used by the Splink development team so receives minimal levels of support.

+

More details on using Postgres as a Splink backend can be found on the Postgres page.

+

Using your chosen backend

+

Choose the relevant DBAPI:

+

Once you have initialised the linker object, there is no difference in the subsequent code between backends.

+
+
+
+
from splink import Linker, DuckDBAPI
+
+linker = Linker(your_args. DuckDBAPI)
+
+
+
+
from splink import Linker, SparkAPI
+
+linker = Linker(your_args. SparkAPI)
+
+
+
+
from splink import Linker, AthenaAPI
+
+linker = Linker(your_args. AthenaAPI)
+
+
+
+
from splink import Linker, SQLiteAPI
+
+linker = Linker(your_args. SQLiteAPI)
+
+
+
+
from splink import Linker, PostgresAPI
+
+linker = Linker(your_args. PostgresAPI)
+
+
+
+
+

Additional Information for specific backends

+

SQLite

+

SQLite does not have native support for fuzzy string-matching functions. +However, the following are available for Splink users as python user-defined functions (UDFs) which are automatically registered when calling SQLiteAPI()

+
    +
  • levenshtein
  • +
  • damerau_levenshtein
  • +
  • jaro
  • +
  • jaro_winkler
  • +
+

However, there are a couple of points to note:

+
    +
  • These functions are implemented using the RapidFuzz package, which must be installed if you wish to make use of them, via e.g. pip install rapidfuzz. If you do not wish to do so you can disable the use of these functions when creating your linker: +
    SQLiteAPI(register_udfs=False)
    +
  • +
  • As these functions are implemented in python they will be considerably slower than any native-SQL comparisons. If you find that your model-training or predictions are taking a large time to run, you may wish to consider instead switching to DuckDB (or some other backend).
  • +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/splink_fundamentals/backends/postgres.html b/topic_guides/splink_fundamentals/backends/postgres.html new file mode 100644 index 0000000000..8ade735aeb --- /dev/null +++ b/topic_guides/splink_fundamentals/backends/postgres.html @@ -0,0 +1,5547 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + PostgreSQL - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Using PostgreSQL as a Splink backend

+

Splink is compatible with using PostgreSQL (or simply as Postgres) as a SQL backend - for other options have a look at the overview of Splink backends.

+

Setup

+

Splink makes use of SQLAlchemy for connecting to Postgres, and the default database adapter is psycopg2, but you should be able to use any other if you prefer. The PostgresLinker requires a valid engine upon creation to manage interactions with the database: +

from sqlalchemy import create_engine
+
+from splink.postgres.linker import PostgresLinker
+import splink.postgres.comparison_library as cl
+
+# create a sqlalchemy engine to manage connecting to the database
+engine = create_engine("postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DB_NAME")
+
+settings = SettingsCreator(
+    link_type= "dedupe_only",
+)
+
+

You can pass data to the linker in one of two ways:

+
    +
  • +

    use the name of a pre-existing table in your database +

    dbapi = PostgresAPI(engine=engine)
    +linker = Linker(
    +    "my_data_table,
    +    settings_dict,
    +    db_api=db_api,
    +)
    +
    +
  • +
  • +

    or pass a pandas DataFrame directly, in which case the linker will create a corresponding table for you automatically in the database +

    import pandas as pd
    +
    +# create pandas frame from csv
    +df = pd.read_csv("./my_data_table.csv")
    +
    +dbapi = PostgresAPI(engine=engine)
    +linker = Linker(
    +    df,
    +    settings_dict,
    +    db_api=db_api,
    +)
    +
    +
  • +
+

Permissions

+

When you connect to Postgres, you must do so with a role that has sufficient privileges for Splink to operate correctly. These are:

+
    +
  • CREATE ON DATABASE, to allow Splink to create a schema for working, and install the fuzzystrmatch extension
  • +
  • USAGE ON LANGUAGE SQL and USAGE ON TYPE float8 - these are required for creating the UDFs that Splink employs for calculations
  • +
+

Things to know

+

Schemas

+

When you create a PostgresLinker, Splink will create a new schema within the database you specify - by default this schema is called splink, but you can choose another name by passing the appropriate argument when creating the linker: +

dbapi = PostgresAPI(engine=engine, schema="another_splink_schema")
+
+This schema is where all of Splink's work will be carried out, and where any tables created by Splink will live. +

By default when looking for tables, Splink will check the schema it created, and the public schema; if you have tables in other schemas that you would like to be discoverable by Splink, you can use the parameter other_schemas_to_search: +

dbapi = PostgresAPI(engine=engine, other_schemas_to_search=["my_data_schema_1", "my_data_schema_2"])
+
+

User-Defined Functions (UDFs)

+

Splink makes use of Postgres' user-defined functions in order to operate, which are defined in the schema created by Splink when you create the linker. These functions are all defined using SQL, and are:

+ +
+

Information

+

The information below is only relevant if you are planning on making changes to Splink. If you are only intending to use Splink with Postgres, you do not need to read any further.

+
+ +

To run only the Splink tests that run against Postgres, you can run simply: +

pytest -m postgres_only tests/
+
+For more information see the documentation page for testing in Splink. +

The tests will are run using a temporary database and user that are created at the start of the test session, and destroyed at the end.

+

Postgres via docker

+

If you are trying to run tests with Splink on Postgres, or simply develop using Postgres, you may prefer to not actually install Postgres on you system, but to run it instead using Docker. +In this case you can simply run the setup script (a thin wrapper around docker-compose): +

./scripts/postgres_docker/setup.sh
+
+Included in the docker-compose file is a pgAdmin container to allow easy exploration of the database as you work, which can be accessed in-browser on the default port. +

When you are finished you can remove these resources: +

./scripts/postgres_docker/teardown.sh
+
+

Running with a pre-existing database

+

If you have a pre-existing Postgres server you wish to use to run the tests against, you will need to specify environment variables for the credentials where they differ from default (in parentheses):

+
    +
  • SPLINKTEST_PG_USER (splinkognito)
  • +
  • SPLINKTEST_PG_PASSWORD (splink123!)
  • +
  • SPLINKTEST_PG_HOST (localhost)
  • +
  • SPLINKTEST_PG_PORT (5432)
  • +
  • SPLINKTEST_PG_DB (splink_db) - tests will not actually run against this, but it is from a connection to this that the temporary test database + user will be created
  • +
+

While care has been taken to ensure that tests are run using minimal permissions, and are cleaned up after, it is probably wise to run tests connected to a non-important database, in case anything goes wrong. +In addition to the above privileges, in order to run the tests you will need:

+
    +
  • CREATE DATABASE to create a temporary testing database
  • +
  • CREATEROLE to create a temporary user role with limited privileges, which will be actually used for all the SQL execution in the tests
  • +
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/splink_fundamentals/link_type.html b/topic_guides/splink_fundamentals/link_type.html new file mode 100644 index 0000000000..9026967562 --- /dev/null +++ b/topic_guides/splink_fundamentals/link_type.html @@ -0,0 +1,5383 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Link type - linking vs deduping - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Link type: Linking, Deduping or Both

+

Splink allows data to be linked, deduplicated or both.

+

Linking refers to finding links between datasets, whereas deduplication finding links within datasets.

+

Data linking is therefore only meaningful when more than one dataset is provided.

+

This guide shows how to specify the settings dictionary and initialise the linker for the three link types.

+

Deduplication

+

The dedupe_only link type expects the user to provide a single input table, and is specified as follows

+
from splink import SettingsCreator
+
+settings = SettingsCreator(
+    link_type= "dedupe_only",
+)
+
+linker = Linker(df, settings, db_api=dbapi, )
+
+ +

The link_only link type expects the user to provide a list of input tables, and is specified as follows:

+
from splink import SettingsCreator
+
+settings = SettingsCreator(
+    link_type= "link_only",
+)
+
+linker = Linker(
+    [df_1, df_2, df_n],
+    settings,
+    db_api=dbapi,
+    input_table_aliases=["name1", "name2", "name3"],
+)
+
+

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.

+ +

The link_and_dedupe link type expects the user to provide a list of input tables, and is specified as follows:

+
from splink import SettingsCreator
+
+settings = SettingsCreator(
+    link_type= "link_and_dedupe",
+)
+
+linker = Linker(
+    [df_1, df_2, df_n],
+    settings,
+    db_api=dbapi,
+    input_table_aliases=["name1", "name2", "name3"],
+)
+
+

The input_table_aliases argument is optional and are used to label the tables in the outputs. If not provided, defaults will be automatically chosen by Splink.

+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/splink_fundamentals/querying_splink_results.html b/topic_guides/splink_fundamentals/querying_splink_results.html new file mode 100644 index 0000000000..1fbe5a1db1 --- /dev/null +++ b/topic_guides/splink_fundamentals/querying_splink_results.html @@ -0,0 +1,5422 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Retrieving and querying Splink results - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + + + + +

Retrieving and Querying Splink Results

+

When Splink returns results, it does so in the format of a SplinkDataFrame. This is needed to allow Splink to provide results in a uniform format across the different database backends

+

For example, when you run df_predict = linker.predict(), the result df_predict is a SplinkDataFrame.

+

A SplinkDataFrame is an abstraction of a table in the underlying backend database, and provides several convenience methods for interacting with the underlying table. +For detailed information check the full API.

+

Converting to other types

+

You can convert a SplinkDataFrame into a Pandas dataframe using splink_df.as_pandas_dataframe().

+

To view the first few records use a limit statement: splink_df.as_pandas_dataframe(limit=10).

+

For large linkages, it is not recommended to convert the whole SplinkDataFrame to pandas because Splink results can be very large, so converting them into pandas can be slow and result in out of memory errors. Usually it will be better to use SQL to query the tables directly.

+

Querying tables

+

You can find out the name of the table in the underlying database using splink_df.physical_name. This enables you to run SQL queries directly against the results. +You can execute queries using linker.misc.query_sql - +this is the recommended approach as it's typically faster and more memory efficient than using pandas dataframes.

+

The following is an example of this approach, in which we use SQL to find the best match to each input record in a link_type="link_only" job (i.e remove duplicate matches):

+
# linker is a Linker with link_type set to "link_only"
+df_predict = linker.predict(threshold_match_probability=0.75)
+
+sql = f"""
+with ranked as
+(
+select *,
+row_number() OVER (
+    PARTITION BY unique_id_l order by match_weight desc
+    ) as row_number
+from {df_predict.physical_name}
+)
+
+select *
+from ranked
+where row_number = 1
+"""
+
+df_query_result = linker.misc.query_sql(sql)  # pandas dataframe
+
+

Note that linker.misc.query_sql will return a pandas dataframe by default, but you can instead return a SplinkDataFrame as follows: +

df_query_result = linker.misc.query_sql(sql, output_type='splink_df')
+
+

Saving results

+

If you have a SplinkDataFrame, you may wish to store the results in some file outside of your database. +As tables may be large, there are a couple of convenience methods for doing this directly without needing to load the table into memory. +Currently Splink supports saving frames to either csv or parquet format. +Of these we generally recommend the latter, as it is typed, compressed, column-oriented, and easily supports nested data.

+

To save results, simply use the methods to_csv() or to_parquet() - for example: +

df_predict = linker.inference.predict()
+df_predict.to_parquet("splink_predictions.parquet", overwrite=True)
+# or alternatively:
+df_predict.to_csv("splink_predictions.csv", overwrite=True)
+
+

Creating a SplinkDataFrame

+

You can create a SplinkDataFrame for any table in your database. You will need to already have a linker to manage interactions with the database: +

import pandas as pd
+import duckdb
+
+from splink import Linker, SettingsCreator, DuckDBAPI
+from splink.datasets import splink_datasets
+
+con = duckdb.connect()
+df_numbers = pd.DataFrame({"id": [1, 2, 3], "number": ["one", "two", "three"]})
+con.sql("CREATE TABLE number_table AS SELECT * FROM df_numbers")
+
+db_api = DuckDBAPI(connection=con)
+df = splink_datasets.fake_1000
+
+linker = Linker(df, settings=SettingsCreator(link_type="dedupe_only"), db_api=db_api)
+splink_df = linker.table_management.register_table("number_table", "a_templated_name")
+splink_df.as_pandas_dataframe()
+
+```
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/splink_fundamentals/settings.html b/topic_guides/splink_fundamentals/settings.html new file mode 100644 index 0000000000..82a8877e7e --- /dev/null +++ b/topic_guides/splink_fundamentals/settings.html @@ -0,0 +1,5792 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Defining Splink models - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + + + + +

Defining a Splink Model

+ +

When building any linkage model in Splink, there are 3 key things which need to be defined:

+
    +
  1. What type of linkage you want (defined by the link type)
  2. +
  3. What pairs of records to consider (defined by blocking rules)
  4. +
  5. What features to consider, and how they should be compared (defined by comparisons)
  6. +
+ +

All aspects of a Splink model are defined via the SettingsCreator object.

+

For example, consider a simple model:

+
 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        ctl.NameComparison("first_name"),
+        ctl.NameComparison("surname"),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["month", "year"],
+            datetime_thresholds=[
+                1,
+                1,
+            ],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+)
+
+

Where:

+

1. Type of linkage

+

The "link_type" is defined as a deduplication for a single dataset.

+
5
    link_type="dedupe_only",
+
+

2. Pairs of records to consider

+

The "blocking_rules_to_generate_predictions" define a subset of pairs of records for the model to be considered when making predictions. In this case, where there is a match on:

+
    +
  • first_name
  • +
  • OR (surname AND dob).
  • +
+
6
+7
+8
+9
    blocking_rules_to_generate_predictions=[
+            block_on("first_name"),
+            block_on("surname", "dob"),
+        ],
+
+

For more information on how blocking is used in Splink, see the dedicated topic guide.

+

3. Features to consider, and how they should be compared

+

The "comparisons" define the features to be compared between records: "first_name", "surname", "dob", "city" and "email".

+
10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
    comparisons=[
+        cl.NameComparison("first_name"),
+        cl.NameComparison("surname"),
+        cl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["month", "year"],
+            datetime_thresholds=[
+                1,
+                1,
+            ],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        cl.EmailComparison("email"),
+    ],
+
+

Using functions from the comparison library to define how these features should be compared.

+

For more information on how comparisons are defined, see the dedicated topic guide.

+

With our finalised settings object, we can train a Splink model using the following code:

+
+Example model using the settings dictionary +
import splink.comparison_library as cl
+import splink.comparison_template_library as ctl
+from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
+
+db_api = DuckDBAPI()
+df = splink_datasets.fake_1000
+
+settings = SettingsCreator(
+    link_type="dedupe_only",
+    blocking_rules_to_generate_predictions=[
+        block_on("first_name"),
+        block_on("surname"),
+    ],
+    comparisons=[
+        ctl.NameComparison("first_name"),
+        ctl.NameComparison("surname"),
+        ctl.DateComparison(
+            "dob",
+            input_is_string=True,
+            datetime_metrics=["month", "year"],
+            datetime_thresholds=[
+                1,
+                1,
+            ],
+        ),
+        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
+        ctl.EmailComparison("email"),
+    ],
+)
+
+linker = Linker(df, settings, db_api=db_api)
+linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
+
+blocking_rule_for_training = block_on("first_name", "surname")
+linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
+
+blocking_rule_for_training = block_on("dob")
+linker.training.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
+
+pairwise_predictions = linker.inference.predict()
+
+clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
+clusters.as_pandas_dataframe(limit=5)
+
+
+

Advanced usage of the settings dictionary

+

The section above refers to the three key aspects of the Splink settings dictionary. There are a variety of other lesser used settings, which can be found as the arguments to the SettingsCreator

+

Saving a trained model

+

Once you have have a trained Splink model, it is often helpful to save out the model. The save_model_to_json function allows the user to save out the specifications of their trained model.

+
linker.misc.save_model_to_json("model.json")
+
+

which, using the example settings and model training from above, gives the following output:

+
+Model JSON +

When the splink model is saved to disk using linker.misc.save_model_to_json("model.json") these settings become:

+
{
+    "link_type": "dedupe_only",
+    "probability_two_random_records_match": 0.0008208208208208208,
+    "retain_matching_columns": true,
+    "retain_intermediate_calculation_columns": false,
+    "additional_columns_to_retain": [],
+    "sql_dialect": "duckdb",
+    "linker_uid": "29phy7op",
+    "em_convergence": 0.0001,
+    "max_iterations": 25,
+    "bayes_factor_column_prefix": "bf_",
+    "term_frequency_adjustment_column_prefix": "tf_",
+    "comparison_vector_value_column_prefix": "gamma_",
+    "unique_id_column_name": "unique_id",
+    "source_dataset_column_name": "source_dataset",
+    "blocking_rules_to_generate_predictions": [
+        {
+            "blocking_rule": "l.\"first_name\" = r.\"first_name\"",
+            "sql_dialect": "duckdb"
+        },
+        {
+            "blocking_rule": "l.\"surname\" = r.\"surname\"",
+            "sql_dialect": "duckdb"
+        }
+    ],
+    "comparisons": [
+        {
+            "output_column_name": "first_name",
+            "comparison_levels": [
+                {
+                    "sql_condition": "\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL",
+                    "label_for_charts": "first_name is NULL",
+                    "is_null_level": true
+                },
+                {
+                    "sql_condition": "\"first_name_l\" = \"first_name_r\"",
+                    "label_for_charts": "Exact match on first_name",
+                    "m_probability": 0.48854806009621365,
+                    "u_probability": 0.0056770619302010565
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9",
+                    "label_for_charts": "Jaro-Winkler distance of first_name >= 0.9",
+                    "m_probability": 0.1903763096120358,
+                    "u_probability": 0.003424501164330396
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8",
+                    "label_for_charts": "Jaro-Winkler distance of first_name >= 0.8",
+                    "m_probability": 0.08609678978546921,
+                    "u_probability": 0.006620702251038765
+                },
+                {
+                    "sql_condition": "ELSE",
+                    "label_for_charts": "All other comparisons",
+                    "m_probability": 0.23497884050628137,
+                    "u_probability": 0.9842777346544298
+                }
+            ],
+            "comparison_description": "jaro_winkler at thresholds 0.9, 0.8 vs. anything else"
+        },
+        {
+            "output_column_name": "surname",
+            "comparison_levels": [
+                {
+                    "sql_condition": "\"surname_l\" IS NULL OR \"surname_r\" IS NULL",
+                    "label_for_charts": "surname is NULL",
+                    "is_null_level": true
+                },
+                {
+                    "sql_condition": "\"surname_l\" = \"surname_r\"",
+                    "label_for_charts": "Exact match on surname",
+                    "m_probability": 0.43210610613512185,
+                    "u_probability": 0.004322481469643699
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9",
+                    "label_for_charts": "Jaro-Winkler distance of surname >= 0.9",
+                    "m_probability": 0.2514700606335103,
+                    "u_probability": 0.002907020988387136
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.8",
+                    "label_for_charts": "Jaro-Winkler distance of surname >= 0.8",
+                    "m_probability": 0.0757748206402343,
+                    "u_probability": 0.0033636211436311888
+                },
+                {
+                    "sql_condition": "ELSE",
+                    "label_for_charts": "All other comparisons",
+                    "m_probability": 0.2406490125911336,
+                    "u_probability": 0.989406876398338
+                }
+            ],
+            "comparison_description": "jaro_winkler at thresholds 0.9, 0.8 vs. anything else"
+        },
+        {
+            "output_column_name": "dob",
+            "comparison_levels": [
+                {
+                    "sql_condition": "\"dob_l\" IS NULL OR \"dob_r\" IS NULL",
+                    "label_for_charts": "dob is NULL",
+                    "is_null_level": true
+                },
+                {
+                    "sql_condition": "\"dob_l\" = \"dob_r\"",
+                    "label_for_charts": "Exact match on dob",
+                    "m_probability": 0.39025358731716286,
+                    "u_probability": 0.0016036280808555408
+                },
+                {
+                    "sql_condition": "damerau_levenshtein(\"dob_l\", \"dob_r\") <= 1",
+                    "label_for_charts": "Damerau-Levenshtein distance of dob <= 1",
+                    "m_probability": 0.1489444378965258,
+                    "u_probability": 0.0016546990388445707
+                },
+                {
+                    "sql_condition": "ABS(EPOCH(try_strptime(\"dob_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"dob_r\", '%Y-%m-%d'))) <= 2629800.0",
+                    "label_for_charts": "Abs difference of 'transformed dob <= 1 month'",
+                    "m_probability": 0.08866691175438302,
+                    "u_probability": 0.002594404665842722
+                },
+                {
+                    "sql_condition": "ABS(EPOCH(try_strptime(\"dob_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"dob_r\", '%Y-%m-%d'))) <= 31557600.0",
+                    "label_for_charts": "Abs difference of 'transformed dob <= 1 year'",
+                    "m_probability": 0.10518866178811104,
+                    "u_probability": 0.030622146410222362
+                },
+                {
+                    "sql_condition": "ELSE",
+                    "label_for_charts": "All other comparisons",
+                    "m_probability": 0.26694640124381713,
+                    "u_probability": 0.9635251218042348
+                }
+            ],
+            "comparison_description": "Exact match vs. Damerau-Levenshtein distance <= 1 vs. month difference <= 1 vs. year difference <= 1 vs. anything else"
+        },
+        {
+            "output_column_name": "city",
+            "comparison_levels": [
+                {
+                    "sql_condition": "\"city_l\" IS NULL OR \"city_r\" IS NULL",
+                    "label_for_charts": "city is NULL",
+                    "is_null_level": true
+                },
+                {
+                    "sql_condition": "\"city_l\" = \"city_r\"",
+                    "label_for_charts": "Exact match on city",
+                    "m_probability": 0.561103053663773,
+                    "u_probability": 0.052019405886043986,
+                    "tf_adjustment_column": "city",
+                    "tf_adjustment_weight": 1.0
+                },
+                {
+                    "sql_condition": "ELSE",
+                    "label_for_charts": "All other comparisons",
+                    "m_probability": 0.438896946336227,
+                    "u_probability": 0.947980594113956
+                }
+            ],
+            "comparison_description": "Exact match 'city' vs. anything else"
+        },
+        {
+            "output_column_name": "email",
+            "comparison_levels": [
+                {
+                    "sql_condition": "\"email_l\" IS NULL OR \"email_r\" IS NULL",
+                    "label_for_charts": "email is NULL",
+                    "is_null_level": true
+                },
+                {
+                    "sql_condition": "\"email_l\" = \"email_r\"",
+                    "label_for_charts": "Exact match on email",
+                    "m_probability": 0.5521904988218763,
+                    "u_probability": 0.0023577568563241916
+                },
+                {
+                    "sql_condition": "NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')",
+                    "label_for_charts": "Exact match on transformed email",
+                    "m_probability": 0.22046667643566936,
+                    "u_probability": 0.0010970118706508391
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88",
+                    "label_for_charts": "Jaro-Winkler distance of email >= 0.88",
+                    "m_probability": 0.21374764835824084,
+                    "u_probability": 0.0007367990176013098
+                },
+                {
+                    "sql_condition": "jaro_winkler_similarity(NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')) >= 0.88",
+                    "label_for_charts": "Jaro-Winkler distance of transformed email >= 0.88",
+                    "u_probability": 0.00027834629553827263
+                },
+                {
+                    "sql_condition": "ELSE",
+                    "label_for_charts": "All other comparisons",
+                    "m_probability": 0.013595176384213488,
+                    "u_probability": 0.9955300859598853
+                }
+            ],
+            "comparison_description": "jaro_winkler on username at threshold 0.88 vs. anything else"
+        }
+    ]
+}
+
+
+

This is simply the settings dictionary with additional entries for "m_probability" and "u_probability" in each of the "comparison_levels", which have estimated during model training.

+

For example in the first name exact match level:

+
16
+17
+18
+19
+20
+21
{
+    "sql_condition": "\"first_name_l\" = \"first_name_r\"",
+    "label_for_charts": "Exact match on first_name",
+    "m_probability": 0.48854806009621365,
+    "u_probability": 0.0056770619302010565
+},
+
+

where the m_probability and u_probability values here are then used to generate the match weight for an exact match on "first_name" between two records (i.e. the amount of evidence provided by records having the same first name) in model predictions.

+

Loading a pre-trained model

+

When using a pre-trained model, you can read in the model from a json and recreate the linker object to make new pairwise predictions. For example:

+
linker = Linker(
+    new_df,
+    settings="./path/to/model.json",
+    db_api=db_api
+)
+
+ + + + + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/theory/fellegi_sunter.html b/topic_guides/theory/fellegi_sunter.html new file mode 100644 index 0000000000..55d6f4c880 --- /dev/null +++ b/topic_guides/theory/fellegi_sunter.html @@ -0,0 +1,5760 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + The Fellegi-Sunter Model - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + + + + + +
+
+ + + + + + + + + + + + +

The Fellegi-Sunter model

+

This topic guide gives a high-level introduction to the Fellegi Sunter model, the statistical model that underlies Splink's methodology.

+

For a more detailed interactive guide that aligns to Splink's methodology see Robin Linacre's interactive introduction to probabilistic linkage.

+
+ +

Parameters of the Fellegi-Sunter model

+

The Fellegi-Sunter model has three main parameters that need to be considered to generate a match probability between two records:

+
+
    +
  • \(\lambda\) - probability that any two records match
  • +
  • \(m\) - probability of a given observation given the records are a match
  • +
  • \(u\) - probability of a given observation given the records are not a match
  • +
+
+
+ +

λ probability

+

The lambda (\(\lambda\)) parameter is the prior probability that any two records match. I.e. assuming no other knowledge of the data, how likely is a match? Or, as a formula:

+
\[ +\lambda = Pr(\textsf{Records match}) +\]
+

This is the same for all records comparisons, but is highly dependent on:

+
    +
  • The total number of records
  • +
  • The number of duplicate records (more duplicates increases \(\lambda\))
  • +
  • The overlap between datasets
      +
    • Two datasets covering the same cohort (high overlap, high \(\lambda\))
    • +
    • Two entirely independent datasets (low overlap, low \(\lambda\))
    • +
    +
  • +
+
+ +

m probability

+

The \(m\) probability is the probability of a given observation given the records are a match. Or, as a formula:

+
\[ +m = Pr(\textsf{Observation | Records match}) +\]
+

For example, consider the the \(m\) probability of a match on Date of Birth (DOB). For two records that are a match, what is the probability that:

+
    +
  • DOB is the same:
  • +
  • Almost 100%, say 98% \(\Longrightarrow m \approx 0.98\)
  • +
  • DOB is different:
  • +
  • Maybe a 2% chance of a data error? \(\Longrightarrow m \approx 0.02\)
  • +
+

The \(m\) probability is largely a measure of data quality - if DOB is poorly collected, it may only match exactly for 50% of true matches.

+
+ +

u probability

+

The \(u\) probability is the probability of a given observation given the records are not a match. Or, as a formula:

+
\[ +u = Pr(\textsf{Observation | Records do not match}) +\]
+

For example, consider the the \(u\) probability of a match on Surname. For two records that are not a match, what is the probability that:

+
    +
  • Surname is the same:
  • +
  • Depending on the surname, <1%? \(\Longrightarrow u \approx 0.005\)
  • +
  • Surname is different:
  • +
  • Almost 100% \(\Longrightarrow u \approx 0.995\)
  • +
+

The \(u\) probability is a measure of coincidence. As there are so many possible surnames, the chance of sharing the same surname with a randomly-selected person is small.

+
+ +

Interpreting m and u

+

In the case of a perfect unique identifier:

+
    +
  • A person is only assigned one such value - \(m = 1\) (match) or \(m=0\) (non-match)
  • +
  • A value is only ever assigned to one person - \(u = 0\) (match) or \(u = 1\) (non-match)
  • +
+

Where \(m\) and \(u\) deviate from these ideals can usually be intuitively explained:

+
+

m probability

+

A measure of data quality/reliability.

+

How often might a person's information change legitimately or through data error?

+
    +
  • Names: typos, aliases, nicknames, middle names, married names etc.
  • +
  • DOB: typos, estimates (e.g. 1st Jan YYYY where date not known)
  • +
  • Address: formatting issues, moving house, multiple addresses, temporary addresses
  • +
+
+
+

u probability

+

A measure of coincidence/cardinality1.

+

How many different people might share a given identifier?

+
    +
  • DOB (high cardinality) – for a flat age distribution spanning ~30 years, there are ~10,000 DOBs (0.01% chance of a match)
  • +
  • Sex (low cardinality) – only 2 potential values (~50% chance of a match)
  • +
+
+
+ +

Match Weights

+

One of the key measures of evidence of a match between records is the match weight.

+

Deriving Match Weights from m and u

+

The match weight is a measure of the relative size of \(m\) and \(u\):

+
\[ +\begin{equation} +\begin{aligned} + M &= \log_2\left(\frac{\lambda}{1-\lambda}\right) + \log_2 K \\[10pt] + &= \log_2\left(\frac{\lambda}{1-\lambda}\right) + \log_2 m - \log_2 u +\end{aligned} +\end{equation} +\]
+

where \(\lambda\) is the probability that two random records match and \(K=m/u\) is the Bayes factor.

+

A key assumption of the Fellegi Sunter model is that observations from different column/comparisons are independent of one another. This means that the Bayes factor for two records is the products of the Bayes factor for each column/comparison:

+
\[ K_\textsf{features} = K_\textsf{forename} \cdot K_\textsf{surname} \cdot K_\textsf{dob} \cdot K_\textsf{city} \cdot K_\textsf{email} \]
+

This, in turn, means that match weights are additive:

+
\[ M_\textsf{obs} = M_\textsf{prior} + M_\textsf{features} \]
+

where \(M_\textsf{prior} = \log_2\left(\frac{\lambda}{1-\lambda}\right)\) and +\(M_\textsf{features} = M_\textsf{forename} + M_\textsf{surname} + M_\textsf{dob} + M_\textsf{city} + M_\textsf{email}\).

+

So, considering these properties, the total match weight for two observed records can be rewritten as:

+
\[ +\begin{equation} +\begin{aligned} + M_\textsf{obs} &= \log_2\left(\frac{\lambda}{1-\lambda}\right) + \sum_{i}^\textsf{features}\log_2(\frac{m_i}{u_i}) \\[10pt] + &= \log_2\left(\frac{\lambda}{1-\lambda}\right) + \log_2\left(\prod_i^\textsf{features}\frac{m_i}{u_i}\right) +\end{aligned} +\end{equation} +\]
+

Interpreting Match Weights

+

The match weight is the central metric showing the amount of evidence of a match is provided by each of the features in a model. +The is most easily shown through Splink's Waterfall Chart:

+

+
    +
  • 1️⃣ are the two records being compared
  • +
  • +

    2️⃣ is the match weight of the prior, \(M_\textsf{prior} = \log_2\left(\frac{\lambda}{1-\lambda}\right)\). + This is the match weight if no additional knowledge of features is taken into account, and can be thought of as similar to the y-intercept in a simple regression.

    +
  • +
  • +

    3️⃣ are the match weights of each feature, \(M_\textsf{forename}\), \(M_\textsf{surname}\), \(M_\textsf{dob}\), \(M_\textsf{city}\) and \(M_\textsf{email}\) respectively.

    +
  • +
  • +

    4️⃣ is the total match weight for two observed records, combining 2️⃣ and 3️⃣:

    +
    \[ +\begin{equation} +\begin{aligned} + M_\textsf{obs} &= M_\textsf{prior} + M_\textsf{forename} + M_\textsf{surname} + M_\textsf{dob} + M_\textsf{city} + M_\textsf{email} \\[10pt] + &= -6.67 + 4.74 + 6.49 - 1.97 - 1.12 + 8.00 \\[10pt] + &= 9.48 +\end{aligned} +\end{equation} +\]
    +
  • +
  • +

    5️⃣ is an axis representing the \(\textsf{match weight} = \log_2(\textsf{Bayes factor})\))

    +
  • +
  • +

    6️⃣ is an axis representing the equivalent match probability (noting the non-linear scale). For more on the relationship between match weight and probability, see the sections below

    +
  • +
+
+ +

Match Probability

+

Match probability is a more intuitive measure of similarity than match weight, and is, generally, used when choosing a similarity threshold for record matching.

+

Deriving Match Probability from Match Weight

+

Probability of two records being a match can be derived from the total match weight:

+
\[ +Pr(\textsf{Match | Observation}) = \frac{2^{M_\textsf{obs}}}{1+2^{M_\textsf{obs}}} +\]
+
+Example +

Consider the example in the Interpreting Match Weights section. +The total match weight, \(M_\textsf{obs} = 9.48\). Therefore,

+
\[ Pr(\textsf{Match | Observation}) = \frac{2^{9.48}}{1+2^{9.48}} \approx 0.999 \]
+
+

Understanding the relationship between Match Probability and Match Weight

+

It can be helpful to build up some intuition for how match weight translates into match probability.

+

Plotting match probability versus match weight gives the following chart:

+

+

Some observations from this chart:

+
    +
  • \(\textsf{Match weight} = 0 \Longrightarrow \textsf{Match probability} = 0.5\)
  • +
  • \(\textsf{Match weight} = 2 \Longrightarrow \textsf{Match probability} = 0.8\)
  • +
  • \(\textsf{Match weight} = 3 \Longrightarrow \textsf{Match probability} = 0.9\)
  • +
  • \(\textsf{Match weight} = 4 \Longrightarrow \textsf{Match probability} = 0.95\)
  • +
  • \(\textsf{Match weight} = 7 \Longrightarrow \textsf{Match probability} = 0.99\)
  • +
+

So, the impact of any additional match weight on match probability gets smaller as the total match weight increases. This makes intuitive sense as, when comparing two records, after you already have a lot of evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.

+

Similarly, if you already have a lot of negative evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.

+

Deriving Match Probability from m and u

+

Given the definitions for match probability and match weight above, we can rewrite the probability in terms of \(m\) and \(u\).

+
\[ +\begin{equation} +\begin{aligned} +Pr(\textsf{Match | Observation}) &= \frac{2^{\log_2\left(\frac{\lambda}{1-\lambda}\right) + \log_2\left(\prod_{i}^\textsf{features}\frac{m_{i}}{u_{i}}\right)}}{1+2^{\log_2\left(\frac{\lambda}{1-\lambda}\right) + \log_2\left(\prod_{i}^\textsf{features}\frac{m_{i}}{u_{i}}\right)}} \\[20pt] + &= \frac{\left(\frac{\lambda}{1-\lambda}\right)\prod_{i}^\textsf{features}\frac{m_{i}}{u_{i}}}{1+\left(\frac{\lambda}{1-\lambda}\right)\prod_{i}^\textsf{features}\frac{m_{i}}{u_{i}}} \\[20pt] + &= 1 - \left[1+\left(\frac{\lambda}{1-\lambda}\right)\prod_{i}^\textsf{features}\frac{m_{i}}{u_{i}}\right]^{-1} +\end{aligned} +\end{equation} +\]
+
+ +

Further Reading

+

This academic paper provides a detailed mathematical description of the model used by R fastLink package. The mathematical uesd by Splink is very similar.

+
+
+
    +
  1. +

    Cardinality is the the number of items in a set. In record linkage, cardinality refers to the number of possible values a feature could have. +This is important in record linkage, as the number of possible options for e.g. date of birth has a significant impact on the amount of evidence that a match on date of birth provides for two records being a match. 

    +
  2. +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/theory/linked_data_as_graphs.html b/topic_guides/theory/linked_data_as_graphs.html new file mode 100644 index 0000000000..127ce33d0d --- /dev/null +++ b/topic_guides/theory/linked_data_as_graphs.html @@ -0,0 +1,5300 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Linked Data as Graphs - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Linked data as graphs

+

When you link data, the results can be thought of as a graph, where each record (node) in your data is connected to other records by links (edges). This guide discusses relevant graph theory.

+

A graph is a collection of points (referred to in graph theory as nodes or vertices) connected by lines (referred to as edges).

+

Basic Graph

+

Then a group of interconnected nodes is referred to as a cluster.

+

Basic Cluster

+

Graphs provide a natural way to represent linked data, where the nodes of a graph represent records being linked and the edges represent the links between them. So, if we have 5 records (A-E) in our dataset(s), with links between them, this can be represented as a graph like so:

+

Basic Graph - Records

+

When linking people together, a cluster represents the all of the records in our dataset(s) that refer to the same person. We can give this cluster a new identifier (F) as a way of referring to this single person.

+

Basic Person Cluster

+
+

Note

+

For clusters produced with Splink, every edge comes with an associated Splink score (the probability of two records being a match). The clustering threshold (match_probability_threshold) supplied by the user determines which records are included in a cluster, as any links (edges) between records with a match probability below this threshold are excluded.

+

Clusters, specifically cluster IDs, are the ultimate output of a Splink pipeline.

+
+

Probabilistic data linkage and graphs

+

When performing probabilistic linkage, each pair of records has a score indicating how similar they are. For example, consider a collection of records with pairwise similarity scores:

+

Threshold Cluster

+

Having a score associated with each pair of records is the key benefit of probabilistic linkage, as we have a measure of similarity of the records (rather than a binary link/no-link). However, we need to choose a threshold at or above which links are considered valid in order to generate our final linked data (clusters).

+

Let's consider a few different thresholds for the records above to see how the resulting clusters change. Setting a threshold of 0.95 keeps all links, so the records are all joined up into a single cluster.

+

Threshold Cluster

+

Whereas if we increase the threshold to 0.99, one link is discarded. This breaks the records into two clusters.

+

Threshold Cluster

+

Increasing the threshold further (to 0.999) breaks an additional two links, resulting in a total of three clusters.

+

Threshold Cluster

+

This demonstrates that choice of threshold can have a significant impact on the final linked data produced (i.e. clusters). For more specific guidance on selecting linkage thresholds, check out the Evaluation Topic Guides.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/theory/probabilistic_vs_deterministic.html b/topic_guides/theory/probabilistic_vs_deterministic.html new file mode 100644 index 0000000000..7116583946 --- /dev/null +++ b/topic_guides/theory/probabilistic_vs_deterministic.html @@ -0,0 +1,5383 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Probabilistic vs Deterministic linkage - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Types of Record Linkage

+

There are two main types of record linkage - Deterministic and Probabilistic.

+

Deterministic Linkage

+

Deterministic Linkage is a rules-based approach for joining records together.

+

For example, consider a single table with duplicates:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AIDNameDOBPostcode
A00001Bob Smith1990-05-09AB12 3CD
A00002Robert Smith1990-05-09AB12 3CD
A00003Robert "Bobby” Smith1990-05-09-
+

and some deterministic rules:

+
IF Name matches AND DOB matches (Rule 1)
+THEN records are a match
+
+ELSE
+
+IF Forename matches AND DOB matches AND Postcode match (Rule 2)
+THEN records are a match
+
+ELSE
+
+records do not match
+
+

Applying these rules to the table above leads to no matches:

+

A0001-A0002 No match (different forename)
+A0001-A0003 No match (different forename)
+A0002-A0003 No match (missing postcode)

+

So, even a relatively simple dataset, with duplicates that are obvious to a human, will require more complex rules.

+

In general, Deterministic linkage is:

+
+
+ ✅ Computationally cheap
+ ✅ Capable of achieving high precision (few False Positives) +
+
+ ❌ Lacking in subtlety
+ ❌ Prone to Low recall (False Negatives) +
+
+ +
+Deterministic Linkage in Splink +

While Splink is primarily a tool for Probabilistic linkage, Deterministic linkage is also supported (utilising blocking rules). See the example notebooks to see how this is Deterministic linkage is implemented in Splink.

+
+

Probabilistic Linkage

+

Probabilistic Linkage is a evidence-based approach for joining records together.

+

Linkage is probabilistic in the sense that it relies on the balance of evidence. In a large dataset, observing that two records match on the full name 'Robert Smith' provides some evidence that these two records may refer to the same person, but this evidence is inconclusive. However, the cumulative evidence from across multiple features within the dataset (e.g. date of birth, home address, email address) can provide conclusive evidence of a match. The evidence for a match is commonly represented as a probability.

+

For example, putting the first 2 records of the table above through a probabilistic model gives a an overall probability that the records are a match: +

+

In addition, the breakdown of this probability by the evidence provided by each feature can be shown through a waterfall chart:

+

+

Given these probabilities, unlike (binary) Deterministic linkage, the user can choose an evidence threshold for what they consider a match before creating a new unique identifier.

+

This is important, as it allows the linkage to be customised to best support the specific use case. For example, if it is important to:

+
    +
  • minimise False Positive matches (i.e. where False Negatives are less of a concern), a higher threshold for a match can be chosen.
  • +
  • maximise True Positive matches (i.e. where False Positives are less of a concern), a lower threshold can be chosen.
  • +
+
+

Further Reading

+

For a more in-depth introduction to Probabilistic Data Linkage, including an interactive version of the waterfall chart above, see Robin Linacre's Blog.

+
+
+Probabilistic Linkage in Splink +

Splink is primarily a tool for Probabilistic linkage, and implements the Fellegi-Sunter model - the most common probabilistic record linkage model. See the Splink Tutorial for a step by step guide for Probabilistic linkage in Splink.

+

A Topic Guide on the Fellegi-Sunter model is can be found here!

+
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/theory/record_linkage.html b/topic_guides/theory/record_linkage.html new file mode 100644 index 0000000000..9b1606a5d4 --- /dev/null +++ b/topic_guides/theory/record_linkage.html @@ -0,0 +1,5309 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Why do we need record linkage? - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

Why do we need record linkage?

+

In a perfect world

+

In a perfect world, everyone (and everything) would have a single, unique identifier. If this were the case, linking any datasets would be a simple inner join.

+
+Example +

Consider 2 tables of people A and B with no duplicates and each person has a unique id UID. To join these tables in SQL we would write:

+
SELECT *
+FROM A
+INNER JOIN B
+ON A.UID = B.UID
+
+
+

In reality

+

Real datasets often lack truly unique identifiers (both within and across datasets).

+

The overall aim of record linkage is to generate a unique identifier to be used like UID to our "perfect world" scenario.

+

Record linkage the process of using the information within records to assess whether records refer to the same entity. For example, if records refer to people, factors such as names, date of birth, location etc can be used to link records together.

+

Record linkage can be done within datasets (deduplication) or between datasets (linkage), or both.

+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/topic_guides/topic_guides_index.html b/topic_guides/topic_guides_index.html new file mode 100644 index 0000000000..9d667cda29 --- /dev/null +++ b/topic_guides/topic_guides_index.html @@ -0,0 +1,5230 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + Introduction - Splink + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + + + + +
+ + + + + + + +
+ +
+ + + + +
+
+ + + +
+
+
+ + + + + + + +
+
+
+ + + +
+
+
+ + + +
+
+
+ + + +
+
+ + + + + + + + + + + + +

User Guide

+

This section contains in-depth guides on a variety of topics and concepts within Splink, as well as data linking more generally. These are intended to provide an extra layer of detail on top of the Splink tutorial and examples.

+

The user guide is broken up into the following categories:

+
    +
  1. Record Linkage Theory - for an introduction to data linkage from a theoretical perspective, and to help build some intuition around the parameters being estimated in Splink models.
  2. +
  3. Linkage Models in Splink - for an introduction to the building blocks of a Splink model. Including the supported SQL Backends and how to define a model with a Splink Settings dictionary.
  4. +
  5. Data Preparation - for guidance on preparing your data for linkage. Including guidance on feature engineering to help improve Splink models.
  6. +
  7. Blocking - for an introduction to Blocking Rules and their purpose within record linkage. Including how blocking rules are used in different contexts within Splink.
  8. +
  9. Comparing Records - for guidance on defining Comparisons withing a Splink model. Including how comparing records are structured within Comparisons, how to utilise string comparators for fuzzy matching and how deal with skewed data with Term Frequency Adjustments.
  10. +
  11. Model Training - for guidance on the methods for training a Splink model, and how to choose them for specific use cases. (Coming soon)
  12. +
  13. Clustering - for guidance on how records are clustered together. (Coming Soon)
  14. +
  15. Evaluation - for guidance on how to evaluate Splink models, links and clusters (including Clerical Labelling).
  16. +
  17. Performance - for guidance on how to make Splink models run more efficiently.
  18. +
+ + + + + + + + + + + + + + + +
+
+ + + + + +
+ + + +
+ + + +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file