"id","truth_value","family_name","given_name","gender","birth_date","phone","street_address","city","state","postal_code","SSN"
"4_Arlene_Oliver.xml","","oliver","arlene","female","05/26/1972","434-228-8487","845 Collier Pike Center","North Katheryn","WA","99033",""
"21_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"22_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"29_Terrance_Weber.xml","","weber","terrance","male","09/19/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""
"30_Terrance_Weber.xml","","weber","terrance","male","09/12/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""
This input has no SSN or truth value which may have something to do with the error.
Things get very bad (please note that running with the same data i initially sent is working fine) :
Format is CSV
Data dir is /tmp/x.csv
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
Estimated u probabilities using random sampling
Your model is not yet fully trained. Missing estimates for:
- given_name (no m values are trained).
- family_name (no m values are trained).
- birth_date (no m values are trained).
- postal_code (some u values are not trained, no m values are trained).
- street_address (no m values are trained).
- phone (no m values are trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"
Parameter estimates will be made for the following comparison(s):
- given_name
- family_name
- birth_date
- postal_code
- street_address
- phone
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
WARNING:
Level Within 1 month on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Within 10 years on comparison birth_date not observed in dataset, unable to train m value
WARNING:
Level Exact match Postcode District on comparison postal_code not observed in dataset, unable to train m value
WARNING:
Level Exact match Postcode Area on comparison postal_code not observed in dataset, unable to train m value
WARNING:
Level All other comparisons on comparison postal_code not observed in dataset, unable to train m value
Iteration 1: Largest change in params was 0.904 in the m_probability of street_address, level `All other comparisons`
Iteration 2: Largest change in params was 0.197 in probability_two_random_records_match
Iteration 3: Largest change in params was 1.06e-09 in probability_two_random_records_match
EM converged after 3 iterations
m probability not trained for birth_date - Within 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Within 10 years (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
Your model is not yet fully trained. Missing estimates for:
- birth_date (some m values are not trained).
- postal_code (some u values are not trained, some m values are not trained).
----- Starting EM training session -----
Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"
Parameter estimates will be made for the following comparison(s):
- given_name
- family_name
- postal_code
- street_address
- phone
Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- birth_date
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 691, in _log_and_run_sql_execution
return self._run_sql_execution(final_sql, templated_name, physical_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 225, in _run_sql_execution
self._con.sql(final_sql)
duckdb.duckdb.OutOfRangeException: Out of Range Error: cannot take logarithm of zero
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 190, in <module>
cli()
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/public_html/mdinteractive-03/extraApps/dedupliFHIR-main/cli/deduplifhirLib/utils.py", line 150, in wrapper
return func(*args,**kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 35, in dedupe_data
linker.estimate_parameters_using_expectation_maximisation(
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 1682, in estimate_parameters_using_expectation_maximisation
em_training_session._train()
File "/usr/local/lib/python3.11/site-packages/splink/em_training_session.py", line 197, in _train
expectation_maximisation(self, cvv)
File "/usr/local/lib/python3.11/site-packages/splink/expectation_maximisation.py", line 256, in expectation_maximisation
df_params = linker._execute_sql_pipeline([df_comparison_vector_values])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 629, in _execute_sql_pipeline
raise e
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 623, in _execute_sql_pipeline
dataframe = self._sql_to_splink_dataframe_checking_cache(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 889, in _sql_to_splink_dataframe_checking_cache
splink_dataframe = self._execute_sql_against_backend(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 220, in _execute_sql_against_backend
self._log_and_run_sql_execution(sql, templated_name, physical_name)
File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 703, in _log_and_run_sql_execution
raise SplinkException(
splink.exceptions.SplinkException: Error executing the following sql for table `__splink__m_u_counts`(__splink__m_u_counts_26a1779fb):
CREATE TABLE __splink__m_u_counts_26a1779fb AS
(
WITH __splink__df_comparison_vectors AS (
SELECT
*
FROM __splink__df_comparison_vectors_a7cd2b9d8
), __splink__df_match_weight_parts AS (
SELECT
"unique_id_l",
"unique_id_r",
gamma_given_name,
CASE
WHEN gamma_given_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_given_name = 4
THEN CAST(111.89189187998129 AS DOUBLE)
WHEN gamma_given_name = 3
THEN CAST(81.88130481588395 AS DOUBLE)
WHEN gamma_given_name = 2
THEN CAST(92.09886357253461 AS DOUBLE)
WHEN gamma_given_name = 1
THEN CAST(48.06629076272102 AS DOUBLE)
WHEN gamma_given_name = 0
THEN CAST(4.009431336940681e-229 AS DOUBLE)
END AS bf_given_name,
CASE
WHEN gamma_given_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 4
THEN (
CASE
WHEN NOT COALESCE("tf_given_name_l", "tf_given_name_r") IS NULL
THEN POWER(
CAST(0.0014492753623188406 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_given_name_l", "tf_given_name_r") >= COALESCE("tf_given_name_r", "tf_given_name_l")
THEN COALESCE("tf_given_name_l", "tf_given_name_r")
ELSE COALESCE("tf_given_name_r", "tf_given_name_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_given_name = 3
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 2
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 1
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_given_name,
gamma_family_name,
CASE
WHEN gamma_family_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_family_name = 4
THEN CAST(139.19707503507163 AS DOUBLE)
WHEN gamma_family_name = 3
THEN CAST(115.4071885370217 AS DOUBLE)
WHEN gamma_family_name = 2
THEN CAST(165.4130726624298 AS DOUBLE)
WHEN gamma_family_name = 1
THEN CAST(57.22386641775578 AS DOUBLE)
WHEN gamma_family_name = 0
THEN CAST(3.992587823932763e-229 AS DOUBLE)
END AS bf_family_name,
CASE
WHEN gamma_family_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 4
THEN (
CASE
WHEN NOT COALESCE("tf_family_name_l", "tf_family_name_r") IS NULL
THEN POWER(
CAST(0.0011649825407125862 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_family_name_l", "tf_family_name_r") >= COALESCE("tf_family_name_r", "tf_family_name_l")
THEN COALESCE("tf_family_name_l", "tf_family_name_r")
ELSE COALESCE("tf_family_name_r", "tf_family_name_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_family_name = 3
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 2
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 1
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_family_name,
gamma_postal_code,
CASE
WHEN gamma_postal_code = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_postal_code = 4
THEN CAST(266.5650741350906 AS DOUBLE)
WHEN gamma_postal_code = 3
THEN CAST(3.970841325158504e-229 AS DOUBLE)
WHEN gamma_postal_code = 2
THEN CAST(1.259921049894873 AS DOUBLE)
WHEN gamma_postal_code = 1
THEN CAST(0.19842513149602492 AS DOUBLE)
WHEN gamma_postal_code = 0
THEN CAST(0.03125 AS DOUBLE)
END AS bf_postal_code,
gamma_street_address,
CASE
WHEN gamma_street_address = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_street_address = 1
THEN CAST(259.14714720358563 AS DOUBLE)
WHEN gamma_street_address = 0
THEN CAST(0.9640980278331669 AS DOUBLE)
END AS bf_street_address,
CASE
WHEN gamma_street_address = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_street_address = 1
THEN (
CASE
WHEN NOT COALESCE("tf_street_address_l", "tf_street_address_r") IS NULL
THEN POWER(
CAST(0.00013905627143784185 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_street_address_l", "tf_street_address_r") >= COALESCE("tf_street_address_r", "tf_street_address_l")
THEN COALESCE("tf_street_address_l", "tf_street_address_r")
ELSE COALESCE("tf_street_address_r", "tf_street_address_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_street_address = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_street_address,
gamma_phone,
CASE
WHEN gamma_phone = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_phone = 1
THEN CAST(1.7515202196470399 AS DOUBLE)
WHEN gamma_phone = 0
THEN CAST(0.5579969710188165 AS DOUBLE)
END AS bf_phone,
CASE
WHEN gamma_phone = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_phone = 1
THEN (
CASE
WHEN NOT COALESCE("tf_phone_l", "tf_phone_r") IS NULL
THEN POWER(
CAST(0.3703346620932604 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_phone_l", "tf_phone_r") >= COALESCE("tf_phone_r", "tf_phone_l")
THEN COALESCE("tf_phone_l", "tf_phone_r")
ELSE COALESCE("tf_phone_r", "tf_phone_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_phone = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_phone,
"ssn_l",
"ssn_r",
match_key
FROM __splink__df_comparison_vectors
), __splink__df_predict AS (
SELECT
LOG(
2,
CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
) AS match_weight,
CASE
WHEN bf_given_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_given_name = CAST('infinity' AS DOUBLE)
OR bf_family_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_family_name = CAST('infinity' AS DOUBLE)
OR bf_postal_code = CAST('infinity' AS DOUBLE)
OR bf_street_address = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_street_address = CAST('infinity' AS DOUBLE)
OR bf_phone = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_phone = CAST('infinity' AS DOUBLE)
THEN 1.0
ELSE (
CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
) / NULLIF(
(
1 + (
CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
)
),
0
)
END AS match_probability,
"unique_id_l",
"unique_id_r",
gamma_given_name,
gamma_family_name,
gamma_postal_code,
gamma_street_address,
gamma_phone,
"ssn_l",
"ssn_r",
match_key
FROM __splink__df_match_weight_parts
ORDER BY
1 NULLS LAST
)
SELECT
gamma_given_name AS comparison_vector_value,
SUM(match_probability * 1) AS m_count,
SUM((
1 - match_probability
) * 1) AS u_count,
'given_name' AS output_column_name
FROM __splink__df_predict
GROUP BY
gamma_given_name
UNION ALL
SELECT
gamma_family_name AS comparison_vector_value,
SUM(match_probability * 1) AS m_count,
SUM((
1 - match_probability
) * 1) AS u_count,
'family_name' AS output_column_name
FROM __splink__df_predict
GROUP BY
gamma_family_name
UNION ALL
SELECT
gamma_postal_code AS comparison_vector_value,
SUM(match_probability * 1) AS m_count,
SUM((
1 - match_probability
) * 1) AS u_count,
'postal_code' AS output_column_name
FROM __splink__df_predict
GROUP BY
gamma_postal_code
UNION ALL
SELECT
gamma_street_address AS comparison_vector_value,
SUM(match_probability * 1) AS m_count,
SUM((
1 - match_probability
) * 1) AS u_count,
'street_address' AS output_column_name
FROM __splink__df_predict
GROUP BY
gamma_street_address
UNION ALL
SELECT
gamma_phone AS comparison_vector_value,
SUM(match_probability * 1) AS m_count,
SUM((
1 - match_probability
) * 1) AS u_count,
'phone' AS output_column_name
FROM __splink__df_predict
GROUP BY
gamma_phone
UNION ALL
SELECT
0 AS comparison_vector_value,
SUM(match_probability * 1) / NULLIF(SUM(1), 0) AS m_count,
SUM((
1 - match_probability
) * 1) / NULLIF(SUM(1), 0) AS u_count,
'_probability_two_random_records_match' AS output_column_name
FROM __splink__df_predict
)
Error was: Out of Range Error: cannot take logarithm of zero
Describe the bug
Ran into out of range error when using tool with this CSV input:
This input has no SSN or truth value which may have something to do with the error.
However when i run the script:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv
Things get very bad (please note that running with the same data i initially sent is working fine) :
To Reproduce
Use the above input with this CLI call:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csvExpected behavior
The CLI should output the deduplicated data normally.
Actual behavior
The tool throws an out of range error