Resolving cluster_id issue in gazetteer_example.py #135

jmiller558 · 2023-05-26T19:47:43Z

Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )

This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.

Fixing type in gazetteer_example.py For loop should either leverage enumerate for the cluster_id, or else should set cluster_id to 0 and then increment on each loop. Current code has both resulting in incorrect cluster_id's

Have updated the code to resolve the cluster_id issue. Now each entry from messy dataset is assigned a unique cluster_id, and matching entries from the canonical dataset will also be assigned to that cluster_id. The relationship is one to many, and entries from the canonical dataset can belong to multiple cluster_ids.

jmiller558 added 2 commits May 26, 2023 10:17

Update gazetteer_example.py

023236a

Fixing type in gazetteer_example.py For loop should either leverage enumerate for the cluster_id, or else should set cluster_id to 0 and then increment on each loop. Current code has both resulting in incorrect cluster_id's

jmiller558 mentioned this pull request May 26, 2023

Issue with cluster_id in gazetteer_example.py #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolving cluster_id issue in gazetteer_example.py #135

Resolving cluster_id issue in gazetteer_example.py #135

jmiller558 commented May 26, 2023

Resolving cluster_id issue in gazetteer_example.py #135

Are you sure you want to change the base?

Resolving cluster_id issue in gazetteer_example.py #135

Conversation

jmiller558 commented May 26, 2023