Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving cluster_id issue in gazetteer_example.py #135

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jmiller558
Copy link

Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )

This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.

Fixing type in gazetteer_example.py

For loop should either leverage enumerate for the cluster_id, or else should set cluster_id to 0 and then increment on each loop.

Current code has both resulting in incorrect cluster_id's
Have updated the code to resolve the cluster_id issue.

Now each entry from messy dataset is assigned a unique cluster_id, and matching entries from the canonical dataset will also be assigned to that cluster_id.  

The relationship is one to many, and entries from the canonical dataset can belong to multiple cluster_ids.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant