Skip to content

Conversation

@Lightning11wins
Copy link

@Lightning11wins Lightning11wins commented Nov 14, 2025

The duplicate detection project is ready to review, although (best case), there are still a couple of things blocking it from being ready to merge.

I would appreciate a full review of all changes, as there's quite a lot here. That said, some areas may require additional special attention, so I've compiled a list of all 28 TODO: Greg comments below. (Note: Some of my todos assume the reader understands various pieces of nearby context / has generally read the indicated source code.)

  • 1 TODO: Greg in update.qy.
  • 1 TODO: Greg in update_duplicates.sh.

Please let me know if you have any questions, comments, or concerns about my changes and design choices.

Israel added 3 commits November 7, 2025 11:08
Update front end dataqa plugin to follow the new schema for p_dups.
Add helpful comments.
Add smother color gradient to dups UI.
Ignore blank emails, phones, and addresses.
Fix a bug where cos_compare() was used for phone numbers instead of lev_compare().
Abstract values into an object in update.qy.
Remove unhelpful optimization attempts.
Remove unhelpful comments.
Add known issues to string similarity documentation.
Clean up and organize todos.
Clean up testing code in several files.
@Lightning11wins
Copy link
Author

Centrallix PR.

@Lightning11wins Lightning11wins changed the title Dups Update Duplicate Detection Nov 14, 2025
Israel and others added 8 commits November 17, 2025 11:09
Rearchitect dupe acquisition queries, inlining them in update.qy.
Add comments get/<field>.qy files to explain how the data is marshalled.
Add print statements to update.qy which help with debugging.
Add a comment to update.qy explaining the strategies used.
Add code to compute individual-field similarities for concat dups, improving reasons displayed to the DB Admin.
Rename globals.qy to cluster_params.qy.
Add last updated date.
Add handling for edge case: missing dup reason.
Add handling for some nondup edge cases.
Organize confusing joins.
Add some friendly flair when all dups are resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants