Skip to content

Conversation

@Lightning11wins
Copy link
Contributor

@Lightning11wins Lightning11wins commented Nov 14, 2025

The duplicate detection project is ready to review, although (best case), there are still a couple of things blocking it from being ready to merge.

I would appreciate a full review of all changes, as there's quite a lot here. That said, some areas may require additional special attention, so I've compiled a list of all 28 TODO: Greg comments below. (Note: Some of my todos assume the reader understands various pieces of nearby context / has generally read the indicated source code.)

  • 4 TODO: Gregs in objdrv_cluster.c
  • 2 TODO: Gregs in OSDriver_authoring.md
  • 1 TODO: Greg in mtsession.md.
  • 1 TODO: Greg in xarray.md.
  • 1 TODO: Greg in xstring.md.

Please let me know if you have any questions, comments, or concerns about my changes and design choices.

Israel added 10 commits October 13, 2025 09:53
Improve edge case logic in comparison functions.
Remove unregister driver function.
Clean up exp_functions.c.
Simplify dataqa_duplicates component in preparation for making it the boundary into our new duplicate system.
Add exp functions: sparse_eql(), ln(), and logn().
Fix bugs in comparison functions.
Make minor tweaks to objdrv_cluster.c.
Modify cluster files to use string keys.
Build vectors fully sparsely.
Add ca_fprint_vector().
Add snprint_llu().
Add exp_fn_trim().
Update exp_fn_cmp().
Organize exp function definitions by group.
Add statistics tracking to cluster driver.
Reduce minimum hint threshold.
Add array handling to ci_xaToTrimmedArray().
Update timer to handle multiple starts and stops properly.
Re-add Levenshtein to exp_functions.
Publish edit_dist() in the cluster library.
Fix mistakes in cluster driver function signatures.
Fix spelling mistakes.
Add detail to an error message in the lexer.
Remove unused .cluster files.
Clean up cluster-schema.cluster.
Clean up other unused junk.
Add known issues to string similarity documentation.
Clean up and organize todos.
Clean up testing code in several files.
@Lightning11wins
Copy link
Contributor Author

Kardia PR.

@Lightning11wins
Copy link
Contributor Author

I'd probably recommend rebasing these into one commit, but that's up to you.

Israel added 10 commits November 14, 2025 16:10
…ast commit).

Update tests to pass with this modification.
… caches).

Fix a formatting issue with the stat method.
Fix a missing include in the util.c library.
…le hundred bytes.

Add check_double() to handle functions that return NAN on failure.
Clean up.
…rary.

Round similarity results to avoid floating point errors.
Enable caching for memory allocated in get_cluster_size().
Rename edit_dist() to ca_edit_dist() to match format for public functions.
Rename print_diagnostics() to print_err().
Copy link
Member

@gbeeley gbeeley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass here, covering all of the documentation and other parts of this PR except the clusters.c and objdrv_cluster.c files. Thanks!

@gbeeley
Copy link
Member

gbeeley commented Dec 4, 2025

Ok, this is odd. Github crashed when I submitted this PR review. This initial review is for everything except the core of this PR (clusters.c and objdrv_cluster.c). So I'm not sure Github submitted this correctly.

Fix styling mistakes.
Finish docs in OSDriver_Authoring.md.
Add support for querying the driver node object.
Fix clusterOpenQuery() succeeding on objects that could not be queried, resulting in fetch failures.
Remove "date_created" and "date_computed" from the list of * attributes on cluster and search entries.
Rename TARGET_ROOT to TARGET_NODE.
Rename snprint_llu() to snprint_commas_llu().
Move double_metaphone.c into centrallix util.
Move TypeToStr() to obj_datatypes.c.
Move TypeFromStr() to obj_datatypes.c.
Remove exp_fn_trim() (temporarily).
Revert reorder of exp_function registrations to avoid merge conflicts.
Update tests to give clearer feedback.
Add GCC_Dependencies.md to document a list of dependencies on GCC features.
Add .cluster to Prefixes.md.
# Conflicts:
#	centrallix/expression/exp_functions.c
Add log() and trim() exp functions (with tests).
Add optional variables in schemas.
Fix styling for schema verification.
Fix log() being treated as a reserved word.
Move docs for newmalloc, xarray, xhash, xstring, mtsession, and mtask out of OSDriver_Authoring.md and into their own files.
Add the imported date to OSDriver_Authoring.md.
@Lightning11wins
Copy link
Contributor Author

FYI: I just cleaned up my todos for you, but I also updated and reorganized the lists in each PR so that they only include todos in their respective branches.

@Lightning11wins
Copy link
Contributor Author

Lightning11wins commented Dec 12, 2025

FYI: I am aware that most of my functions are missing the required final return; statement. I'll get to that next time I'm on this branch.

Just pushed a commit to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants