-
Notifications
You must be signed in to change notification settings - Fork 10
Update Duplicate Detection #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Lightning11wins
wants to merge
42
commits into
master
Choose a base branch
from
dups
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 27 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
5f2e901
Checkpoint: Switching to DM UI project.
994e99f
Checkpoint: Switching to DM project.
ea6430f
Checkpoing: Switching to DM project.
cf0dbb5
Finish implementing major features for the cluster driver.
a861fb4
Upgrade memory handling in the cluster driver.
b4634f3
Begin adding query files to search for duplicates.
63a4dc2
Add warning for providing an invalid parameter.
22e55a3
Merge branch 'master' into dups
4b656a4
Improve exp_functions() to use central schema verification.
fa28afa
Add ClusterDriverRequirements (forgot to commit them before).
81a1d2f
Clean up unintended usage of glyph.h
e624d40
Attempt to reduce issues from ambiguously signed chars.
b0e000b
All tests now pass.
0874365
Re-apply reduced weight for duplicate pairs (temporarily turned off l…
01d918a
Clean up.
42a65f1
Update licences.
b281037
Clean up.
ee0bca7
Add "show_less" option to the cache method (skips printing uncomputed…
0c9eb2c
Update cluster library to use dynamic memory for any data over a coup…
394764e
Remove necessary requests for the driver name in objQueryFetch().
9b8cc19
Fix bugs that caused regressions after the updates to the cluster lib…
17156b7
Fix an invalid free (nmFree used instead of nmSysFree()).
648e30a
Merge branch 'master' into dups
29640a1
Minor improvements and clean up.
0fa62d3
Correct minor mistakes.
d3b571c
Merge branch 'master' into dups
06bae81
Implement a more extendable schema verification system.
13fd4b7
Replace old schema verification with the new system.
e83c15f
Expand the new schema verification system with extra data validation …
070cfe3
Clean up, bug fixes, and naming convention updates.
8795aaf
Add tests for log and power functions.
2e948d8
Add exp_fn_i_get_number().
4c347be
Add exp_fn_i_do_math() to bring the power of schema verification to l…
d177522
Minor clean up.
Lightning11wins 7b49a5b
Address Greg's comments
Lightning11wins e9c10a5
Merge branch 'exp-schema' into dups
Lightning11wins b6abca7
Finish exp_functions.c work.
Lightning11wins 8c86b5f
Organize docs.
Lightning11wins 63fa5ba
Fix wrong stAddValue() info caused by reading old code.
Lightning11wins d0d4f54
Clean up stale TOODs.
Lightning11wins 3b86627
Fix more styling mistakes.
Lightning11wins 6b83c67
Fix indentation mistakes (thanks Centrallix Indent extension).
Lightning11wins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,3 +62,4 @@ perf.data.old | |
| .idea/ | ||
| .vscode/ | ||
| centrallix-os/tmp/* | ||
| centrallix-os/datasets/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| #ifndef CLUSTERS_H | ||
| #define CLUSTERS_H | ||
|
|
||
| /************************************************************************/ | ||
| /* Centrallix Application Server System */ | ||
| /* Centrallix Core */ | ||
| /* */ | ||
| /* Copyright (C) 1998-2012 LightSys Technology Services, Inc. */ | ||
| /* */ | ||
| /* This program is free software; you can redistribute it and/or modify */ | ||
| /* it under the terms of the GNU General Public License as published by */ | ||
| /* the Free Software Foundation; either version 2 of the License, or */ | ||
| /* (at your option) any later version. */ | ||
| /* */ | ||
| /* This program is distributed in the hope that it will be useful, */ | ||
| /* but WITHOUT ANY WARRANTY; without even the implied warranty of */ | ||
| /* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the */ | ||
| /* GNU General Public License for more details. */ | ||
| /* */ | ||
| /* You should have received a copy of the GNU General Public License */ | ||
| /* along with this program; if not, write to the Free Software */ | ||
| /* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA */ | ||
| /* 02111-1307 USA */ | ||
| /* */ | ||
| /* A copy of the GNU General Public License has been included in this */ | ||
| /* distribution in the file "COPYING". */ | ||
| /* */ | ||
| /* Module: lib_cluster.c, lib_cluster.h */ | ||
| /* Author: Israel Fuller */ | ||
| /* Creation: September 29, 2025 */ | ||
| /* Description: Clustering library used to cluster and search data with */ | ||
| /* cosine similarity and Levenshtein similarity (aka. edit */ | ||
| /* distance). Used by the "clustering driver". */ | ||
| /* For more information on how to use this library, see */ | ||
| /* string-similarity.md in the centrallix-sysdoc folder. */ | ||
| /************************************************************************/ | ||
|
|
||
| #include <stdlib.h> | ||
| #include <stdbool.h> | ||
|
|
||
| #ifdef CXLIB_INTERNAL | ||
| #include "xarray.h" | ||
| #else | ||
| #include "cxlib/xarray.h" | ||
| #endif | ||
|
|
||
| /*** 2147483629 is the signed int max, and is also a prime number. | ||
| *** Using this value ensures that the longest run of 0s will not | ||
| *** cause an int underflow with the current encoding scheme. | ||
| *** | ||
| *** Unfortunately, we can't use a number this large yet because | ||
| *** kmeans algorithm creates densely allocated centroids with | ||
| *** `CA_NUM_DIMS` dimensions, so a large number causes it to fail. | ||
| ***/ | ||
| #define CA_NUM_DIMS 251 //2147483629 /* aka. The vector table size. */ | ||
|
|
||
| /// LINK ../../centrallix-sysdoc/string_comparison.md#cosine_charsets | ||
| /** The character used to create a pair with the first and last characters of a string. **/ | ||
| #define CA_BOUNDARY_CHAR (unsigned char)('a' - 1) | ||
|
|
||
| /** Types. **/ | ||
| typedef int* pVector; /* Sparse vector. */ | ||
| typedef double* pCentroid; /* Dense centroid. */ | ||
| #define pCentroidSize CA_NUM_DIMS * sizeof(double) | ||
|
|
||
| /** Duplocate information. **/ | ||
| typedef struct | ||
| { | ||
| void* key1; | ||
| void* key2; | ||
| double similarity; | ||
| } | ||
| Dup, *pDup; | ||
|
|
||
| /** Registering all defined types for debugging. **/ | ||
| #define ca_init() \ | ||
| nmRegister(sizeof(pVector), "pVector"); \ | ||
| nmRegister(sizeof(pCentroid), "pCentroid"); \ | ||
| nmRegister(pCentroidSize, "Centroid"); \ | ||
| nmRegister(sizeof(Dup), "Dup") | ||
|
|
||
| /** Edit distance function. **/ | ||
| int ca_edit_dist(const char* str1, const char* str2, const size_t str1_length, const size_t str2_length); | ||
|
|
||
| /** Vector functions. **/ | ||
| pVector ca_build_vector(const char* str); | ||
| unsigned int ca_sparse_len(const pVector vector); | ||
| void ca_print_vector(const pVector vector); | ||
| void ca_free_vector(pVector sparse_vector); | ||
|
|
||
| /** Kmeans function. **/ | ||
| int ca_kmeans( | ||
| pVector* vectors, | ||
| const unsigned int num_vectors, | ||
| const unsigned int num_clusters, | ||
| const unsigned int max_iter, | ||
| const double min_improvement, | ||
| unsigned int* labels, | ||
| double* vector_sims); | ||
|
|
||
| /** Vector helper macros. **/ | ||
| #define ca_is_empty(vector) (vector[0] == -CA_NUM_DIMS) | ||
| #define ca_has_no_pairs(vector) \ | ||
| ({ \ | ||
| __typeof__ (vector) _v = (vector); \ | ||
| _v[0] == -172 && _v[1] == 11 && _v[2] == -78; \ | ||
| }) | ||
|
|
||
| /** Comparison functions (see ca_search()). **/ | ||
| double ca_cos_compare(void* v1, void* v2); | ||
| double ca_lev_compare(void* str1, void* str2); | ||
| bool ca_eql(pVector v1, pVector v2); | ||
|
|
||
| /** Similarity search functions. **/ | ||
| void* ca_most_similar( | ||
| void* target, | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const double (*similarity)(void*, void*), | ||
| const double threshold); | ||
| pXArray ca_sliding_search( | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const unsigned int window_size, | ||
| const double (*similarity)(void*, void*), | ||
| const double dupe_threshold, | ||
| void** maybe_keys, | ||
| pXArray dups); | ||
| pXArray ca_complete_search( | ||
| void** data, | ||
| const unsigned int num_data, | ||
| const double (*similarity)(void*, void*), | ||
| const double dupe_threshold, | ||
| void** maybe_keys, | ||
| pXArray dups); | ||
|
|
||
| #endif /* End of .h file. */ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| #ifndef GLYPH_H | ||
| #define GLYPH_H | ||
|
|
||
| /************************************************************************/ | ||
| /* Centrallix Application Server System */ | ||
| /* Centrallix Core */ | ||
| /* */ | ||
| /* Copyright (C) 1998-2012 LightSys Technology Services, Inc. */ | ||
| /* */ | ||
| /* This program is free software; you can redistribute it and/or modify */ | ||
| /* it under the terms of the GNU General Public License as published by */ | ||
| /* the Free Software Foundation; either version 2 of the License, or */ | ||
| /* (at your option) any later version. */ | ||
| /* */ | ||
| /* This program is distributed in the hope that it will be useful, */ | ||
| /* but WITHOUT ANY WARRANTY; without even the implied warranty of */ | ||
| /* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the */ | ||
| /* GNU General Public License for more details. */ | ||
| /* */ | ||
| /* You should have received a copy of the GNU General Public License */ | ||
| /* along with this program; if not, write to the Free Software */ | ||
| /* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA */ | ||
| /* 02111-1307 USA */ | ||
| /* */ | ||
| /* A copy of the GNU General Public License has been included in this */ | ||
| /* distribution in the file "COPYING". */ | ||
| /* */ | ||
| /* Module: glyph.h */ | ||
| /* Author: Israel Fuller */ | ||
| /* Creation: October 27, 2025 */ | ||
| /* Description: A simple debug visualizer to make pretty patterns in */ | ||
| /* developer's terminal which can be surprisingly useful */ | ||
| /* for debugging algorithms. */ | ||
| /************************************************************************/ | ||
|
|
||
| #include <stdlib.h> | ||
|
|
||
| /** Uncomment to activate glyphs. **/ | ||
| /** Should not be enabled in production code on the master branch. */ | ||
| // #define ENABLE_GLYPHS | ||
|
|
||
| #ifdef ENABLE_GLYPHS | ||
| #define glyph_print(s) printf("%s", s); | ||
| /*** Initialize a simple debug visualizer to make pretty patterns in the | ||
| *** developer's terminal. Great for when you need to run a long task and | ||
| *** want a super simple way to make sure it's still working. | ||
| *** | ||
| *** @attention - Relies on storing data in variables in scope, so calling | ||
| *** glyph() requires a call to glyph_init() previously in the same scope. | ||
| *** | ||
| *** @param name The symbol name of the visualizer. | ||
| *** @param str The string printed for the visualization. | ||
| *** @param interval The number of invocations of glyph() required to print. | ||
| *** @param flush Whether to flush on output. | ||
| ***/ | ||
| #define glyph_init(name, str, interval, flush) \ | ||
| const char* vis_##name##_str = str; \ | ||
| const unsigned int vis_##name##_interval = interval; \ | ||
| const bool vis_##name##_flush = flush; \ | ||
| unsigned int vis_##name##_i = 0u; | ||
|
|
||
| /*** Invoke a visualizer. | ||
| *** | ||
| *** @param name The name of the visualizer to invoke. | ||
| ***/ | ||
| #define glyph(name) \ | ||
| if (++vis_##name##_i % vis_##name##_interval == 0) \ | ||
| { \ | ||
| glyph_print(vis_##name##_str); \ | ||
| if (vis_##name##_flush) fflush(stdout); \ | ||
| } | ||
| #else | ||
| #define glyph_print(str) | ||
| #define glyph_init(name, str, interval, flush) | ||
| #define glyph(name) | ||
| #endif | ||
|
|
||
| #endif /* End of .h file. */ |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.