Add fuzzy matching by iamsims · Pull Request #103 · NASA-IMPACT/accelerated-discovery

iamsims · 2025-08-05T20:34:19Z

Config use_fuzzy_match and fuzzy_match_threshold

Fuzzy match uses fuzzy token set ratio, in default fuzzy logic is set to false and the fuzzy threshold for match is 87.

- Config use_fuzzy_match and fuzzy_match_threshold

NISH1001 · 2025-08-06T19:31:47Z

@iamsims please do follow the branch naming convention we have such as feature/ bugfix/ hotfix/ refactor/ etc

Thanks.

NISH1001 · 2025-08-06T19:34:00Z

akd/tools/source_validator.py

+
+                 # if (
+                #     whitelisted_title in source_title
+                #     or source_title in whitelisted_title
+                # ):
+                #     # Check if it's a meaningful match (not just common words)
+                #     if len(whitelisted_title) > 10 or len(source_title) > 10:
+                #         return True, category_name, 0.8


I think this remove older logic no?

Instead of having destructive change, can we have constructive change. That is: first check if fuzzy is enabled. If not enabled, then use the older logic. Don't replace current logic because not sure fuzzy will work 100% of the time.

I'd suggest something like

if self.config.use_fuzzy_match: ....<your code> ...return continue with whatever logic we have previously. This should fix this comment.

NISH1001 · 2025-08-06T19:34:56Z

akd/tools/source_validator.py

+
+
+                if self.config.use_fuzzy_match:
+                    fuzzy_score_set = token_set_ratio(source_title, whitelisted_title)


Can we have the fuzzy match function name as configurable? That is fuzzy_fn_name or something and get the function based on that name. It will give more configurability on what fuzzy matching algo to apply

NISH1001 · 2025-08-06T19:36:38Z

@iamsims ALso let's add source validation test cases as well tests/tools/test_source_validator or something like that.

Thanks.

- Configure the type of fuzzy match function - Fallback to the original algorithm for matching if fuzzy match is disabled

NISH1001 · 2025-08-07T15:43:42Z

@iamsims What's the test coverage now?

Could you run python -m pytest --cov=akd --cov-report=term-missing tests/ and post the results here of the coverage? Thanks

NISH1001 · 2025-08-07T15:44:31Z

akd/tools/source_validator.py

+    token_set = "token_set"
+    token_sort = "token_sort"
+    ratio = "ratio"


Let's have enum all caps lock like TOKEN_SET etc....

TOKEN_SET = "token_set" ...

NISH1001 · 2025-08-07T15:46:24Z

akd/tools/source_validator.py

+        scorer_map = {
+        "token_set": token_set_ratio,
+        "token_sort": token_sort_ratio,
+        "ratio": ratio,
+        }


maybe this could be class-level attribute?

class SourceValidator(...): ... _scorer_map = dict(token_set = token_set_ratio, ...) ... def __init__(self, ...): ...

- Change the cases of enum constants to CAPS - Make class variable of scorer_map instead of function variable

NISH1001 · 2025-09-03T20:10:55Z

@iamsims is this PR still relevant?

Add fuzzy matching

f8dfb68

- Config use_fuzzy_match and fuzzy_match_threshold

NISH1001 requested changes Aug 6, 2025

View reviewed changes

iamsims added 2 commits August 6, 2025 21:43

Make change to address PR comments

0b6da10

- Configure the type of fuzzy match function - Fallback to the original algorithm for matching if fuzzy match is disabled

Add unit tests for source validator

c670353

iamsims requested a review from NISH1001 August 7, 2025 02:45

NISH1001 requested changes Aug 7, 2025

View reviewed changes

Address the comments

aac0f52

- Change the cases of enum constants to CAPS - Make class variable of scorer_map instead of function variable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy matching#103

Add fuzzy matching#103
iamsims wants to merge 4 commits intodevelopfrom
configure-fuzzy-match

iamsims commented Aug 5, 2025

Uh oh!

NISH1001 commented Aug 6, 2025

Uh oh!

NISH1001 Aug 6, 2025

Uh oh!

NISH1001 Aug 6, 2025

Uh oh!

NISH1001 commented Aug 6, 2025

Uh oh!

NISH1001 commented Aug 7, 2025

Uh oh!

NISH1001 Aug 7, 2025 •

edited

Loading

Uh oh!

NISH1001 Aug 7, 2025

Uh oh!

NISH1001 commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		if self.config.use_fuzzy_match:
		fuzzy_score_set = token_set_ratio(source_title, whitelisted_title)

Conversation

iamsims commented Aug 5, 2025

Uh oh!

NISH1001 commented Aug 6, 2025

Uh oh!

NISH1001 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 commented Aug 6, 2025

Uh oh!

NISH1001 commented Aug 7, 2025

Uh oh!

NISH1001 Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NISH1001 Aug 7, 2025 •

edited

Loading