Conversation
- Config use_fuzzy_match and fuzzy_match_threshold
|
@iamsims please do follow the branch naming convention we have such as Thanks. |
akd/tools/source_validator.py
Outdated
|
|
||
| # if ( | ||
| # whitelisted_title in source_title | ||
| # or source_title in whitelisted_title | ||
| # ): | ||
| # # Check if it's a meaningful match (not just common words) | ||
| # if len(whitelisted_title) > 10 or len(source_title) > 10: | ||
| # return True, category_name, 0.8 |
There was a problem hiding this comment.
I think this remove older logic no?
Instead of having destructive change, can we have constructive change. That is: first check if fuzzy is enabled. If not enabled, then use the older logic. Don't replace current logic because not sure fuzzy will work 100% of the time.
I'd suggest something like
if self.config.use_fuzzy_match:
....<your code>
...return
continue with whatever logic we have previously. This should fix this comment.
akd/tools/source_validator.py
Outdated
|
|
||
|
|
||
| if self.config.use_fuzzy_match: | ||
| fuzzy_score_set = token_set_ratio(source_title, whitelisted_title) |
There was a problem hiding this comment.
Can we have the fuzzy match function name as configurable? That is fuzzy_fn_name or something and get the function based on that name. It will give more configurability on what fuzzy matching algo to apply
|
@iamsims ALso let's add source validation test cases as well Thanks. |
- Configure the type of fuzzy match function - Fallback to the original algorithm for matching if fuzzy match is disabled
|
@iamsims What's the test coverage now? Could you run |
akd/tools/source_validator.py
Outdated
| token_set = "token_set" | ||
| token_sort = "token_sort" | ||
| ratio = "ratio" |
There was a problem hiding this comment.
Let's have enum all caps lock like TOKEN_SET etc....
TOKEN_SET = "token_set"
...
akd/tools/source_validator.py
Outdated
| scorer_map = { | ||
| "token_set": token_set_ratio, | ||
| "token_sort": token_sort_ratio, | ||
| "ratio": ratio, | ||
| } |
There was a problem hiding this comment.
maybe this could be class-level attribute?
class SourceValidator(...):
...
_scorer_map = dict(token_set = token_set_ratio, ...)
...
def __init__(self, ...):
...- Change the cases of enum constants to CAPS - Make class variable of scorer_map instead of function variable
|
@iamsims is this PR still relevant? |
Fuzzy match uses fuzzy token set ratio, in default fuzzy logic is set to false and the fuzzy threshold for match is 87.