-
-
Notifications
You must be signed in to change notification settings - Fork 420
perf: Optimize searching tags with DB indexes #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
01bb83b to
30d9403
Compare
| if limit > 0 and not name: | ||
| query = query.limit(limit).order_by(func.lower(Tag.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the sorting to happen after truncating the results, which differs from the behaviour when searching, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was not sorting by len(text) before truncating. The previous implementation only sorted priority results by len so I've updated sort_key to do that. Which will make this order_by statement and sort_key produce the same results when no query is provided.
| tags.sort(key=lambda t: sort_key(t[1])) | ||
| seen_ids = set() | ||
| tag_ids = [] | ||
| for row in tags: | ||
| id = row[0] | ||
| if id in seen_ids: | ||
| continue | ||
| tag_ids.append(id) | ||
| seen_ids.add(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be written as the following?
tags = dict(tags)
tag_ids = sorted(tags.keys(), key=lambda t: sort_key(tags[t]))
del tags # not sure if this is makes a diff, but `tags` could become quite large and triggering gc on it sooner can't hurtthis is both simpler code wise and should be faster by only sorting the deduplicated list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so it will use the order from Tag.name or TagAlias.name depending on which comes first for each tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case the following should work since dict.keys() maintains insertion order and dict deduplicates by key:
tags.sort(key=lambda t: sort_key(t[1]))
tag_ids = dict(tags).keys() # get the deduplicated list of idsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than that I think this is good to merge
src/tagstudio/qt/mixed/tag_search.py
Outdated
| direct_tags, ancestor_tags = self.lib.search_tags(name=query, limit=tag_limit) | ||
|
|
||
| if query and query.strip(): | ||
| for tag in raw_results: | ||
| if tag.name.lower().startswith(query_lower): | ||
| priority_results.add(tag) | ||
| all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
| for tag in ancestor_tags: | ||
| if tag.id not in self.exclude: | ||
| all_results.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code previously handled self.exclude being None, is there a reason you removed that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its type is list[int] and I couldn't find any code that could cause it to be None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good reason ^^, I missed that
src/tagstudio/qt/mixed/tag_search.py
Outdated
| all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
| for tag in ancestor_tags: | ||
| if tag.id not in self.exclude: | ||
| all_results.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| all_results = [t for t in direct_tags if t.id not in self.exclude] | |
| for tag in ancestor_tags: | |
| if tag.id not in self.exclude: | |
| all_results.append(tag) | |
| all_results = [t for t in direct_tags if t.id not in self.exclude] | |
| all_results += [t for t in ancestor_tags if t.id not in self.exclude] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up doing this to avoid creating extra lists.
all_results.extend(t for t in ancestor_tags if t.id not in self.exclude)
|
Sorry for the delay on the review had to work on my thesis ^^ |
Summary
Create indexes on columns commonly used in queries.
Tag.name, Tag.shorthandfinding tags by nameTagParent.child_idfetching tag hierarchiesTagEntry.entry_idfetching entries with their tagsChange query in
Library.search_tagsso it will use above indexes.Sort results before applying limit to prevent truncating tags that should be prioritized.
Tasks Completed