perf: Optimize searching tags with DB indexes #1129

TheBobBobs · 2025-09-14T22:05:18Z

Summary

Create indexes on columns commonly used in queries.
Tag.name, Tag.shorthand finding tags by name
TagParent.child_id fetching tag hierarchies
TagEntry.entry_id fetching entries with their tags

Change query in Library.search_tags so it will use above indexes.
Sort results before applying limit to prevent truncating tags that should be prioritized.

Tasks Completed

Platforms Tested:
- Linux x86
Tested For:
- Basic functionality
- PyInstaller executable

Computerdores · 2025-09-15T13:07:13Z

src/tagstudio/core/library/alchemy/library.py

+            if limit > 0 and not name:
+                query = query.limit(limit).order_by(func.lower(Tag.name))


This causes the sorting to happen after truncating the results, which differs from the behaviour when searching, no?

Yes this was not sorting by len(text) before truncating. The previous implementation only sorted priority results by len so I've updated sort_key to do that. Which will make this order_by statement and sort_key produce the same results when no query is provided.

Computerdores · 2025-09-15T13:10:20Z

src/tagstudio/core/library/alchemy/library.py

+            tags.sort(key=lambda t: sort_key(t[1]))
+            seen_ids = set()
+            tag_ids = []
+            for row in tags:
+                id = row[0]
+                if id in seen_ids:
+                    continue
+                tag_ids.append(id)
+                seen_ids.add(id)


Couldn't this be written as the following?

tags = dict(tags) tag_ids = sorted(tags.keys(), key=lambda t: sort_key(tags[t])) del tags # not sure if this is makes a diff, but `tags` could become quite large and triggering gc on it sooner can't hurt

this is both simpler code wise and should be faster by only sorting the deduplicated list

This is so it will use the order from Tag.name or TagAlias.name depending on which comes first for each tag.

in that case the following should work since dict.keys() maintains insertion order and dict deduplicates by key:

tags.sort(key=lambda t: sort_key(t[1])) tag_ids = dict(tags).keys() # get the deduplicated list of ids

other than that I think this is good to merge

Computerdores · 2025-09-15T13:29:13Z

src/tagstudio/qt/mixed/tag_search.py

+        direct_tags, ancestor_tags = self.lib.search_tags(name=query, limit=tag_limit)

-        if query and query.strip():
-            for tag in raw_results:
-                if tag.name.lower().startswith(query_lower):
-                    priority_results.add(tag)
+        all_results = [t for t in direct_tags if t.id not in self.exclude]
+        for tag in ancestor_tags:
+            if tag.id not in self.exclude:
+                all_results.append(tag)


This code previously handled self.exclude being None, is there a reason you removed that?

Its type is list[int] and I couldn't find any code that could cause it to be None.

good reason ^^, I missed that

Computerdores · 2025-09-15T13:34:09Z

src/tagstudio/qt/mixed/tag_search.py

+        all_results = [t for t in direct_tags if t.id not in self.exclude]
+        for tag in ancestor_tags:
+            if tag.id not in self.exclude:
+                all_results.append(tag)


Suggested change

all_results = [t for t in direct_tags if t.id not in self.exclude]

for tag in ancestor_tags:

if tag.id not in self.exclude:

all_results.append(tag)

all_results = [t for t in direct_tags if t.id not in self.exclude]

all_results += [t for t in ancestor_tags if t.id not in self.exclude]

Ended up doing this to avoid creating extra lists.
all_results.extend(t for t in ancestor_tags if t.id not in self.exclude)

Computerdores · 2025-10-07T18:28:24Z

Sorry for the delay on the review had to work on my thesis ^^

TheBobBobs added 5 commits September 14, 2025 13:49

perf: create sqlite indexes for common columns

ea288e9

perf: optimize Library.search_tags

1d7a267

fix(tag_search): do ordering before applying limit

8ce773b

tag_search: order shorter tag names first

a1dd8d4

update tests

30d9403

TheBobBobs force-pushed the perf/tag_search branch from 01bb83b to 30d9403 Compare September 14, 2025 22:12

Computerdores reviewed Sep 15, 2025

View reviewed changes

CyanVoxel added TagStudio: Library Relating to the TagStudio library system TagStudio: Tags Relating to the TagStudio tag system TagStudio: Search The TagStudio search engine Type: Performance An issue or change related to performance labels Sep 15, 2025

CyanVoxel added this to TagStudio Development Sep 15, 2025

CyanVoxel moved this to 👀 In review in TagStudio Development Sep 15, 2025

TheBobBobs added 3 commits September 16, 2025 10:50

cleanup

06dc5e8

tag_search: use same sorting order when returning all tags

9ff3364

use dict for deduplicating tags

dba9e24

Computerdores approved these changes Oct 7, 2025

View reviewed changes

CyanVoxel added this to the Alpha v9.5.7 milestone Oct 8, 2025

		if limit > 0 and not name:
		query = query.limit(limit).order_by(func.lower(Tag.name))

Uh oh!

perf: Optimize searching tags with DB indexes #1129

Are you sure you want to change the base?

perf: Optimize searching tags with DB indexes #1129

Conversation

TheBobBobs commented Sep 14, 2025

Summary

Tasks Completed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Computerdores Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Computerdores commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Computerdores Sep 16, 2025 •

edited

Loading