Skip to content

Comments

Improve short-query precision in Algolia search#741

Open
Flamki wants to merge 3 commits intoprecice:masterfrom
Flamki:issue-733-search-short-token-precision
Open

Improve short-query precision in Algolia search#741
Flamki wants to merge 3 commits intoprecice:masterfrom
Flamki:issue-733-search-short-token-precision

Conversation

@Flamki
Copy link

@Flamki Flamki commented Feb 22, 2026

Summary
Improve search precision for short queries (for example gsoc) without weakening XML reference search globally.

Problem
Short queries can produce false positives from XML/code snippets due to typo + prefix matching.

Change
In both Algolia client entry points (_includes/algolia.html, js/algolia-search.js):

  • Detect whether the query contains any short alphanumeric token (length 3-5), including multi-word queries.
  • Apply stricter typo thresholds only for those queries:
    • minWordSizefor1Typo = 5
    • minWordSizefor2Typos = 9
  • Keep default thresholds otherwise (4 / 8).

Why this approach

  • Query-time and scoped: no global index/config changes.
  • Keeps XML search behavior for normal XML queries (for example sockets).
  • Handles both single-word and multi-word cases (for example gsoc projects).

Validation

  • pre-commit run --files _includes/algolia.html js/algolia-search.js
  • docker run --rm -v "${PWD}:/srv/jekyll" -w /srv/jekyll jekyll/jekyll:4 bash -lc "bundle install && bundle exec jekyll build"

Closes #733

@MuhammadAashirAslam
Copy link
Contributor

Hi @Flamki, thanks for submitting the PR! I just have one question about the scope of the fix.

The getStrictShortTokens function only activates for single-word alphanumeric queries between 3-5 characters. What happens in a multi-word query like "gsoc projects" where the short token is part of a longer search?

Since tokens.length !== 1 returns early with an empty array, disableTypoToleranceOnWords would receive [] and the typo tolerance would remain at default — meaning "gsoc" could still match "sockets" in that context.

Was this intentional to keep the fix narrow?

(Also can you attach the screenshot of your solution working locally)

Also can you see my PR #744

@Flamki
Copy link
Author

Flamki commented Feb 23, 2026

Thanks for the careful review, great catch.

You were right: the earlier helper was too narrow for multi-word input. I updated the PR so short tokens are handled even when part of a longer query (for example gsoc projects).

Follow-up changes in this PR:

  • Detect short alphanumeric tokens (3-5 chars) anywhere in the query, not only single-word queries.
  • Use an Algolia-compatible query-time approach:
    • if a short token exists: minWordSizefor1Typo=5, minWordSizefor2Typos=9
    • otherwise: defaults 4 / 8

This keeps the scope narrow (only short-token queries), avoids global index/config changes, and keeps XML search behavior for normal XML queries.

Validation:

  • pre-commit run --files _includes/algolia.html js/algolia-search.js
  • docker run --rm -v "${PWD}:/srv/jekyll" -w /srv/jekyll jekyll/jekyll:4 bash -lc "bundle install && bundle exec jekyll build"

Local screenshots:

  1. gsoc (no noisy XML matches)

local-gsoc

  1. sockets (XML results still available)

local-sockets

@Flamki Flamki force-pushed the issue-733-search-short-token-precision branch from d33a06c to 6629092 Compare February 23, 2026 06:08
@Flamki
Copy link
Author

Flamki commented Feb 23, 2026

Also yes, I saw #744. The main difference is scope: #744 applies typo-threshold changes globally, while this PR applies stricter thresholds only when a short token is present in the query.

@MuhammadAashirAslam
Copy link
Contributor

Hey @Flamki , thanks for iterating on this! One thing I wanted to point out about the updated approach:

minWordSizefor1Typo and minWordSizefor2Typos are per-word thresholds, not per-query settings. Setting minWordSizefor1Typo = 5 only affects words with 4 or fewer characters. If the query has no such words, the setting has no effect compared to the default of 4.

This means the conditional check via hasStrictShortToken produces identical results to always setting the values. For example, searching "configuration" (13 chars) gets 2 typos regardless of whether minWordSizefor1Typo is 4 or 5, because 13 > 5 either way.

So the helper function adds ~20 lines of logic that doesn't change any search behavior compared to the simpler unconditional approach.

I think the minimal 2-line version keeps things cleaner and easier to maintain. Happy to discuss if I'm missing something though! 🙂

@MakisH MakisH added GSoC Contributed in the context of the Google Summer of Code technical Technical issues on the website labels Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GSoC Contributed in the context of the Google Summer of Code technical Technical issues on the website

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search for ‘GSoC’ Returns Irrelevant XML Documentation Pages

3 participants