Skip to content

feat: incremental matching#65

Open
ph1losof wants to merge 5 commits intosaghen:mainfrom
ph1losof:feat/incremental-matching
Open

feat: incremental matching#65
ph1losof wants to merge 5 commits intosaghen:mainfrom
ph1losof:feat/incremental-matching

Conversation

@ph1losof
Copy link

@ph1losof ph1losof commented Feb 22, 2026

Hi @saghen, this is between a draft and fully ready, so feel free to share your opinion.

Note: AI was used to help write the benchmark code.

Closes #5, related to television discussion - #19

What this does

When you're typing into a fuzzy finder character by character (f -> fo -> foo), each keystroke can only eliminate matches from the previous set - it can never add new ones. IncrementalMatcher takes advantage of this by remembering which haystack indices matched last time and only rescoring that shrinking subset on each prefix extension. Everything else (completely different query, haystack list change) falls back to a full rescore.

The API mirrors Matcher but takes the needle per-call instead of at construction:

let mut matcher = IncrementalMatcher::new(&config);
let matches = matcher.match_list("f", &haystacks);
let matches = matcher.match_list("fo", &haystacks);  // only rescores previous matches

Also supports match_list_indices, match_list_parallel, reset(), and handles haystack growth between calls.

Benchmark results on synthetic file-path data (500k haystacks, 8 keystroke sequence):

Overall:  one-shot 319ms  incremental 166ms  (1.92x)

Per step:
 "s"        482088 matches    30ms →  30ms   1.0x
 "sr"       355063 matches    44ms →  43ms   1.0x
 "src"      188339 matches    41ms →  36ms   1.2x
 "src/"     112305 matches    43ms →  25ms   1.7x
 "src/c"     53330 matches    50ms →  18ms   2.7x
 "src/co"    42282 matches    51ms →   6ms   8.5x
 "src/com"   17301 matches    38ms →   3ms   9.8x
 "src/comp"   6293 matches    22ms →   1ms  16.2x

Early steps where almost everything matches show no overhead (1.0x). The wins kick in once selectivity increases and the narrowed subset gets meaningfully smaller.

Approaches I tested that didn't pan out

Delta prefilter - before running the full SIMD prefilter on the narrowed set, check if just the new character exists in the haystack (cheap scalar byte scan). Turns out the SIMD prefilter already runs at ~16ns/item, and the scalar check costs ~25ns/item. Adding it as a pre-pass actually made things slower since the SIMD path is already fast enough that the extra branch isn't worth it.

Incremental prefilter/SW construction - set_needle rebuilds both the prefilter and the Smith-Waterman matcher from scratch each call. Thought about making it append-only for prefix extensions. Measured it: set_needle costs 120-270ns depending on needle length. Over an 8-step sequence that's ~1.7µs total out of ~166ms. Not worth the complexity, and the SW matrix uses a variable stride that changes per haystack, making in-place growth impractical without reworking the core DP.

Partial SW reuse - cache the first N-1 rows of the score matrix from the previous needle, only compute the new row. Problem is the matrix is per-haystack (different haystack lengths = different column counts), so you'd need to store a matrix per surviving match. That's ~12KB per match. At 23k matches that's 280MB.

What could come next

  • Top-K mode where you can stop scoring early once you have enough high-quality results. Right now we score everything because the API returns all matches. A match_list_top_k(needle, haystacks, k) variant could skip SW for items that can't possibly make the cut based on their previous score.
  • Score threshold parameter for similar early-out behavior when the caller knows they only care about matches above a quality bar.

Replace `if let ... && ...` expressions with nested `if let` / `if`
blocks to avoid E0658 on toolchains without stabilized let chains.
When a user types a query character by character, extending a needle
can only eliminate matches, never create new ones. IncrementalMatcher
stores which haystack indices matched previously and only rescores
that subset on prefix extension, giving ~2x overall speedup.

Supports match_list, match_list_indices, match_list_parallel, reset,
and haystack growth between calls.
Compares IncrementalMatcher vs one-shot match_list with file-path
datasets across multiple query patterns and dataset sizes. Shows
per-step breakdowns and selectivity impact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incremental matching

1 participant