Open
Conversation
Replace `if let ... && ...` expressions with nested `if let` / `if` blocks to avoid E0658 on toolchains without stabilized let chains.
When a user types a query character by character, extending a needle can only eliminate matches, never create new ones. IncrementalMatcher stores which haystack indices matched previously and only rescores that subset on prefix extension, giving ~2x overall speedup. Supports match_list, match_list_indices, match_list_parallel, reset, and haystack growth between calls.
Compares IncrementalMatcher vs one-shot match_list with file-path datasets across multiple query patterns and dataset sizes. Shows per-step breakdowns and selectivity impact.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi @saghen, this is between a draft and fully ready, so feel free to share your opinion.
Note: AI was used to help write the benchmark code.
Closes #5, related to television discussion - #19
What this does
When you're typing into a fuzzy finder character by character (
f->fo->foo), each keystroke can only eliminate matches from the previous set - it can never add new ones.IncrementalMatchertakes advantage of this by remembering which haystack indices matched last time and only rescoring that shrinking subset on each prefix extension. Everything else (completely different query, haystack list change) falls back to a full rescore.The API mirrors
Matcherbut takes the needle per-call instead of at construction:Also supports
match_list_indices,match_list_parallel,reset(), and handles haystack growth between calls.Benchmark results on synthetic file-path data (500k haystacks, 8 keystroke sequence):
Early steps where almost everything matches show no overhead (1.0x). The wins kick in once selectivity increases and the narrowed subset gets meaningfully smaller.
Approaches I tested that didn't pan out
Delta prefilter - before running the full SIMD prefilter on the narrowed set, check if just the new character exists in the haystack (cheap scalar byte scan). Turns out the SIMD prefilter already runs at ~16ns/item, and the scalar check costs ~25ns/item. Adding it as a pre-pass actually made things slower since the SIMD path is already fast enough that the extra branch isn't worth it.
Incremental prefilter/SW construction -
set_needlerebuilds both the prefilter and the Smith-Waterman matcher from scratch each call. Thought about making it append-only for prefix extensions. Measured it:set_needlecosts 120-270ns depending on needle length. Over an 8-step sequence that's ~1.7µs total out of ~166ms. Not worth the complexity, and the SW matrix uses a variable stride that changes per haystack, making in-place growth impractical without reworking the core DP.Partial SW reuse - cache the first N-1 rows of the score matrix from the previous needle, only compute the new row. Problem is the matrix is per-haystack (different haystack lengths = different column counts), so you'd need to store a matrix per surviving match. That's ~12KB per match. At 23k matches that's 280MB.
What could come next
match_list_top_k(needle, haystacks, k)variant could skip SW for items that can't possibly make the cut based on their previous score.