-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarks: add ad hoc benchmark for 'quick-strings' #156
base: master
Are you sure you want to change the base?
Conversation
This is in response to @samuelcolvin's [tweet] about Rust's standard library substring search routines being slow. This PR hooks up his naive implementation of substring search to memchr's [rebar] harness and bakes them off against the substring search implementations in `std` and `memchr` across a wide variety of benchmarks. I ran the benchmarks like so: ``` $ rebar build $ rebar measure \ -e quick-strings \ -e rust/std/memmem/oneshot \ -e rust/std/memmem/prebuilt \ -e rust/memchr/memmem/oneshot \ -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv ``` And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured): ``` $ rebar rank --intersection /tmp/results.csv Engine Version Geometric mean of speed ratios Benchmark count ------ ------- ------------------------------ --------------- rust/memchr/memmem/prebuilt 2.7.4 1.05 38 rust/memchr/memmem/oneshot 2.7.4 1.22 38 rust/std/memmem/prebuilt 1.80.0-nightly b92758a9a 7.02 38 rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 7.48 38 rust/quick-strings/memmem/oneshot 0.0.1 11.22 38 ``` The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.") For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks: ``` $ rebar rank /tmp/results.csv --intersection -e oneshot Engine Version Geometric mean of speed ratios Benchmark count ------ ------- ------------------------------ --------------- rust/memchr/memmem/oneshot 2.7.4 1.12 38 rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 6.87 38 rust/quick-strings/memmem/oneshot 0.0.1 10.31 38 ``` Here is a breakdown on the level of the individual benchmark: ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x) memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x) memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x) memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x) memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x) memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x) memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x) ``` Even in comparison to `memchr`, which has been optimized like crazy, `quick-strings` actually does better on some benchmarks! But... if we ask for only results where there is, say, a very large difference, watch what happens: ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5 benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) ``` In every case, `quick-strings` is the one that is slower. Here's another way to slice the data. Let's _only_ look at benchmarks with non-teeny haystacks: ``` [andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) ``` Again, `quick-strings` is slower in every case. Now let's look at benchmarks with only teeny haystacks (the opposite of the above): ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x) memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x) memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x) memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x) memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x) memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x) memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x) memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x) memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x) memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x) memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x) memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x) ``` In this case, `quick-strings` is actually faster in _every_ case! My contextualization of these results is that `quick-strings` implements a "naive" substring search algorithm that has worst case `O(m * n)` time but *very low* startup costs. This means that if you pick your benchmarks correctly, it can appear to be a lot faster in some common cases. But once you extend beyond that or get large haystacks, perf can really tank pretty considerably. Substring search implementations have a lot of different cases to cover. `memchr` in particular does try very hard to optimize for the kinds of cases where `quick-strings` is fast, and this is likely why it fairs a bit better than `std` does above on the teeny haystacks. (For example, `memchr` is only 1.5x slower than `quick-strings` on `teeny-zh-sherlock-holmes`, but std is 4.3x slower.) The reason why it's difficult to get as fast as `quick-strings` (or pretty much any naive implementation) is that there's a bunch of startup overhead to pay for in a substring search implementation that uses multiple different strategies. For example, `memchr` will use SIMD in some cases, but only when the CPU supports it (via runtime querying). And `memchr` will change which implementation it uses based on haystack length. So there is some amount of case analysis being done before the search can even begin to execute. In contrast, `quick-strings` can just launch right into the search. With that said, there's always room for improvement! `memchr` isn't necessarily at the floor of an otherwise irreducible trade-off, but I think that so long as you want to get high throughput for big haystacks (like what ripgrep really wants to do), you probably need to at least partially sell out some other types of tasks for very very small haystacks. [tweet]: https://x.com/samuel_colvin/status/1815453691480338692 [rebar]: https://github.com/BurntSushi/rebar
This is fascinating, thank you so much. I'll look now soon. Perhaps we can just use It's worth noting that wrt
|
That said, |
Thanks @BurntSushi,
Here arrow-rs already falls back to using For that reason, we only need ascii case insensitive methods — if the needle is case insensitive, we use that methods, otherwise we can just fallback to using Ascii needles make up (in my experience) a very large proportion of queries, hence the value in optimising that case. Is there any interest in adding ascii only case-insensitive methods to |
cc @alamb who I think might be interested in this discussion. |
The "search in ASCII case insensitive mode" is something that I think would probably have a good home in this crate, but it is a big scope increase. That is, it's something that I do think would be nice to support, but it's a tall order. Of course, an initial implementation could be naive (as long as it's not naive algorithmically, i.e., optimal worst case behavior) I suppose. But there are likely SIMD tricks specific to ASCII case insensitive search that we'll want to employ. That's probably easily ~1 month of my free coding time in the evenings. But yeah, your strategy makes sense. I suspect what you're fighting is actually just |
An approach we have taken elsewhere in arrow-rs is to literally special case the different implementations So maybe we could do something like this: if haystack.len() < 1024 { // or whatever the constant happens to be
// use small-strings implementation
} else {
// use more general purpose std or memchr etc
} This would of course result in more code / bigger binary |
I think this is an excellent observation @BurntSushi @samuelcolvin maybe this is a good hint that we could make the arrow-rs substring kernels behave more like the regex kernes and "compile / prepare" the matcher Given the kernels typically operate on batches of 4000 or 8000ish rows we can probably amortize the "prepare the needle" step well |
See apache/arrow-rs#6118, apache/arrow-rs#6128 and apache/arrow-rs#6131 for implementation in arrow-rs. |
This is in response to @samuelcolvin's tweet about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's rebar harness and bakes
them off against the substring search implementations in
std
andmemchr
across a wide variety of benchmarks.I ran the benchmarks like so:
And here is a high level comparison using the geometric mean of
relative speed ratios across all measured benchmarks (limited only
to those on which all of std's, memchr's and quick-string's code is
measured):
The "oneshot" benchmarks are the closest apples-to-apples comparison.
Namely, they measure the time it takes to do a search of a needle in a
haystack with a single function call. The "prebuilt" benchmarks, on the
other hand, are allowed to accept a needle, build a searcher and then
reuse that search for subsequent use. The "prebuilt" benchmarks
demonstrate the advantages of a superior API when the needle is
invariant. (For example, "find all occurrences of this needle in the
lines of a file.")
For our purposes, and so we can narrow things down a bit, let's stick to
"oneshot" benchmarks:
Here is a breakdown on the level of the individual benchmark:
Even in comparison to
memchr
, which has been optimized like crazy,quick-strings
actually does better on some benchmarks! But... if weask for only results where there is, say, a very large difference, watch
what happens:
In every case,
quick-strings
is the one that is slower. Here's anotherway to slice the data. Let's only look at benchmarks with non-teeny
haystacks:
Again,
quick-strings
is slower in every case. Now let's look atbenchmarks with only teeny haystacks (the opposite of the above):
In this case,
quick-strings
is actually faster in every case!My contextualization of these results is that
quick-strings
implements a "naive" substring search algorithm that has worst case
O(m * n)
time but very low startup costs. This means that if youpick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.
Substring search implementations have a lot of different cases to
cover.
memchr
in particular does try very hard to optimize for thekinds of cases where
quick-strings
is fast, and this is likely whyit fairs a bit better than
std
does above on the teeny haystacks.(For example,
memchr
is only 1.5x slower thanquick-strings
onteeny-zh-sherlock-holmes
, but std is 4.3x slower.)The reason why it's difficult to get as fast as
quick-strings
(orpretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example,
memchr
will use SIMD insome cases, but only when the CPU supports it (via runtime querying).
And
memchr
will change which implementation it uses based on haystacklength. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast,
quick-strings
can justlaunch right into the search.
With that said, there's always room for improvement!
memchr
isn'tnecessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.