Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks: add ad hoc benchmark for 'quick-strings' #156

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BurntSushi
Copy link
Owner

This is in response to @samuelcolvin's tweet about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's rebar harness and bakes
them off against the substring search implementations in std and
memchr across a wide variety of benchmarks.

I ran the benchmarks like so:

$ rebar build
$ rebar measure \
    -e quick-strings \
    -e rust/std/memmem/oneshot \
    -e rust/std/memmem/prebuilt \
    -e rust/memchr/memmem/oneshot \
    -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv

And here is a high level comparison using the geometric mean of
relative speed ratios across all measured benchmarks (limited only
to those on which all of std's, memchr's and quick-string's code is
measured):

$ rebar rank --intersection /tmp/results.csv
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/prebuilt        2.7.4                     1.05                            38
rust/memchr/memmem/oneshot         2.7.4                     1.22                            38
rust/std/memmem/prebuilt           1.80.0-nightly b92758a9a  7.02                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  7.48                            38
rust/quick-strings/memmem/oneshot  0.0.1                     11.22                           38

The "oneshot" benchmarks are the closest apples-to-apples comparison.
Namely, they measure the time it takes to do a search of a needle in a
haystack with a single function call. The "prebuilt" benchmarks, on the
other hand, are allowed to accept a needle, build a searcher and then
reuse that search for subsequent use. The "prebuilt" benchmarks
demonstrate the advantages of a superior API when the needle is
invariant. (For example, "find all occurrences of this needle in the
lines of a file.")

For our purposes, and so we can narrow things down a bit, let's stick to
"oneshot" benchmarks:

$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/oneshot         2.7.4                     1.12                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  6.87                            38
rust/quick-strings/memmem/oneshot  0.0.1                     10.31                           38

Here is a breakdown on the level of the individual benchmark:

$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes            989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson                 1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson                 1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock                     1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock                     1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock                     1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

Even in comparison to memchr, which has been optimized like crazy,
quick-strings actually does better on some benchmarks! But... if we
ask for only results where there is, say, a very large difference, watch
what happens:

$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

In every case, quick-strings is the one that is slower. Here's another
way to slice the data. Let's only look at benchmarks with non-teeny
haystacks:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

Again, quick-strings is slower in every case. Now let's look at
benchmarks with only teeny haystacks (the opposite of the above):

$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark                                         rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                         --------------------------  ---------------------------------  -----------------------
memmem/subtitles/never/teeny-en-john-watson       1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes  989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes   1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space         989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson       1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson       1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock           1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock           1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock           1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

In this case, quick-strings is actually faster in every case!

My contextualization of these results is that quick-strings
implements a "naive" substring search algorithm that has worst case
O(m * n) time but very low startup costs. This means that if you
pick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.

Substring search implementations have a lot of different cases to
cover. memchr in particular does try very hard to optimize for the
kinds of cases where quick-strings is fast, and this is likely why
it fairs a bit better than std does above on the teeny haystacks.
(For example, memchr is only 1.5x slower than quick-strings on
teeny-zh-sherlock-holmes, but std is 4.3x slower.)

The reason why it's difficult to get as fast as quick-strings (or
pretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example, memchr will use SIMD in
some cases, but only when the CPU supports it (via runtime querying).
And memchr will change which implementation it uses based on haystack
length. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast, quick-strings can just
launch right into the search.

With that said, there's always room for improvement! memchr isn't
necessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.

This is in response to @samuelcolvin's [tweet] about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's [rebar] harness and bakes
them off against the substring search implementations in `std` and
`memchr` across a wide variety of benchmarks.

I ran the benchmarks like so:

```
$ rebar build
$ rebar measure \
    -e quick-strings \
    -e rust/std/memmem/oneshot \
    -e rust/std/memmem/prebuilt \
    -e rust/memchr/memmem/oneshot \
    -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv
```

And here is a high level comparison using the geometric mean of
relative speed ratios across all measured benchmarks (limited only
to those on which all of std's, memchr's and quick-string's code is
measured):

```
$ rebar rank --intersection /tmp/results.csv
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/prebuilt        2.7.4                     1.05                            38
rust/memchr/memmem/oneshot         2.7.4                     1.22                            38
rust/std/memmem/prebuilt           1.80.0-nightly b92758a9a  7.02                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  7.48                            38
rust/quick-strings/memmem/oneshot  0.0.1                     11.22                           38
```

The "oneshot" benchmarks are the closest apples-to-apples comparison.
Namely, they measure the time it takes to do a search of a needle in a
haystack with a single function call. The "prebuilt" benchmarks, on the
other hand, are allowed to accept a needle, build a searcher and then
reuse that search for subsequent use. The "prebuilt" benchmarks
demonstrate the advantages of a superior API when the needle is
invariant. (For example, "find all occurrences of this needle in the
lines of a file.")

For our purposes, and so we can narrow things down a bit, let's stick to
"oneshot" benchmarks:

```
$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/oneshot         2.7.4                     1.12                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  6.87                            38
rust/quick-strings/memmem/oneshot  0.0.1                     10.31                           38
```

Here is a breakdown on the level of the individual benchmark:

```
$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes            989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson                 1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson                 1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock                     1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock                     1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock                     1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)
```

Even in comparison to `memchr`, which has been optimized like crazy,
`quick-strings` actually does better on some benchmarks! But... if we
ask for only results where there is, say, a very large difference, watch
what happens:

```
$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
```

In every case, `quick-strings` is the one that is slower. Here's another
way to slice the data. Let's _only_ look at benchmarks with non-teeny
haystacks:

```
[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
```

Again, `quick-strings` is slower in every case. Now let's look at
benchmarks with only teeny haystacks (the opposite of the above):

```
$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark                                         rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                         --------------------------  ---------------------------------  -----------------------
memmem/subtitles/never/teeny-en-john-watson       1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes  989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes   1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space         989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson       1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson       1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock           1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock           1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock           1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)
```

In this case, `quick-strings` is actually faster in _every_ case!

My contextualization of these results is that `quick-strings`
implements a "naive" substring search algorithm that has worst case
`O(m * n)` time but *very low* startup costs. This means that if you
pick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.

Substring search implementations have a lot of different cases to
cover. `memchr` in particular does try very hard to optimize for the
kinds of cases where `quick-strings` is fast, and this is likely why
it fairs a bit better than `std` does above on the teeny haystacks.
(For example, `memchr` is only 1.5x slower than `quick-strings` on
`teeny-zh-sherlock-holmes`, but std is 4.3x slower.)

The reason why it's difficult to get as fast as `quick-strings` (or
pretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example, `memchr` will use SIMD in
some cases, but only when the CPU supports it (via runtime querying).
And `memchr` will change which implementation it uses based on haystack
length. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast, `quick-strings` can just
launch right into the search.

With that said, there's always room for improvement! `memchr` isn't
necessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.

[tweet]: https://x.com/samuel_colvin/status/1815453691480338692
[rebar]: https://github.com/BurntSushi/rebar
@samuelcolvin
Copy link

This is fascinating, thank you so much.

I'll look now soon.

Perhaps we can just use memchr for arrow-rs.

It's worth noting that wrt LIKE and ILIKE queries:

  1. Most worries have quite a short needle
  2. 99.x% of rows will often be misses, so the cost of a match doesn't matter much.

@BurntSushi
Copy link
Owner Author

memchr would be a good bet. If you're doing LIKE-style queries, then you should be able to build a memchr::memmem::Finder once and reuse it. And if you do that, it does much better on the short haystacks cases, even when comparing with quick-strings:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e quick-strings -e memchr/memmem/prebuilt
benchmark                                                   rust/memchr/memmem/prebuilt  rust/quick-strings/memmem/oneshot
---------                                                   ---------------------------  ---------------------------------
memmem/code/rust-library-never-fn-strength                  49.0 GB/s (1.00x)            1620.4 MB/s (30.94x)
memmem/code/rust-library-never-fn-strength-paren            48.9 GB/s (1.00x)            1637.2 MB/s (30.59x)
memmem/code/rust-library-never-fn-quux                      36.5 GB/s (1.00x)            1637.2 MB/s (22.86x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)            5.8 GB/s (8.40x)
memmem/pathological/md5-huge-no-hash                        40.3 GB/s (1.00x)            1190.6 MB/s (34.63x)
memmem/pathological/md5-huge-last-hash                      47.3 GB/s (1.00x)            1196.0 MB/s (40.49x)
memmem/pathological/rare-repeated-huge-tricky               51.3 GB/s (1.00x)            1984.3 MB/s (26.47x)
memmem/pathological/rare-repeated-small-tricky              27.4 GB/s (1.00x)            1936.4 MB/s (14.50x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)             1069.8 MB/s (5.18x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.00x)            1167.3 MB/s (16.87x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.9 MB/s (1.00x)           22.9 MB/s (43.38x)
memmem/subtitles/never/huge-en-john-watson                  40.2 GB/s (1.00x)            1930.5 MB/s (21.34x)
memmem/subtitles/never/huge-en-all-common-bytes             45.6 GB/s (1.00x)            1377.7 MB/s (33.91x)
memmem/subtitles/never/huge-en-some-rare-bytes              49.2 GB/s (1.00x)            1969.1 MB/s (25.59x)
memmem/subtitles/never/huge-en-two-space                    41.6 GB/s (1.00x)            727.8 MB/s (58.53x)
memmem/subtitles/never/teeny-en-john-watson                 1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-all-common-bytes            1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1668.9 MB/s (1.00x)          1335.1 MB/s (1.25x)
memmem/subtitles/never/teeny-en-two-space                   1668.9 MB/s (1.00x)          1112.6 MB/s (1.50x)
memmem/subtitles/never/huge-ru-john-watson                  40.3 GB/s (1.00x)            475.6 MB/s (86.68x)
memmem/subtitles/never/teeny-ru-john-watson                 2.6 GB/s (1.00x)             1668.9 MB/s (1.60x)
memmem/subtitles/never/huge-zh-john-watson                  49.8 GB/s (1.00x)            1457.1 MB/s (35.00x)
memmem/subtitles/never/teeny-zh-john-watson                 1847.7 MB/s (1.00x)          1642.4 MB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.4 GB/s (1.00x)            1907.4 MB/s (27.58x)
memmem/subtitles/rare/huge-en-sherlock                      50.8 GB/s (1.00x)            1907.0 MB/s (27.26x)
memmem/subtitles/rare/huge-en-medium-needle                 36.9 GB/s (1.00x)            1358.2 MB/s (27.84x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)            1715.0 MB/s (25.82x)
memmem/subtitles/rare/huge-en-huge-needle                   39.3 GB/s (1.00x)            1915.9 MB/s (21.03x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              1570.8 MB/s (1.00x)          1335.1 MB/s (1.18x)
memmem/subtitles/rare/teeny-en-sherlock                     1335.1 MB/s (1.11x)          1483.5 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               50.3 GB/s (1.00x)            479.5 MB/s (107.49x)
memmem/subtitles/rare/huge-ru-sherlock                      51.2 GB/s (1.00x)            479.5 MB/s (109.32x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              2.1 GB/s (1.00x)             1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock                     1741.5 MB/s (1.00x)          1668.9 MB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.1 GB/s (1.00x)            1080.8 MB/s (46.50x)
memmem/subtitles/rare/huge-zh-sherlock                      50.0 GB/s (1.00x)            1075.2 MB/s (47.60x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1137.1 MB/s (1.30x)          1478.2 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1231.8 MB/s (1.50x)          1847.7 MB/s (1.00x)

std doesn't expose any sort of analogous API for a prebuilt searcher based on just the needle.

That said, memchr doesn't do case insensitive search. That's a different can of worms, especially if you care about Unicode support. A regex::Regex supports it, but there's a lot more machinery there and its overhead is even bigger than memchr::memmem. If you only need ASCII case insensitivity, then there is AhoCorasickBuilder::ascii_case_insensitive, although I don't believe that's had a ton of optimization work put into it, so YMMV.

@samuelcolvin
Copy link

Thanks @BurntSushi,

That said, memchr doesn't do case insensitive search.

Here arrow-rs already falls back to using regex::Regex, the idea for quick-strings is for the very common cases where a LIKE or ILIKE query can be represented as starts_with, ends_with, contains, istarts_with, iends_with or icontains.

For that reason, we only need ascii case insensitive methods — if the needle is case insensitive, we use that methods, otherwise we can just fallback to using regex::Regex.

Ascii needles make up (in my experience) a very large proportion of queries, hence the value in optimising that case.

Is there any interest in adding ascii only case-insensitive methods to memchr? I would really like to use just memchr and not need to use any mof the logic from quick-strings.

@samuelcolvin
Copy link

cc @alamb who I think might be interested in this discussion.

@BurntSushi
Copy link
Owner Author

The "search in ASCII case insensitive mode" is something that I think would probably have a good home in this crate, but it is a big scope increase. That is, it's something that I do think would be nice to support, but it's a tall order. Of course, an initial implementation could be naive (as long as it's not naive algorithmically, i.e., optimal worst case behavior) I suppose. But there are likely SIMD tricks specific to ASCII case insensitive search that we'll want to employ. That's probably easily ~1 month of my free coding time in the evenings.

But yeah, your strategy makes sense. I suspect what you're fighting is actually just regex overhead itself. The regex crate is, quite literally, doing exactly the optimization you're trying to do manually. The actual regex engine won't be used at all for simple ASCII-only literal queries. It will just defer to this crate. But there are abstraction layers it has to cut through at search time... Maybe there is a way to improve those, but I suspect there will always be cases like this where it will be faster for the caller to do this sort of optimization on their own unfortunately. If you were searching larger haystacks, then you wouldn't have to worry about this at all.

@alamb
Copy link

alamb commented Jul 23, 2024

An approach we have taken elsewhere in arrow-rs is to literally special case the different implementations

So maybe we could do something like this:

if haystack.len() < 1024 { // or whatever the constant happens to be
  // use small-strings implementation
} else {
  // use more general purpose std or memchr etc
}

This would of course result in more code / bigger binary

@alamb
Copy link

alamb commented Jul 23, 2024

. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant.

I think this is an excellent observation @BurntSushi

@samuelcolvin maybe this is a good hint that we could make the arrow-rs substring kernels behave more like the regex kernes and "compile / prepare" the matcher

Given the kernels typically operate on batches of 4000 or 8000ish rows we can probably amortize the "prepare the needle" step well

@samuelcolvin
Copy link

samuelcolvin commented Jul 26, 2024

See apache/arrow-rs#6118, apache/arrow-rs#6128 and apache/arrow-rs#6131 for implementation in arrow-rs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants