benchmarks: add ad hoc benchmark for 'quick-strings' #156

BurntSushi · 2024-07-22T19:48:47Z

This is in response to @samuelcolvin's tweet about Rust's standard
library substring search routines being slow. This PR hooks up his naive
implementation of substring search to memchr's rebar harness and bakes
them off against the substring search implementations in std and
memchr across a wide variety of benchmarks.

I ran the benchmarks like so:

$ rebar build
$ rebar measure \
    -e quick-strings \
    -e rust/std/memmem/oneshot \
    -e rust/std/memmem/prebuilt \
    -e rust/memchr/memmem/oneshot \
    -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv

And here is a high level comparison using the geometric mean of
relative speed ratios across all measured benchmarks (limited only
to those on which all of std's, memchr's and quick-string's code is
measured):

$ rebar rank --intersection /tmp/results.csv
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/prebuilt        2.7.4                     1.05                            38
rust/memchr/memmem/oneshot         2.7.4                     1.22                            38
rust/std/memmem/prebuilt           1.80.0-nightly b92758a9a  7.02                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  7.48                            38
rust/quick-strings/memmem/oneshot  0.0.1                     11.22                           38

The "oneshot" benchmarks are the closest apples-to-apples comparison.
Namely, they measure the time it takes to do a search of a needle in a
haystack with a single function call. The "prebuilt" benchmarks, on the
other hand, are allowed to accept a needle, build a searcher and then
reuse that search for subsequent use. The "prebuilt" benchmarks
demonstrate the advantages of a superior API when the needle is
invariant. (For example, "find all occurrences of this needle in the
lines of a file.")

For our purposes, and so we can narrow things down a bit, let's stick to
"oneshot" benchmarks:

$ rebar rank /tmp/results.csv --intersection -e oneshot
Engine                             Version                   Geometric mean of speed ratios  Benchmark count
------                             -------                   ------------------------------  ---------------
rust/memchr/memmem/oneshot         2.7.4                     1.12                            38
rust/std/memmem/oneshot            1.80.0-nightly b92758a9a  6.87                            38
rust/quick-strings/memmem/oneshot  0.0.1                     10.31                           38

Here is a breakdown on the level of the individual benchmark:

$ rebar cmp /tmp/results.csv --intersection -e oneshot
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/teeny-en-john-watson                 1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes            989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space                   989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/teeny-ru-john-watson                 1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/never/teeny-zh-john-watson                 1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock                     1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock                     1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock                     1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

Even in comparison to memchr, which has been optimized like crazy,
quick-strings actually does better on some benchmarks! But... if we
ask for only results where there is, say, a very large difference, watch
what happens:

$ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

In every case, quick-strings is the one that is slower. Here's another
way to slice the data. Let's only look at benchmarks with non-teeny
haystacks:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny
benchmark                                                   rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                                   --------------------------  ---------------------------------  -----------------------
memmem/code/rust-library-never-fn-strength                  48.9 GB/s (1.00x)           1620.4 MB/s (30.87x)               1378.7 MB/s (36.28x)
memmem/code/rust-library-never-fn-strength-paren            48.3 GB/s (1.00x)           1637.2 MB/s (30.21x)               1482.8 MB/s (33.35x)
memmem/code/rust-library-never-fn-quux                      29.4 GB/s (1.00x)           1637.2 MB/s (18.42x)               1540.9 MB/s (19.57x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)           5.8 GB/s (8.40x)                   1496.9 MB/s (33.44x)
memmem/pathological/md5-huge-no-hash                        46.8 GB/s (1.00x)           1190.6 MB/s (40.27x)               8.7 GB/s (5.36x)
memmem/pathological/md5-huge-last-hash                      46.7 GB/s (1.00x)           1196.0 MB/s (39.95x)               10.5 GB/s (4.45x)
memmem/pathological/rare-repeated-huge-tricky               51.2 GB/s (1.00x)           1984.3 MB/s (26.44x)               1240.3 MB/s (42.30x)
memmem/pathological/rare-repeated-small-tricky              21.2 GB/s (1.00x)           1936.4 MB/s (11.20x)               1202.3 MB/s (18.05x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)            1069.8 MB/s (5.18x)                4.4 GB/s (1.24x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.37x)           1167.3 MB/s (23.04x)               26.3 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.6 MB/s (1.00x)          22.9 MB/s (43.36x)                 826.4 MB/s (1.20x)
memmem/subtitles/never/huge-en-john-watson                  39.6 GB/s (1.00x)           1930.5 MB/s (21.03x)               1605.2 MB/s (25.29x)
memmem/subtitles/never/huge-en-all-common-bytes             43.9 GB/s (1.00x)           1377.7 MB/s (32.63x)               2.9 GB/s (15.12x)
memmem/subtitles/never/huge-en-some-rare-bytes              50.3 GB/s (1.00x)           1969.1 MB/s (26.17x)               3.2 GB/s (15.52x)
memmem/subtitles/never/huge-en-two-space                    41.4 GB/s (1.00x)           727.8 MB/s (58.28x)                1158.8 MB/s (36.61x)
memmem/subtitles/never/huge-ru-john-watson                  51.3 GB/s (1.00x)           475.6 MB/s (110.41x)               2.2 GB/s (23.42x)
memmem/subtitles/never/huge-zh-john-watson                  45.1 GB/s (1.00x)           1457.1 MB/s (31.69x)               6.5 GB/s (6.90x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.2 GB/s (1.00x)           1907.4 MB/s (27.48x)               2.5 GB/s (20.36x)
memmem/subtitles/rare/huge-en-sherlock                      37.9 GB/s (1.00x)           1907.0 MB/s (20.33x)               2.7 GB/s (14.11x)
memmem/subtitles/rare/huge-en-medium-needle                 48.0 GB/s (1.00x)           1358.2 MB/s (36.22x)               3.9 GB/s (12.27x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)           1715.0 MB/s (25.78x)               6.8 GB/s (6.32x)
memmem/subtitles/rare/huge-en-huge-needle                   41.8 GB/s (1.00x)           1915.9 MB/s (22.37x)               7.0 GB/s (5.97x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               40.1 GB/s (1.00x)           479.5 MB/s (85.61x)                2.6 GB/s (15.64x)
memmem/subtitles/rare/huge-ru-sherlock                      51.1 GB/s (1.00x)           479.5 MB/s (109.22x)               1733.9 MB/s (30.21x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.0 GB/s (1.00x)           1080.8 MB/s (46.38x)               4.9 GB/s (9.92x)
memmem/subtitles/rare/huge-zh-sherlock                      49.9 GB/s (1.00x)           1075.2 MB/s (47.56x)               3.4 GB/s (14.73x)

Again, quick-strings is slower in every case. Now let's look at
benchmarks with only teeny haystacks (the opposite of the above):

$ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny
benchmark                                         rust/memchr/memmem/oneshot  rust/quick-strings/memmem/oneshot  rust/std/memmem/oneshot
---------                                         --------------------------  ---------------------------------  -----------------------
memmem/subtitles/never/teeny-en-john-watson       1027.0 MB/s (1.44x)         1483.5 MB/s (1.00x)                920.8 MB/s (1.61x)
memmem/subtitles/never/teeny-en-all-common-bytes  989.0 MB/s (1.50x)          1483.5 MB/s (1.00x)                1027.0 MB/s (1.44x)
memmem/subtitles/never/teeny-en-some-rare-bytes   1027.0 MB/s (1.30x)         1335.1 MB/s (1.00x)                1161.0 MB/s (1.15x)
memmem/subtitles/never/teeny-en-two-space         989.0 MB/s (1.12x)          1112.6 MB/s (1.00x)                953.7 MB/s (1.17x)
memmem/subtitles/never/teeny-ru-john-watson       1178.1 MB/s (1.42x)         1668.9 MB/s (1.00x)                1001.4 MB/s (1.67x)
memmem/subtitles/never/teeny-zh-john-watson       1095.0 MB/s (1.50x)         1642.4 MB/s (1.00x)                953.7 MB/s (1.72x)
memmem/subtitles/rare/teeny-en-sherlock-holmes    989.0 MB/s (1.35x)          1335.1 MB/s (1.00x)                387.0 MB/s (3.45x)
memmem/subtitles/rare/teeny-en-sherlock           1112.6 MB/s (1.33x)         1483.5 MB/s (1.00x)                606.9 MB/s (2.44x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes    1112.6 MB/s (1.29x)         1430.5 MB/s (1.00x)                408.7 MB/s (3.50x)
memmem/subtitles/rare/teeny-ru-sherlock           1292.1 MB/s (1.29x)         1668.9 MB/s (1.00x)                656.6 MB/s (2.54x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes    985.5 MB/s (1.50x)          1478.2 MB/s (1.00x)                343.8 MB/s (4.30x)
memmem/subtitles/rare/teeny-zh-sherlock           1182.6 MB/s (1.56x)         1847.7 MB/s (1.00x)                579.7 MB/s (3.19x)

In this case, quick-strings is actually faster in every case!

My contextualization of these results is that quick-strings
implements a "naive" substring search algorithm that has worst case
O(m * n) time but very low startup costs. This means that if you
pick your benchmarks correctly, it can appear to be a lot faster
in some common cases. But once you extend beyond that or get large
haystacks, perf can really tank pretty considerably.

Substring search implementations have a lot of different cases to
cover. memchr in particular does try very hard to optimize for the
kinds of cases where quick-strings is fast, and this is likely why
it fairs a bit better than std does above on the teeny haystacks.
(For example, memchr is only 1.5x slower than quick-strings on
teeny-zh-sherlock-holmes, but std is 4.3x slower.)

The reason why it's difficult to get as fast as quick-strings (or
pretty much any naive implementation) is that there's a bunch of startup
overhead to pay for in a substring search implementation that uses
multiple different strategies. For example, memchr will use SIMD in
some cases, but only when the CPU supports it (via runtime querying).
And memchr will change which implementation it uses based on haystack
length. So there is some amount of case analysis being done before the
search can even begin to execute. In contrast, quick-strings can just
launch right into the search.

With that said, there's always room for improvement! memchr isn't
necessarily at the floor of an otherwise irreducible trade-off, but I
think that so long as you want to get high throughput for big haystacks
(like what ripgrep really wants to do), you probably need to at least
partially sell out some other types of tasks for very very small
haystacks.

@samuelcolvin

This is in response to @samuelcolvin's [tweet] about Rust's standard library substring search routines being slow. This PR hooks up his naive implementation of substring search to memchr's [rebar] harness and bakes them off against the substring search implementations in `std` and `memchr` across a wide variety of benchmarks. I ran the benchmarks like so: ``` $ rebar build $ rebar measure \ -e quick-strings \ -e rust/std/memmem/oneshot \ -e rust/std/memmem/prebuilt \ -e rust/memchr/memmem/oneshot \ -e rust/memchr/memmem/prebuilt | tee /tmp/results.csv ``` And here is a high level comparison using the geometric mean of relative speed ratios across all measured benchmarks (limited only to those on which all of std's, memchr's and quick-string's code is measured): ``` $ rebar rank --intersection /tmp/results.csv Engine Version Geometric mean of speed ratios Benchmark count ------ ------- ------------------------------ --------------- rust/memchr/memmem/prebuilt 2.7.4 1.05 38 rust/memchr/memmem/oneshot 2.7.4 1.22 38 rust/std/memmem/prebuilt 1.80.0-nightly b92758a9a 7.02 38 rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 7.48 38 rust/quick-strings/memmem/oneshot 0.0.1 11.22 38 ``` The "oneshot" benchmarks are the closest apples-to-apples comparison. Namely, they measure the time it takes to do a search of a needle in a haystack with a single function call. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant. (For example, "find all occurrences of this needle in the lines of a file.") For our purposes, and so we can narrow things down a bit, let's stick to "oneshot" benchmarks: ``` $ rebar rank /tmp/results.csv --intersection -e oneshot Engine Version Geometric mean of speed ratios Benchmark count ------ ------- ------------------------------ --------------- rust/memchr/memmem/oneshot 2.7.4 1.12 38 rust/std/memmem/oneshot 1.80.0-nightly b92758a9a 6.87 38 rust/quick-strings/memmem/oneshot 0.0.1 10.31 38 ``` Here is a breakdown on the level of the individual benchmark: ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x) memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x) memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x) memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x) memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x) memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x) memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x) ``` Even in comparison to `memchr`, which has been optimized like crazy, `quick-strings` actually does better on some benchmarks! But... if we ask for only results where there is, say, a very large difference, watch what happens: ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot -t 5 benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) ``` In every case, `quick-strings` is the one that is slower. Here's another way to slice the data. Let's _only_ look at benchmarks with non-teeny haystacks: ``` [andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e oneshot -F teeny benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/code/rust-library-never-fn-strength 48.9 GB/s (1.00x) 1620.4 MB/s (30.87x) 1378.7 MB/s (36.28x) memmem/code/rust-library-never-fn-strength-paren 48.3 GB/s (1.00x) 1637.2 MB/s (30.21x) 1482.8 MB/s (33.35x) memmem/code/rust-library-never-fn-quux 29.4 GB/s (1.00x) 1637.2 MB/s (18.42x) 1540.9 MB/s (19.57x) memmem/code/rust-library-rare-fn-from-str 48.9 GB/s (1.00x) 5.8 GB/s (8.40x) 1496.9 MB/s (33.44x) memmem/pathological/md5-huge-no-hash 46.8 GB/s (1.00x) 1190.6 MB/s (40.27x) 8.7 GB/s (5.36x) memmem/pathological/md5-huge-last-hash 46.7 GB/s (1.00x) 1196.0 MB/s (39.95x) 10.5 GB/s (4.45x) memmem/pathological/rare-repeated-huge-tricky 51.2 GB/s (1.00x) 1984.3 MB/s (26.44x) 1240.3 MB/s (42.30x) memmem/pathological/rare-repeated-small-tricky 21.2 GB/s (1.00x) 1936.4 MB/s (11.20x) 1202.3 MB/s (18.05x) memmem/pathological/defeat-simple-vector-alphabet 5.4 GB/s (1.00x) 1069.8 MB/s (5.18x) 4.4 GB/s (1.24x) memmem/pathological/defeat-simple-vector-freq-alphabet 19.2 GB/s (1.37x) 1167.3 MB/s (23.04x) 26.3 GB/s (1.00x) memmem/pathological/defeat-simple-vector-repeated-alphabet 991.6 MB/s (1.00x) 22.9 MB/s (43.36x) 826.4 MB/s (1.20x) memmem/subtitles/never/huge-en-john-watson 39.6 GB/s (1.00x) 1930.5 MB/s (21.03x) 1605.2 MB/s (25.29x) memmem/subtitles/never/huge-en-all-common-bytes 43.9 GB/s (1.00x) 1377.7 MB/s (32.63x) 2.9 GB/s (15.12x) memmem/subtitles/never/huge-en-some-rare-bytes 50.3 GB/s (1.00x) 1969.1 MB/s (26.17x) 3.2 GB/s (15.52x) memmem/subtitles/never/huge-en-two-space 41.4 GB/s (1.00x) 727.8 MB/s (58.28x) 1158.8 MB/s (36.61x) memmem/subtitles/never/huge-ru-john-watson 51.3 GB/s (1.00x) 475.6 MB/s (110.41x) 2.2 GB/s (23.42x) memmem/subtitles/never/huge-zh-john-watson 45.1 GB/s (1.00x) 1457.1 MB/s (31.69x) 6.5 GB/s (6.90x) memmem/subtitles/rare/huge-en-sherlock-holmes 51.2 GB/s (1.00x) 1907.4 MB/s (27.48x) 2.5 GB/s (20.36x) memmem/subtitles/rare/huge-en-sherlock 37.9 GB/s (1.00x) 1907.0 MB/s (20.33x) 2.7 GB/s (14.11x) memmem/subtitles/rare/huge-en-medium-needle 48.0 GB/s (1.00x) 1358.2 MB/s (36.22x) 3.9 GB/s (12.27x) memmem/subtitles/rare/huge-en-long-needle 43.2 GB/s (1.00x) 1715.0 MB/s (25.78x) 6.8 GB/s (6.32x) memmem/subtitles/rare/huge-en-huge-needle 41.8 GB/s (1.00x) 1915.9 MB/s (22.37x) 7.0 GB/s (5.97x) memmem/subtitles/rare/huge-ru-sherlock-holmes 40.1 GB/s (1.00x) 479.5 MB/s (85.61x) 2.6 GB/s (15.64x) memmem/subtitles/rare/huge-ru-sherlock 51.1 GB/s (1.00x) 479.5 MB/s (109.22x) 1733.9 MB/s (30.21x) memmem/subtitles/rare/huge-zh-sherlock-holmes 49.0 GB/s (1.00x) 1080.8 MB/s (46.38x) 4.9 GB/s (9.92x) memmem/subtitles/rare/huge-zh-sherlock 49.9 GB/s (1.00x) 1075.2 MB/s (47.56x) 3.4 GB/s (14.73x) ``` Again, `quick-strings` is slower in every case. Now let's look at benchmarks with only teeny haystacks (the opposite of the above): ``` $ rebar cmp /tmp/results.csv --intersection -e oneshot -f teeny benchmark rust/memchr/memmem/oneshot rust/quick-strings/memmem/oneshot rust/std/memmem/oneshot --------- -------------------------- --------------------------------- ----------------------- memmem/subtitles/never/teeny-en-john-watson 1027.0 MB/s (1.44x) 1483.5 MB/s (1.00x) 920.8 MB/s (1.61x) memmem/subtitles/never/teeny-en-all-common-bytes 989.0 MB/s (1.50x) 1483.5 MB/s (1.00x) 1027.0 MB/s (1.44x) memmem/subtitles/never/teeny-en-some-rare-bytes 1027.0 MB/s (1.30x) 1335.1 MB/s (1.00x) 1161.0 MB/s (1.15x) memmem/subtitles/never/teeny-en-two-space 989.0 MB/s (1.12x) 1112.6 MB/s (1.00x) 953.7 MB/s (1.17x) memmem/subtitles/never/teeny-ru-john-watson 1178.1 MB/s (1.42x) 1668.9 MB/s (1.00x) 1001.4 MB/s (1.67x) memmem/subtitles/never/teeny-zh-john-watson 1095.0 MB/s (1.50x) 1642.4 MB/s (1.00x) 953.7 MB/s (1.72x) memmem/subtitles/rare/teeny-en-sherlock-holmes 989.0 MB/s (1.35x) 1335.1 MB/s (1.00x) 387.0 MB/s (3.45x) memmem/subtitles/rare/teeny-en-sherlock 1112.6 MB/s (1.33x) 1483.5 MB/s (1.00x) 606.9 MB/s (2.44x) memmem/subtitles/rare/teeny-ru-sherlock-holmes 1112.6 MB/s (1.29x) 1430.5 MB/s (1.00x) 408.7 MB/s (3.50x) memmem/subtitles/rare/teeny-ru-sherlock 1292.1 MB/s (1.29x) 1668.9 MB/s (1.00x) 656.6 MB/s (2.54x) memmem/subtitles/rare/teeny-zh-sherlock-holmes 985.5 MB/s (1.50x) 1478.2 MB/s (1.00x) 343.8 MB/s (4.30x) memmem/subtitles/rare/teeny-zh-sherlock 1182.6 MB/s (1.56x) 1847.7 MB/s (1.00x) 579.7 MB/s (3.19x) ``` In this case, `quick-strings` is actually faster in _every_ case! My contextualization of these results is that `quick-strings` implements a "naive" substring search algorithm that has worst case `O(m * n)` time but *very low* startup costs. This means that if you pick your benchmarks correctly, it can appear to be a lot faster in some common cases. But once you extend beyond that or get large haystacks, perf can really tank pretty considerably. Substring search implementations have a lot of different cases to cover. `memchr` in particular does try very hard to optimize for the kinds of cases where `quick-strings` is fast, and this is likely why it fairs a bit better than `std` does above on the teeny haystacks. (For example, `memchr` is only 1.5x slower than `quick-strings` on `teeny-zh-sherlock-holmes`, but std is 4.3x slower.) The reason why it's difficult to get as fast as `quick-strings` (or pretty much any naive implementation) is that there's a bunch of startup overhead to pay for in a substring search implementation that uses multiple different strategies. For example, `memchr` will use SIMD in some cases, but only when the CPU supports it (via runtime querying). And `memchr` will change which implementation it uses based on haystack length. So there is some amount of case analysis being done before the search can even begin to execute. In contrast, `quick-strings` can just launch right into the search. With that said, there's always room for improvement! `memchr` isn't necessarily at the floor of an otherwise irreducible trade-off, but I think that so long as you want to get high throughput for big haystacks (like what ripgrep really wants to do), you probably need to at least partially sell out some other types of tasks for very very small haystacks. [tweet]: https://x.com/samuel_colvin/status/1815453691480338692 [rebar]: https://github.com/BurntSushi/rebar

samuelcolvin · 2024-07-22T21:50:59Z

This is fascinating, thank you so much.

I'll look now soon.

Perhaps we can just use memchr for arrow-rs.

It's worth noting that wrt LIKE and ILIKE queries:

Most worries have quite a short needle
99.x% of rows will often be misses, so the cost of a match doesn't matter much.

BurntSushi · 2024-07-22T22:20:43Z

memchr would be a good bet. If you're doing LIKE-style queries, then you should be able to build a memchr::memmem::Finder once and reuse it. And if you do that, it does much better on the short haystacks cases, even when comparing with quick-strings:

[andrew@duff memchr]$ rebar cmp /tmp/results.csv --intersection -e quick-strings -e memchr/memmem/prebuilt
benchmark                                                   rust/memchr/memmem/prebuilt  rust/quick-strings/memmem/oneshot
---------                                                   ---------------------------  ---------------------------------
memmem/code/rust-library-never-fn-strength                  49.0 GB/s (1.00x)            1620.4 MB/s (30.94x)
memmem/code/rust-library-never-fn-strength-paren            48.9 GB/s (1.00x)            1637.2 MB/s (30.59x)
memmem/code/rust-library-never-fn-quux                      36.5 GB/s (1.00x)            1637.2 MB/s (22.86x)
memmem/code/rust-library-rare-fn-from-str                   48.9 GB/s (1.00x)            5.8 GB/s (8.40x)
memmem/pathological/md5-huge-no-hash                        40.3 GB/s (1.00x)            1190.6 MB/s (34.63x)
memmem/pathological/md5-huge-last-hash                      47.3 GB/s (1.00x)            1196.0 MB/s (40.49x)
memmem/pathological/rare-repeated-huge-tricky               51.3 GB/s (1.00x)            1984.3 MB/s (26.47x)
memmem/pathological/rare-repeated-small-tricky              27.4 GB/s (1.00x)            1936.4 MB/s (14.50x)
memmem/pathological/defeat-simple-vector-alphabet           5.4 GB/s (1.00x)             1069.8 MB/s (5.18x)
memmem/pathological/defeat-simple-vector-freq-alphabet      19.2 GB/s (1.00x)            1167.3 MB/s (16.87x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  991.9 MB/s (1.00x)           22.9 MB/s (43.38x)
memmem/subtitles/never/huge-en-john-watson                  40.2 GB/s (1.00x)            1930.5 MB/s (21.34x)
memmem/subtitles/never/huge-en-all-common-bytes             45.6 GB/s (1.00x)            1377.7 MB/s (33.91x)
memmem/subtitles/never/huge-en-some-rare-bytes              49.2 GB/s (1.00x)            1969.1 MB/s (25.59x)
memmem/subtitles/never/huge-en-two-space                    41.6 GB/s (1.00x)            727.8 MB/s (58.53x)
memmem/subtitles/never/teeny-en-john-watson                 1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-all-common-bytes            1668.9 MB/s (1.00x)          1483.5 MB/s (1.12x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1668.9 MB/s (1.00x)          1335.1 MB/s (1.25x)
memmem/subtitles/never/teeny-en-two-space                   1668.9 MB/s (1.00x)          1112.6 MB/s (1.50x)
memmem/subtitles/never/huge-ru-john-watson                  40.3 GB/s (1.00x)            475.6 MB/s (86.68x)
memmem/subtitles/never/teeny-ru-john-watson                 2.6 GB/s (1.00x)             1668.9 MB/s (1.60x)
memmem/subtitles/never/huge-zh-john-watson                  49.8 GB/s (1.00x)            1457.1 MB/s (35.00x)
memmem/subtitles/never/teeny-zh-john-watson                 1847.7 MB/s (1.00x)          1642.4 MB/s (1.12x)
memmem/subtitles/rare/huge-en-sherlock-holmes               51.4 GB/s (1.00x)            1907.4 MB/s (27.58x)
memmem/subtitles/rare/huge-en-sherlock                      50.8 GB/s (1.00x)            1907.0 MB/s (27.26x)
memmem/subtitles/rare/huge-en-medium-needle                 36.9 GB/s (1.00x)            1358.2 MB/s (27.84x)
memmem/subtitles/rare/huge-en-long-needle                   43.2 GB/s (1.00x)            1715.0 MB/s (25.82x)
memmem/subtitles/rare/huge-en-huge-needle                   39.3 GB/s (1.00x)            1915.9 MB/s (21.03x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              1570.8 MB/s (1.00x)          1335.1 MB/s (1.18x)
memmem/subtitles/rare/teeny-en-sherlock                     1335.1 MB/s (1.11x)          1483.5 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               50.3 GB/s (1.00x)            479.5 MB/s (107.49x)
memmem/subtitles/rare/huge-ru-sherlock                      51.2 GB/s (1.00x)            479.5 MB/s (109.32x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              2.1 GB/s (1.00x)             1430.5 MB/s (1.47x)
memmem/subtitles/rare/teeny-ru-sherlock                     1741.5 MB/s (1.00x)          1668.9 MB/s (1.04x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               49.1 GB/s (1.00x)            1080.8 MB/s (46.50x)
memmem/subtitles/rare/huge-zh-sherlock                      50.0 GB/s (1.00x)            1075.2 MB/s (47.60x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1137.1 MB/s (1.30x)          1478.2 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1231.8 MB/s (1.50x)          1847.7 MB/s (1.00x)

std doesn't expose any sort of analogous API for a prebuilt searcher based on just the needle.

That said, memchr doesn't do case insensitive search. That's a different can of worms, especially if you care about Unicode support. A regex::Regex supports it, but there's a lot more machinery there and its overhead is even bigger than memchr::memmem. If you only need ASCII case insensitivity, then there is AhoCorasickBuilder::ascii_case_insensitive, although I don't believe that's had a ton of optimization work put into it, so YMMV.

samuelcolvin · 2024-07-23T20:05:52Z

Thanks @BurntSushi,

That said, memchr doesn't do case insensitive search.

Here arrow-rs already falls back to using regex::Regex, the idea for quick-strings is for the very common cases where a LIKE or ILIKE query can be represented as starts_with, ends_with, contains, istarts_with, iends_with or icontains.

For that reason, we only need ascii case insensitive methods — if the needle is case insensitive, we use that methods, otherwise we can just fallback to using regex::Regex.

Ascii needles make up (in my experience) a very large proportion of queries, hence the value in optimising that case.

Is there any interest in adding ascii only case-insensitive methods to memchr? I would really like to use just memchr and not need to use any mof the logic from quick-strings.

samuelcolvin · 2024-07-23T20:06:29Z

cc @alamb who I think might be interested in this discussion.

BurntSushi · 2024-07-23T21:30:25Z

The "search in ASCII case insensitive mode" is something that I think would probably have a good home in this crate, but it is a big scope increase. That is, it's something that I do think would be nice to support, but it's a tall order. Of course, an initial implementation could be naive (as long as it's not naive algorithmically, i.e., optimal worst case behavior) I suppose. But there are likely SIMD tricks specific to ASCII case insensitive search that we'll want to employ. That's probably easily ~1 month of my free coding time in the evenings.

But yeah, your strategy makes sense. I suspect what you're fighting is actually just regex overhead itself. The regex crate is, quite literally, doing exactly the optimization you're trying to do manually. The actual regex engine won't be used at all for simple ASCII-only literal queries. It will just defer to this crate. But there are abstraction layers it has to cut through at search time... Maybe there is a way to improve those, but I suspect there will always be cases like this where it will be faster for the caller to do this sort of optimization on their own unfortunately. If you were searching larger haystacks, then you wouldn't have to worry about this at all.

alamb · 2024-07-23T21:50:20Z

An approach we have taken elsewhere in arrow-rs is to literally special case the different implementations

So maybe we could do something like this:

if haystack.len() < 1024 { // or whatever the constant happens to be
  // use small-strings implementation
} else {
  // use more general purpose std or memchr etc
}

This would of course result in more code / bigger binary

alamb · 2024-07-23T21:53:58Z

. The "prebuilt" benchmarks, on the other hand, are allowed to accept a needle, build a searcher and then reuse that search for subsequent use. The "prebuilt" benchmarks demonstrate the advantages of a superior API when the needle is invariant.

I think this is an excellent observation @BurntSushi

@samuelcolvin maybe this is a good hint that we could make the arrow-rs substring kernels behave more like the regex kernes and "compile / prepare" the matcher

Given the kernels typically operate on batches of 4000 or 8000ish rows we can probably amortize the "prepare the needle" step well

samuelcolvin · 2024-07-26T12:58:53Z

See apache/arrow-rs#6118, apache/arrow-rs#6128 and apache/arrow-rs#6131 for implementation in arrow-rs.

samuelcolvin mentioned this pull request Jul 24, 2024

String search kernel optimisations apache/arrow-rs#6107

Open

6 tasks

samuelcolvin mentioned this pull request Jul 28, 2024

Add IContains for case insensitiveILIKE: improve performance for "contains" where the needle is ASCII apache/arrow-rs#6131

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks: add ad hoc benchmark for 'quick-strings' #156

benchmarks: add ad hoc benchmark for 'quick-strings' #156

BurntSushi commented Jul 22, 2024

samuelcolvin commented Jul 22, 2024

BurntSushi commented Jul 22, 2024

samuelcolvin commented Jul 23, 2024

samuelcolvin commented Jul 23, 2024

BurntSushi commented Jul 23, 2024

alamb commented Jul 23, 2024 •

edited

Loading

alamb commented Jul 23, 2024

samuelcolvin commented Jul 26, 2024 •

edited

Loading

benchmarks: add ad hoc benchmark for 'quick-strings' #156

Are you sure you want to change the base?

benchmarks: add ad hoc benchmark for 'quick-strings' #156

Conversation

BurntSushi commented Jul 22, 2024

samuelcolvin commented Jul 22, 2024

BurntSushi commented Jul 22, 2024

samuelcolvin commented Jul 23, 2024

samuelcolvin commented Jul 23, 2024

BurntSushi commented Jul 23, 2024

alamb commented Jul 23, 2024 • edited Loading

alamb commented Jul 23, 2024

samuelcolvin commented Jul 26, 2024 • edited Loading

alamb commented Jul 23, 2024 •

edited

Loading

samuelcolvin commented Jul 26, 2024 •

edited

Loading