Speed up s.startswith() #671

gvanrossum · 2024-04-01T17:13:01Z

People keep discovering that, embarrassingly, checking whether a string starts with a given substring is slower with s.startswith("foo") than with s[:3] == "foo". The reason is that startswith requires a name lookup. I think we now have the machinery to special-case selected built-in method in the specializer? This would be a good candidate.

The text was updated successfully, but these errors were encountered:

brandtbucher · 2024-04-01T17:25:14Z

FWIW, we already do this with list.append (CALL_LIST_APPEND). The same pattern could be followed for any other builtin methods without too much hassle.

serhiy-storchaka · 2024-04-01T17:26:04Z

It is in the debug build, but startswith is actually faster:

$ ./python -m timeit -s 's = "abcdef"' 's.startswith("abc")'
2000000 loops, best of 5: 189 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:3] == "abc"'
1000000 loops, best of 5: 283 nsec per loop

gvanrossum · 2024-04-01T17:35:19Z

I think @JelleZijlstra showed that in a fully optimized build it is still slower. I repeated his experiments and got the same results:

(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x[:3] == "..."'
10000000 loops, best of 5: 32 nsec per loop
(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x[:3].startswith("...")'
5000000 loops, best of 5: 64.4 nsec per loop
(.venv) ~$ python3.13 -V
Python 3.13.0a5

EDIT: The second timing is wrong, see corrected timing below.

brandtbucher · 2024-04-01T17:37:06Z

Your second timing does both the slice and method call. :)

gvanrossum · 2024-04-01T17:44:52Z

Your second timing does both the slice and method call. :)

Oof. :-( But it's still slower: 40 vs. 32:

(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x.startswith("...")'
5000000 loops, best of 5: 40.3 nsec per loop

serhiy-storchaka · 2024-04-01T18:10:56Z

Yes, for comparison with #671 (comment), on the same computer:

$ ./python -m timeit -s 's = "abcdef"' 's.startswith("abc")'
5000000 loops, best of 5: 84.3 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:3] == "abc"'
5000000 loops, best of 5: 65.9 nsec per loop

startswith is 25-30% slower.

The difference is larger for 1-character prefix (because the overhead of making a 1-character substring is much smaller):

$ ./python -m timeit -s 's = "abcdef"' 's.startswith("a")'
5000000 loops, best of 5: 83.6 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:1] == "a"'
5000000 loops, best of 5: 51 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[0] == "a"'
10000000 loops, best of 5: 22.3 nsec per loop

eendebakpt · 2024-04-01T19:42:18Z

There is also some overhead in the argument parsing for startswith. On my system

(env312) python -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 63.8 nsec per loop

(env312) python -m timeit -s "s = 'abcdef'" "s[:3] == 'abc'"
5000000 loops, best of 5: 47.6 nsec per loop

Adding a simple fast path for the single argument case of startswith results in

(env312) python -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 42.3 nsec per loop

The fast path I implemented is python/cpython@main...eendebakpt:cpython:startswith. That might not be the best approach, but this was just a quick test. The asciilib_parse_args_finds could perhaps also be improved, for example eliminate the copy at https://github.com/python/cpython/blob/c741ad3537193c63fe697a8f0316aecd45eeb9ba/Objects/stringlib/find.h#L97

markshannon · 2024-04-02T08:58:24Z

The reason is that startswith requires a name lookup.

This is not true. It is the call that is slow, not the attribute lookup.

The attribute lookup gets specialized to LOAD_ATTR_METHOD_NO_DICT which is as fast as an attribute lookup can be (in tier 1).

The call is slow for two reasons.

str.startwith is a METH_VARARGS function. It should be METH_FASTCALL
As @eendebakpt says, there is no fast path for the the common case.

def foo(s):
   for _ in range(1000):
       s.startswith("foo")

>>> dis.dis(foo, adaptive=True)
  1           RESUME_CHECK             0

  2           LOAD_GLOBAL              1 (range + NULL)
              LOAD_CONST               1 (1000)
              CALL                     1
              GET_ITER
      L1:     FOR_ITER_RANGE          20 (to L2)
              STORE_FAST               1 (_)

  3           LOAD_FAST                0 (s)
              LOAD_ATTR_METHOD_NO_DICT 3 (startswith + NULL|self)
              LOAD_CONST               2 ('foo')
              CALL                     1
              POP_TOP
              JUMP_BACKWARD           22 (to L1)

  2   L2:     END_FOR
              POP_TOP
              RETURN_CONST             0 (None)

markshannon · 2024-04-02T09:13:10Z

We should probably replace all METH_VARARGS with METH_FASTCALL, in the long run.

In the short term, we should change the methods on builtin objects.
git grep METH_VARARGS | grep "Objects/" gives me 49 results, so it wouldn't be that much work.

erlend-aasland · 2024-04-02T09:50:45Z

We should probably replace all METH_VARARGS with METH_FASTCALL, in the long run.

~~FYI: With python/cpython#107984, Argument Clinic got the ability to deprecate keyword use of parameters.~~ Oh, they're already positional-only. Sorry for the noise.

erlend-aasland · 2024-04-02T14:29:42Z

METH_FASTCALL gives a significant speedup¹:

$ ./python.exe -m timeit -s "s = 'abcdef'" "s[:3] == 'abc'"
2000000 loops, best of 5: 177 nsec per loop
$ ./python.exe -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 58.3 nsec per loop

debug build ↩

gvanrossum · 2024-04-02T14:34:47Z

@eendebakpt Can you turn that into a PR then?

eendebakpt · 2024-04-02T15:05:37Z

@eendebakpt Can you turn that into a PR then?

Yes, I will pick that up

@erlend-aasland Did you use argument clinic in testing the METH_FASTCALL?

erlend-aasland · 2024-04-02T15:12:46Z

@eendebakpt: yes. See my ~~draft~~ PR.

erlend-aasland · 2024-04-03T12:13:02Z

Resolved by:

gh-117431: Adapt bytes and bytearray .startswith() and .endswith() to Argument Clinic python/cpython#117495
gh-117431: Adapt str.{start,end}swith to use the METH_FASTCALL calling convention python/cpython#117466

Argument Clinic (and METH_FASTCALL) FTW.

erlend-aasland · 2024-04-03T12:13:49Z

@eendebakpt is also working on further optimisations in stringlib 🚀

eendebakpt · 2024-04-11T21:48:01Z

With the recent conversion to vectorcall by Erland there is not much to gain in the implementation of startswith (and endswith). The computation time for x.startswith('a') can be split into the following components (all numbers are rough estimates, details in the section below)

Attribute lookup of startswith on x: 2.7 ns
Python method call overhead: 21.8 ns
Time for adding the single argument a: 1.5 ns
Argument parsing: 0.6 ns
Calculations with the start and end parameters, type checking: 2.6 ns
Actual string comparison: 8.4 ns
Total: 37.6 ns

The implementation of the actual string comparison can be improved a bit, for the single character case a few ns. See #117782

Our baseline is `x.startswith('a')` which takes about 37.6 ns: ``` x.startswith('a'): Mean +- std dev: 37.6 ns +- 0.1 ns ``` That time can be roughly divided into:

2.7 ns: lookup of startswith on the string x

x_startswith('a'): Mean +- std dev: 34.9 ns +- 0.2 ns

21.8 ns: python method call overhead. Determined by adding

    if (nargs==0)
        return Py_None;

at the start of unicode_startswith and benchmarking x.startswith()

x.startswith(): Mean +- std dev: 24.5 ns +- 0.4 ns

1.5 ns: additional time required for adding a single argument a. Determined by adding

    if (args[0]==Py_None)
        return Py_None;

at the start of unicode_startswith and benchmarking x.startswith(None)

x.startswith(None): Mean +- std dev: 26.0 ns +- 0.3 ns

0.6 ns = 26.6 - 26.0 ns. The time for argument parsing for the single argument case is (measured by adding to
the start of unicode_startswith_impl

    if (subobj == Py_None) {
        // measure minimum time for call with argument processing
        return Py_None;
    }

and comparing to the previous.

x.startswith(None): Mean +- std dev: 26.6 ns +- 0.7 ns

The implementation unicode_startswith_impl checks the type of the first argument and then calls tailmatch which performs a few simple checks with the (optional) start and end arguments of startswith. In the case of an empty substring no work needs to be done

x.startswith(''): Mean +- std dev: 29.2 ns +- 0.2 ns

Comparing x.startswith('a') with x.startswith('') shows about 8.4 ns is used for the actual string comparison.
Adding additional arguments to startswith is rather expensive. This is mainly due to adding the arguments to the call, and partly due to the parameter processing with calls _PyEval_SliceIndex.

x.startswith('a', 0, 10): Mean +- std dev: 54.3 ns +- 0.3 ns

Performance of x.startswith('a', 0, 10) is about 54.3 ns, which seems high. A large part of that time is in the preparation of the three arguments. A small part is in the argument parsing, in particular the calls to _PyEval_SliceIndex. That method can be improved, since in the common path (e.g. indices of type int) we can avoid an incref/decref and some type checks.
With this python/cpython@main...eendebakpt:cpython:pynumber performance improves

x.startswith('a', 0, 10): Mean +- std dev: [main] 54.4 ns +- 1.1 ns -> [p] 52.3 ns +- 0.3 ns: 1.04x faster

The performance gain is not very large, but PyNumber_AsSsize_t is used in many places so things might be worthwhile. The implementation is short and localized, but not very pretty as it adds an ad-hoc variable do_decref. I am not yet sure how to implement this better, so feedback is welcome.

AlexWaygood mentioned this issue Apr 1, 2024

gh-117349: Micro-optimize a few os.path functions python/cpython#117350

Merged

eendebakpt mentioned this issue Apr 1, 2024

Improve performance of startswith, endswith, count, *find, and *index methods for str, bytes and bytearray python/cpython#117431

Open

erlend-aasland mentioned this issue Apr 2, 2024

gh-117431: Adapt str.{start,end}swith to use the METH_FASTCALL calling convention python/cpython#117466

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up s.startswith() #671

Speed up s.startswith() #671

gvanrossum commented Apr 1, 2024

brandtbucher commented Apr 1, 2024 •

edited

Loading

serhiy-storchaka commented Apr 1, 2024

gvanrossum commented Apr 1, 2024 •

edited

Loading

brandtbucher commented Apr 1, 2024

gvanrossum commented Apr 1, 2024

serhiy-storchaka commented Apr 1, 2024

eendebakpt commented Apr 1, 2024

markshannon commented Apr 2, 2024

markshannon commented Apr 2, 2024

erlend-aasland commented Apr 2, 2024 •

edited

Loading

erlend-aasland commented Apr 2, 2024 •

edited

Loading

gvanrossum commented Apr 2, 2024

eendebakpt commented Apr 2, 2024

erlend-aasland commented Apr 2, 2024 •

edited

Loading

erlend-aasland commented Apr 3, 2024

erlend-aasland commented Apr 3, 2024

eendebakpt commented Apr 11, 2024

Speed up s.startswith() #671

Speed up s.startswith() #671

Comments

gvanrossum commented Apr 1, 2024

brandtbucher commented Apr 1, 2024 • edited Loading

serhiy-storchaka commented Apr 1, 2024

gvanrossum commented Apr 1, 2024 • edited Loading

brandtbucher commented Apr 1, 2024

gvanrossum commented Apr 1, 2024

serhiy-storchaka commented Apr 1, 2024

eendebakpt commented Apr 1, 2024

markshannon commented Apr 2, 2024

markshannon commented Apr 2, 2024

erlend-aasland commented Apr 2, 2024 • edited Loading

erlend-aasland commented Apr 2, 2024 • edited Loading

Footnotes

gvanrossum commented Apr 2, 2024

eendebakpt commented Apr 2, 2024

erlend-aasland commented Apr 2, 2024 • edited Loading

erlend-aasland commented Apr 3, 2024

erlend-aasland commented Apr 3, 2024

eendebakpt commented Apr 11, 2024

brandtbucher commented Apr 1, 2024 •

edited

Loading

gvanrossum commented Apr 1, 2024 •

edited

Loading

erlend-aasland commented Apr 2, 2024 •

edited

Loading

erlend-aasland commented Apr 2, 2024 •

edited

Loading

erlend-aasland commented Apr 2, 2024 •

edited

Loading