Add fast path skipping UTF8 length counting #2819

gaearon · 2024-09-14T17:29:07Z

Stacked on #2817

Commits

What

Similar to #2817, I'm trying to avoid calling into TextEncoder().encode(str).byteLength for every string. After this change, I basically don't hit it in the app at all — the fast path always lets me out early.

The fast pass itself is pretty general. The idea is that .length counts UTF-16 code units, and each UTF-16 code unit corresponds to at most 3 bytes in UTF-8 encoding. So we can safely use value.length * 3 as an upper bound on what utf8Len(value) could possibly be. If this upper bound is below the minLength, the same is true for utf8Len. If this upper bound is within maxLength, the same is true for utf8Len.

Why * 3?

Codepoints that fit into a single UTF-16 code unit become 1 to 3 bytes in UTF-8. (Worst case is 3x.)
Codepoints that need two UTF-16 code units become 4 bytes in UTF-8. (Worst case is 2x.)

So .length * 3 should always give us a valid upper bound. But this needs a look from an expert.

I've added some test cases.

bnewbold · 2024-09-16T19:18:33Z

this seems reasonable, though I should probably re-read more carefully and maybe cook up more corner-cases. I kind of suspect that it won't be as much of a win as the earlier grapheme cluster and utf8 caching patch though? I guess UTF-16 to UTF-8 does cost something through, and this probably does help with the happy path, and we do a lot of these, hrm.

devinivy

Good thinkin! Re: the factor of 3 in here, I am quite sure that checks out.

gaearon added 5 commits September 14, 2024 15:33

Cache length calculations between min and max

0d6b54d

Harden grapheme counter tests

7b97382

Add fast paths

8b0bdcb

Harden UTF8 length test cases

e3dabaf

Add fast path for UTF8 length check

9337c2b

gaearon requested review from bnewbold, pfrazee, devinivy and dholms September 14, 2024 17:29

gaearon changed the title ~~Add fast path for UTF8 length counting~~ Add fast path skipping UTF8 length counting Sep 14, 2024

devinivy approved these changes Oct 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast path skipping UTF8 length counting #2819

Add fast path skipping UTF8 length counting #2819

gaearon commented Sep 14, 2024 •

edited

Loading

bnewbold commented Sep 16, 2024

devinivy left a comment

Add fast path skipping UTF8 length counting #2819

Are you sure you want to change the base?

Add fast path skipping UTF8 length counting #2819

Conversation

gaearon commented Sep 14, 2024 • edited Loading

Stacked on #2817

Commits

What

bnewbold commented Sep 16, 2024

devinivy left a comment

Choose a reason for hiding this comment

gaearon commented Sep 14, 2024 •

edited

Loading