Skip to content

Comments

local V8 regexp.c: improve case-INsensitivity support#738

Open
avih wants to merge 5 commits intogwsw:masterfrom
avih:v8-case
Open

local V8 regexp.c: improve case-INsensitivity support#738
avih wants to merge 5 commits intogwsw:masterfrom
avih:v8-case

Conversation

@avih
Copy link
Contributor

@avih avih commented Feb 17, 2026

Note: this PR is completely independent of #732 . One does not depend on the other, and if both are merged then their merge order doesn't matter.

Currently there's an issue with the local V8 regexp.c (and other engines which don't support ICASE in regcomp), that when the search is case-insensitive, then patterns like [H-a] or [@-_] (starts just before A and ends few chars after Z) or [[:upper:]] (if supported) don't work correctly.

With such engines "less" itself converts the pattern and text to lowercase, and while it works for patterns like [A-Z], it has no effect on the patterns above, and the matched chars do not reflect the case-insensitivity correctly.

This PR fixes the problem when using the local V8 regexp.c .

Two solutions are suggested.

Solution 1: Using only the first two commits.

1st commit implements regcomp2(str, lower) which, when lower is not 0, does an internal tolower on the chars which str would match.

2nd commit wires this new regcomp2 into less.

This seems ideal in the context of "less", because it's very little code, and "less" already supports converting the pattern and/or the text to lowercase,

However, it adds new semantics with re_handles_lower (name TBD) which means "the engine can do tolower in regcomp2, but not at regexec2".

Such semantics is not present in other engines, and while currently it seems that it's integrated correctly, it does "split" the behavior which was previously controled by re_handles_caseless, and can get a bit dodgy, depending on how much of the code has to be split, possibly in the future too.

If this re_handles_lower semantics is considered too delicate, then the next commits go the full length of making the local V8 support ICASE properly.

Solution 2: squash the first 4 commits, resulting in "add regcomp2 which supports icase flag with the same semantics as posix REG_ICASE".

On the plus side, proper support for ICASE in V8 means that the only change in "less" is to enable re_supports_caseless for the local V8, and use its respective regcomp2 which takes an icase argument.

On the other hand, this means that regexec2 now has more work to do - perform the tolower conversion, with malloc for the copy, which previously happened in "less".

Other than requiring more code, this is also not ideal from a performance point of view, because typically "less" already does a conversion anyway, and "tolower" is only one of the flags during this conversion, so it probably doesn't result in new allocation specifically for the tolower conversion, but, pending various optimizations, it does result in new allocation in regexec2 if the program was compiled as case-insensitive.

However, while not ideal, at least so far I couldn't measure performance differences while doing case-insensitive search in "less" between using only the first 2 commits, or all 5 commits.

But the tests were not extensive, so this "no diff" results should be considered anecdotal for now.

So I think the choice is between 3 options:

  1. We don't care about correct case-insensitivity with patterns like [[:upper:]] or [@-_] or [H-a] with the local V8.
  2. (only first 2 commits) We care, and for minimal changes of code, only add internal tolower suppport in regcomp2, make "less" use it instead of converting the pattern tolower, but "less" still convert the text tolower for regexec2.
  3. (all 5 commits, and squash the first 4) If 2 above is considered too delicate, just add proper full ICASE support for the local V8, which is a bit more code (in regexp.c), but simple integration in "less", which largely requires only to enable re_handles_caseless for the local V8.

avih added 5 commits February 17, 2026 23:57
regcomp2 is currently not used, but it could be used instead of "less"
converting the pattern to lower-case (for case-insensitive search),
and do a better job at it.

For instance the pattern [@-_] includes A-Z which "less" would want to
match a-z, but tolower of the pattern wouldn't do that. Similarly
other patterns where tolower doesn't work: [[:upper:]], [H-a], etc.

regcomp2 with lower==1 arg doesn't have this issue, because tolower
is applied to each individual char which the pattern would match,
regardless if it's part of the pattern itself or not.

Note that the caller still needs to also convert the text to lowercase
when using regexec[2], in order for the search to be case-insensitive.
With the local V8 and case-INsensitive search:

Before this commit:
- "re_handles_caseless" represents ICASE support of the engine,
  which refers both to the pattern (in regcomp) and text (in regexec).
- Convert the pattern tolower before regcomp.
- Convert the text tolower for regexec2.
- If case-sensitivity changes but the pattern is without A-Z:
  - don't recompile it.

With this commit:
- "re_handles_caseless" is the same as before, but if it's false, we
  now have a new "re_handles_lower" which means that the (V8) engine
  supports internal tolower in regcomp2, but still not in regexec2.
- tolower of the pattern is now handled by the new regcomp2 in V8.
- Text is still converted by "less" tolower before regexec2.
- If case-sensitivity changes but the pattern is without A-Z:
  - recompile the pattern, like with other engines with ICASE support.

This means that patterns like [@-_] or [H-a] now work correctly with
the local V8 regexp.c, while previously they didn't work correctly.

Also, changing of case-sensitivity at runtime now works for such
patterns with the local V8 - and can indeed match different chars.
This reverts commit b2d1120937eb91ad0ff55112d3387b7741e2a53b.

The next commits will add true support for ICASE in the local v8,
so revert the integration of regcomp2 "lower" in "less".
The "icase" arg at regcomp2 now has the semantics of posix REG_ICASE.

regcomp2 still compiles the same program as with lower==1, but now it
also sets a flag at the program that the search should be insensitive.

When it is, regexec[2] creates and works on a tolower copy of the
input, and finally adjusts the match pointers back to the original
string.

regcomp2 is still unused by "less", so this commit is currently no-op.
This integrates the new local V8 support for icase flag into less,
which now follows the code paths where re_handles_caseless==true.
@avih
Copy link
Contributor Author

avih commented Feb 17, 2026

Found a bug:
In the 2nd commit "wire regcomp2(str, lower)...", this:

#ifdef re_handles_lower

should be changed to:

#if re_handles_lower

(it's always defined, only its value changes between TRUE/FALSE - just like re_handles_caseless).

But I'm not updating it for now. I'll update it once it's decided how we proceed with this PR.

@avih
Copy link
Contributor Author

avih commented Feb 18, 2026

should be changed to:

#if re_handles_lower

Actually that's also incorrect. For some reason it's still always entered, even when re_handles_lower is defined as FALSE.

And, this also is always entered, even when re_handles_lower is defined as FALSE:

#if re_handles_lower == TRUE

I think it's related to TRUE defined as LTRUE and FALSE being defined as LFALSE, but both LTRUE and LFALSE are not defined - because they're enums? But TBH I'm not entirely sure.

However, this seems to work for the #if HAVE_REGEXEC2 part (same as the last commit - 7866015 does for the ICASE support):

#if HAVE_REGEXEC2
   /* local regexp.c handles tolower in regcomp2, but not of the text in regexec[2] */
   #define re_handles_lower      TRUE
   #define USE_REGCOMP2
#endif

together with this for the selection:

#ifdef USE_REGCOMP2
	comp = regcomp2(pattern, is_caseless);
#else
 	comp = regcomp(pattern);
#endif

I'm still not updating the commit, because it does enter the correct code paths by default when compiling for windows (it should only be broken if using a non-local V8), but if we decide to use only the first two commits, then it needs to change, maybe to something like above.

And independently, maybe instead of defining HAVE_REGEXEC2 (via configure.ac and the various defines.* files), they should define HAVE_V8_LOCAL, becuase this is what it actually means (HAVE_REGEXEC2 is specific to local V8 - it's not used with any other engine), and in pattern.h define any derivatives of HAVE_V8_LOCAL, like HAVE_REGEXEC2 and/or HAVE_REGCOMP2. Or maybe just check HAVE_V8_LOCAL where appropriate without defining the derivatives. But that's for another time.

@avih
Copy link
Contributor Author

avih commented Feb 18, 2026

should be changed to:

#if re_handles_lower

Actually that's also incorrect. For some reason it's still always entered, even when re_handles_lower is defined as FALSE.

Well, incorrect again. It's actually never entered regardless of whether re_handles_lower is defined as TRUE or FALSE.

I think it's related to TRUE defined as LTRUE ...

Finally something is correct. The rest of that comment is also correct, including the suggested solution.

Here's a test case:

#include <stdio.h>

/* definitions copied from lang.h of "less" */
typedef enum lbool { LFALSE, LTRUE } lbool;

#undef  TRUE
#define TRUE  LTRUE
#undef  FALSE
#define FALSE LFALSE

int main(void) {

#if x
    puts("#if x");
#endif
#if x == TRUE
    puts("#if x == TRUE");
#endif

    if (x)
        puts("(x) is true");
    else
        puts("(x) is false");

    return 0;
}

When compiled as cc -Dx=TRUE test.c it prints:

#if x == TRUE
(x) is true

and when compiled as cc -Dx=FALSE test.c it prints:

#if x == TRUE
(x) is false

So regardless if x is defined as TRUE or FALSE, #if x is never entered, and #if x == TRUE is always entered, and only if (x) (actual C code) behaves correctly depending on how it's defined.

@gwsw
Copy link
Owner

gwsw commented Feb 18, 2026

Thanks, I will try to evaluate this issue and #732 soon. Just FYI, it may be a few days before I get to it, because I have just tested positive for COVID today and am feeling pretty miserable.

@avih
Copy link
Contributor Author

avih commented Feb 18, 2026

Ouch... no worries. Take your time and get well soon!

@avih
Copy link
Contributor Author

avih commented Feb 21, 2026

2. (only first 2 commits) We care, and for minimal changes of code, only add internal tolower suppport in regcomp2, make "less" use it instead of converting the pattern tolower, but "less" still convert the text tolower for regexec2.

I just realized that this solution has a potential pitfall (in addition to the inconvenience of splitting the semantics of re_supports_caseless): the notion of "lower case" should be the same between the new regcomp2 which takes a lower argument, and CVT_TO_LC which less would use with the text for regcomp2. If it's not the same, then it can result in some incorrect matches.

I think it's OK currently with the first 2 commits, because both only do tolower for ASCII A-Z? but I'm not convinced, and also, even if now it's OK, one of them could still potentially change in the future (for instance CVT_TO_LC learns to work on UTF-8 also for non-ASCII codepoints).

So it's something to take into consideration.

The cleanest approach is probably that regexp.c takes care of both the pattern and text, and converts both of them the same way to lower (solution 3 - all 5 commits).

I really need to find a way to benchmark the approaches against eachother in the context of "less" with some good inputs of texts and patterns. So far my method was to hack WIN32getch to return hardcoded f (perform "page down" as fast as possible), together with no-op of win_flush (skip the actual output and SGR conversions), and time it with time less -E MYFILE +/PATTERN, but as noted, the I couldn't identify differences above the noise level. It's possible that my sample inputs were not representative, because I really expected the additional malloc in regcomp2 to have measurable impact on the timing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants