local V8 regexp.c: improve case-INsensitivity support#738
local V8 regexp.c: improve case-INsensitivity support#738avih wants to merge 5 commits intogwsw:masterfrom
Conversation
regcomp2 is currently not used, but it could be used instead of "less" converting the pattern to lower-case (for case-insensitive search), and do a better job at it. For instance the pattern [@-_] includes A-Z which "less" would want to match a-z, but tolower of the pattern wouldn't do that. Similarly other patterns where tolower doesn't work: [[:upper:]], [H-a], etc. regcomp2 with lower==1 arg doesn't have this issue, because tolower is applied to each individual char which the pattern would match, regardless if it's part of the pattern itself or not. Note that the caller still needs to also convert the text to lowercase when using regexec[2], in order for the search to be case-insensitive.
With the local V8 and case-INsensitive search: Before this commit: - "re_handles_caseless" represents ICASE support of the engine, which refers both to the pattern (in regcomp) and text (in regexec). - Convert the pattern tolower before regcomp. - Convert the text tolower for regexec2. - If case-sensitivity changes but the pattern is without A-Z: - don't recompile it. With this commit: - "re_handles_caseless" is the same as before, but if it's false, we now have a new "re_handles_lower" which means that the (V8) engine supports internal tolower in regcomp2, but still not in regexec2. - tolower of the pattern is now handled by the new regcomp2 in V8. - Text is still converted by "less" tolower before regexec2. - If case-sensitivity changes but the pattern is without A-Z: - recompile the pattern, like with other engines with ICASE support. This means that patterns like [@-_] or [H-a] now work correctly with the local V8 regexp.c, while previously they didn't work correctly. Also, changing of case-sensitivity at runtime now works for such patterns with the local V8 - and can indeed match different chars.
This reverts commit b2d1120937eb91ad0ff55112d3387b7741e2a53b. The next commits will add true support for ICASE in the local v8, so revert the integration of regcomp2 "lower" in "less".
The "icase" arg at regcomp2 now has the semantics of posix REG_ICASE. regcomp2 still compiles the same program as with lower==1, but now it also sets a flag at the program that the search should be insensitive. When it is, regexec[2] creates and works on a tolower copy of the input, and finally adjusts the match pointers back to the original string. regcomp2 is still unused by "less", so this commit is currently no-op.
This integrates the new local V8 support for icase flag into less, which now follows the code paths where re_handles_caseless==true.
|
Found a bug: #ifdef re_handles_lowershould be changed to: #if re_handles_lower
(it's always defined, only its value changes between TRUE/FALSE - just like But I'm not updating it for now. I'll update it once it's decided how we proceed with this PR. |
Actually that's also incorrect. For some reason it's still always entered, even when And, this also is always entered, even when #if re_handles_lower == TRUE
I think it's related to However, this seems to work for the #if HAVE_REGEXEC2
/* local regexp.c handles tolower in regcomp2, but not of the text in regexec[2] */
#define re_handles_lower TRUE
#define USE_REGCOMP2
#endiftogether with this for the selection: #ifdef USE_REGCOMP2
comp = regcomp2(pattern, is_caseless);
#else
comp = regcomp(pattern);
#endifI'm still not updating the commit, because it does enter the correct code paths by default when compiling for windows (it should only be broken if using a non-local V8), but if we decide to use only the first two commits, then it needs to change, maybe to something like above. And independently, maybe instead of defining HAVE_REGEXEC2 (via |
Well, incorrect again. It's actually never entered regardless of whether
Finally something is correct. The rest of that comment is also correct, including the suggested solution. Here's a test case: #include <stdio.h>
/* definitions copied from lang.h of "less" */
typedef enum lbool { LFALSE, LTRUE } lbool;
#undef TRUE
#define TRUE LTRUE
#undef FALSE
#define FALSE LFALSE
int main(void) {
#if x
puts("#if x");
#endif
#if x == TRUE
puts("#if x == TRUE");
#endif
if (x)
puts("(x) is true");
else
puts("(x) is false");
return 0;
}When compiled as and when compiled as So regardless if |
|
Thanks, I will try to evaluate this issue and #732 soon. Just FYI, it may be a few days before I get to it, because I have just tested positive for COVID today and am feeling pretty miserable. |
|
Ouch... no worries. Take your time and get well soon! |
I just realized that this solution has a potential pitfall (in addition to the inconvenience of splitting the semantics of I think it's OK currently with the first 2 commits, because both only do tolower for ASCII A-Z? but I'm not convinced, and also, even if now it's OK, one of them could still potentially change in the future (for instance CVT_TO_LC learns to work on UTF-8 also for non-ASCII codepoints). So it's something to take into consideration. The cleanest approach is probably that regexp.c takes care of both the pattern and text, and converts both of them the same way to lower (solution 3 - all 5 commits). I really need to find a way to benchmark the approaches against eachother in the context of "less" with some good inputs of texts and patterns. So far my method was to hack |
Note: this PR is completely independent of #732 . One does not depend on the other, and if both are merged then their merge order doesn't matter.
Currently there's an issue with the local V8 regexp.c (and other engines which don't support ICASE in regcomp), that when the search is case-insensitive, then patterns like
[H-a]or[@-_](starts just beforeAand ends few chars afterZ) or[[:upper:]](if supported) don't work correctly.With such engines "less" itself converts the pattern and text to lowercase, and while it works for patterns like
[A-Z], it has no effect on the patterns above, and the matched chars do not reflect the case-insensitivity correctly.This PR fixes the problem when using the local V8 regexp.c .
Two solutions are suggested.
Solution 1: Using only the first two commits.
1st commit implements
regcomp2(str, lower)which, whenloweris not 0, does an internaltoloweron the chars whichstrwould match.2nd commit wires this new
regcomp2into less.This seems ideal in the context of "less", because it's very little code, and "less" already supports converting the pattern and/or the text to lowercase,
However, it adds new semantics with
re_handles_lower(name TBD) which means "the engine can do tolower in regcomp2, but not at regexec2".Such semantics is not present in other engines, and while currently it seems that it's integrated correctly, it does "split" the behavior which was previously controled by
re_handles_caseless, and can get a bit dodgy, depending on how much of the code has to be split, possibly in the future too.If this
re_handles_lowersemantics is considered too delicate, then the next commits go the full length of making the local V8 support ICASE properly.Solution 2: squash the first 4 commits, resulting in "add regcomp2 which supports icase flag with the same semantics as posix REG_ICASE".
On the plus side, proper support for ICASE in V8 means that the only change in "less" is to enable
re_supports_caselessfor the local V8, and use its respectiveregcomp2which takes anicaseargument.On the other hand, this means that
regexec2now has more work to do - perform the tolower conversion, withmallocfor the copy, which previously happened in "less".Other than requiring more code, this is also not ideal from a performance point of view, because typically "less" already does a conversion anyway, and "tolower" is only one of the flags during this conversion, so it probably doesn't result in new allocation specifically for the tolower conversion, but, pending various optimizations, it does result in new allocation in
regexec2if the program was compiled as case-insensitive.However, while not ideal, at least so far I couldn't measure performance differences while doing case-insensitive search in "less" between using only the first 2 commits, or all 5 commits.
But the tests were not extensive, so this "no diff" results should be considered anecdotal for now.
So I think the choice is between 3 options:
[[:upper:]]or[@-_]or[H-a]with the local V8.re_handles_caselessfor the local V8.