Authors: Aurèle Barrière and Clément Pit-Claudel.
This is a linear regular expression engine for a subset of JavaScript regexes. The underlying algorithm is an extension of the PikeVM, supporting more JavaScript features. This engine implements the algorithms described in the paper Linear Matching of JavaScript Regular Expressions by the same authors.
In particular, it supports, for the first time with linear time and space complexity:
- nullable JavaScript quantifiers (these have different semantics than in other regex languages, see for instance
(a?b??)*on string "ab") - capture reset, a JavaScript-specific property where capture groups are reset at each quantifier iteration (for instance
((a)|(b))*on string "ab") - all lookarounds (lookahads and lookbehinds), even with capture groups inside
- linear matching of the greedy or nullable plus.
RegElk means Regex Engine with Linear looKarounds. Elks are diagonal walkers, meaning that they reuse their front legs prints for their rear legs to conserve energy, evoking how a PikeVM merges threads reaching the same state to preserve linearity.
Given a regex of size |r| and a string of size |s|, this engine has linear worst-case time complexity in both of them O(|r|*|s|).
While counted quantifiers are supported, they increase the regex size.
For instance, e{4-8} will multiply the size of e 8 times.
However, the greedy plus (+ or {1,}) or the nonnullable lazy plus (as in (ab)+?) are handled without duplication.
The engine also has O(|r|*|s|) space complexity.
If one wants to avoid a string-size dependent space complexity, we provide alternative register data-structures, presenting various time-space complexity tradeoff.
| Time Complexity | Space Complexity | |
|---|---|---|
| List (default) | O(|r|*|s|) |
O(|r|*|s|) |
| Array | O(|r|^2*|s|) |
O(|r|^2) |
| Tree | O(|r|*log(|r|)*|s|) |
O(|r|^2) |
Note however that a O(|r|*|s|) space complexity cannot be avoided when using our linear lookaround algorithm.
| Feature | Example |
|---|---|
| Lookaheads | a(?=(b)), a(?!=b) |
| Lookbehinds | (?<=b)a, (?<!b)a |
| Capture Groups | (a*)b |
| Noncapturing Groups | (?:a*)b |
| Greedy Quantifiers | *, +, ? |
| Lazy Quantifiers | *?, +?, ?? |
| Counted Quantifiers | a{6,12}, a{7,}, a{9}, a{4,5}? |
| Character Classes | [a-z], [^h], [aeiouy] |
| Character Groups | \w, \d, \s, \W, \D, \S |
| Anchors | $, ^ |
| Word Boundaries | \b, \B |
Backreferences are not supported, as they make the matching problem NP-hard. Named capture groups, hexadecimal escapes, unicode escapes, unicode properties and regex flags are not supported yet, although they could be in the future.
You need the following Opam packages. Other version numbers may also work.
- Ocaml 5.0
- ocamlbuild 0.14.1
- Menhir 20220210
- ocaml_intrinsics v0.15.2
- core_bench v0.15.0
- core v0.15.1
- core_unix v0.15.2
- yojson 2.1.0
You also need to install Node.JS and have node in your path.
Build all executables with make.
Make sure to have configured your opam switch so that it has all dependencies listed above.
This creates several executables:
main.nativeis the Ocaml matcherfuzzer.nativeis a fuzzer that compares the OCaml matcher to the Irregexp engine of V8 in Nodetests.nativecontains a battery of tests that should all succeedstats.nativecomputes regex feature usage statistics from corpora of regexesbenchmark.nativeallows you to run benchmarksmatcher.nativeandlinearbaseline.nativeare only used for the benchmarks and you should not run them directly
main.native, fuzzer.native and benchmark.native have command line options that can be printed with the argument --help.
- the main entry point of the engine is in the
main.mlfile - regexes and regex annotation are found in
regex.ml - the extended bytecode NFA representation is defined in
bytecode.ml - compilation from a regex to bytecode is found in
compiler.ml - the NFA simulation algorithm, with all our extensions, is implemented in
interpreter.ml - the CDN plus formulas are defined in
cdn.ml - the oracles used y the lookaround algorithm is defined in
oracle.ml - the three capture registers implementations are defined in
regs.ml - character classes are implemented in
charclasses.ml - the ECMA-style regex parser is defined in
parser_src/
- the differential fuzzer is implemented in
fuzzer.ml - the computation of statistics on regex features usage is defined in
stats.ml - the
scripts_benchdirectory contains scripts called by the benchmarks or the fuzzer to compare the OCaml engine to other engines.
- The linear engine from V8 is called "V8Linear" in the paper. It is sometimes called "Experimental" or "Exp" in the code (as this is the name used by the V8 developers).
- The bytecode instructions
ConsumeandConsumeAnyfrom Figure 4 are replaced by a singleConsumeinstruction inbytecode.ml, which takes as argument either a character or a list of character ranges. - The
Jumpinstruction is calledJmpin the code. SetRegis calledSetRegisterToCP.SetQuantis calledSetQuantToClock.CheckNullis calledCheckNullable.SetNullPlusin the paper is not an independent instruction in the code. InsteadSetQuantToClockencodes bothSetQuantandSetNullPlus: it takes a boolean argument indicating if this quantifier is a nulled plus or not.- The "Balanced Tree" register implementation in the paper is renamed to
Map_Regsin the code.
This is implemented by the functions advance_epsilon and find_match in interpreter.ml.
- In
compiler.ml, line 112, you can see the bytecode compilation of a quantifier and see thatBeginLoopandEndLoopinstructions are inserted. - In
interpreter.ml, the thread booleanexit_allowedencodes in which automata the thread is. See at lines 412 and 417 how the two new instructions are implemented.
- Threads are augmented with clocks in
quant_regs(line 107 ofinterpreter.ml). - The filtering algorithm is implemented at line 285 of
interpreter.ml.
- The oracle table is defined in
oracle.ml. - The first phase is implemented as the
build_oraclefunction, line 651 ofinterpreter.ml. - The second phase is simply the
find_matchfunction defined previously. - The third phase is implemented as the
build_capturefunction, line 680 ofinterpreter.ml.
Switch to the strlb directory for this algorithm, and see the corresponding README.md file.
- The nullability analysis of Section 4.5.1 and Figure 12 starts at line 139 in
regex.ml. - The non-nullable plus case of Section 4.5.2 is implemented at line 78 in
compiler.ml. - For Section 4.5.3, see lines 90 and 101 of
compiler.ml. - The nullability formulas of footnote 8 are defined in
cdn.mland called "CDN formulas".
- The three different register data-structures are defined in
regs.ml:Array_Regs,List_RegsandMap_Regs. - At line 28 of
interpreter.ml, see that the interpreter is parameterized by a register implementation (aREGSmodule). - The benchmark used for Figure 15 is defined as
dsarray,dslistanddstreeinbenchmark_vectors.ml.
- All regex corpora are in the
corpusdirectory. - The
stats.mlfile uses the parser to analyze each regex and see in which categories it belongs to.
- (C1) the benchmark used for Figure 17 is named "Clocks", defined line 117 of
benchmark_vectors. - (C2) the benchmark used for Figure 18 is named "NNPlus", defined line 69.
- (C3) the benchmark used for Figure 19 is named "CDN", defined line 95.
- (C4) the benchmark used for Figure 20 is named "LBstr", defined line 179.
- (C5) the benchmark used for Figure 21 is named "LAreg", defined line 138.
- (C5) the benchmark used for Figure 22 is named "LAstr", defined line 157.
- For every pair of regex and string discussed in the paper, we added this test to our test suite.
- This list is the
paper_testslist at line 290 oftests.ml. - This test suite is executed for each of the three register implementations when running
tests.native.
