Skip to content

RegExp Implementation

Roger Johansson edited this page Jan 14, 2026 · 1 revision

RegExp Implementation

The JsRegExp class (JsTypes/JsRegExp.cs) implements JavaScript regular expressions by translating JS regex syntax to .NET System.Text.RegularExpressions.Regex.


Overview

flowchart LR
    JS[/"JS Pattern + Flags"/] --> Validate((Validate Flags))
    Validate --> Normalize((NormalizePattern))
    Normalize --> Options((Map to RegexOptions))
    Options --> Compile((".NET Regex"))
    Compile --> Test((Test/Exec/Match))
Loading

Key Files

File Purpose
JsTypes/JsRegExp.cs Core RegExp implementation
StdLib/RegExpPrototype.cs Prototype methods (match, replace, split, etc.)
RealmState.UpdateRegExpStatics() Legacy static properties ($1-$9, input, etc.)

Flag Support

Flag Property RegexOptions Notes
g Global - Affects lastIndex behavior
i IgnoreCase IgnoreCase Case-insensitive matching
m Multiline Multiline ^/$ match line boundaries
s DotAll - . matches newlines (handled in pattern)
u Unicode - Full Unicode mode (pattern translation)
y Sticky - Anchor at lastIndex
d HasIndices - Include indices array in results
v UnicodeSets - Unicode property escapes (like u)

Flag Validation

private static void ValidateFlags(string flags)
{
    var seen = new HashSet<char>();
    foreach (var flag in flags)
    {
        if (!seen.Add(flag))
            throw new ParseException($"Duplicate flag '{flag}'");
        
        if (flag is not ('g' or 'i' or 'm' or 'u' or 'y' or 's' or 'd' or 'v'))
            throw new ParseException($"Invalid flag '{flag}'");
        
        // 'u' and 'v' are mutually exclusive
        if ((flag == 'u' && seen.Contains('v')) ||
            (flag == 'v' && seen.Contains('u')))
            throw new ParseException("Flags 'u' and 'v' cannot both be set");
    }
}

Pattern Translation

JavaScript regex syntax differs from .NET in several ways. The NormalizePattern method handles translation.

Translation Flow

flowchart TD
    Input((Input Pattern)) --> Check{Unicode Flag?}
    Check -->|Yes| Unicode((NormalizePattern))
    Check -->|No| Legacy((NormalizeLegacyPattern))
    
    Unicode --> Groups((Collect Group Names))
    Groups --> Parse((Parse + Transform))
    Parse --> Surrogates((Handle Surrogates))
    Surrogates --> Output((Normalized .NET Pattern))
    
    Legacy --> LGroups((Collect Group Names))
    LGroups --> LOctal((Handle Octal Escapes))
    LOctal --> LBackref((Handle Backreferences))
    LBackref --> Output
Loading

Unicode Mode Transformations

JS Syntax .NET Translation Reason
. Complex surrogate-aware pattern Match full code points, not just BMP
\S Surrogate-aware non-whitespace Include astral code points
\u{1F600} (?:\uD83D\uDE00) Surrogate pair for code point > 0xFFFF
[...] with astral (?:[bmp]|astral|...) Split BMP and astral ranges

Unicode Dot Pattern

private const string UnicodeDotPattern =
    @"(?<![\uD800-\uDBFF])(?:[^\n\r\u2028\u2029]|" +
    @"[\uD800-\uDBFF][\uDC00-\uDFFF]|" +
    @"[\uD800-\uDBFF](?![\uDC00-\uDFFF])|" +
    @"[\uDC00-\uDFFF])";

This matches:

  1. Any BMP character except line terminators
  2. Valid surrogate pairs (full astral code points)
  3. Isolated high surrogates (error recovery)
  4. Isolated low surrogates (error recovery)

Surrogate Pair Handling

flowchart LR
    subgraph Input
        High["High Surrogate\n0xD800-0xDBFF"]
        Low["Low Surrogate\n0xDC00-0xDFFF"]
    end
    
    subgraph Validation
        Check{Valid Pair?}
    end
    
    subgraph Output
        CodePoint["Convert to\nCode Point"]
        Error["ParseException"]
    end
    
    High --> Check
    Low --> Check
    Check -->|"High + Low"| CodePoint
    Check -->|"Isolated"| Error
Loading
if (char.IsHighSurrogate(c))
{
    if (i + 1 >= pattern.Length || !char.IsLowSurrogate(pattern[i + 1]))
        throw new ParseException("Invalid unicode escape.");
    
    var cp = char.ConvertToUtf32(c, pattern[i + 1]);
    AppendCodePoint(builder, cp, hasUnicodeFlag, ignoreCase, false);
    i++;
    continue;
}

if (char.IsLowSurrogate(c))
    throw new ParseException("Invalid unicode escape.");

Named Groups

Forward Reference Handling

JavaScript allows forward references to named groups (referencing before definition). .NET doesn't natively support this, so we use conditional patterns:

if (definedSoFar.Contains(normalizedName))
{
    // Backward reference: group already defined
    builder.Append(pattern, i, end - i + 1);
}
else
{
    // Forward reference: conditional to match empty if not yet captured
    builder.Append("(?(");
    builder.Append(normalizedName);
    builder.Append(")\\k<");
    builder.Append(normalizedName);
    builder.Append(">|)");
}

This translates /\k<foo>(?<foo>bar)/u to a conditional that matches empty string until foo is captured.

Group Name Validation

flowchart TD
    Name((Raw Group Name)) --> Decode((DecodeGroupName))
    Decode --> Runes((List of Runes))
    Runes --> First{First Char?}
    First -->|ID Start| Rest{Rest Chars?}
    First -->|Invalid| Error((ParseException))
    Rest -->|All ID Part| Valid((Normalized Name))
    Rest -->|Invalid| Error
Loading

Valid identifier start characters:

  • $, _
  • Unicode categories: Lu, Ll, Lt, Lm, Lo, Nl

Valid identifier part characters:

  • All start characters plus
  • Unicode categories: Nd, Pc, Mn, Mc

Legacy Mode (No Unicode Flag)

Octal Escape Handling

if (allOctal && octalDigits > 0)
{
    var effectiveValue = octalValue;
    var effectiveDigits = octalDigits;
    
    // Cap at 0xFF
    while (effectiveValue > 0xFF && effectiveDigits > 1)
    {
        effectiveValue >>= 3;
        effectiveDigits--;
    }
    
    AppendCodePoint(builder, effectiveValue, false, ignoreCase, true);
    i = start + effectiveDigits - 1;
    continue;
}

Backreference vs Octal Disambiguation

// If it's a valid backreference, use it
if (value > 0 && value <= totalCaptures)
{
    if (value <= captureCount)
        builder.Append($"\\{numText}");  // Backreference
    else
        builder.Append("(?:)");          // Forward ref = empty
    continue;
}

// Otherwise, treat as octal

lastIndex Behavior

The lastIndex property controls where matching starts for global/sticky regexps:

flowchart TD
    Start((Start Match)) --> Check{Global or Sticky?}
    Check -->|No| Zero["Start at 0"]
    Check -->|Yes| GetLast["Get lastIndex"]
    GetLast --> Range{In Range?}
    Range -->|Yes| Match((Attempt Match))
    Range -->|No| Reset["Reset lastIndex = 0"]
    Match --> Success{Match Found?}
    Success -->|Yes| Update["lastIndex = match.end"]
    Success -->|No| Reset
    Update --> Return((Return Result))
    Reset --> Return
    Zero --> Return
Loading
public bool Test(string input)
{
    var startIndex = Global || Sticky ? GetLastIndex() : 0;
    if (startIndex > input.Length)
        startIndex = 0;

    var match = EnsureRegex().Match(input, startIndex);

    if (match.Success && Global)
        SetLastIndex(match.Index + match.Length);
    else if (!match.Success && (Global || Sticky))
        SetLastIndex(0);

    return match.Success;
}

Exec Results

The Exec method returns a specialized array with additional properties:

flowchart LR
    Match((Match Object)) --> Array((JsArray))
    Array --> Index["index: match position"]
    Array --> Input["input: original string"]
    Array --> Groups["groups: named captures"]
    Array --> Indices["indices: (if 'd' flag)"]
    
    subgraph Elements
        E0["[0]: full match"]
        E1["[1]: capture 1"]
        EN["[n]: capture n"]
    end
    
    Array --> Elements
Loading
private JsArray CreateMatchArray(Match match, string input)
{
    var result = new JsArray(RealmState);
    
    // Full match + capture groups
    for (var i = 0; i < match.Groups.Count; i++)
    {
        var group = match.Groups[i];
        result.Push(group.Success ? new JsValue(group.Value) : JsValue.Undefined);
    }

    // Add properties
    result.SetProperty("index", (double)match.Index);
    result.SetProperty("input", input);
    result.SetProperty("groups", BuildGroupsObject(match, captureValues));
    
    if (HasIndices)
        result.SetProperty("indices", BuildIndicesArray(match));

    return result;
}

Indices Array (d flag)

When the d flag is present, each capture includes start/end positions:

private JsArray BuildIndicesArray(Match match)
{
    var indices = new JsArray(RealmState);
    
    for (var i = 0; i < match.Groups.Count; i++)
    {
        var group = match.Groups[i];
        if (group.Success)
        {
            var pair = new JsArray(RealmState);
            pair.Push((double)group.Index);
            pair.Push((double)(group.Index + group.Length));
            indices.Push(JsValue.FromJsArray(pair));
        }
        else
        {
            indices.Push(JsValue.Undefined);
        }
    }
    
    indices.SetProperty("groups", BuildIndicesGroupsObject(match, regex, indexValues));
    return indices;
}

Character Class Translation

Unicode Character Classes

Character classes with Unicode flag need special handling for astral code points:

flowchart TD
    Class[/"[abc\u{1F600}]"/] --> Parse((Parse Ranges))
    Parse --> Split{Code Point > 0xFFFF?}
    Split -->|Yes| Astral["Astral Ranges"]
    Split -->|No| BMP["BMP Ranges"]
    BMP --> BuildBMP((BuildBmpClassContent))
    Astral --> BuildAstral((BuildAstralAlternation))
    BuildBMP --> Combine((Combine))
    BuildAstral --> Combine
    Combine --> Result[/"(?:[bmp]|astral1|astral2)"/]
Loading

Negated Unicode Classes

// For negated classes like [^abc], use negative lookahead
return $"(?:(?!{disallowed}){AnyCodePointPattern})";

RegexOptions Mapping

var options = RegexOptions.CultureInvariant;

if (IgnoreCase)
    options |= RegexOptions.IgnoreCase;

if (Multiline)
    options |= RegexOptions.Multiline;

// Note: DotAll is handled via pattern transformation, not options

The engine always uses CultureInvariant to ensure consistent matching across cultures.


Static Properties (Legacy)

For compatibility, the engine maintains static regex properties:

RealmState.UpdateRegExpStatics(input, match);

This updates properties like:

  • RegExp.$1 through RegExp.$9
  • RegExp.input / RegExp.$_
  • RegExp.lastMatch / RegExp.$&
  • RegExp.lastParen / RegExp.$+
  • RegExp.leftContext / `RegExp.$``
  • RegExp.rightContext / RegExp.$'

Performance Considerations

Lazy Compilation

private Regex? _compiledRegex;

private Regex EnsureRegex()
{
    return _compiledRegex ??= new Regex(_normalizedPattern, _regexOptions);
}

The .NET Regex is only compiled on first use, allowing cheap RegExp object creation.

Pattern Caching

Since pattern normalization is deterministic, the same JS pattern always produces the same .NET pattern. The normalized pattern is stored in _normalizedPattern for reuse.


Edge Cases

Kelvin Sign (U+212A)

if (!unicodeMode && ignoreCase && codePoint == 0x212A)
{
    // Kelvin sign should not case-fold to 'K' in legacy mode
    builder.Append("(?-i:\\u212A)");
    return;
}

Empty Pattern

if (string.IsNullOrEmpty(pattern))
    return pattern;

Incomplete Escapes

if (i + 1 >= pattern.Length || IsLineTerminator(pattern[i + 1]))
    throw new ParseException("Invalid regular expression: incomplete escape.");

See Also

Clone this wiki locally