-
Notifications
You must be signed in to change notification settings - Fork 1
RegExp Implementation
The JsRegExp class (JsTypes/JsRegExp.cs) implements JavaScript regular expressions by translating JS regex syntax to .NET System.Text.RegularExpressions.Regex.
flowchart LR
JS[/"JS Pattern + Flags"/] --> Validate((Validate Flags))
Validate --> Normalize((NormalizePattern))
Normalize --> Options((Map to RegexOptions))
Options --> Compile((".NET Regex"))
Compile --> Test((Test/Exec/Match))
| File | Purpose |
|---|---|
JsTypes/JsRegExp.cs |
Core RegExp implementation |
StdLib/RegExpPrototype.cs |
Prototype methods (match, replace, split, etc.) |
RealmState.UpdateRegExpStatics() |
Legacy static properties ($1-$9, input, etc.) |
| Flag | Property | RegexOptions | Notes |
|---|---|---|---|
g |
Global |
- | Affects lastIndex behavior |
i |
IgnoreCase |
IgnoreCase |
Case-insensitive matching |
m |
Multiline |
Multiline |
^/$ match line boundaries |
s |
DotAll |
- |
. matches newlines (handled in pattern) |
u |
Unicode |
- | Full Unicode mode (pattern translation) |
y |
Sticky |
- | Anchor at lastIndex |
d |
HasIndices |
- | Include indices array in results |
v |
UnicodeSets |
- | Unicode property escapes (like u) |
private static void ValidateFlags(string flags)
{
var seen = new HashSet<char>();
foreach (var flag in flags)
{
if (!seen.Add(flag))
throw new ParseException($"Duplicate flag '{flag}'");
if (flag is not ('g' or 'i' or 'm' or 'u' or 'y' or 's' or 'd' or 'v'))
throw new ParseException($"Invalid flag '{flag}'");
// 'u' and 'v' are mutually exclusive
if ((flag == 'u' && seen.Contains('v')) ||
(flag == 'v' && seen.Contains('u')))
throw new ParseException("Flags 'u' and 'v' cannot both be set");
}
}JavaScript regex syntax differs from .NET in several ways. The NormalizePattern method handles translation.
flowchart TD
Input((Input Pattern)) --> Check{Unicode Flag?}
Check -->|Yes| Unicode((NormalizePattern))
Check -->|No| Legacy((NormalizeLegacyPattern))
Unicode --> Groups((Collect Group Names))
Groups --> Parse((Parse + Transform))
Parse --> Surrogates((Handle Surrogates))
Surrogates --> Output((Normalized .NET Pattern))
Legacy --> LGroups((Collect Group Names))
LGroups --> LOctal((Handle Octal Escapes))
LOctal --> LBackref((Handle Backreferences))
LBackref --> Output
| JS Syntax | .NET Translation | Reason |
|---|---|---|
. |
Complex surrogate-aware pattern | Match full code points, not just BMP |
\S |
Surrogate-aware non-whitespace | Include astral code points |
\u{1F600} |
(?:\uD83D\uDE00) |
Surrogate pair for code point > 0xFFFF |
[...] with astral |
(?:[bmp]|astral|...) |
Split BMP and astral ranges |
private const string UnicodeDotPattern =
@"(?<![\uD800-\uDBFF])(?:[^\n\r\u2028\u2029]|" +
@"[\uD800-\uDBFF][\uDC00-\uDFFF]|" +
@"[\uD800-\uDBFF](?![\uDC00-\uDFFF])|" +
@"[\uDC00-\uDFFF])";This matches:
- Any BMP character except line terminators
- Valid surrogate pairs (full astral code points)
- Isolated high surrogates (error recovery)
- Isolated low surrogates (error recovery)
flowchart LR
subgraph Input
High["High Surrogate\n0xD800-0xDBFF"]
Low["Low Surrogate\n0xDC00-0xDFFF"]
end
subgraph Validation
Check{Valid Pair?}
end
subgraph Output
CodePoint["Convert to\nCode Point"]
Error["ParseException"]
end
High --> Check
Low --> Check
Check -->|"High + Low"| CodePoint
Check -->|"Isolated"| Error
if (char.IsHighSurrogate(c))
{
if (i + 1 >= pattern.Length || !char.IsLowSurrogate(pattern[i + 1]))
throw new ParseException("Invalid unicode escape.");
var cp = char.ConvertToUtf32(c, pattern[i + 1]);
AppendCodePoint(builder, cp, hasUnicodeFlag, ignoreCase, false);
i++;
continue;
}
if (char.IsLowSurrogate(c))
throw new ParseException("Invalid unicode escape.");JavaScript allows forward references to named groups (referencing before definition). .NET doesn't natively support this, so we use conditional patterns:
if (definedSoFar.Contains(normalizedName))
{
// Backward reference: group already defined
builder.Append(pattern, i, end - i + 1);
}
else
{
// Forward reference: conditional to match empty if not yet captured
builder.Append("(?(");
builder.Append(normalizedName);
builder.Append(")\\k<");
builder.Append(normalizedName);
builder.Append(">|)");
}This translates /\k<foo>(?<foo>bar)/u to a conditional that matches empty string until foo is captured.
flowchart TD
Name((Raw Group Name)) --> Decode((DecodeGroupName))
Decode --> Runes((List of Runes))
Runes --> First{First Char?}
First -->|ID Start| Rest{Rest Chars?}
First -->|Invalid| Error((ParseException))
Rest -->|All ID Part| Valid((Normalized Name))
Rest -->|Invalid| Error
Valid identifier start characters:
-
$,_ - Unicode categories: Lu, Ll, Lt, Lm, Lo, Nl
Valid identifier part characters:
- All start characters plus
- Unicode categories: Nd, Pc, Mn, Mc
if (allOctal && octalDigits > 0)
{
var effectiveValue = octalValue;
var effectiveDigits = octalDigits;
// Cap at 0xFF
while (effectiveValue > 0xFF && effectiveDigits > 1)
{
effectiveValue >>= 3;
effectiveDigits--;
}
AppendCodePoint(builder, effectiveValue, false, ignoreCase, true);
i = start + effectiveDigits - 1;
continue;
}// If it's a valid backreference, use it
if (value > 0 && value <= totalCaptures)
{
if (value <= captureCount)
builder.Append($"\\{numText}"); // Backreference
else
builder.Append("(?:)"); // Forward ref = empty
continue;
}
// Otherwise, treat as octalThe lastIndex property controls where matching starts for global/sticky regexps:
flowchart TD
Start((Start Match)) --> Check{Global or Sticky?}
Check -->|No| Zero["Start at 0"]
Check -->|Yes| GetLast["Get lastIndex"]
GetLast --> Range{In Range?}
Range -->|Yes| Match((Attempt Match))
Range -->|No| Reset["Reset lastIndex = 0"]
Match --> Success{Match Found?}
Success -->|Yes| Update["lastIndex = match.end"]
Success -->|No| Reset
Update --> Return((Return Result))
Reset --> Return
Zero --> Return
public bool Test(string input)
{
var startIndex = Global || Sticky ? GetLastIndex() : 0;
if (startIndex > input.Length)
startIndex = 0;
var match = EnsureRegex().Match(input, startIndex);
if (match.Success && Global)
SetLastIndex(match.Index + match.Length);
else if (!match.Success && (Global || Sticky))
SetLastIndex(0);
return match.Success;
}The Exec method returns a specialized array with additional properties:
flowchart LR
Match((Match Object)) --> Array((JsArray))
Array --> Index["index: match position"]
Array --> Input["input: original string"]
Array --> Groups["groups: named captures"]
Array --> Indices["indices: (if 'd' flag)"]
subgraph Elements
E0["[0]: full match"]
E1["[1]: capture 1"]
EN["[n]: capture n"]
end
Array --> Elements
private JsArray CreateMatchArray(Match match, string input)
{
var result = new JsArray(RealmState);
// Full match + capture groups
for (var i = 0; i < match.Groups.Count; i++)
{
var group = match.Groups[i];
result.Push(group.Success ? new JsValue(group.Value) : JsValue.Undefined);
}
// Add properties
result.SetProperty("index", (double)match.Index);
result.SetProperty("input", input);
result.SetProperty("groups", BuildGroupsObject(match, captureValues));
if (HasIndices)
result.SetProperty("indices", BuildIndicesArray(match));
return result;
}When the d flag is present, each capture includes start/end positions:
private JsArray BuildIndicesArray(Match match)
{
var indices = new JsArray(RealmState);
for (var i = 0; i < match.Groups.Count; i++)
{
var group = match.Groups[i];
if (group.Success)
{
var pair = new JsArray(RealmState);
pair.Push((double)group.Index);
pair.Push((double)(group.Index + group.Length));
indices.Push(JsValue.FromJsArray(pair));
}
else
{
indices.Push(JsValue.Undefined);
}
}
indices.SetProperty("groups", BuildIndicesGroupsObject(match, regex, indexValues));
return indices;
}Character classes with Unicode flag need special handling for astral code points:
flowchart TD
Class[/"[abc\u{1F600}]"/] --> Parse((Parse Ranges))
Parse --> Split{Code Point > 0xFFFF?}
Split -->|Yes| Astral["Astral Ranges"]
Split -->|No| BMP["BMP Ranges"]
BMP --> BuildBMP((BuildBmpClassContent))
Astral --> BuildAstral((BuildAstralAlternation))
BuildBMP --> Combine((Combine))
BuildAstral --> Combine
Combine --> Result[/"(?:[bmp]|astral1|astral2)"/]
// For negated classes like [^abc], use negative lookahead
return $"(?:(?!{disallowed}){AnyCodePointPattern})";var options = RegexOptions.CultureInvariant;
if (IgnoreCase)
options |= RegexOptions.IgnoreCase;
if (Multiline)
options |= RegexOptions.Multiline;
// Note: DotAll is handled via pattern transformation, not optionsThe engine always uses CultureInvariant to ensure consistent matching across cultures.
For compatibility, the engine maintains static regex properties:
RealmState.UpdateRegExpStatics(input, match);This updates properties like:
-
RegExp.$1throughRegExp.$9 -
RegExp.input/RegExp.$_ -
RegExp.lastMatch/RegExp.$& -
RegExp.lastParen/RegExp.$+ -
RegExp.leftContext/ `RegExp.$`` -
RegExp.rightContext/RegExp.$'
private Regex? _compiledRegex;
private Regex EnsureRegex()
{
return _compiledRegex ??= new Regex(_normalizedPattern, _regexOptions);
}The .NET Regex is only compiled on first use, allowing cheap RegExp object creation.
Since pattern normalization is deterministic, the same JS pattern always produces the same .NET pattern. The normalized pattern is stored in _normalizedPattern for reuse.
if (!unicodeMode && ignoreCase && codePoint == 0x212A)
{
// Kelvin sign should not case-fold to 'K' in legacy mode
builder.Append("(?-i:\\u212A)");
return;
}if (string.IsNullOrEmpty(pattern))
return pattern;if (i + 1 >= pattern.Length || IsLineTerminator(pattern[i + 1]))
throw new ParseException("Invalid regular expression: incomplete escape.");- JsValue System - How regex results are represented
- Standard Library Architecture - How RegExp prototype methods are defined