Regex

Regex Library for Mojo

regex is a regex library featuring a hybrid DFA/NFA engine architecture that automatically optimizes pattern matching based on complexity.

It aims to provide a similar interface as the re stdlib package while leveraging Mojo's performance capabilities.

Disclaimer ⚠️

This software is in an early stage of development. Even though it is functional, it is not yet feature-complete and may contain bugs. Check the features section below and the TO-DO sections for the current status

Implemented Features

Basic Elements

✅ Literal characters (a, hello)
✅ Wildcard (.) - matches any character except newline
✅ Whitespace (\s) - matches space, tab, newline, carriage return, form feed
✅ Digits (\d) - matches ASCII digits 0-9
✅ Word characters (\w) - matches letters, digits, and underscore [a-zA-Z0-9_]
✅ Escape sequences (\t for tab, \\ for literal backslash)

Character Classes

✅ Character ranges ([a-z], [0-9], [A-Za-z0-9])
✅ Negated ranges ([^a-z], [^0-9])
✅ Mixed character sets ([abc123])
✅ Character ranges within groups ((b|[c-n]))

Quantifiers

✅ Zero or more (*)
✅ One or more (+)
✅ Zero or one (?)
✅ Exact count ({3})
✅ Range count ({2,4})
✅ Minimum count ({2,})
✅ Quantifiers on all elements (characters, wildcards, ranges, groups)

Anchors

✅ Start of string (^)
✅ End of string ($)
✅ Anchors in OR expressions (^na|nb$)

Groups and Alternation

✅ Capturing groups ((abc))
✅ Alternation/OR (a|b)
✅ Complex OR patterns ((a|b), na|nb)
✅ Nested alternations ((b|[c-n]))
✅ Group quantifiers ((a)*, (abc)+)

Engine Features

✅ Hybrid DFA/NFA Architecture - Automatic engine selection for optimal performance
✅ O(n) Performance - DFA engine for simple patterns (literals, basic quantifiers, character classes)
✅ Full Regex Support - NFA engine with backtracking for complex patterns
✅ Pattern Complexity Analysis - Intelligent routing between engines
✅ SIMD Optimization - Vectorized character class matching
✅ Pattern Compilation Caching - Pre-compiled patterns for reuse
✅ Match Position Tracking - Precise start_idx, end_idx reporting
✅ Simple API: match_first(pattern, text) -> Optional[Match]

Installation

Install pixi
Add the Package (at the top level of your project):
```
pixi add mojo-regex
```

Example Usage

from regex import match_first, findall

# Basic literal matching
var result = match_first("hello", "hello world")
if result:
    print("Match found:", result.value().get_match_text())

# Find all matches
var matches = findall("a", "banana")
print("Found", len(matches), "matches:")
for i in range(len(matches)):
    print("  Match", i, ":", matches[i].get_match_text(), "at position", matches[i].start_idx)

# Wildcard and quantifiers
result = match_first(".*@.*", "[email protected]")
if result:
    print("Email found")

# Find all numbers in text using predefined digit class
var numbers = findall("\\d+", "Price: $123, Quantity: 456, Total: $579")
for i in range(len(numbers)):
    print("Number found:", numbers[i].get_match_text())

# Find all word identifiers
var words = findall("\\w+", "variable_name123 and another_var")
for i in range(len(words)):
    print("Word found:", words[i].get_match_text())

# Character ranges
result = match_first("[a-z]+", "hello123")
if result:
    print("Letters:", result.value().get_match_text())

# Groups and alternation
result = match_first("(com|org|net)", "example.com")
if result:
    print("TLD found:", result.value().get_match_text())

# Find all domains in text
var domains = findall("(com|org|net)", "Visit example.com or test.org for more info")
for i in range(len(domains)):
    print("Domain found:", domains[i].get_match_text())

# Anchors
result = match_first("^https?://", "https://example.com")
if result:
    print("Valid URL")

# Complex patterns using predefined character classes
result = match_first("^\\w+@\\w+\\.\\w+$", "[email protected]")
if result:
    print("Valid email format")

# Find all email addresses in text
var emails = findall("\\w+@\\w+\\.\\w+", "Contact [email protected] or [email protected]")
for i in range(len(emails)):
    print("Email found:", emails[i].get_match_text())

# Validate identifiers (programming variable names)
result = match_first("^[a-zA-Z_]\\w*$", "_private_var123")
if result:
    print("Valid identifier")

# === OPTIMIZATION SHOWCASE EXAMPLES ===
# These patterns now benefit from DFA optimization!

# Large alternation (8 branches) - DFA-optimized
var fruit_pattern = "(apple|banana|cherry|date|elderberry|fig|grape|honey)"
var fruit_text = "I love eating apple and banana for breakfast"
var fruits = findall(fruit_pattern, fruit_text)
for i in range(len(fruits)):
    print("Fruit found:", fruits[i].get_match_text())

# Deep nested alternation (depth 4) - DFA-optimized
var nested_pattern = "(?:(?:(?:a|b)|(?:c|d))|(?:(?:e|f)|(?:g|h)))"
var nested_result = match_first(nested_pattern, "Testing with character a")
if nested_result:
    print("Deep nested match:", nested_result.value().get_match_text())

# Literal-heavy alternation (80% threshold) - DFA-optimized
var user_pattern = "(user123|admin456|guest789|root000|test111|demo222)"
var login_text = "Login attempts: user123 failed, admin456 success"
var users = findall(user_pattern, login_text)
for i in range(len(users)):
    print("User found:", users[i].get_match_text())

# Complex phone patterns - DFA-optimized
var phone_pattern = "[0-9]{3}-[0-9]{3}-[0-9]{4}"  # DFA-optimized
var phone_text = "Call us at 555-123-4567 or 800-555-9999"
var phones = findall(phone_pattern, phone_text)
for i in range(len(phones)):
    print("Phone found:", phones[i].get_match_text())

# Complex group (5 children) - DFA-optimized
var complex_pattern = "(hello|world|test|demo|sample)[0-9]{3}[a-z]{2}"
var complex_text = "Found: hello123ab, world456cd, test789ef in data"
var complex_matches = findall(complex_pattern, complex_text)
for i in range(len(complex_matches)):
    print("Complex match:", complex_matches[i].get_match_text())

Performance

The hybrid DFA/NFA architecture provides significant performance benefits:

Pattern Performance Characteristics

Pattern Type	Engine Used	Time Complexity	SIMD Optimization	Example
Literal strings	DFA + SIMD	O(n/w)	String search vectorization	`"hello"`, `"example.com"`
Character classes	DFA + SIMD	O(n/w)	Lookup table vectorization	`"[a-z]+"`, `"[0-9]+"`
Built-in classes	DFA + SIMD	O(n/w)	Pre-built SIMD matchers	`"\d+"`, `"\w+"`
Simple quantifiers	DFA + SIMD	O(n/w)	Vectorized counting	`"a*"`, `"[0-9]{3}"`
Anchors	DFA	O(1)	Position validation	`"^start"`, `"end$"`
Basic groups	DFA/NFA + SIMD	O(n) to O(nm)	Partial vectorization	`"(abc)+"`, `"([a-z]+)"`
Small alternation	DFA + SIMD	O(n/w)	DFA state optimization	`"cat\|dog"`, `"(a\|b\|c)"`
Large alternation	NFA + Prefilter	O(n) to O(nm)	Extended to 8 branches	`"(apple\|banana\|...\|honey)"`
Literal-heavy alternation	NFA + Prefilter	O(n) to O(nm)	80% threshold detection	`"(user123\|admin456\|...)"`
Deep nested groups	NFA + SIMD	O(n) to O(nm)	Depth 4 support	`"(?:(?:(?:a\|b)\|(?:c\|d))\|...)"`
Complex phone patterns	DFA + SIMD	O(n/w)	Now DFA-optimized	US national phone validation
Complex patterns	NFA + SIMD	O(nm) to O(2^n)	Character-level SIMD	Backreferences, lookahead

Where w = SIMD width (typically 16-32 characters processed per instruction)

Building and Testing

# Build the package
./tools/build.sh

# Run tests
./tools/run-tests.sh

# Or run specific test
mojo test -I src/ tests/test_matcher.mojo

# Run benchmarks to see performance including SIMD optimizations
mojo benchmarks/bench_engine.mojo

# Run SIMD-specific tests
mojo test -I src/ tests/test_simd_integration.mojo

TO-DO: Missing Features

High Priority

Medium Priority

Non-greedy quantifiers (*?, +?, ??)
Word boundaries (\b, \B)
Match groups extraction and iteration
Pattern compilation object
Unicode character classes (\p{L}, \p{N})
Multiline mode (^ and $ match line boundaries)
Dot-all mode (. matches newlines)

Advanced Features

Engine Improvements

Optimization Strategy

The regex engine uses a selective optimization approach to extend DFA coverage while avoiding performance regressions:

Pattern Classification System

Patterns are automatically classified into three categories:

SIMPLE: DFA-optimized with O(n) performance - literals, basic quantifiers, character classes, optimized alternations
MEDIUM: Hybrid DFA/NFA approach - complex groups, medium alternations, some phone patterns
COMPLEX: NFA with backtracking - backreferences, lookahead, very complex nesting

Selective Optimization Criteria

Optimizations are applied only when they provide clear benefits:

Extended Alternation Support: Increased from 3→8 branches for DFA optimization
Deep Nesting Tolerance: Groups now supported up to depth 4 (vs. 3 previously)
Literal-Heavy Detection: 80% threshold for alternations with mostly literal branches
Complex Group Support: Up to 5 children in groups (vs. 3 previously)
Early Termination: Analysis functions use selective branching to reduce overhead

Performance Validation

All optimizations undergo rigorous validation:

Cross-language benchmarking against Python and Rust implementations
Regression testing to ensure no performance degradation on existing patterns
Analysis overhead measurement to validate optimization efficiency
Real-world pattern testing with complex phone validation and email patterns

This approach ensures that optimizations extend engine capabilities without compromising the performance characteristics that make the main implementation effective.

Contributing

Contributions are welcome! If you'd like to contribute, please follow the contribution guidelines in the CONTRIBUTING.md file in the repository.

Acknowledgments

Thanks to Claude Code for helping a lot with the implementation and testing of the regex library, and to the Mojo community for their support and feedback.

License

mojo is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 702 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
playground		playground
proposals		proposals
src/regex		src/regex
static		static
talks/modular-community-meeting		talks/modular-community-meeting
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Regex

Disclaimer ⚠️

Implemented Features

Basic Elements

Character Classes

Quantifiers

Anchors

Groups and Alternation

Engine Features

Installation

Example Usage

Performance

Pattern Performance Characteristics

Building and Testing

TO-DO: Missing Features

High Priority

Medium Priority

Advanced Features

Engine Improvements

Optimization Strategy

Pattern Classification System

Selective Optimization Criteria

Performance Validation

Contributing

Acknowledgments

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

msaelices/mojo-regex

Folders and files

Latest commit

History

Repository files navigation

Regex

Disclaimer ⚠️

Implemented Features

Basic Elements

Character Classes

Quantifiers

Anchors

Groups and Alternation

Engine Features

Installation

Example Usage

Performance

Pattern Performance Characteristics

Building and Testing

TO-DO: Missing Features

High Priority

Medium Priority

Advanced Features

Engine Improvements

Optimization Strategy

Pattern Classification System

Selective Optimization Criteria

Performance Validation

Contributing

Acknowledgments

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages