Skip to content

Commit 5224e77

Browse files
ANTLR-based XPathMatcher implementation (#6353)
Replace regex-based XPath parsing with a proper ANTLR grammar for improved correctness, maintainability, and performance. ### Key Changes - **ANTLR Grammar**: New `XPathLexer.g4` and `XPathParser.g4` provide a formal grammar for the supported XPath subset - **XPathCompiler**: Compiles XPath expressions into an intermediate representation at match time (lazy DCL pattern) - **Bottom-up Matching**: Walk the cursor chain instead of building path arrays - more efficient for deep trees - **Allocation-free Evaluation**: Unified expression evaluation avoids allocations during matching - **Fail-fast Validation**: Unsupported XPath functions throw at parse time rather than silently failing at match time ### Supported XPath Features - Absolute/relative paths: `/root/child`, `child/grandchild` - Descendant-or-self: `//element` - Wildcards: `/root/*` - Attributes: `/@attr`, `/@*` - Predicates: `[@attr='value']`, `[child='value']`, `[1]`, `[last()]` - Functions: `local-name()`, `contains()`, `starts-with()`, `text()`, `position()`, `last()`, `not()` - Logical operators: `and`, `or` - Axes: `parent::`, `self::`, `..`, `.` ### Test Improvements Tests now verify exact match counts instead of just boolean existence, providing stronger validation of XPath matching behavior.
1 parent 61b818e commit 5224e77

19 files changed

+7005
-310
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
/*
2+
* Copyright 2025 the original author or authors.
3+
* <p>
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
* <p>
8+
* https://www.apache.org/licenses/LICENSE-2.0
9+
* <p>
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
/**
18+
* XPath lexer for a limited subset of XPath expressions.
19+
* Supports absolute and relative paths, wildcards, predicates,
20+
* attribute access, and common XPath functions.
21+
*/
22+
lexer grammar XPathLexer;
23+
24+
// Whitespace
25+
WS : [ \t\r\n]+ -> skip ;
26+
27+
// Path separators
28+
SLASH : '/' ;
29+
DOUBLE_SLASH : '//' ;
30+
AXIS_SEP : '::' ;
31+
32+
// Brackets
33+
LBRACKET : '[' ;
34+
RBRACKET : ']' ;
35+
LPAREN : '(' ;
36+
RPAREN : ')' ;
37+
38+
// Operators
39+
AT : '@' ;
40+
DOTDOT : '..' ; // Must come before DOT for proper lexing
41+
DOT : '.' ;
42+
COMMA : ',' ;
43+
EQUALS : '=' ;
44+
NOT_EQUALS : '!=' ;
45+
LTE : '<=' ; // Must come before LT for proper lexing
46+
GTE : '>=' ; // Must come before GT for proper lexing
47+
LT : '<' ;
48+
GT : '>' ;
49+
WILDCARD : '*' ;
50+
51+
// Numbers
52+
NUMBER : [0-9]+ ('.' [0-9]+)? ;
53+
54+
// Logical operators (for predicate conditions)
55+
AND : 'and' ;
56+
OR : 'or' ;
57+
58+
// XPath functions
59+
LOCAL_NAME : 'local-name' ;
60+
NAMESPACE_URI : 'namespace-uri' ;
61+
62+
// String literals
63+
STRING_LITERAL
64+
: '\'' (~['])* '\''
65+
| '"' (~["])* '"'
66+
;
67+
68+
// NCName (Non-Colonized Name) - XML name without colons
69+
// QName (Qualified Name) - NCName with optional prefix
70+
// QNAME must come before NCNAME to match longer token first
71+
QNAME
72+
: NCNAME_CHARS ':' NCNAME_CHARS
73+
;
74+
75+
NCNAME
76+
: NCNAME_CHARS
77+
;
78+
79+
fragment NCNAME_CHARS
80+
: NAME_START_CHAR NAME_CHAR*
81+
;
82+
83+
fragment NAME_START_CHAR
84+
: [a-zA-Z_]
85+
| '\u00C0'..'\u00D6'
86+
| '\u00D8'..'\u00F6'
87+
| '\u00F8'..'\u02FF'
88+
| '\u0370'..'\u037D'
89+
| '\u037F'..'\u1FFF'
90+
| '\u200C'..'\u200D'
91+
| '\u2070'..'\u218F'
92+
| '\u2C00'..'\u2FEF'
93+
| '\u3001'..'\uD7FF'
94+
| '\uF900'..'\uFDCF'
95+
| '\uFDF0'..'\uFFFD'
96+
;
97+
98+
fragment NAME_CHAR
99+
: NAME_START_CHAR
100+
| '-'
101+
| '.'
102+
| [0-9]
103+
| '\u00B7'
104+
| '\u0300'..'\u036F'
105+
| '\u203F'..'\u2040'
106+
;
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
WS=1
2+
SLASH=2
3+
DOUBLE_SLASH=3
4+
LBRACKET=4
5+
RBRACKET=5
6+
LPAREN=6
7+
RPAREN=7
8+
AT=8
9+
DOT=9
10+
COMMA=10
11+
EQUALS=11
12+
WILDCARD=12
13+
AND=13
14+
OR=14
15+
LOCAL_NAME=15
16+
NAMESPACE_URI=16
17+
STRING_LITERAL=17
18+
QNAME=18
19+
'/'=2
20+
'//'=3
21+
'['=4
22+
']'=5
23+
'('=6
24+
')'=7
25+
'@'=8
26+
'.'=9
27+
','=10
28+
'='=11
29+
'*'=12
30+
'and'=13
31+
'or'=14
32+
'local-name'=15
33+
'namespace-uri'=16
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
/*
2+
* Copyright 2025 the original author or authors.
3+
* <p>
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
* <p>
8+
* https://www.apache.org/licenses/LICENSE-2.0
9+
* <p>
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
/**
18+
* XPath parser for a limited subset of XPath expressions.
19+
*
20+
* Supports:
21+
* - Absolute paths: /root/child
22+
* - Relative paths: child/grandchild
23+
* - Descendant-or-self: //element
24+
* - Wildcards: /root/*
25+
* - Attribute access: /root/@attr, /root/element/@*
26+
* - Node type tests: /root/element/text(), /root/comment(), etc.
27+
* - Predicates with conditions: /root/element[@attr='value']
28+
* - Child element predicates: /root/element[child='value']
29+
* - Positional predicates: /root/element[1], /root/element[last()]
30+
* - Parenthesized expressions with predicates: (/root/element)[1], (/root/a)[last()]
31+
* - XPath functions: local-name(), namespace-uri(), text(), contains(), position(), last(), etc.
32+
* - Logical operators in predicates: and, or
33+
* - Multiple predicates: /root/element[@attr='value'][local-name()='element']
34+
* - Top-level function expressions: contains(/root/element, 'value')
35+
* - Boolean expressions: not(contains(...)), string-length(...) > 2
36+
* - Abbreviated syntax: . (self), .. (parent)
37+
* - Parent axis: parent::node(), parent::element
38+
*/
39+
parser grammar XPathParser;
40+
41+
options { tokenVocab=XPathLexer; }
42+
43+
// Entry point for XPath expression
44+
xpathExpression
45+
: booleanExpr
46+
| filterExpr
47+
| absoluteLocationPath
48+
| relativeLocationPath
49+
;
50+
51+
// Filter expression - parenthesized path with predicates and optional trailing path: (/root/a)[1]/child
52+
filterExpr
53+
: LPAREN (absoluteLocationPath | relativeLocationPath) RPAREN predicate+ (pathSeparator relativeLocationPath)?
54+
;
55+
56+
// Boolean expression (function calls with optional comparison)
57+
booleanExpr
58+
: functionCall comparisonOp comparand
59+
| functionCall
60+
;
61+
62+
// Comparison operators
63+
comparisonOp
64+
: EQUALS
65+
| NOT_EQUALS
66+
| LT
67+
| GT
68+
| LTE
69+
| GTE
70+
;
71+
72+
// Value to compare against
73+
comparand
74+
: stringLiteral
75+
| NUMBER
76+
;
77+
78+
// Absolute path starting with / or //
79+
absoluteLocationPath
80+
: SLASH relativeLocationPath?
81+
| DOUBLE_SLASH relativeLocationPath
82+
;
83+
84+
// Relative path (series of steps)
85+
relativeLocationPath
86+
: step (pathSeparator step)*
87+
;
88+
89+
// Path separator between steps
90+
pathSeparator
91+
: SLASH
92+
| DOUBLE_SLASH
93+
;
94+
95+
// A single step in the path
96+
step
97+
: axisStep predicate*
98+
| nodeTest predicate*
99+
| attributeStep predicate*
100+
| nodeTypeTest
101+
| abbreviatedStep
102+
;
103+
104+
// Axis step - explicit axis like parent::node()
105+
axisStep
106+
: axisName AXIS_SEP nodeTest
107+
;
108+
109+
// Supported axis names (NCName - no namespace prefix)
110+
axisName
111+
: NCNAME // parent, ancestor, self, child, etc. - validated at runtime
112+
;
113+
114+
// Abbreviated step - . or ..
115+
abbreviatedStep
116+
: DOTDOT // parent::node()
117+
| DOT // self::node()
118+
;
119+
120+
// Node type test - text(), comment(), node(), processing-instruction()
121+
// Validation of which functions are valid node type tests happens at runtime
122+
nodeTypeTest
123+
: NCNAME LPAREN RPAREN
124+
;
125+
126+
// Attribute step (@attr, @ns:attr, or @*)
127+
attributeStep
128+
: AT (QNAME | NCNAME | WILDCARD)
129+
;
130+
131+
// Node test (element name, ns:element, or wildcard)
132+
nodeTest
133+
: QNAME
134+
| NCNAME
135+
| WILDCARD
136+
;
137+
138+
// Predicate in square brackets
139+
predicate
140+
: LBRACKET predicateExpr RBRACKET
141+
;
142+
143+
// Predicate expression (supports and/or)
144+
predicateExpr
145+
: orExpr
146+
;
147+
148+
// OR expression (lowest precedence)
149+
orExpr
150+
: andExpr (OR andExpr)*
151+
;
152+
153+
// AND expression (higher precedence than OR)
154+
andExpr
155+
: primaryExpr (AND primaryExpr)*
156+
;
157+
158+
// Primary expression in a predicate
159+
primaryExpr
160+
: predicateValue comparisonOp comparand // any value expression with comparison
161+
| predicateValue // standalone value (last(), position(), number, boolean)
162+
;
163+
164+
// A value-producing expression in a predicate
165+
predicateValue
166+
: functionCall // local-name(), last(), position(), contains(), etc.
167+
| attributeStep // @attr, @*
168+
| relativeLocationPath // bar/baz/text()
169+
| childElementTest // child, *
170+
| NUMBER // positional predicate [1], [2], etc.
171+
;
172+
173+
// XPath function call - unified for both top-level and predicate use
174+
// Function names are NCNames (no namespace prefix in standard XPath 1.0)
175+
functionCall
176+
: LOCAL_NAME LPAREN RPAREN
177+
| NAMESPACE_URI LPAREN RPAREN
178+
| NCNAME LPAREN functionArgs? RPAREN
179+
;
180+
181+
// Function arguments (comma-separated)
182+
functionArgs
183+
: functionArg (COMMA functionArg)*
184+
;
185+
186+
// A single function argument
187+
// Note: functionCall must come before relativeLocationPath
188+
// because both can start with QNAME, but we need to check for '(' to distinguish them
189+
functionArg
190+
: absoluteLocationPath
191+
| functionCall
192+
| relativeLocationPath
193+
| stringLiteral
194+
| NUMBER
195+
;
196+
197+
// Child element test in predicate (element name, ns:element, or wildcard)
198+
childElementTest
199+
: QNAME
200+
| NCNAME
201+
| WILDCARD
202+
;
203+
204+
// String literal value
205+
stringLiteral
206+
: STRING_LITERAL
207+
;

0 commit comments

Comments
 (0)