Skip to content

Refactor jsx mode in parser #7751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 6, 2025
Merged

Conversation

nojaf
Copy link
Member

@nojaf nojaf commented Aug 2, 2025

I was experimenting with the parser related to JSX and noticed that we have a somewhat convoluted mechanism for handling /> and - in identifier names.

First, I created a token dump tool in res_parser, which was previously missing.

Currently, the parser employs a sequence of Scanner.set_jsx_mode p.Parser.scanner; and Scanner.pop_mode p.scanner Jsx; while processing elements to distinguish between parser JSX and non-JSX. However, this is mainly used in the following cases:

  • Allowing a - inside an identifier. This logic belongs in the parser, but it currently resides in the scanner, which feels inappropriate.
  • Combining a < + / into a </ token. I would prefer using a lookahead for when a < is encountered. This would clarify that it's specific to JSX parsing. Even though LessThanSlash exists, there is still a separate LessThan + Slash check, which makes the code a bit messy.

This is an effort to streamline the JSX mode.

PS: to run the local analysis tests, I had to revert to legacy clean. This is for the better until we figure out #7707

| LessThan ->
(* Imagine: <div> <Navbar /> <
* is `<` the start of a jsx-child? <div …
* or is it the start of a closing tag? </div>
* reconsiderLessThan peeks at the next token and
* determines the correct token to disambiguate *)
let token = Scanner.reconsider_less_than p.scanner in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what bother me a bit, there is LessThanSlash above, yet we still need to do the reconsider_less_than call.

let attr_expr = parse_primary_expr ~operand:(parse_atomic_expr p) p in
Some (Parsetree.JSXPropValue ({txt = name; loc}, optional, attr_expr))
| _ -> Some (Parsetree.JSXPropPunning (false, {txt = name; loc})))
(* {...props} *)
| Lbrace -> (
Scanner.pop_mode p.scanner Jsx;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather confusing when you are in a nested jsx scenario:

<div>
  <p>
    {foo}
  </p>
</div>

Popping Jsx from p requires the pop of div also to happen to get out of Jsx mode.

posCursor:[30:12] posNoWhite:[30:11] Found expr:[30:9->32:10]
JSX <di:[30:10->30:12] div[32:6->32:9]=...[32:6->32:9]> _children:None
posCursor:[30:12] posNoWhite:[30:11] Found expr:[30:9->30:12]
JSX <di:[30:10->30:12] > _children:None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes is because there is slightly different AST for:

<div>
  <di
</div>

It used to be

[
  structure_item (A.res[1,0+0]..[3,12+6])
    Pstr_eval
    expression (A.res[1,0+0]..[3,12+6])
      Pexp_jsx_container_element "div" (A.res[1,0+1]..[1,0+4])
      jsx_props =
        []
      > [1,0+4]
      jsx_children =
        [
          expression (A.res[2,6+2]..[3,12+6])
            Pexp_jsx_container_element "di" (A.res[2,6+3]..[2,6+5])
            jsx_props =
              [
                div              ]
            > [3,12+5]
            jsx_children =
              []
        ]
]

and now is

[
  structure_item (A.res[1,0+0]..[3,12+6])
    Pstr_eval
    expression (A.res[1,0+0]..[3,12+6])
      Pexp_jsx_container_element "div" (A.res[1,0+1]..[1,0+4])
      jsx_props =
        []
      > [1,0+4]
      jsx_children =
        [
          expression (A.res[2,6+2]..[2,6+5])
            Pexp_jsx_unary_element "di" (A.res[2,6+3]..[2,6+5])
            jsx_props =
              []
        ]
]

I think this is more correct. The unary (new) versus container (old) doesn't matter that much, neither can be determined for <di.
However, the container had a weird prop div, which it no longer has.

Copy link

pkg-pr-new bot commented Aug 2, 2025

Open in StackBlitz

rescript

npm i https://pkg.pr.new/rescript-lang/rescript@7751

@rescript/darwin-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-arm64@7751

@rescript/darwin-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-x64@7751

@rescript/linux-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-arm64@7751

@rescript/linux-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-x64@7751

@rescript/win32-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/win32-x64@7751

commit: 0ea79fd

@nojaf nojaf marked this pull request as ready for review August 2, 2025 12:00
@nojaf nojaf requested a review from cristianoc August 2, 2025 12:00
@nojaf nojaf changed the title Add token dump printer Refactor jsx mode in parser Aug 2, 2025
@nojaf
Copy link
Member Author

nojaf commented Aug 4, 2025

Hi @cristianoc, sorry for my eagerness, could you take a look at these changes?

@cristianoc
Copy link
Collaborator

Are there differences in white space behavior? Are these intended? (For composite tokens)? Or possible ambiguities when single characters are tokenised in isolation.
Traveling so could not take a look in detail, but these are the things that come to mind.

@nojaf
Copy link
Member Author

nojaf commented Aug 4, 2025

Or possible ambiguities when single characters are tokenised in isolation.

No, actually not, there is no other way the language can encounter </ besides JSX.
So that made me wonder if we needed the jsx mode in the first place.

Safe travels, will ask someone else for a review.
(Don't be shy to take a look at this once you are back 😇)

@zth , @shulhi , @aspeddro any volunteers?

@nojaf nojaf requested review from Copilot and removed request for cristianoc August 4, 2025 18:12
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the JSX mode handling in the ReScript parser by removing the convoluted JSX mode mechanism from the scanner and moving JSX-specific logic to the parser. The refactoring streamlines JSX parsing by eliminating the need for Scanner.set_jsx_mode and Scanner.pop_mode calls throughout the codebase, and introduces lookahead functionality for better JSX token handling.

Key changes:

  • Removes JSX mode from the scanner and replaces it with parser-level JSX identifier handling
  • Introduces lookahead functions (peekMinus, peekSlash) for better JSX token disambiguation
  • Adds a token debugger tool for development purposes

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
compiler/syntax/src/res_scanner.ml Removes JSX mode handling and adds lookahead functions for minus and slash characters
compiler/syntax/src/res_core.ml Implements JSX identifier parsing in the parser with lookahead-based logic
compiler/syntax/src/res_token.ml Removes LessThanSlash token type
compiler/syntax/src/res_token_debugger.ml Adds new token debugging utility
tests/syntax_tests/data/parsing/errors/expressions/expected/jsx.res.txt Updates expected error message
Various package.json files Reverts to legacy clean command for analysis tests

Comment on lines 740 to 747
p.token <- token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
p.token <- token
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Direct mutation of parser state (p.token <-) breaks encapsulation and makes the code harder to reason about. Consider using a proper parser method or returning the modified token instead of mutating parser state directly.

Suggested change
p.token <- token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
p.token <- token
set_token p token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
set_token p token

Copilot uses AI. Check for mistakes.

Comment on lines 740 to 747
p.token <- token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
p.token <- token
Copy link
Preview

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Direct mutation of parser state (p.token <-) breaks encapsulation and makes the code harder to reason about. Consider using a proper parser method or returning the modified token instead of mutating parser state directly.

Suggested change
p.token <- token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
p.token <- token
set_token p token
| Uident txt when Scanner.peekMinus p.scanner ->
let buffer = Buffer.create (String.length txt) in
Buffer.add_string buffer txt;
Parser.next p;
let name = visit buffer |> Buffer.contents in
let token = Token.Uident name in
set_token p token

Copilot uses AI. Check for mistakes.

@cristianoc
Copy link
Collaborator

Hitting AI with AI

https://chatgpt.com/share/68933fb4-cef4-8011-a04c-a30808a82b89

Jump to the end for the relevant summary

@nojaf
Copy link
Member Author

nojaf commented Aug 6, 2025

@cristianoc thanks, hit it again 😅
Think it is fine now.

Copy link
Collaborator

@cristianoc cristianoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go!

@nojaf nojaf enabled auto-merge (squash) August 6, 2025 13:22
@nojaf nojaf merged commit c738f75 into rescript-lang:master Aug 6, 2025
52 of 53 checks passed
@cristianoc
Copy link
Collaborator

@nojaf here's one more cleanup one could consider:

Awesome—here’s a no-mutation refactor you can actually drop in and reason about. It consumes tokens and returns structured values; no code rewrites p.token (other than the usual Parser.next p to advance).

Core idea
• Parse JSX tag names via helpers that consume HEAD ( "-" IDENT )* and return the combined name + location (+ shape).
• The parser keeps reading the real token stream; nothing injects synthetic tokens into p.token.

Types

type jsx_ident_kind = [ Lower | Upper ]

type jsx_tag_name =
| Lower of { name : string; loc : Loc.t } (* "a-b-c" )
| QualifiedLower of { path : Longident.t; name : string; loc : Loc.t } (
V.X.y-z )
| Upper of { path : Longident.t; loc : Loc.t } (
V.X.Component *)

Helpers (no mutation, consume & return)

(* Read a single ident at current token, returning its text, loc, and kind. Does not advance. *)
let peek_ident (p : Parser.t) : (string * Loc.t * jsx_ident_kind) option =
match p.Parser.token with
| Token.Lident txt -> Some (txt, Parser.loc p, Lower) | Token.Uident txt -> Some (txt, Parser.loc p, Upper)
| _ -> None

(* Consume one Lident/Uident, error if none. *)
let expect_ident (p : Parser.t) : (string * Loc.t * jsx_ident_kind) option =
match peek_ident p with
| None -> None
| Some (txt, loc, k) -> Parser.next p; Some (txt, loc, k)

(* Consume ("-" IDENT)* appending to [buf], update [last_end], diagnose trailing '-'. )
let rec read_hyphen_chain (p : Parser.t) (buf : Buffer.t) (last_end : Lexing.position ref) : unit =
match p.Parser.token with
| Token.Minus ->
let minus_loc = Parser.loc p in
Parser.next p; (
after '-' )
begin match peek_ident p with
| Some (txt, loc, ) ->
Buffer.add_char buf '-';
Buffer.add_string buf txt;
last_end := (Loc.end
loc);
Parser.next p; (
consume ident *)
read_hyphen_chain p buf last_end
| None ->
Parser.err p (Diagnostics.message_at minus_loc "JSX identifier cannot end with a hyphen")
end
| _ -> ()

Read local name (a-b-c or X-Y — head determines kind)

let read_local_jsx_name (p : Parser.t) : (string * Loc.t * jsx_ident_kind) option =
match expect_ident p with
| None -> None
| Some (head, head_loc, kind) ->
let buf = Buffer.create (String.length head + 8) in
Buffer.add_string buf head;
let start_pos = Loc.start head_loc in
let last_end = ref (Loc.end_ head_loc) in
read_hyphen_chain p buf last_end;
let name = Buffer.contents buf in
let loc = Loc.span start_pos !last_end in
Some (name, loc, kind)

Read qualified-or-local tag name (covers a-b, V.Component, V.x-y)

let read_jsx_tag_name (p : Parser.t) : jsx_tag_name option =
match peek_ident p with
| Some (_, , Lower) -> (* Plain lowercase tag with optional hyphens *) read_local_jsx_name p |> Option.map (fun (name, loc, _) -> Lower {name; loc}) | Some (seg, seg_loc, Upper) ->
(* Could be: Upper path ('.' Uident), OR QualifiedLower path '.' Lident('-' ident) *)
let start_pos = Loc.start seg_loc in
let rev_segs = ref [seg] in
let last_end = ref (Loc.end
seg_loc) in
Parser.next p; (* consume first Uident *)

  let rec loop_path () =
    match p.Parser.token with
    | Token.Dot ->
        Parser.next p; (* after '.' *)
        begin match peek_ident p with
        | Some (txt, loc, `Upper) ->
            rev_segs := txt :: !rev_segs;
            last_end := Loc.end_ loc;
            Parser.next p;
            loop_path ()
        | Some (_, _, `Lower) ->
            (* QualifiedLower: path already in rev_segs, now read final lowercase with hyphens *)
            begin match read_local_jsx_name p with
            | Some (lname, l_loc, _) ->
                let path = Longident.of_rev_list (List.rev !rev_segs) in
                let loc = Loc.span start_pos (Loc.end_ l_loc) in
                Some (QualifiedLower {path; name = lname; loc})
            | None -> None
            end
        | None ->
            Parser.err p (Diagnostics.message "expected identifier after '.' in JSX tag name");
            None
        end
    | _ ->
        (* Pure Upper path (component) *)
        let path = Longident.of_rev_list (List.rev !rev_segs) in
        let loc = Loc.span start_pos !last_end in
        Some (Upper {path; loc})
  in
  loop_path ()

| None -> None

Call sites (no p.token <- … anywhere)

Opening/self-closing tags

(* ... after consuming '<' ... )
let parse_jsx_opening_or_self_closing_element (p : Parser.t) =
match read_jsx_tag_name p with
| None ->
Parser.err p (Diagnostics.message "expected JSX tag name");
(
recover… )
| Some tag ->
(
p.token now points to the first token after the name: props or '>' or '/>' )
let props = parse_jsx_props p in
match p.Parser.token with
| Token.SlashGreater -> Parser.next p; Ast.jsx_self_closing tag props
| Token.Greater -> Parser.next p; Ast.jsx_opening tag props
| _ ->
Parser.err p (Diagnostics.message "expected '>' or '/>' after JSX tag name");
(
recover… *)

Closing tags (compare names by value, not tokens)

(* ... after consuming '</' ... )
let parse_jsx_closing_element (p : Parser.t) =
match read_jsx_tag_name p with
| None ->
Parser.err p (Diagnostics.message "expected JSX closing tag name");
(
recover… *)
| Some closing ->
(match p.Parser.token with
| Token.Greater -> Parser.next p
| _ -> Parser.err p (Diagnostics.message "expected '>' after closing tag"));
closing

(* When finishing an element, ensure names match: *)
let names_equal (open_ : jsx_tag_name) (close_ : jsx_tag_name) =
match open_, close_ with
| Lower a, Lower b -> String.equal a.name b.name
| QualifiedLower a, QualifiedLower b ->
Longident.equal a.path b.path && String.equal a.name b.name
| Upper a, Upper b -> Longident.equal a.path b.path
| _ -> false

Props (if you allow data-/aria-, reuse the same local reader)

let read_jsx_prop_name (p : Parser.t) : (string * Loc.t) option =
match read_local_jsx_name p with
| Some (name, loc, Lower) -> Some (name, loc) | Some (_, _, Upper) ->
Parser.err p (Diagnostics.message "JSX prop names must be lowercase");
None
| None -> None

Why this is robust (and what to check)
• No synthetic tokens injected. The only state change is advancing with Parser.next p. The token stream stays faithful to the lexer, which is great for error recovery, incremental parsing, and source maps.
• Progress is explicit. On each -, we Parser.next before looking for the next ident. No chance of stalling on a-b-c.
• Trailing - is diagnosed. We report and stop instead of swallowing it.
• Qualified forms covered. Upper ('.' Upper)* for components; Upper ('.' Upper)* '.' Lower ('-' ident)* for things like <V.x-y>. Pure lowercase tags handled via Lower.
• Clear post-conditions at call sites: after read_jsx_tag_name, p.token is the token after the tag name. That’s easy to reason about.

Edge cases you can (and should) test
• ✅ , <V.x-y></V.x-y>, <V.Component />, nested combos.
• ❌ , >, <V.a-></V.a->> (should all emit the hyphen error).
• Mismatched closing: (should error via names_equal).

Evaluation

Does it look good?
Yep. This is the “clean” version: no mutation to parser state other than consuming tokens, crisp invariants, and all the correctness issues we discussed are handled in one place. It’s also easy to evolve—if you later decide hyphens are only allowed after lowercase segments, you can enforce that policy right where QualifiedLower is parsed.

Trade-offs:
• Slightly more code than the mutation trick, but much easier to read and test.
• You’ll need tiny plumbing in the AST to carry jsx_tag_name (or convert it immediately to whatever the existing AST nodes expect).

If you want me to adapt this precisely to your Loc/Diagnostics/Ast types (or wire it into your exact parse_jsx_* functions), paste those type signatures and I’ll tailor it.

@nojaf
Copy link
Member Author

nojaf commented Aug 8, 2025

Hmm, I think I like it. So plan is to tailor the AST on element name and prop name?
Thus, removing the need for a combined token thing.
Will check it out sometime.

@nojaf nojaf mentioned this pull request Aug 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants