Skip to content

Encoding flags for parts of an InterpolatedRegularExpressionNode #2669

@eregon

Description

@eregon

For regular expression /#{ }\xc2\xa1/e, which comes from test_m17n.rb in CRuby tests, the flags are respectively forced_utf8_encoding and forced_binary_encoding for source encoding UTF-8 and US-ASCII.

I am not sure if this is correct or not.
We are looking at this in TruffleRuby and honoring those flags is causing to compute the wrong Regexp encoding.

I guess ideally because of the /e the parts would be "force_eucjp_encoding".
That seems the best to avoid mistakes in consumers.

Or maybe no flags, and let the consumer attach the encoding correctly, though more error-prone it might be better than (arguably) the "wrong" encoding flag.

@kddnewton WDYT?

$ bin/parse -e '/#{ }\xc2\xa1/e'
@ ProgramNode (location: (1,0)-(1,15))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,15))
    └── body: (length: 1)
        └── @ InterpolatedRegularExpressionNode (location: (1,0)-(1,15))
            ├── flags: euc_jp
            ├── opening_loc: (1,0)-(1,1) = "/"
            ├── parts: (length: 2)
            │   ├── @ EmbeddedStatementsNode (location: (1,1)-(1,5))
            │   │   ├── opening_loc: (1,1)-(1,3) = "\#{"
            │   │   ├── statements: ∅
            │   │   └── closing_loc: (1,4)-(1,5) = "}"
            │   └── @ StringNode (location: (1,5)-(1,13))
            │       ├── flags: forced_utf8_encoding
            │       ├── opening_loc: ∅
            │       ├── content_loc: (1,5)-(1,13) = "\\xc2\\xa1"
            │       ├── closing_loc: ∅
            │       └── unescaped: "\\xc2\\xa1"
            └── closing_loc: (1,13)-(1,15) = "/e"
$ bin/parse -e '# encoding: US-ASCII
/#{ }\xc2\xa1/e'
...
AST:
@ ProgramNode (location: (2,0)-(2,15))
├── locals: []
└── statements:
    @ StatementsNode (location: (2,0)-(2,15))
    └── body: (length: 1)
        └── @ InterpolatedRegularExpressionNode (location: (2,0)-(2,15))
            ├── flags: euc_jp
            ├── opening_loc: (2,0)-(2,1) = "/"
            ├── parts: (length: 2)
            │   ├── @ EmbeddedStatementsNode (location: (2,1)-(2,5))
            │   │   ├── opening_loc: (2,1)-(2,3) = "\#{"
            │   │   ├── statements: ∅
            │   │   └── closing_loc: (2,4)-(2,5) = "}"
            │   └── @ StringNode (location: (2,5)-(2,13))
            │       ├── flags: forced_binary_encoding
            │       ├── opening_loc: ∅
            │       ├── content_loc: (2,5)-(2,13) = "\\xc2\\xa1"
            │       ├── closing_loc: ∅
            │       └── unescaped: "\\xc2\\xa1"
            └── closing_loc: (2,13)-(2,15) = "/e"

cc @andrykonchin

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions