Fix #4816: Apply @ temp operators from left to right #4826

kazarmy · 2025-01-05T06:08:49Z

Your checklist for this pull request

I've read the guidelines for contributing to this repository
I made sure to follow the project's coding style
I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
I've added tests that prove my fix is effective or that my feature works (if possible)
I've updated the rizin book with the relevant information (if needed) Update '@' help book#137

Detailed description

This pr fixes #4816 by making the application of the @ temp operators (listed at @?) go from left to right instead of right to left. This was more work than I expected.

The main issue encountered was that the tmp_*_stmt nodes in the grammar cannot be statements anymore (i.e. they cannot have a _simple_stmt at the front). This disqualifies them from being commands wrt to the substitution functions ts_node_handle_arg() and ts_node_handle_arg_prargs(). I think I've managed to shoehorn in a fix without too much of a fuss.

The tmp_*_stmt nodes in the grammar should be renamed to e.g. tmp_*_op, but this has been deferred due to the mass renaming involved and because the renaming affects autocomplete.

Test plan

All builds are green.

Closing issues

Closes #4816.

wargio

I'm ok with this. makes more sense. wait for another approval before merging.

XVilka

LGTM but wait for @ret2libc first, please

ret2libc

Hey!

I think this is a "big" change as in it completely changes the shell behaviour as it has been so far. That said, I'm fine with it as long as it is well thought and works well!

test/db: are those the only places where the order really matters? I remember we have a lot of tests where we chain several @ together.
parsing commands should avoid global state? i think we don't have any for now
when i think about the shell grammar i always think about it recursively. this change kinda breaks it and makes things (in my view) harder to follow. It might just be I'm used to the right-to-left way though.

More on the last point. The thing is that now the command is structured as <cmd> <@tmp1> <@tmp2> .... while before it was <cmd> <@tmp> recursively. the reason the recursion works well in my mind is that each tmp_stmt is structured in this way: 1) setup 2) run original <cmd> 3) teardown. For example, @ addr does 1) save original addresses and change address to new one 2) run command 3) restore saved addresses. This works well with the previous command structure, but less well imho with the new one (which is the reason why you have some hacks around, I guess, like global child_idx, grandparent, etc.).

What if whenever you see a tmp_stmt you "rewrite" it in a recursive way like before?

essentially, the user can write px @!5 @x:1234 but then internally rizin parse it as (tmp_blksz_stmt (tmp_hex_stmt (simple_stmt))) so that we can reuse the parsing functions as before? What do you think? The rewriting can probably happen on the fly in the DEFINE_HANDLE_TS_FCN_AND_SYMBOL(tmp_stmt) {.

ret2libc · 2025-01-07T10:40:31Z

librz/core/cmd/cmd.c

+static uint32_t tmp_child_idx = 0;
+
+static TSNode tmp_get_next_node(TSNode cur) {
+	TSNode next = ts_node_next_named_sibling(cur);
+	tmp_child_idx++;
+	if (ts_node_is_null(next)) {
+		next = ts_node_named_child(ts_node_parent(cur), 0);
+		tmp_child_idx = 0;
+	}
+	return next;
+}
+


I don't like this tbh, it makes parallel parsing of commands not safe.

If we ever, ever decide to parse commands in parallel, I think the usual tricks can be applied e.g. make tmp_child_idx a thread-local variable, pass tmp_child_idx around as an argument etc.

ret2libc · 2025-01-07T10:43:46Z

librz/core/cmd/cmd.c

@@ -3102,7 +3102,8 @@ static bool substitute_args(struct tsr2cmd_state *state, TSNode args, TSNode *ne
 // If do_unwrap is true, then quote unwrapping is always done, else cd is
 // checked. An arg of raw type (this can be determined if cd is available)
 // prevents unescaping and quote unwrapping regardless.
-static RzCmdParsedArgs *ts_node_handle_arg_prargs(struct tsr2cmd_state *state, TSNode command, TSNode arg, uint32_t child_idx, bool do_unwrap, const RzCmdDesc *cd) {
+static RzCmdParsedArgs *ts_node_handle_arg_prargs(struct tsr2cmd_state *state, TSNode command, TSNode arg,


what is the grandparent used for? why is it necessary?
also, why a change in the tmp_*_stmt requires changes here? I think this might be a sign that something is not as it should be.

what is the grandparent used for? why is it necessary?

It's necessary because the following line

rizin/librz/core/cmd/cmd.c

Line 3115 in 14eecce

arg = ts_node_named_child(new_command, child_idx);

assumes that TSNode arg is a child of TSNode command. I need it to go 1 layer deeper, because arg is now a child of a ts_*_stmt which is a child of ts_stmt which contains the whole list of ts_*_stmt + a _simple_stmt at the beginning.

also, why a change in the tmp_*_stmt requires changes here? I think this might be a sign that something is not as it should be.

My opinion on this is that substitution (like for backticks) should be allowed on any node regardless on whether the node is a command, but then I see

rizin/librz/core/cmd/cmd.c

Line 3044 in 14eecce

return ts_parser_parse_string(state->parser, NULL, state->input, strlen(state->input));

and I decided it would be better to just let sleeping dragons lie

kazarmy · 2025-01-07T12:06:46Z

test/db: are those the only places where the order really matters? I remember we have a lot of tests where we chain several @ together.

Yes but for most of them, the @ order doesn't matter.

parsing commands should avoid global state? i think we don't have any for now

Parsing commands currently do use the stack to hold previous nodes that have been visited, so in a sense global state is used. Ofc if parallel parsing is desired, a thread will get their own stack automatically.

when i think about the shell grammar i always think about it recursively. this change kinda breaks it and makes things (in my view) harder to follow. It might just be I'm used to the right-to-left way though.

I do see that the original code is basically doing (stmt (stmt (stmt ...))) and having one approach to rule them all does simplify matters, but like you implied, lists can only be parsed one way and that way might not be the ideal way.

What if whenever you see a tmp_stmt you "rewrite" it in a recursive way like before?

essentially, the user can write px @!5 @x:1234 but then internally rizin parse it as (tmp_blksz_stmt (tmp_hex_stmt (simple_stmt))) so that we can reuse the parsing functions as before? What do you think? The rewriting can probably happen on the fly in the DEFINE_HANDLE_TS_FCN_AND_SYMBOL(tmp_stmt) {.

I honestly did think hard about the approach you proposed (one reason being that changing 16 different tmp_*_stmt functions the same way is somewhat tedious) but what you propose would involve moving nodes, either at the tree-sitter level (if that's even possible) or at the C level. I read https://tree-sitter.github.io/tree-sitter/using-parsers/3-advanced-parsing.html#editing and it appeared that moving nodes might mess up the connection between nodes and the source text (among other things) so I think the pr approach is the lesser of 2 evils, i.e. it would complicate matters less.

ret2libc · 2025-01-07T12:29:25Z

I personally find that "need it to go 1 layer deeper, because arg is now a child of a ts_stmt which is a child of ts_stmt which contains the whole list of ts_stmt + a _simple_stmt at the beginning" & co is very hard to follow and introduces some global state (not just global vars, but some global "concept" as well).

WRT editing the input, yeah it might not be trivial indeed.

Going back to the main change: is it really safe and consistent doing it? Like, what about @@= and similar? Those things and everything else so far, I think, has a right-to-left order.

What is the result of something like:

wb 1213 @!5 @@=$$+0 $$+0x10 $$+0x20 @ 0x70

?
if i follow the old mental model, i easily know/understand how operations are executed, but with the new model i'm not sure.

(unrelated, but there's probably a bug where $$ is evaluated after each execution instead of beforehand).

kazarmy · 2025-01-07T13:38:56Z

I personally find that "need it to go 1 layer deeper, because arg is now a child of a ts_stmt which is a child of ts_stmt which contains the whole list of ts_stmt + a _simple_stmt at the beginning" & co is very hard to follow and introduces some global state (not just global vars, but some global "concept" as well).

Yes it appears to be a new concept. I plan to rename ts_*_stmt to ts_*_op because they are no longer statements. I haven't done it yet because it will add to the pr and I didn't think it was necessary.

Perhaps this is easier to follow. For pd 2 @a:x86, the tree is:

(statements
  (*tmp_stmt*
    (arged_stmt
      (cmd_identifier)
      (args
        (arg
          (arg_identifier))))
    (tmp_arch_stmt
      (*arg*
        (arg_identifier)))))

Substitution needs to be done on the highlighted arg, but I need to pass the highlighted tmp_stmt and not tmp_arch_stmt to ts_node_handle_arg() because it is a command since it has arged_stmt at the beginning. The highlighted tmp_stmt is the grandparent of the highlighted arg.

WRT editing the input, yeah it might not be trivial indeed.

Oh that's a relief.

Going back to the main change: is it really safe and consistent doing it? Like, what about @@= and similar? Those things and everything else so far, I think, has a right-to-left order.

Since I'm not moving nodes, it's safe and consistent as long as tree-sitter is safe and consistent -- "safe and consistent" here meaning that the semantics is precisely defined, and the parser is deterministic and won't go into an infinite loop.

What is the result of something like:
wb 1213 @!5 @@=$$+0 $$+0x10 $$+0x20 @ 0x70 
?

Its tree is

(statements
  (tmp_stmt
    (iter_offsets_stmt
      (tmp_stmt
        (arged_stmt
          (cmd_identifier)
          (args
            (arg
              (arg_identifier))))
        (tmp_blksz_stmt
          (args
            (arg
              (arg_identifier)))))
      (args
        (arg
          (arg_identifier))
        (arg
          (arg_identifier))
        (arg
          (arg_identifier))))
    (tmp_seek_stmt
      (args
        (arg
          (arg_identifier))))))

which is

((((wb 1213) @!5) @@=$$+0 $$+0x10 $$+0x20) @ 0x70)

which is actually the same as before.

Only if you have something like:

wb 1213 @!5 @ 0x70 @@=$$+0 $$+0x10 $$+0x20

(the @ operators are next to each other) then the tree is

(statements
  (iter_offsets_stmt
    (tmp_stmt
      (arged_stmt
        (cmd_identifier)
        (args
          (arg
            (arg_identifier))))
      (tmp_blksz_stmt
        (args
          (arg
            (arg_identifier))))
      (tmp_seek_stmt
        (args
          (arg
            (arg_identifier)))))
    (args
      (arg
        (arg_identifier))
      (arg
        (arg_identifier))
      (arg
        (arg_identifier)))))

which is

(((wb 1213) @!5 @ 0x70) @@=$$+0 $$+0x10 $$+0x20)

where previously it was

((((wb 1213) @!5) @ 0x70) @@=$$+0 $$+0x10 $$+0x20)

ret2libc

i'd say... can we just find a way to avoid tmp_child_idx?

kazarmy · 2025-01-08T10:03:23Z

i'd say... can we just find a way to avoid tmp_child_idx?

yes eedef9b

ret2libc

great! thanks a lot ;)

Fix rizinorg#4816: Apply @ temp operators from left to right

27bd823

github-actions bot added rz-test RzCore labels Jan 5, 2025

Fix pdq test

8d8ee45

kazarmy marked this pull request as ready for review January 5, 2025 06:42

kazarmy requested review from ret2libc, wargio, XVilka and thestr4ng3r as code owners January 5, 2025 06:42

wargio approved these changes Jan 5, 2025

View reviewed changes

XVilka approved these changes Jan 5, 2025

View reviewed changes

XVilka assigned ret2libc Jan 5, 2025

ret2libc reviewed Jan 7, 2025

View reviewed changes

ret2libc reviewed Jan 8, 2025

View reviewed changes

kazarmy added 2 commits January 8, 2025 17:39

Merge branch 'dev' of https://github.com/rizinorg/rizin into @-op-ltr

093331c

Remove tmp_child_idx

eedef9b

kazarmy force-pushed the @-op-ltr branch from 614d1d2 to eedef9b Compare January 8, 2025 10:01

ret2libc approved these changes Jan 8, 2025

View reviewed changes

kazarmy merged commit 1fafca6 into rizinorg:dev Jan 8, 2025
46 checks passed

This was referenced Jan 14, 2025

grammar: Rename tmp_*_stmt to tmp_*_op #4837

Merged

grammar: Remove prec.right from tmp_*_op rules #4846

Merged

kazarmy mentioned this pull request Feb 4, 2025

Update '@' help rizinorg/book#137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #4816: Apply @ temp operators from left to right #4826

Fix #4816: Apply @ temp operators from left to right #4826

kazarmy commented Jan 5, 2025 •

edited

Loading

wargio left a comment

XVilka left a comment

ret2libc left a comment

ret2libc Jan 7, 2025

kazarmy Jan 7, 2025

ret2libc Jan 7, 2025

kazarmy Jan 7, 2025

kazarmy commented Jan 7, 2025

ret2libc commented Jan 7, 2025

kazarmy commented Jan 7, 2025

ret2libc left a comment

kazarmy commented Jan 8, 2025

ret2libc left a comment

Fix #4816: Apply @ temp operators from left to right #4826

Fix #4816: Apply @ temp operators from left to right #4826

Conversation

kazarmy commented Jan 5, 2025 • edited Loading

wargio left a comment

Choose a reason for hiding this comment

XVilka left a comment

Choose a reason for hiding this comment

ret2libc left a comment

Choose a reason for hiding this comment

ret2libc Jan 7, 2025

Choose a reason for hiding this comment

kazarmy Jan 7, 2025

Choose a reason for hiding this comment

ret2libc Jan 7, 2025

Choose a reason for hiding this comment

kazarmy Jan 7, 2025

Choose a reason for hiding this comment

kazarmy commented Jan 7, 2025

ret2libc commented Jan 7, 2025

kazarmy commented Jan 7, 2025

ret2libc left a comment

Choose a reason for hiding this comment

kazarmy commented Jan 8, 2025

ret2libc left a comment

Choose a reason for hiding this comment

kazarmy commented Jan 5, 2025 •

edited

Loading