Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #4816: Apply @ temp operators from left to right #4826

Merged
merged 4 commits into from
Jan 8, 2025

Conversation

kazarmy
Copy link
Member

@kazarmy kazarmy commented Jan 5, 2025

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository
  • I made sure to follow the project's coding style
  • I've documented or updated the documentation of every function and struct this PR changes. If not so I've explained why.
  • I've added tests that prove my fix is effective or that my feature works (if possible)
  • I've updated the rizin book with the relevant information (if needed) Update '@' help book#137

Detailed description

This pr fixes #4816 by making the application of the @ temp operators (listed at @?) go from left to right instead of right to left. This was more work than I expected.

The main issue encountered was that the tmp_*_stmt nodes in the grammar cannot be statements anymore (i.e. they cannot have a _simple_stmt at the front). This disqualifies them from being commands wrt to the substitution functions ts_node_handle_arg() and ts_node_handle_arg_prargs(). I think I've managed to shoehorn in a fix without too much of a fuss.

The tmp_*_stmt nodes in the grammar should be renamed to e.g. tmp_*_op, but this has been deferred due to the mass renaming involved and because the renaming affects autocomplete.

Test plan

All builds are green.

Closing issues

Closes #4816.

@kazarmy kazarmy marked this pull request as ready for review January 5, 2025 06:42
Copy link
Member

@wargio wargio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this. makes more sense. wait for another approval before merging.

Copy link
Member

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but wait for @ret2libc first, please

Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!

I think this is a "big" change as in it completely changes the shell behaviour as it has been so far. That said, I'm fine with it as long as it is well thought and works well!

  • test/db: are those the only places where the order really matters? I remember we have a lot of tests where we chain several @ together.
  • parsing commands should avoid global state? i think we don't have any for now
  • when i think about the shell grammar i always think about it recursively. this change kinda breaks it and makes things (in my view) harder to follow. It might just be I'm used to the right-to-left way though.

More on the last point. The thing is that now the command is structured as <cmd> <@tmp1> <@tmp2> .... while before it was <cmd> <@tmp> recursively. the reason the recursion works well in my mind is that each tmp_stmt is structured in this way: 1) setup 2) run original <cmd> 3) teardown. For example, @ addr does 1) save original addresses and change address to new one 2) run command 3) restore saved addresses. This works well with the previous command structure, but less well imho with the new one (which is the reason why you have some hacks around, I guess, like global child_idx, grandparent, etc.).

What if whenever you see a tmp_stmt you "rewrite" it in a recursive way like before?

essentially, the user can write px @!5 @x:1234 but then internally rizin parse it as (tmp_blksz_stmt (tmp_hex_stmt (simple_stmt))) so that we can reuse the parsing functions as before? What do you think? The rewriting can probably happen on the fly in the DEFINE_HANDLE_TS_FCN_AND_SYMBOL(tmp_stmt) {.

Comment on lines 3545 to 3556
static uint32_t tmp_child_idx = 0;

static TSNode tmp_get_next_node(TSNode cur) {
TSNode next = ts_node_next_named_sibling(cur);
tmp_child_idx++;
if (ts_node_is_null(next)) {
next = ts_node_named_child(ts_node_parent(cur), 0);
tmp_child_idx = 0;
}
return next;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this tbh, it makes parallel parsing of commands not safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever, ever decide to parse commands in parallel, I think the usual tricks can be applied e.g. make tmp_child_idx a thread-local variable, pass tmp_child_idx around as an argument etc.

@@ -3102,7 +3102,8 @@ static bool substitute_args(struct tsr2cmd_state *state, TSNode args, TSNode *ne
// If do_unwrap is true, then quote unwrapping is always done, else cd is
// checked. An arg of raw type (this can be determined if cd is available)
// prevents unescaping and quote unwrapping regardless.
static RzCmdParsedArgs *ts_node_handle_arg_prargs(struct tsr2cmd_state *state, TSNode command, TSNode arg, uint32_t child_idx, bool do_unwrap, const RzCmdDesc *cd) {
static RzCmdParsedArgs *ts_node_handle_arg_prargs(struct tsr2cmd_state *state, TSNode command, TSNode arg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the grandparent used for? why is it necessary?
also, why a change in the tmp_*_stmt requires changes here? I think this might be a sign that something is not as it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the grandparent used for? why is it necessary?

It's necessary because the following line

arg = ts_node_named_child(new_command, child_idx);

assumes that TSNode arg is a child of TSNode command. I need it to go 1 layer deeper, because arg is now a child of a ts_*_stmt which is a child of ts_stmt which contains the whole list of ts_*_stmt + a _simple_stmt at the beginning.

also, why a change in the tmp_*_stmt requires changes here? I think this might be a sign that something is not as it should be.

My opinion on this is that substitution (like for backticks) should be allowed on any node regardless on whether the node is a command, but then I see

return ts_parser_parse_string(state->parser, NULL, state->input, strlen(state->input));

and I decided it would be better to just let sleeping dragons lie

@kazarmy
Copy link
Member Author

kazarmy commented Jan 7, 2025

  • test/db: are those the only places where the order really matters? I remember we have a lot of tests where we chain several @ together.

Yes but for most of them, the @ order doesn't matter.

  • parsing commands should avoid global state? i think we don't have any for now

Parsing commands currently do use the stack to hold previous nodes that have been visited, so in a sense global state is used. Ofc if parallel parsing is desired, a thread will get their own stack automatically.

  • when i think about the shell grammar i always think about it recursively. this change kinda breaks it and makes things (in my view) harder to follow. It might just be I'm used to the right-to-left way though.

I do see that the original code is basically doing (stmt (stmt (stmt ...))) and having one approach to rule them all does simplify matters, but like you implied, lists can only be parsed one way and that way might not be the ideal way.

What if whenever you see a tmp_stmt you "rewrite" it in a recursive way like before?

essentially, the user can write px @!5 @x:1234 but then internally rizin parse it as (tmp_blksz_stmt (tmp_hex_stmt (simple_stmt))) so that we can reuse the parsing functions as before? What do you think? The rewriting can probably happen on the fly in the DEFINE_HANDLE_TS_FCN_AND_SYMBOL(tmp_stmt) {.

I honestly did think hard about the approach you proposed (one reason being that changing 16 different tmp_*_stmt functions the same way is somewhat tedious) but what you propose would involve moving nodes, either at the tree-sitter level (if that's even possible) or at the C level. I read https://tree-sitter.github.io/tree-sitter/using-parsers/3-advanced-parsing.html#editing and it appeared that moving nodes might mess up the connection between nodes and the source text (among other things) so I think the pr approach is the lesser of 2 evils, i.e. it would complicate matters less.

@ret2libc
Copy link
Member

ret2libc commented Jan 7, 2025

I personally find that "need it to go 1 layer deeper, because arg is now a child of a ts_stmt which is a child of ts_stmt which contains the whole list of ts_stmt + a _simple_stmt at the beginning" & co is very hard to follow and introduces some global state (not just global vars, but some global "concept" as well).

WRT editing the input, yeah it might not be trivial indeed.

Going back to the main change: is it really safe and consistent doing it? Like, what about @@= and similar? Those things and everything else so far, I think, has a right-to-left order.

What is the result of something like:

wb 1213 @!5 @@=$$+0 $$+0x10 $$+0x20 @ 0x70 

?
if i follow the old mental model, i easily know/understand how operations are executed, but with the new model i'm not sure.

(unrelated, but there's probably a bug where $$ is evaluated after each execution instead of beforehand).

@kazarmy
Copy link
Member Author

kazarmy commented Jan 7, 2025

I personally find that "need it to go 1 layer deeper, because arg is now a child of a ts_stmt which is a child of ts_stmt which contains the whole list of ts_stmt + a _simple_stmt at the beginning" & co is very hard to follow and introduces some global state (not just global vars, but some global "concept" as well).

Yes it appears to be a new concept. I plan to rename ts_*_stmt to ts_*_op because they are no longer statements. I haven't done it yet because it will add to the pr and I didn't think it was necessary.

Perhaps this is easier to follow. For pd 2 @a:x86, the tree is:

(statements
  (*tmp_stmt*
    (arged_stmt
      (cmd_identifier)
      (args
        (arg
          (arg_identifier))))
    (tmp_arch_stmt
      (*arg*
        (arg_identifier)))))

Substitution needs to be done on the highlighted arg, but I need to pass the highlighted tmp_stmt and not tmp_arch_stmt to ts_node_handle_arg() because it is a command since it has arged_stmt at the beginning. The highlighted tmp_stmt is the grandparent of the highlighted arg.

WRT editing the input, yeah it might not be trivial indeed.

Oh that's a relief.

Going back to the main change: is it really safe and consistent doing it? Like, what about @@= and similar? Those things and everything else so far, I think, has a right-to-left order.

Since I'm not moving nodes, it's safe and consistent as long as tree-sitter is safe and consistent -- "safe and consistent" here meaning that the semantics is precisely defined, and the parser is deterministic and won't go into an infinite loop.

What is the result of something like:

wb 1213 @!5 @@=$$+0 $$+0x10 $$+0x20 @ 0x70 

?

Its tree is

(statements
  (tmp_stmt
    (iter_offsets_stmt
      (tmp_stmt
        (arged_stmt
          (cmd_identifier)
          (args
            (arg
              (arg_identifier))))
        (tmp_blksz_stmt
          (args
            (arg
              (arg_identifier)))))
      (args
        (arg
          (arg_identifier))
        (arg
          (arg_identifier))
        (arg
          (arg_identifier))))
    (tmp_seek_stmt
      (args
        (arg
          (arg_identifier))))))

which is

((((wb 1213) @!5) @@=$$+0 $$+0x10 $$+0x20) @ 0x70)

which is actually the same as before.

Only if you have something like:

wb 1213 @!5 @ 0x70 @@=$$+0 $$+0x10 $$+0x20

(the @ operators are next to each other) then the tree is

(statements
  (iter_offsets_stmt
    (tmp_stmt
      (arged_stmt
        (cmd_identifier)
        (args
          (arg
            (arg_identifier))))
      (tmp_blksz_stmt
        (args
          (arg
            (arg_identifier))))
      (tmp_seek_stmt
        (args
          (arg
            (arg_identifier)))))
    (args
      (arg
        (arg_identifier))
      (arg
        (arg_identifier))
      (arg
        (arg_identifier)))))

which is

(((wb 1213) @!5 @ 0x70) @@=$$+0 $$+0x10 $$+0x20)

where previously it was

((((wb 1213) @!5) @ 0x70) @@=$$+0 $$+0x10 $$+0x20)

Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd say... can we just find a way to avoid tmp_child_idx?

@kazarmy
Copy link
Member Author

kazarmy commented Jan 8, 2025

i'd say... can we just find a way to avoid tmp_child_idx?

yes eedef9b

Copy link
Member

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! thanks a lot ;)

@kazarmy kazarmy merged commit 1fafca6 into rizinorg:dev Jan 8, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple @-modifiers should apply left-to-right, not right-to-left
4 participants