log08.txt

    Well, I think the next thing to do is make the
    interpreter take input from STDIN. With that, I'll be
    able to not only play with it interactively, but also
    redirect and pipe instructions into it, which will mean
    being able to have some regression tests as well.

        [ ] Get input string from 'read' syscall
        [ ] Add some testing (shell script?)

    Testing is one of those things that a lot of people have
    strong opinions about. But I absolutely love having a
    reasonable number of tests in place to let me know that
    I haven't broken something. Tests let me be _more_
    creative and _more_ brave with the code because I can
    try something and know right away whether or not it
    works.

    I have come to _loathe_ manual testing because I've been
    in the Web dev world forever and testing on the web
    SUCKS. In a lot of cases, the state of the art is still
    refreshing a browser and clicking through a bunch of
    pages. When you're used to that, it is a _delight_ to be
    able to easily set up some STDIN/STDOUT tests on a
    command line!!!

    Okay, that's more than enough about that. It's time to
    get real input!

    Right off the bat, I know I'm going to need to handle
    this in three places:

        * get_token
        * eat_spaces
        * quote

    On the plus side, I'm happy with my input functionality
    (especially the string literals) and I don't regret
    how they've turned out. But now's when I have to pay the
    price for four separate methods that read input.

    This is definitely going to test my resolve to stick
    with the "inline all the things" redundant code. But I
    don't think "get_input" or whatever I end up calling it
    will be very long. Might even be under 100 bytes. So
    three copies shouldn't be too bad. :-)

    (When I wrote the above paragraphs, I thought I was
    going to have up to six copies, but I kept realizing
    that most of those places weren't actually reading a
    whole stream of input - they were relying on one of
    these three to do it.)

    Okay, stripped of comments, this is get_input:

            mov ebx, [input_file]
            mov ecx, input_buffer
            mov edx, INPUT_SIZE
            mov eax, SYS_READ
            int 0x80
            cmp eax, INPUT_SIZE
            jge %%done
            mov byte [input_buffer + eax], 0
        %%done:
            mov dword [input_buffer_pos], input_buffer

    It's tiny - just the Linux 'read' syscall to get more
    input into the input_buffer. The only interesting thing
    is that if we read more than an entire buffer's worth,
    it null-terminates the string.

    Now I gotta use it in (at least?) three places. Two of
    them also needed to be updated now that I understand
    what the esi and edi registers are for, ha ha. Anyway,
    here's a typical example from 'eat_spaces':

        cmp esi, input_buffer_end ; need to get more input?
        jl .continue    ; no, keep going
        GET_INPUT_CODE  ; yes, get some
        jmp .reset      ; got more input, reset and continue

    I kept simplifying until I got down to those four lines.

    But does it work? Just a couple dumb mistakes and
    then...

$  mr
hello
Could not find word "hello" while looking in IMMEDIATE  mode.
Exit status: 1

    Wow! I don't know why I typed "hello" as my first live
    input into this thing. But it totally worked. Ha ha, I
    probably don't need an unfound word to be a fatal error
    anymore. :-)

    How about something that *will* work:

$ mr
"Hello world!\n" print
Hello world!
Goodbye.
Exit status: 0

    Yay! My first live "Hello world" in this interpreter!

    I'm still exiting after one line of input. I'll have to
    figure that out. Do I read until an actual EOF character
    is encountered? I can't remember. I'm just super excited
    this works!

    So after all that hand-wringing about having a couple
    copies of this code, how much impact has that actually
    had?

    Here's the relevant bits from 'inspect_all':

get_input: 45 bytes IMMEDIATE COMPILE
get_token: 108 bytes IMMEDIATE
eat_spaces: 80 bytes IMMEDIATE COMPILE
quote: 348 bytes IMMEDIATE COMPILE

    Let's compare with the previous log07.txt results:

get_token: 55 bytes IMMEDIATE 
eat_spaces: 38 bytes IMMEDIATE COMPILE 
quote: 247 bytes IMMEDIATE COMPILE 

    Since I cleaned up some of the words, there wasn't an
    across-the board increase of 45 * 3 bytes.

    I'd like to see what the grand total has become. And
    I'll probably want to do that often. So I'll make a new
    option in my build.sh script:

  if [[ $1 == 'bytes' ]]
  then
      AWK='/^.*: [0-9]+/ {t=t+$2} END{print "Total bytes:", t}'
      echo 'inspect_all' | ./$F | awk -e "$AWK"
      exit
  fi

    Okay, let's see the damage:

$ ./build.sh bytes
Total bytes: 2816

    The last run in log07.txt was 2655 bytes, so the
    difference is:

        2816   (current)
      - 2655   (previous)
      ------
         161

    Ha ha, only 161 bytes difference, and since one of the
    three copies is needed, I only gained 116 bytes of
    "bloat". I think I can live with that on the x86
    platform. :-)

    Now I gotta figure out how to continue reading after one
    line of input.

    Oh, wait! One last thing. I had also set the input
    buffer to an artificially tiny size so I could make sure
    it was being refilled as needed. I'll add a DEBUG
    statement to see where that's happening.

    The buffer size is 16 bytes.

$ mr
GET_INPUT00000000
"This is a jolly long string to make sure we read plenty into input buffer a couple times.\n" print
GET_INPUT00000000
GET_INPUT00000000
GET_INPUT00000000
GET_INPUT00000000
GET_INPUT00000000
GET_INPUT00000000
This is a jolly long string to make sure we read plenty into input buffer a couple times.
Goodbye.
Exit status: 0

    Okay, perfect, that long line of input required 7 calls
    to 'get_input' to refill the input_buffer. Now I'll set
    it to a reasonable size. I've seen some conflicting
    stuff online, so I'll just take the coward's way out:

        %assign INPUT_SIZE 1024 ; size of input buffer

    Now to figure out how to keep reading after the first
    line (or token?) of input.

    Okay, so I do need to check the return value from 'read'
    because that's the only way I can know if I've really
    got an EOF instead of just "no more input at this
    moment" - as would be the case between when the user
    hits enter and types the next line of input.

    I also added a new eof global that I can trip as soon as
    any of the 'get_input' instances hits the end of input:

            cmp eax, 0            ; 0=EOF, -1=error
            jge %%normal
            mov dword [input_eof], 1  ; set EOF reached
        %%normal:

dave@cygnus~/meow5$ mr
"Hello world!\n" print
Hello world!

: loud_meow "MEOW!\n" print ;
loud_meow
MEOW!

exit
Exit status: 12

    Heh, that's so cool. I can finally interact with this
    thing for real. But CTRL+D doesn't exit. I had to type
    'exit' to make that happen.

    I'll add a debug to 'get_input' to see what 'read' is
    returning...

dave@cygnus~/meow5$ mr
"goodbye cruel world" print
read bytes: 0000001c
goodbye cruel world      <---- I typed ENTER here
read bytes: 00000001
                         <---- ENTER again here
read bytes: 00000001
read bytes: 00000000     <---- CTRL+D

read bytes: 00000001     <---- ENTER again
exit
read bytes: 00000005
Exit status: 1

    Okay, so I guess I'm not checking the input_eof flag
    correctly in my interpreter loop?

    No! Ha, perhaps you spotted it before I did in the
    assembly snippet? Here it is again:

            cmp eax, 0            ; 0=EOF, -1=error
            jge %%normal
            mov dword [input_eof], 1  ; set EOF reached
        %%normal:

    Silly mistake:

            jge %%normal

    should be

            jg %%normal

    so that 0 will trigger EOF!

    Okay, that pretty much worked. But there's still some
    inelegant code in the interpreter where I feel like I'm
    checking for input too many times and it's somehow still
    not enough.

    I was null-terminating it and I think I would be better
    off setting an upper bound on it.

    Two nights later: Okay, just about have the kinks worked
    out. I've got two new global variables to keep track of
    the input buffer:

        input_buffer: resb INPUT_SIZE
        input_buffer_pos: resb 4
        input_buffer_end: resb 4  <--- new
        input_eof: resb 4         <--- new

    Now I can check input_eof in any input words and in the
    outer interpreter.

    Okay, I'm stuck in 'eat_spaces'. I'm peppering it with
    DEBUG macro calls to see what's up. esi contains the
    current character in the input buffer (if it's a space,
    we want to advance past it). ebx contains the last
    position filled in the buffer by 'read'.

$ mr
eat_spaces pos: 0804c774
eat_spaces RESET, pos: 0804c774
ES more input! esi: 0804c774
ES more input! ebx: 0804c774
45 234 "hello!" meow             <----- I typed this
read bytes: 00000015
eat_spaces RESET, pos: 0804c774
ES more input! esi: 0804c774
ES more input! ebx: 0804c774
read bytes: 00000000             <----- I typed CTRL+D here
get_input EOF! 00000001
eat_spaces RESET, pos: 0804c774
get_next_token checking for EOF 0804ace2
Goodbye.
Exit status: 0

    Well, that would be a problem. Looks like esi and ebx
    are always the same value. Oops!

    LOL, that's exactly it. I forgot to save the new end of
    buffer pointer in 'get_input'. Here we are:

        mov dword [input_buffer_end], ebx ; save it

    Do you like super verbose logging? You'll love this.
    Here I am printing "hello" and then quitting with
    CTRL+D. It's hard to even find the interaction amidst
    all the noise:

eat_spaces pos: 0804c7d5
eat_spaces RESET, pos: 0804c7d5
eat_spaces looking at char... 0000000a
ES more input! esi: 0804c7d6
ES more input! ebx: 0804c7d6
"hello" print
read bytes: 0000000e
eat_spaces RESET, pos: 0804c7cc
eat_spaces looking at char... 00000022
get_next_token checking for EOF 0804ad12
get_next_token looking at chars. 0804ad12
quote0804c7cc
eat_spaces pos: 0804c7d3
eat_spaces RESET, pos: 0804c7d3
eat_spaces looking at char... 0804c320
eat_spaces looking at char... 0804c370
eat_spaces pos: 0804c7d4
eat_spaces RESET, pos: 0804c7d4
eat_spaces looking at char... 00000070
get_next_token checking for EOF 0804ad12
get_next_token looking at chars. 0804ad12
get_token0804c7d4
helloeat_spaces pos: 0804c7d9
eat_spaces RESET, pos: 0804c7d9
eat_spaces looking at char... 0000000a
ES more input! esi: 0804c7da
ES more input! ebx: 0804c7da
read bytes: 00000000
get_input EOF! 00000001
eat_spaces RESET, pos: 0804c7cc
get_next_token checking for EOF 0804ad12
Goodbye.
Exit status: 0

    But it works. I'll clean this up tomorrow night and see
    if I can add a simple test script.

    Next night: The DEBUGs are cleaned up. Now a couple
    housekeeping things. First, I want to complete that TODO
    item from the last log, a word to print all defined
    words (just the names, not the entire 'inspect' output.
    I think I'll call it 'all'.

        [ ] New word: 'all' to list all current word names

    Well, that was easy:

$ mr
all
all inspect_all inspect ps printmode printnum number decimal bin oct hex radix str2num quote num2str ; return : copystr get_token eat_spaces get_input find is_runcomp get_flags inline print newline strlen exit
Goodbye.
Exit status: 0

    I also added a non-destructive stack printing word last
    log and I never actually got it working. So I'd like to
    fix that.

        [ ] Finish 'ps' (non-destructive stack print)

    And since I have string escape sequences for
    runtime newline printing and NASM can include newlines
    in string literals with backticks, I'd like to remove
    the 'newline' word. I'm only using it in a couple places
    anyway.

        [ ] Remove word 'newline' (replace with `\n`)

    That one was super-easy too. I didn't really need a TODO
    item for it. But it'll feel good to show that checked
    box at the end of the log, so why not?

    Now for that print stack:

$ mr
42 ps
1 4290881940 0 4290881948 4290881964 4290881982
4290882002 4290882040 4290882048 4290882106 ...

    It just keeps going on and on. And then ends in a
    Segmentation fault. So clearly I've got something wrong.

    When the interpreter starts, I save the stack pointer to
    a variable.

        mov dword [stack_start], esp

    I want to do a sanity check, so I'll push two values:

        push dword 555
        push dword 42

    Let's see this in action to confirm how x86 stacks work:

$ mb
Reading symbols from meow5...
(gdb) break 877
Breakpoint 1 at 0x8049f92: file meow5.asm, line 877.
(gdb) r
Starting program: /home/dave/meow5/meow5 

Breakpoint 1, _start () at meow5.asm:877

    Okay, let's see what the stack register current points
    to (and by using GDB's 'display', this will always print
    after every command):

(gdb) disp $esp
1: $esp = (void *) 0xffffd780
(gdb) disp *(int)$esp
2: *(int)$esp = 1

    I've noticed that 1 (one) when I was trying to debug the
    stack before. I have no idea why that's there. That's
    something else to figure out.

    Anyway, we can see that the "first" stack address:

        0xffffd780

    And as I push values onto the stack, esp should
    decrement by 4 since the x86 stack writes to memory
    backward. (By the way, I feel a rant about how we
    describe this coming on, stay tuned for that in a
    moment.)

    -------------------------------------------------------
                            NOTE
    -------------------------------------------------------
    By the way, I often manually manipulate these GDB
    sessions here in my logs so that the instruction I'm
    executing shows up right before I start examining
    memory. Sorry if that confuses people who are
    well-versed in GDB and are wondering what the heck is
    going on.
    -------------------------------------------------------

    Now I'll just verify that my stack_start variable indeed
    holds the same value as esp and it points to that '1' at
    the beginning of the stack:

877	    mov dword [stack_start], esp
(gdb) s
1: $esp = (void *) 0xffffd780
2: *(int)$esp = 1
(gdb) x/a (int)stack_start 
0xffffd780:	0x1

    Yup. No surprises so far.

    Now when I push, we should see esp decrement and point
    to the newly pushed value:

879	    push dword 555
(gdb) s
1: $esp = (void *) 0xffffd77c
2: *(int)$esp = 555
880	    push dword 42
(gdb) s
1: $esp = (void *) 0xffffd778
2: *(int)$esp = 42

    Looks good so far!

        0xffffd780 1
        0xffffd77c 555
        0xffffd778 42

    ...I think. I'm really no good at hex calculations in my
    head. Even easy ones. Let's confirm with 'dc', the old
    RPN desk calculator on UNIX systems since forever:

$ dc
16 i 10 o   <--- set input and output base to 16 (get it?)
1A 5 + p
1F          <--- just making sure it's set up okay
D780 p
D780        <--- 0xffffd780
4 - p
D77C        <--- 0xffffd77c
4 - p
D778        <--- 0xffffd778

    dc is crazy. Anyway, those addresses are right. Every
    push subtracts 4 from esp and writes the pushed value to
    that address.

    So when I examine the stack area of memory, I should be able to
    subtract 4 from my stack_start variable and see each
    value. When I hit the current value of esp, that's the
    last value on the stack and I'm done:

(gdb) x/d (int)stack_start 
0xffffd780:	1
(gdb) x/d (int)stack_start -4
0xffffd77c:	555
(gdb) x/d (int)stack_start -8
0xffffd778:	42

    Great! So the computer is doing what I think it's doing.
    Always a good sign. :-)

     *****************************************************
     * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * 
     *****************************************************

    Okay, so my issue with how we talk about stacks is the
    use of terms like "top" and "bottom".

    If we start with the stack of plates analogy, it's
    perfectly fine to talk about the top of the stack
    because it makes physical sense:

        =====   <--- top plate
        =====
        =====
        =====

    But where's the "top" of this memory?

        +-----+
        |     | 0x0000
        +-----+
        |     | ...
        +-----+
        |     | 0xFFFF
        +-----+

    Okay, now where's the "top" of this memory?

        +-----+
        |     | 0xFFFF
        +-----+
        |     | ...
        +-----+
        |     | 0x0000
        +-----+

    Where's the "top" of the stack in this memory?

        +-----+
        | === | 0xFFFF  } stack start
        +-===-+         } stack
        | === | ...     } stack
        +-----+
        |     | 0x0000
        +-----+

    And the "top" of the stack in this memory?

        +-----+
        | === | 0x0000  } stack start
        +-===-+         } stack
        | === | ...     } stack
        +-----+
        |     | 0xFFFF
        +-----+

    Or this?

        +-----+
        |     | 0xFFFF
        +-----+
        | === | ...     } stack
        +-===-+         } stack
        | === | 0x0000  } stack start
        +-----+

    Or this?

        +-----+
        |     | 0x0000
        +-----+
        | === | ...     } stack
        +-===-+         } stack
        | === | 0xFFFF  } stack start
        +-----+

    I've seen ALL of these representations over the years
    and the person making the diagram just passes it off
    like their own personal mental model is completely
    obvious.

    This situation is nuts.

    And I know Intel's official docs for x86 use the "top"
    and "bottom" terms. But guess what? Intel's "word" size
    on 64-bit processors is 16 bits, so I think we can
    safely ignore their advice on terminology.

    Personally, I don't picture ANY of the diagrams above.

    Instead, I imagine the stack as horizontal memory and
    the stack grows to the right:

       +--------------------
       | A | B | C | D | E --->
       +--------------------
         ^               ^
        oldest          current

    But you'll notice that I don't say "rightmost" or
    "leftmost". That would be ridiculous. Especially since
    x86 has a stack that grows from a high-numbered address
    to a lower-numbered address. So it's really more like
    this:

                           --------------------+       
                        <--- E | D | C | B | A |       
                           --------------------+       
                             ^               ^
                            0xE4           0xFF
                          (current)      (oldest)

    Anyway, the point is that using directional descriptions
    as if we were all looking at the same physical object is
    super confusing.

    I prefer stack descriptions such as:

        * current / newest / recent
        * older / previous
        * oldest
        * hot vs cold
        * surfaced / buried

    And so on. I'm sure you can think of some better ones.
    Actually, please do.

     *****************************************************
     * RANT ALERT * RANT ALERT * RANT ALERT * RANT ALERT * 
     *****************************************************

    Sorry about that. I do feel better now. So, I've made
    some changes in how I do the stack printing (I needed to
    basically reverse everything I was doing, ha ha) and
    let's see if it works now:

$ mr
ps
1
42 555 97 33
ps
1 42 555 97 33
"Hello $ $ $" print
Hello 33 97 555
ps
1 42
"I put $ on there, but where does the $ come from???\n" print
I put 42 on there, but where does the 1 come from???
Goodbye.
Exit status: 0

    I don't know if that's hard to follow or not? It's
    tempting to make some sort of prompt in the interpreter
    just so it's easier to see the commands I type versus
    the responses.
    
    Anyway, it works great. I just don't understand why
    there's a 1 on the stack when I start?

    I guess it doesn't really matter. It occurs to me that I
    should consider the start of the stack to be the *next*
    available position. I'll update that now.

    From:

        mov dword [stack_start], esp

    To:

        lea eax, [esp - 4]
        mov [stack_start], eax

    Did that fix it?

ps

42 16 ps
42 16
8 ps
42 16 8

    Yup! Now we start with nothing on the stack and adding
    items to the stack only shows those items.

    Now how about a test script? I'm a big fan of simple
    tests that are just enough to give me the  peace-of-mind
    that I haven't broken anything that used to work.

    One thing that works just fine now that I take input on
    STDIN is piping input:

$ echo "42 13 ps" | ./meow5 
42 13 
Goodbye.

    And I can grep/ag the results to make they contain what
    I want.

    But I remember 'expect' from back when I was heavy into
    Tcl. I think I'll give that a shot to interactively
    drive the interpreter and test it.

    Expect is so cool. Here's my whole test script so far:

        #!/usr/bin/expect

        spawn ./meow5

        # Print a string
        send -- "\"Meow\\n\" print\r"
        expect "Meow"

        # Consruct meow and test it
        send -- ": meow \"Meow. \" print ;\r"
        send -- "meow\r"
        expect "Meow. "

        # Consruct meow5 and test it
        send -- ": meow5 meow meow
                meow meow meow \"\\n\" print ;\r"
        send -- "meow5\r"
        expect "Meow. Meow. Meow. Meow. Meow."

        # Exit (send CTRL+D EOF)
        send -- "\x04"
        expect eof

    The long meow5 definition line has been broken onto the
    next line for this log.

    Here it is running!

$ ./test.exp
spawn ./meow5
"Meow\n" print
Meow
: meow "Meow. " print ;
meow
Meow. : meow5 meow meow meow meow meow "\n" print ;
meow5
Meow. Meow. Meow. Meow. Meow. 
Goodbye.

    I'll add a new alias for it now. (Defined by my "meow"
    function in .bashrc):

        alias mt="./build.sh ; ./test.exp"

    Sweet! That wraps up this log and the goals I had for
    it. I'll ad more to the test script as I go. This was
    just go get it started.

    
        [x] Get input string from 'read' syscall
        [x] Finish 'ps' (non-destructive stack print)
        [x] New word: 'all' to list all current word names
        [x] Remove word 'newline' (replace with `\n`)
        [x] Add some testing (expect!)
    
    I think I might make some math words next so I can use
    the language to do basic stuff like add and subtract!