Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pbio/os: Prototype simpler and unified async OS. #298

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Conversation

laurensvalk
Copy link
Member

Overall, the goals are to reduce complexity, reduce code size, and make it more cross platform.

In this PR, the new approach still drives the original pbio event loop as well, so we can convert one process at a time without breaking things. The charger process is done here as a simple test to illustrate what it looks like. Once all processes have been converted and everything is working, we could drop the Contiki dependency.

Not everything we need is implemented yet, but it is already easy to see that pbio/os is much easier to get into than the full contiki library. Mainly because we only implement what we need, but also due to the simplifications introduced below.

  • Returning errors from protothreads is currently cumbersome. This modification allows protothreads to return error codes. PBIO_ERROR_AGAIN essentially takes the role of PT_YIELDED / PT_WAITING. Everything else indicates completion with the respective error or success.

  • Contiki partially distinguishes (but not really) between waiting and yielding. In practice, both yield and the latter yields at least once. And then both macros exist separately for processes and protothreads so you end up with four variants that are almost the same but subtly different. I think we can do everything with a single set of AWAIT macros.

  • The contiki event timers are more complicated than we need for our use case. Currently, a timer tick first polls the timer process and then posts N events depending on the number of active timers. This is not be safe if we have more timer events than fit in the event queue. Also, etimers use the global PROCESS_CURRENT which requires hacks such as PROCESS_CONTEXT_BEGIN when accessing them from elsewhere.

  • Instead of using various events and polling individual processes throughout, I think we may be able to work with just a single pbio_os_request_poll() function. Then all processes will check their current wait condition. This should not be that much more computationally expensive than checking the needs_poll flag on each process (and it may be less due to overall simplicity).

  • We currently use broadcasting between processes in a few places, which can be hard to follow and risks overflowing the event queue as already stated in a few existing REVISITs around the code. We have recently eliminated this for the uart process. We should be able to do this for the status change too.

  • We can move some of the logic for handling pending events and entering sleep mode to pbio instead of having it in mphalport. Aside from hooks such as enable_irq it becomes platform agnostic. This also brings the virtual hub a bit closer to the embedded hubs. It also unifies the EV3 and NXT with Powered Up.

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 13, 2025

Example usage:

pbio_error_t pbio_os_wait_example(pbio_os_state_t *state, uint32_t duration) {

    static pbio_os_timer_t timer;

    ASYNC_BEGIN(state);

    printf("Going to wait for %d ms\n", duration);

    // We can return errors, parse args etc.
    if (duration > 2000) {
        return PBIO_ERROR_INVALID_ARG;
    }

    // This is a nice convenience wrapper I always missed.
    AWAIT_MS(state, &timer, duration);

    ASYNC_END(PBIO_SUCCESS);
}

pbio_error_t pbio_os_sub_example(pbio_os_state_t *state) {

    static pbio_os_state_t child;

    static uint32_t counter;

    pbio_error_t err;

    ASYNC_BEGIN(state);

    for (counter = 0; counter < 100; counter++) {

        AWAIT(state, &child, err = pbio_os_wait_example(&child, 100 * counter));

        if (err != PBIO_SUCCESS) {
            printf("Something bad happened!\n");
            return err;
        }
    }

    // Return arbitrary error code. Macro is needed to close up the } bracket just like in Contiki.
    ASYNC_END(PBIO_SUCCESS);
}

// Outermost "process thread" looks the same but must only have the state argument
pbio_error_t pbio_os_example_thread(pbio_os_state_t *state) {

    static pbio_os_state_t child;

    pbio_error_t err;

    ASYNC_BEGIN(state);

    // Await and get/use return value of awaited thread.
    AWAIT(state, &child, err = pbio_os_sub_example(&child));
    if (err != PBIO_SUCCESS) {
        return err;
    }

    ASYNC_END(PBIO_SUCCESS);
}

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 13, 2025

Similarly, the following:

PT_THREAD(pbdrv_uart_write(struct pt *pt, pbdrv_uart_dev_t *uart, uint8_t *msg, uint8_t length, uint32_t timeout, pbio_error_t *err)) {

    PT_BEGIN(pt);

    if (!msg || !length) {
        *err = PBIO_ERROR_INVALID_ARG;
        PT_EXIT(pt);
    }

    if (uart->tx_buf) {
        *err = PBIO_ERROR_BUSY;
        PT_EXIT(pt);
    }

    uart->tx_buf = msg;
    uart->tx_buf_size = length;
    uart->tx_buf_index = 0;

    if (timeout) {
        etimer_set(&uart->tx_timer, timeout);
    }

    uart->USART->CR1 |= USART_CR1_TXEIE;

    // Await completion or timeout.
    PT_WAIT_UNTIL(pt, uart->tx_buf_index == uart->tx_buf_size || (timeout && etimer_expired(&uart->tx_timer)));

    if (timeout && etimer_expired(&uart->tx_timer)) {
        uart->USART->CR1 &= ~(USART_CR1_TXEIE | USART_CR1_TCIE);
        *err = PBIO_ERROR_TIMEDOUT;
    } else {
        etimer_stop(&uart->tx_timer);
        *err = PBIO_SUCCESS;
    }
    uart->tx_buf = NULL;

    PT_END(pt);
}

Becomes:

pbio_error_t pbdrv_uart_write(pbio_os_state_t *state, pbdrv_uart_dev_t *uart, uint8_t *msg, uint8_t length, uint32_t timeout) {

    ASYNC_BEGIN(state);

    if (!msg || !length) {
        return PBIO_ERROR_INVALID_ARG;
    }

    if (uart->tx_buf) {
        return PBIO_ERROR_BUSY;
    }

    uart->tx_buf = msg;
    uart->tx_buf_size = length;
    uart->tx_buf_index = 0;

    if (timeout) {
        pbio_os_timer_set(&uart->tx_timer, timeout);
    }

    uart->USART->CR1 |= USART_CR1_TXEIE;

    // Await completion or timeout.
    AWAIT_UNTIL(state, uart->tx_buf_index == uart->tx_buf_size || (timeout && pbio_os_timer_is_expired(&uart->tx_timer)));

    uart->tx_buf = NULL;

    if (timeout && pbio_os_timer_is_expired(&uart->tx_timer)) {
        uart->USART->CR1 &= ~(USART_CR1_TXEIE | USART_CR1_TCIE);
        return PBIO_ERROR_TIMEDOUT;
    }

    ASYNC_END(PBIO_SUCCESS);
}

Error handling at the start and towards the end looks quite a bit more elegant. Awaiting on the condition remains basically the same.

@coveralls
Copy link

coveralls commented Mar 13, 2025

Coverage Status

coverage: 56.653% (+0.2%) from 56.477%
when pulling 0663e76 on os
into 962e44b on master.

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 17, 2025

This also lets us eliminate wrapper threads such as these:

PROCESS_THREAD(pwm_lp50xx_stm32, ev, data) {
    PROCESS_BEGIN();

    // need to allow all drivers to init first
    PROCESS_PAUSE();

    for (;;) {
        for (int i = 0; i < PBDRV_CONFIG_PWM_LP50XX_STM32_NUM_DEV; i++) {
            pbdrv_pwm_lp50xx_stm32_priv_t *priv = &dev_priv[i];
            pbdrv_pwm_lp50xx_stm32_handle_event(priv, ev);
        }
        PROCESS_WAIT_EVENT();
    }

    PROCESS_END();
}

Instead, we can start N processes with the same thread function. In this case that is pbdrv_pwm_lp50xx_stm32_handle_event almost unmodified. (Contiki macros don't not allow this out of the box, which was a limitation I ran into when revising the ports module earlier.)

I've added this example to this PR. It looks like a slightly bigger change, but that is only because I moved init() below the process so we don't have to declare it first like we used to do with Contiki.

@laurensvalk laurensvalk force-pushed the os branch 2 times, most recently from c7394fa to 05946cf Compare March 19, 2025 08:32
@laurensvalk
Copy link
Member Author

Any objections or thoughts @dlech ?

If not, we could merge and gradually update each process without breaking things.

The goal is to:
- Add error return value to all protothreads.
- Reduce complexity and code size.
- Avoid risk of overruning event queue.
- Use single poll flag and no broadcasting between threads.
- Cross-platform handling of events during sleep.
- Make stm32, ev3, nxt all work the same way.

For now the new OS also drives the contiki event loop so we can migrate processes one by one instead of breaking everything at once.
Idling with WFI was previously implemented in the MicroPython HAL. Now that we moved it to pbio/os in a platform agnostic way, we can fix this longstanding open REVISIT note.
This is a simple example to show that we can
convert the processes one at a time.
When using the same thread for multiple processes it can be useful to have a reference to the state that belongs to that instance.

This avoids having to use several cases of CONTAINER_OF to get to the parent of a state, which is not always available.

This also brings it one step closer to pbio tasks, which we might model in this way too.
Similar to PROCESS_PAUSE() but without posting an event to itself like Contiki does.

It is basically the same as awaiting 0ms, but without the need to allocate a timer.

Also restore this for the charger, which used PROCESS_PAUSE() here initially.
Otherwise we don't drive the new processes.
Otherwise we don't drive the new processes.

Fixes the light not blinking at the end because
the newly updated LED process was not driven during
pbsys deinit.
@dlech
Copy link
Member

dlech commented Mar 19, 2025

I haven't had time yet to have a close look at this yet. If you can wait a bit longer, I can try to have a look this weekend.

@laurensvalk
Copy link
Member Author

Sure, no rush -- that's what PRs are for 🙂 Thanks!

@BertLindeman
Copy link
Contributor

BertLindeman commented Mar 20, 2025

I assume this is the correct place to report:
'disconnect from pc does not always switch the hub into "I want to pair" mode'

Running ci-build-3798-v3.6.1-134-g8e25902f on 2025-03-19 on a TechnicHub with nothing on the ports.

All blinking I talk about here is blue.

To reproduce:

  • press the hub button to activate the hub

  • in the UI press connect and select the hu

  • then repeat

    • press in the UI the disconnect button
    • wait for the hub to blink-blink-pause
      this occurs most of the times.
      But it does not always occur:
  1. Sometimes the hub light goes off for a short time and then steady blue for a bit more than a second and then blink-blink-pause
  2. Sometimes the hub stays in constant blue.

Situation 1. occurs more often: 1 in ~4 to 10 attempts
Situation 2. I only saw 3 times since yesterday.

Testing with bluetooth, so I often disconnect for BT to be stable.

I did not succeed in a more stable scenario.

No need to give this a priority, I just wanted to report it.
Hardly any influence on my testing.

Bert

[EDIT]
Similar situation:
hub on / connect to pc / do nothing for 5 to 10 minutes
disconnect via UI
Hub goes shortly to blink-blink-pause ( two times I think) and powers off.

[EDIT 2]
After disconnect and the hub is steady blue: a press on the button makes the saved program run (as expected)
and another press stops the program and I would expect blink-blink-pause but the hub stays steady blue.
So the only way to get it connected is long button press to power-off and press to start.

Copy link
Member

@dlech dlech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is tricky to get right. We will probably introduce lots of bug/regressions by doing this. But in the end, it will be simpler/nicer. So probably worth it in the long run.

@laurensvalk
Copy link
Member Author

This stuff is tricky to get right. We will probably introduce lots of bug/regressions by doing this. But in the end, it will be simpler/nicer. So probably worth it in the long run.

Thanks for the overal and specific feedback!

@laurensvalk
Copy link
Member Author

I guess this means that statement can never return anything besides PBIO_ERROR_AGAIN or PBIO_SUCCESS? Other error will be ignored, which isn't ideal. Could fix this by getting rid of do/while and introduce a local err variable so that this can be an expression statement. Then we could write err = AWAIT(...); and actually handle errors.

The intention is to use it like this:

pbio_error_t example_thread_func(pbio_os_state_t *state) {

    pbio_error_t err;

    ASYNC_BEGIN(state);

    ...

    AWAIT(state, &child, err = pbdrv_uart_write(&child, device, data, len, timeout));
    if (err != PBIO_SUCCESS) {
        return err;
    }

    ...

    ASYNC_END(PBIO_SUCCESS);
}

The goal here is indeed precisely to handle errors. PBIO_ERROR_AGAIN is "reserved" for async functions, but everything else can be returned. Also, the final ASYNC_END(PBIO_SUCCESS) can return other errors.

The AWAIT_RACE function can be used in the same way (err1 = ...., err2 = ...) so you can find out which completed/errored.

Then we could write err = AWAIT(...);

The example above mostly achieves this, but if there is a clever way to make the macros work differently, we could look at it. Right now they work just like Contiki.

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 25, 2025

Hmm... polling every processes every 1ms seems like it would make it harder to catch bugs where we aren't properly polling in response to other events.

Yes, this is one of the main design considerations we'll want to think about thoroughly before going ahead.

It simplifies processes a lot when everything is working, but debugging could be harder. Maybe we could have the auto-polling be optional for each thread. Then certain processes can independently (self-)poll as we used to do.

It also means we have to be really careful about using timers in cases where we could be waiting on both a timer and another event a the same time.

This is one of the things that caught me out in the Contiki way of doing things -- if it got to a yield-once-come-back-later check, it could "miss" the event if it had just got it before getting there. In the new proposal, this should be simpler.

What cases are you thinking of that could now be trickier?

@laurensvalk
Copy link
Member Author

Resetting a process sounds dangerous without some kind of cleanup/teardown of the current state. Might be worth mentioning that in the description.

Yeah, this is probably only to be used for certain special cases. It is something I'd like to use for the ports, which has its own handling of safely resetting.

Basically, it "detaches" the LEGO DCM+LUMP process when the port goes into a different mode. It could attach that "none_process" dummy process instead if that mode needs no process. This allows everything to be synchronously handled rather than having process_exit needing several cycles to await other things to end.

@dlech
Copy link
Member

dlech commented Mar 25, 2025

What cases are you thinking of that could now be trickier?

It's been to long, so I don't remember clearly. But I have a vague recollection that a few times during development I would forget to to put in the ev == PROCESS_EVENT_TIMER and only had the etimer_expired(&timer) and things wouldn't work quite right until I remembered to also check ev. But I never really dug into it to see what was actually happening that caused it not to work correctly.

@laurensvalk
Copy link
Member Author

That is one complexities I was hoping to address here. A contiki timer etimer is expired when it is not associated with a process: et->p == PROCESS_NONE. There are multiple ways that this can happen, such as when it hasn't been set, or has been unset elsewhere. Particularly with the etimer objects in our uart drivers. They would only work correctly if set from that very process, as et->p would be set the global CURRENT_PROCESS().

The new timer is just the clock time with a duration.

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 28, 2025

Instead of PBIO_ERROR_AGAIN, maybe we can use (introduce) PBIO_ERROR_IN_PROGRESS uniquely for this so we don't confuse it with PBIO_ERROR_AGAIN in existing code. That is if we want to model after Linux errors. We can also just have PBIO_YIELDED and not treat it as an error.

@dlech
Copy link
Member

dlech commented Mar 28, 2025

Why not just return a bool if there are only 2 possible return values?

@laurensvalk
Copy link
Member Author

Not sure I follow. The point is to have the ability to return error values from protothreads (all our existing error values). We just need one more value to indicate that there is no error to report yet. I used EAGAIN in this PR, but EINPROGRESS is the closer to the correct meaning, and ensures no existing code that already uses EGAIN gets confused (even if they should never be mixed).

This is the more idiomatic way. The notation if (do_yield_now) {;} stems from Contiki, which can be a bit mysterious in the context of these macros. Its sole purpose to suppress unused variable warnings is not obvious.
Replace specific note about differences with generic one as this is not the full picture.
This makes the invocations a bit longer to fit, but it is probably clearer in the long run.

While we are at it, cut one level of abstraction and specify macros in terms of the switch statement directly.

Also, instead of LC (contiki), use the terminology "checkpoint", which is untuitive and also technically quite close to reality
Can be called from interrupt handlers.
@dlech
Copy link
Member

dlech commented Mar 28, 2025

Everything I saw only returned PBIO_ERROR_AGAIN or PBIO_SUCCESS and you didn't want to pass through the error on AWAIT, so I assumed that we weren't passing any other errors either.

Adding a PBIO_YIELD makes sense though as this is not actually an error condition that should be propagated. Maybe even PBIO_YIELD = -1 to have an easy way to differentiate it from errors - assuming that doesn't break the -fshort-enums optimization.

@laurensvalk
Copy link
Member Author

Errors are passed on everywhere, including in the AWAIT through an assignment expression (unless there is a way to make the macros work another way). The goal is to clean up functions such as:

PT_THREAD(pbdrv_uart_write(struct pt *pt, pbdrv_uart_dev_t *uart, uint8_t *msg, uint8_t length, uint32_t timeout, pbio_error_t *err)) {

    PT_BEGIN(pt);

    if (!msg || !length) {
        *err = PBIO_ERROR_INVALID_ARG;
        PT_EXIT(pt);
    }

    if (uart->tx_buf) {
        *err = PBIO_ERROR_BUSY;
        PT_EXIT(pt);
    }

   ...

to:

pbio_error_t pbdrv_uart_write(pbio_os_state_t *state, pbdrv_uart_dev_t *uart, uint8_t *msg, uint8_t length, uint32_t timeout) {

    PBIO_OS_ASYNC_BEGIN(state);

    if (!msg || !length) {
        return PBIO_ERROR_INVALID_ARG;
    }

    if (uart->tx_buf) {
        return PBIO_ERROR_BUSY;
    }

   ...

The PBIO_OS_ASYNC_END(err) can also return an error, which works well with states that should not nominally be reached as we do in many normal functions. (It just needs to be a macro to squeeze in the extra }, just like Contiki.

@dlech
Copy link
Member

dlech commented Mar 28, 2025

(unless there is a way to make the macros work another way

That is what I was trying to suggest in #298 (comment) 😉

If we drop the do/while, the {} can be used as an expression statement.

#define AWAIT(state, child, expression)  { \
    ASYNC_INIT((child));                   \
    PB_LC_SET(state);                      \
    pbio_error_t _err = expression;        \
    if (_err== PBIO_ERROR_AGAIN) {         \
        return PBIO_ERROR_AGAIN;           \
    }                                      \
    _err;                                  \
}
pbio_error_t err;
...
err = AWAIT(some_protothread())
if (err)
    return err;
...

I guess we would have to introduce a struct with two pbio_error_t values if we wanted to do the same with AWAIT_RACE.

* @param [in] child Protothread state of the child.
* @param [in] statement The statement to await.
*/
#define AWAIT(state, child, statement) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this need to be an expression, not a statement. And the doc comment needs to say that the expression needs to evaluate to a pbio_error_t. Same applies to the AWAIT_RACE as well.

@laurensvalk
Copy link
Member Author

laurensvalk commented Mar 28, 2025

Very interesting! The expression statement is a nice way of defining a local pbio_error_t that I didn't think of.

Generalizes handling to work better for the virtual hub, which doesn't use uint32_t for the interrupt state.
This was initially removed in #298, but we are restoring it to be sure that it isn't a breaking change.
@laurensvalk
Copy link
Member Author

It would be nice to avoid #includes in mpconfig* if we can.

Although if the header is very specifically crafted for such purpose (with comments that say so in the header itself), then it would be OK. Otherwise, it's easy to break things with future changes that aren't aware that mpconfig* aren't typical header files. However, we would want something just with the hook functions as os.h is way more than we should be including here.

We can fix:

  • the above note
  • having them static inline as they used to be before this PR
  • having it in /lib/pbio
  • 9a6cc19 polluting the existing pbioconfig.h.

all at once by having these hooks in a dedicated pbio_os_config.h, not unlike contiki-conf.h. Anything that shouldn't be a static inline can go in platform.c for the respective platform.

Proceed to make pbio more independent from MicroPython.

Also restore them as static inline as they used to be
prior to #298
which reduces build size.
@laurensvalk
Copy link
Member Author

(unless there is a way to make the macros work another way

That is what I was trying to suggest in #298 (comment) 😉

If we drop the do/while, the {} can be used as an expression statement.

#define AWAIT(state, child, expression)  { \
    ASYNC_INIT((child));                   \
    PB_LC_SET(state);                      \
    pbio_error_t _err = expression;        \
    if (_err== PBIO_ERROR_AGAIN) {         \
        return PBIO_ERROR_AGAIN;           \
    }                                      \
    _err;                                  \
}
pbio_error_t err;
...
err = AWAIT(some_protothread())
if (err)
    return err;
...

I guess we would have to introduce a struct with two pbio_error_t values if we wanted to do the same with AWAIT_RACE.

That looks really nice, but it appears we cannot jump back into expression statements:

error: switch jumps into statement expression

@laurensvalk
Copy link
Member Author

Most points here have been addressed. Thanks for the feedback.

I'll move some of the mp_hal cleanups like the stack info to a new, separate pull request since it isn't really related to this one.

Hmm... polling every processes every 1ms seems like it would make it harder to catch bugs where we aren't properly polling in response to other events.

Yes, this is one of the main design considerations we'll want to think about thoroughly before going ahead.

It simplifies processes a lot when everything is working, but debugging could be harder. Maybe we could have the auto-polling be optional for each thread. Then certain processes can independently (self-)poll as we used to do.

It also means we have to be really careful about using timers in cases where we could be waiting on both a timer and another event a the same time.

This is one of the things that caught me out in the Contiki way of doing things -- if it got to a yield-once-come-back-later check, it could "miss" the event if it had just got it before getting there. In the new proposal, this should be simpler.

What cases are you thinking of that could now be trickier?

This is the slightly more fundamental one that is remaining. I'll think about it some more before we merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants