[BUG] the semantics of enter_critical_section is unclear #14593

yamt · 2024-11-01T09:23:45Z

Description / Steps to reproduce the issue

https://github.com/apache/nuttx/blob/master/Documentation/implementation/critical_sections.rst

Without going into the details of the implementation of enter_critical_section() suffice it to say that it (1) disables interrupts on the local CPU and (2) uses spinlocks to assure exclusive access to a code sequence across all CPUs.

as users of enter_critical_section don't seem to have a way to limit the scope,
i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs.
the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

On which OS does this issue occur?

[OS: Mac]

What is the version of your OS?

macOS 14.7

NuttX Version

master

Issue Architecture

[Arch: all]

Issue Area

[Area: Api], [Area: Kernel]

Verification

I have verified before submitting the report.

The text was updated successfully, but these errors were encountered:

acassis · 2024-11-01T13:20:30Z

@anchao could you please help to shed some light here?

anchao · 2024-11-04T12:05:38Z

In the current implementation, enter_critical_section() is a global lock used to protect critical sections between multiple CPUs under SMP case. In performance evaluations, the frequent calls to enter_critical_section() under SMP have posed significant challenges for the system real-time capability. @hujun260 's work focuses on optimizing SMP performance on NuttX, and we'll see that many of his contributions are related to critical section protection and interrupt refactor.

Here are a few PRs under review, @acassis @yamt
#14626
#14623

hujun260 · 2024-11-04T12:39:55Z

i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

It can be understood like this.

However, due to the complex of the enter_critical_section function, it can encompass any area,
especially scenarios where context switching occurs within the protected area, which may lead to the failure of protection.
This is a vulnerability of the critical section.

The current recommendation is that, if it is not related to protecting the tcb/group
good programming practices can avoid to use critical sections.

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs. the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

enter_critical_section only disable local CPU interrupts and cannot disable remote CPU interrupts.

yamt · 2024-11-05T09:20:22Z

i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

It can be understood like this.

However, due to the complex of the enter_critical_section function, it can encompass any area, especially scenarios where context switching occurs within the protected area, which may lead to the failure of protection. This is a vulnerability of the critical section.

The current recommendation is that, if it is not related to protecting the tcb/group good programming practices can avoid to use critical sections.

what do you recommend to use instead?
mutex? semaphore? spin lock?

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs. the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

enter_critical_section only disable local CPU interrupts and cannot disable remote CPU interrupts.

so, the current semantics of enter_critical_section is:

it disables interrupts on the calling CPU.
it does NOT disable interrupts on remote CPUs.
it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

is it right?

hujun260 · 2024-11-05T12:58:57Z

what do you recommend to use instead?
mutex? semaphore? spin lock?

Based on the program logic, deciding which synchronization strategy to use,
I believe that if the protected region executes quickly without context switching,
spinlock is the preferred choice and recursive locks should not be used.

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

I don't think adaptive mutex is a good idea.

Using general locks strategy can easily lead to abuse and a decline in code quality.
We need more precise lock behavior, especially in RTOS, where more precise time control
and good programming practices are required.

Just like critical sections in NuttX, which are used everywhere
and significantly degrade performance in smp complex scenarios.

so, the current semantics of enter_critical_section is:

it disables interrupts on the calling CPU.

it does NOT disable interrupts on remote CPUs.

it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

is it right?

yes
Additionally, the old enter_critical_section impl was more complex than the current one,
as it also handling sync pause strategies. We have already made some simplifications.
Looking towards the future, we need to remove the recursive strategy and auto released/re-aquired strategy within the critical section, turning it into a simple spinlock. But this will take a relatively long time,

pussuw · 2024-11-05T14:01:44Z

so, the current semantics of enter_critical_section is:
* it disables interrupts on the calling CPU.

Yes

* it does NOT disable interrupts on remote CPUs.

Yes

* it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

Yes

is it right?

The release/re-acquire logic is a bit hard to follow at least for me. I'm stuck with my SMP integration to MPFS + BUILD_KERNEL because I'm getting random crashes / DEBUGASSERT()s from irq_csection() (as well as deadlocks too). There is something odd going on with this logic.

EDIT: It is somehow tied to the fact that we have two "types" of context switches; From ISR and from thread. There is some kind of a hole in the recursive irqlock handling related to this.

The issue might be the fact that we have a task specific tcb.irqcount (which is IMO the correct way to do the recursion, as critical sections are task specific states) and a global "per-cpu" irqcount g_cpu_nestcount which is supposed to supersede the task specific recursion counter "somehow".

hujun260 · 2024-11-06T02:29:49Z

is it right?

yes

The issue might be the fact that we have a task specific tcb.irqcount (which is IMO the correct way to do the recursion, as critical sections are task specific states) and a global "per-cpu" irqcount g_cpu_nestcount which is supposed to supersede the task specific recursion counter "somehow".

when an interrupt occurs, g_cpu_nestcount is used as the recursive counter because context switching may happen and this_task() will change, and tcp->irqcount cannot be used.

yamt · 2024-11-06T06:06:13Z

what do you recommend to use instead?
mutex? semaphore? spin lock?

Based on the program logic, deciding which synchronization strategy to use, I believe that if the protected region executes quickly without context switching, spinlock is the preferred choice and recursive locks should not be used.

i feel a bare spinlock is too primitive for many cases.

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

I don't think adaptive mutex is a good idea.

Using general locks strategy can easily lead to abuse and a decline in code quality. We need more precise lock behavior, especially in RTOS, where more precise time control and good programming practices are required.

although it doesn't need to be adaptive mutex (it was just an example)
it would be nice to have a small set of api, which can cover 90% of use cases.

Just like critical sections in NuttX, which are used everywhere and significantly degrade performance in smp complex scenarios.

well, a typical implementation of adaptive mutex would be far better than a system global critical section
even for a RTOS.

pussuw · 2024-11-06T08:24:55Z

when an interrupt occurs, g_cpu_nestcount is used as the recursive counter because context switching may happen and this_task() will change, and tcp->irqcount cannot be used.

Yes this much I was able to figure out when I tried to remove g_cpu_nestcount.

There is still a hole somewhere in the acquire / release logic as I can see random debugasserts and deadlocks which happen when:

CPUx tries to take the big kernel lock from ISR
CPUy tries to take the big kernel lock from thread

And very likely a context switch is what triggers the issue. This happens more often under high load scenarios, like system bootup. In our case the system boot uses a lot of CPU, it also spawns a lot of processes that start, run and exit almost immediately.

I'll keep investigating.

yamt added the Type: Bug Something isn't working label Nov 1, 2024

github-actions bot added Arch: all Issues that apply to all architectures Area: Api API Issues Area: Kernel Kernel issues OS: Mac Issues related to MacOS (building system, etc) labels Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] the semantics of enter_critical_section is unclear #14593

[BUG] the semantics of enter_critical_section is unclear #14593

yamt commented Nov 1, 2024

acassis commented Nov 1, 2024

anchao commented Nov 4, 2024

hujun260 commented Nov 4, 2024 •

edited

Loading

yamt commented Nov 5, 2024 •

edited

Loading

hujun260 commented Nov 5, 2024 •

edited

Loading

pussuw commented Nov 5, 2024 •

edited

Loading

hujun260 commented Nov 6, 2024

yamt commented Nov 6, 2024

pussuw commented Nov 6, 2024

[BUG] the semantics of enter_critical_section is unclear #14593

[BUG] the semantics of enter_critical_section is unclear #14593

Comments

yamt commented Nov 1, 2024

Description / Steps to reproduce the issue

On which OS does this issue occur?

What is the version of your OS?

NuttX Version

Issue Architecture

Issue Area

Verification

acassis commented Nov 1, 2024

anchao commented Nov 4, 2024

hujun260 commented Nov 4, 2024 • edited Loading

yamt commented Nov 5, 2024 • edited Loading

hujun260 commented Nov 5, 2024 • edited Loading

pussuw commented Nov 5, 2024 • edited Loading

hujun260 commented Nov 6, 2024

yamt commented Nov 6, 2024

pussuw commented Nov 6, 2024

hujun260 commented Nov 4, 2024 •

edited

Loading

yamt commented Nov 5, 2024 •

edited

Loading

hujun260 commented Nov 5, 2024 •

edited

Loading

pussuw commented Nov 5, 2024 •

edited

Loading