Make sure we never context switch while holding VM lock. #735

luke-gruber · 2025-09-10T19:04:00Z

We were seeing errors in our application that looked like:

[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
* Don't enter the interpreter loop.
* Don't yield to ruby code.
* Don't call rb_nogvl (it will context switch you and will not unlock the VM lock).
* Don't check your own interrupts, it can switch you.

If you don't have the GVL:
* Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).

luke-gruber · 2025-09-11T19:08:44Z

There's 1 bug in bigdecimal right now that's crashing due to GC.add_stress_to_class(BigDecimal). This is an old issue that was fixed by Matt. I'll try to investigate more later. For now, I'm going to see how this does in the experimental cluster, if it stops the errors and doesn't crash.
cc @jhawthorn

We were seeing errors in our application that looked like: ``` [BUG] unexpected situation - recordd:1 current:0 /error.c:1097 rb_bug_without_die_internal /vm_sync.c:275 disallow_reentry /eval_intern.h:136 rb_ec_vm_lock_rec_check /eval_intern.h:147 rb_ec_tag_state /vm.c:2619 rb_vm_exec /vm.c:1702 rb_yield /eval.c:1173 rb_ensure ``` We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue. These rules must be held going forward to ensure we don't context switch unexpectedly: If you have the VM lock held, * Don't enter the interpreter loop. * Don't yield to ruby code. * Don't call rb_nogvl (it will context switch you and will not unlock the VM lock). * You can still check interrupts, but we won't allow context switching If you don't have the GVL: * Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).

This doesn't appear to be a correct fix. We should allow raising NoMemoryError even if we're under the VM lock.

luke-gruber · 2025-09-12T21:40:42Z

The BigDecimal bug has been fixed.

ioquatix · 2025-09-18T01:11:52Z

I wonder if this could be useful: https://github.com/ruby/ruby/blob/0bb6a8bea49fed8ccef0a70aca5f2ea05af94292/vm_core.h#L73-L103

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 7 times, most recently from 7ca26e8 to 1fa307d Compare September 11, 2025 16:38

luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 36d6901 to f8887a8 Compare September 11, 2025 19:28

luke-gruber added 2 commits September 12, 2025 14:19

Revert changes introduced in 2f6c694

d9e6507

This doesn't appear to be a correct fix. We should allow raising NoMemoryError even if we're under the VM lock.

Add a test that fails without these lock_rec changes.

1e3e0c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make sure we never context switch while holding VM lock. #735

Make sure we never context switch while holding VM lock. #735

Uh oh!

luke-gruber commented Sep 10, 2025

Uh oh!

luke-gruber commented Sep 11, 2025

Uh oh!

luke-gruber commented Sep 12, 2025

Uh oh!

ioquatix commented Sep 18, 2025

Uh oh!

Uh oh!

Make sure we never context switch while holding VM lock. #735

Are you sure you want to change the base?

Make sure we never context switch while holding VM lock. #735

Uh oh!

Conversation

luke-gruber commented Sep 10, 2025

Uh oh!

luke-gruber commented Sep 11, 2025

Uh oh!

luke-gruber commented Sep 12, 2025

Uh oh!

ioquatix commented Sep 18, 2025

Uh oh!

Uh oh!