Skip to content

Conversation

luke-gruber
Copy link

We were seeing errors in our application that looked like:

[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. We were running finalizers with the VM lock held, and they were context switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
* Don't enter the interpreter loop.
* Don't yield to ruby code.
* Don't call rb_nogvl (it will context switch you and will not unlock the VM lock).
* Don't check your own interrupts, it can switch you.

If you don't have the GVL:
* Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).

@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch 7 times, most recently from 7ca26e8 to 1fa307d Compare September 11, 2025 16:38
@luke-gruber
Copy link
Author

There's 1 bug in bigdecimal right now that's crashing due to GC.add_stress_to_class(BigDecimal). This is an old issue that was fixed by Matt. I'll try to investigate more later. For now, I'm going to see how this does in the experimental cluster, if it stops the errors and doesn't crash.
cc @jhawthorn

We were seeing errors in our application that looked like:

```
[BUG] unexpected situation - recordd:1 current:0
/error.c:1097 rb_bug_without_die_internal
/vm_sync.c:275 disallow_reentry
/eval_intern.h:136 rb_ec_vm_lock_rec_check
/eval_intern.h:147 rb_ec_tag_state
/vm.c:2619 rb_vm_exec
/vm.c:1702 rb_yield
/eval.c:1173 rb_ensure
```

We concluded that there was context switching going on while a thread
held the VM lock. During the investigation into the issue, we added
assertions that we never yield to another thread with the VM lock held.
We enabled these VM lock assertions even in single ractor mode. These
assertions were failing in a few places, but most notably in finalizers.
We were running finalizers with the VM lock held, and they were context
switching and causing this issue.

These rules must be held going forward to ensure we don't context switch unexpectedly:

If you have the VM lock held,
    * Don't enter the interpreter loop.
    * Don't yield to ruby code.
    * Don't call rb_nogvl (it will context switch you and will not unlock the VM lock).
    * You can still check interrupts, but we won't allow context switching

If you don't have the GVL:
    * Don't call rb_ensure/rb_protect, etc (these are old rules but good to have assertions for).
@luke-gruber luke-gruber force-pushed the fix_context_switch_while_holding_vm_lock branch from 36d6901 to f8887a8 Compare September 11, 2025 19:28
This doesn't appear to be a correct fix. We should allow raising
NoMemoryError even if we're under the VM lock.
@luke-gruber
Copy link
Author

The BigDecimal bug has been fixed.

@ioquatix
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants