Skip to content

Conversation

tianleq
Copy link
Collaborator

@tianleq tianleq commented Jul 28, 2025

This PR adds a concurrent Immix plan.

pub fn prepare(&mut self, full_heap: bool) {
if full_heap {
debug_assert!(self.treadmill.is_from_space_empty());
// debug_assert!(self.treadmill.is_from_space_empty());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to comment out this assertion as a full stop-the-world GC might be triggered during concurrent marking, this assertion will be triggered in this case.

@@ -162,4 +162,7 @@ pub trait Collection<VM: VMBinding> {
fn create_gc_trigger() -> Box<dyn GCTriggerPolicy<VM>> {
unimplemented!()
}

/// Inform the VM of concurrent marking status
fn set_concurrent_marking_state(_active: bool) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binding uses this to know if SATB is active and decides whether to call 'soft reference load barrier'. This flag can be stored as a global flag for SATB barriers, and the binding can access the barrier. In that case, we may not need this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that barrier is an opaque pointer from the binding's perspective. If we want to store this flag in the barrier and let binding access it, then we need to make sure all rust barrier struct has the same size, and expose that to the binding. I do not think it is worthwhile to do so.

Copy link
Member

@qinsoon qinsoon Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it an opaque pointer? It is a dyn Barrier fat pointer, and allows Downcast. We can always access the barrier from the binding-side Rust code and store an active flag in the same static for the C++ to use. E.g. at the end of a pause, check if SATB is active. If so, set the static variable and enable the load reference barrier.

But anyway, the proposal above is based on an assumption that we do not want the function fn set_concurrent_marking_state(_active: bool). If it is okay to keep the function, then we don't need the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call seems necessary. I tried to add an active flag for barriers in MMTk core, and set the flag in the mutator prepare of initial marking, and clear the flag in mutator release of final marking. In the binding side, I get the flag from MMTk core, and set the value at the binding side before resuming mutators. I assumed that we can set/clear the flag at any time during a pause, and the binding only needs to see the flag before a pause ends. However, after one current GC (initial mark and final mark), I saw segfaults. So my assumption seems not right. Maybe there are some threads that are running during a pause, and the barrier affect those threads.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a mutator thread is doing IO or running (or blocked in) any native functions that can't touch the heap, the VM may consider it "stopped". So resuming mutators doesn't mean all mutators will see the state change promptly.

This is also the main headache of implementing block_for_gc because one mutator may get blocked for multiple GC cycles.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring looks different from what @wks and I were thinking of. We thought in the binding, we started to looking at the local flag in the barrier instance to find out if the barrier is active or not. But in the current implementation, the binding is still looking at that global variable, the only change is how the value gets updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The core is using local flags, but the binding still uses the global variable. Checking a local flag may not be faster for bindings, as they need to deference the barrier and then check the value.

I will just leave this as it is. If we do want to implement the binding to check the local flag, I can do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a related question: the flag is only used for weak reference load barrier, why we can't use the log bit for weak reference load barrier as well?

Copy link
Collaborator

@wks wks Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a related question: the flag is only used for weak reference load barrier, why we can't use the log bit for weak reference load barrier as well?

The log bit is for maintaining the "snapshot at the beginning". Because a write operation deletes an edge, the old target which was part of the SATB may not be reachable from roots after the writing. So the log bit identifies the first modification of an object, and the write barrier records its old target. But the weak reference load barrier is different. The weak referent was not part of the SATB, but merely loading from the weak field makes the referent strongly reachable. It is analogous to an object allocated during concurrent marking being conservatively considered strongly reachable in this GC. Similarly, a weakly reachable object loaded from a WeakReference is conservatively considered strongly reachable in this GC, too. This is not related to "first modification", so the log bit is not applicable. Concretly, WeakReference.get() modifies neither the WeakReference itself nor the referent, but it only modifies the object graph by inserting a strong edge from roots to the referent. So it is inappropriate to change the log bits of the WeakReference or the referent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to have a mix of local and global. I think it might be easier/simpler to do the global checking on both ends (mmtk-core and binding).

@wks
Copy link
Collaborator

wks commented Jul 29, 2025

What will happen if a mutator wants to fork() while concurrent marking is in progress? There are subtle interactions between preparing for forking (MMTK::prepare_to_fork), the concurrently running marking task (including the dirty mark bits, etc.), and the WorkerGoal which is currently only designed for triggering GC and triggering "prepare-to-fork".

We may temporarily tell the VM binding that MMTk doesn't currently support forking when using concurrent GC. We may fix it later. One simple solution is postponing the forking until the current GC finishes.

The status quo is that only CRuby and Android need forking. But CRuby will not support concurrent GC in a short term.

@k-sareen
Copy link
Collaborator

What will happen if a mutator wants to fork() while concurrent marking is in progress?

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

@wks
Copy link
Collaborator

wks commented Jul 29, 2025

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

Agreed. Fortunately, the current API doc for prepare_to_fork says:

    /// This function sends an asynchronous message to GC threads and returns immediately, but it
    /// is only safe for the VM to call `fork()` after the underlying **native threads** of the GC
    /// threads have exited.  After calling this function, the VM should wait for their underlying
    /// native threads to exit in VM-specific manner before calling `fork()`.

So a well-behaving VM binding shall wait for all the GC worker threads (which are created by the binding via VMCollection::spawn_gc_thread anyway) to exit before calling fork(). That's VM-specific, but not hard. Extending this API to support concurrent GC should not require the VM binding to rewrite this part.

@k-sareen
Copy link
Collaborator

Mentioning it before I forget: we need to change the name for the GC counter to say it's pauses or have a separate counter that counts pauses.

#[repr(u8)]
#[derive(Debug, PartialEq, Eq, Copy, Clone, NoUninit)]
pub enum Pause {
Full = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a better name than Full? I think here it means doing a full GC in one STW (as opposed to doing part of a GC, such as initial mark, final mark, or an incremental part of a GC). But it may be confused with full-heap GC (as opposed to nursery GC).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have the same concern but I could not find a better term so I simply leave full here as a placeholder


fn load_reference(&mut self, _o: ObjectReference) {}

fn object_reference_clone_pre(&mut self, _obj: ObjectReference) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only has an blank implementation in BarrierSemantics::object_reference_clone_pre, and SATBBarrier::object_reference_clone_pre calls that. What is this method intended for?

Comment on lines 155 to 175
struct SlotIteratorImpl<VM: VMBinding, F: FnMut(VM::VMSlot)> {
f: F,
// should_discover_references: bool,
// should_claim_clds: bool,
// should_follow_clds: bool,
_p: PhantomData<VM>,
}

impl<VM: VMBinding, F: FnMut(VM::VMSlot)> SlotVisitor<VM::VMSlot> for SlotIteratorImpl<VM, F> {
fn visit_slot(&mut self, slot: VM::VMSlot) {
(self.f)(slot);
}
}

pub struct SlotIterator<VM: VMBinding> {
_p: PhantomData<VM>,
}

impl<VM: VMBinding> SlotIterator<VM> {
pub fn iterate(
o: ObjectReference,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not really need SlotIterator or SlotIteratorImpl. Note that in mmtk-core/src/vm/scanning.rs we already have SlotVisitor and impl<SL: Slot, F: FnMut(SL)> SlotVisitor<SL> for F {. That's enough for us to pass a lambda (&mut |slot| { ... }) to Scanning::scan_object and use it as the slot_visitor argument.

I think the lxr branch introduced this struct to do some kind of filtering, and multiplexed multiple methods of Scanning, including the newly introduced Scanning::scan_object_with_klass.

However it is still nice to add a method ObjectReference::iterate_fields. It's convenient.

Comment on lines 704 to 706
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
SlotIterator::<VM>::iterate(self, f)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FnMut(VM::VMSlot) already implements trait SlotVisitor, we can directly pass it to scan_object.

Suggested change
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
SlotIterator::<VM>::iterate(self, f)
}
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, mut f: F) {
<VM::VMScanning as Scanning<VM>>::scan_object(
VMWorkerThread(VMThread::UNINITIALIZED),
self,
&mut f,
)
}

However, this is still imperfect.

  1. Not all VMs support scan_object. CRuby is one notable exception. If we only intend to visit the value of all (non-null) fields without updating (forwarding) any of them, I suggest we introduce another method in the Scanning trait, such as Scanning::enumerate_children. Otherwise, consider invoking Scanning::support_slot_enqueuing and scan_object_and_trace_edges, too.
  2. This method doesn't have tls. For CRuby, I am implementing object scanning by setting a thread-local variable (which is accessed via the tls) as a call-back, calling a C function to visit fields, and the C function will call the call-back to let the Rust code visit the field value. But because this method is called from the barrier, it may not always provide a VMWorkerThread. I suggest the newly introduced method Scanning::enumerate_children to take a VMThread instead of VMWorkerThread as argument.

In summary, I am also suggesting we introduce a method

pub trait Scanning<VM: VMBinding> {
    /// Visit the children of an object.
    ///
    /// This method may be called during mutator time, and is required by concurrent GC. Currently,
    /// we don't support concurrent copying GC, so this method can assume no objects are moved by GC
    /// while this method is running.
    fn enumerate_children(
        tls: VMThread,
        object: ObjectReference,
        child_visitor: &mut impl FnMut(ObjectReference),
    );
}

And if we make that change, then ObjectReference::iterate_fields will call Scanning::enumerate_children instead.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current "WorkerGoal" mechanism should be able to handle the case where mutators trigger another GC between InitialMark and FinalMark. We can remove WorkPacketStage::Initial and WorkPacketStage::ConcurrentSentinel, and use GCWorkScheduler::on_last_parked to transition the GC state from InitialMark to the concurrent marking to FinalMark and finally finish the GC. See inline comments for more details.

notify = true;
}
if notify {
self.wakeup_all_concurrent_workers();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wakes up other workers. However, the current GC worker considers itself the "last parked worker". So after the on_gc_finished function returns, it will clear the current "goal" (goals.on_current_goal_completed()) and try to handle another "goal" (i.e. a GC request) in self.respond_to_requests(worker, goals). Unless the mutator triggers another GC immediatelyu, the current GC worker thread will find that there is no requests, and park itself (return LastParkedResult::ParkSelf;). If we set MMTK_THREADS=1, it will never wake up again.

So we need to make some changes to GCWorkScheduler::on_last_parked so that if this is the end of the InitialMark STW, the on_last_parked method should return LastParkedResult::WakeAll so that all GC workers wake up again, including the current GC worker thread itself.

And we shouldn't clear the current "worker goal" if it is the end of InitialMark. The GC workers should still be working towards the "GC" goal. But it needs to be a bit stateful, knowing whether it is (1) during a "full GC", (2) during initial mark, (3) during final mark, or (4) between initial mark and final mark (we may just call it "concurrent tracing". As long as the GC workers are still working on a goal (regardless of concrete state), it will not accept other GC requests or requests for forking, which automatically solves the forking problem.

And if concurrent tracing finishes again (all the postponed work packets and their children are drained), the last parked worker will reach on_last_parked again. This time, it can immediately start scheduling the FinalMark stage (or should we call it a "super stage" to distinguish it from work bucket stage, or just call it an STW). When FinalMark finishes, the GC really finishes. GC workers can now accept requests for another GC or requests for forking.

@@ -250,3 +260,82 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> {
}
}
}

pub struct SATBBarrier<S: BarrierSemantics> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a pre-write ObjectBarrier. The current ObjectBarrier only implements the post-write functions, and this implementation implements the pre-write functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't plan to do this refactoring in this PR.

@wks
Copy link
Collaborator

wks commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects (actually the mark bits of all words in a line are set). This is OK. But what if we need VO bits? The default strategy for maintaining VO bits is copying from the mark bits. Since those objects are not marked all words of the line are marked, their VO bits will be zero the VO bits will also be all ones if we blindly copy VO bits from the mark bits. But because post_alloc already sets the VO bits, we just need to keep their VO bits as is.

To properly support VO bits, we need to do two things:

  1. Reset the BumpPointer of ImmixAllocator so that after the InitialMark, all mutators start allocation from empty lines (i.e. make sure a line never contains both objects allocated before InitialMark and objects allocated after InitialMark).
  2. In Block::sweep,
    • if a line only contains objects allocated after InitialMark, we keep the VO bits as is;
    • if it only contains live objects before InitialMark, we copy the VO bits over from mark bits;
    • otherwise the line must be empty. We clear its VO bits.

There are multiple ways to know if a line only contains new objects or old objects.

  • If a line is marked, but the mark bits of a line are all zero, it must be a line that only contains objects allocated after InitialMark. This doesn't need extra metadata. (Update: This is not true. If a mutator adds an edge from a live old object to the new object, the new object will still be reachable.)
  • If a line is marked, and the mark bits are a superset of the VO bits, it must be a line that only contains objects allocated after InitialMark. If a block only contains objects before the InitialMark, the mark bits must be a subset of the VO bits. Only all-ones is a superset of all possible VO-bit patterns. If the mark bits and VO bits are identical (every word is an individual object, and they are either all live or all new), it doesn't matter if we copy the mark bits or retain the VO bits. Currently each Line is 256 bytes, corresponding to 32 bits of mark bits or VO bits. It should be easy and efficient to do 32-bit bit operation.
  • We introduce another one-bit-per-line metadata to record if it contains objects allocated after InitialMark.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

@qinsoon
Copy link
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

@wks
Copy link
Collaborator

wks commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

I forgot the SATB barrier. It will remember B when we remove the edge A -> B.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

In your case, B will be captured by the SATB barrier, it will not be considered as dead. There is no need to scan those newly allocated objects because any children of those newly allocated objects must have been alive in the snapshot and thus is guaranteed to be traced

@qinsoon
Copy link
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

The mark bit could be in the header, and we have to set it per object if it is in the header. We can differentiate header mark bit and side mark bit, and deal with each differently.

But bulk setting mark bits is still a bit hacky -- and this is why we would have issues with VO bits. VO bits copies from mark bits, assuming mark bits is only set for each individual object.

}

fn object_probable_write_slow(&mut self, obj: ObjectReference) {
crate::plan::tracing::SlotIterator::<VM, _>::iterate_fields(obj, self.tls.0, |s| {
Copy link
Member

@qinsoon qinsoon Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scanning object? Do we need to call post_scan_object here? @tianleq

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a write barrier. It's not being scanned by the GC in the sense of we've marked it gray, etc. So I don't think it's sound to call post_scan_object here.

@qinsoon qinsoon marked this pull request as ready for review August 27, 2025 06:33
@qinsoon qinsoon requested a review from wks August 27, 2025 06:33
@qinsoon
Copy link
Member

qinsoon commented Aug 27, 2025

@wks I don't have other intended changes for this PR. You can take a look. Also feel free to commit to the PR.

qinsoon and others added 3 commits August 27, 2025 23:44
It is currently a no-op and not needed for SATB barrier.

It was intended for implementing OpenJDK's object cloning pre barrier.
We will reintroduce it into mmtk-core when we implement a plan that
needs such a barrier, and we need to design our API in a VM-agnostic
way.
@wks
Copy link
Collaborator

wks commented Aug 28, 2025

Currently ConcurrentImmix::collection_required sets the ConcurrentImmix::gc_cause field before triggering GC using GCRequester. But if the user triggered GC (System.gc()) outside concurrent mark, it will bypass collection_required, and directly call into GCRequester, and the field ConcurrentImmix::gc_cause will have a stale value Some(Pause::FinalMark) from the last FinalMark pause. Then the next pause will be FinalMark, but it triggers an assertion error because InitialMark has not happened.

wks added 2 commits August 28, 2025 16:54
We instead use a boolean field `should_do_full_gc` to tell if the user
or the `collection_required` method think it is time to do full GC.  We
let `schedule_collection` decide the actual pause type.
Currently it is unsafe to skip FinalMark and go directly to Full GC.  We
add a comment for that, and postpone full GC request to the next GC
after FinalMark.
@qinsoon
Copy link
Member

qinsoon commented Aug 28, 2025

Currently ConcurrentImmix::collection_required sets the ConcurrentImmix::gc_cause field before triggering GC using GCRequester. But if the user triggered GC (System.gc()) outside concurrent mark, it will bypass collection_required, and directly call into GCRequester, and the field ConcurrentImmix::gc_cause will have a stale value Some(Pause::FinalMark) from the last FinalMark pause. Then the next pause will be FinalMark, but it triggers an assertion error because InitialMark has not happened.

I changed select_collection_kind to choose what collection to do based on gc_cause in 1368ac5. Before that change, select_collection_kind will recheck all the conditions. Possibly we want to revert that, and remove gc_cause instead (gc_cause was unused).

@wks
Copy link
Collaborator

wks commented Aug 28, 2025

... Possibly we want to revert that, and remove gc_cause instead (gc_cause was unused).

Yes. I did it. I refactored the code, removed the gc_cause field and let schedule_collection (select_collection_kind is inlined into it for conciseness) once again decide what kind of pause to schedule.

pub struct StopMutators<C: GCWorkContext> {
/// If this is true, we skip creating [`ScanMutatorRoots`] work packets for mutators.
/// By default, this is false.
skip_mutator_roots: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianleq suggests that we should just allow a closure as the mutator visitor, instead of having some flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants