Skip to content

Ensure log bits are correctly maintained #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Apr 12, 2025

Conversation

k-sareen
Copy link
Collaborator

@k-sareen k-sareen commented Jul 17, 2024

This fixes a bug wherein we were erroneously resetting the unlog bits
for objects in the modbuf for full-heap GCs. If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

In major GCs, we now bulk clear unlog bits for mark-sweep space and
immix space during the prepare stage and bulk clear them for copyspace
during the release stage. We clear log bits for large objects during the
sweep. We also set the unlog bits for mature objects when tracing. This
ensures that there are no stale unlog bits and that we don't accidentally
log nursery objects.

This PR also adds debug assertions checking that any object that has
been added to the modbuf is considered mature by MMTk.

@k-sareen k-sareen requested a review from wks July 17, 2024 12:44
@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 470787e to ec6ba72 Compare July 17, 2024 12:48
// Scan objects in the modbuf and forward pointers
let modbuf = std::mem::take(&mut self.modbuf);
GCWork::do_work(
&mut ScanObjects::<E>::new(modbuf, false, WorkBucketStage::Closure),
worker,
mmtk,
)
} else {
// Flip the per-object unlogged bits to "unlogged" state for objects inside the bootimage
#[cfg(feature = "vm_space")]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know what was better: to go through the list or bulk set the unlog bits in vm_space.release(). I thought this was better since it is more general than bulk setting the unlog bits (in case the VM binding is not using side-metadata for the unlog bits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why the VM space needs to be specially dealt here. Why do we need to set the unlog bits in mature GCs for the VM space? What about other spaces like immortal or LOS, how is the VM space different from those spaces?

Copy link
Collaborator Author

@k-sareen k-sareen Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. So what we want to do is only reset the unlog bit for objects that are considered live. All ImmixSpace and CopySpace mature objects will set the unlog bit when you copy them over because of the CopySemantics::Mature. But objects in the VM space (or even immortal and LOS like you mentioned) don't perform copies and so we need to reset their bits iff those objects are live. VM space and immortal space objects are always live so we need to unconditionally reset them, but for LOS we need to reset only when the object is live. So really we're doing this reset at the wrong place and time.

One way to fix it would be to reset the unlog bit for mature LOS objects inside its trace_object. And then we can reset the unlog bits for VM space and immortal space objects in the ProcessModBuf work packet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But objects in the VM space (or even immortal and LOS like you mentioned) don't perform copies and so we need to reset their bits iff those objects are live.

Why not do that in the trace_object method for the policy where we know the object is reached? For ImmixSpace and CopySpace, the log bit is set during copying in trace_object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's what I suggested for LOS objects. But it's not clear to me if ART, for example, will actually end up tracing through all objects in the bootimage or immortal space for a full heap GC given their optimization(s). Maybe that's an ART-specific issue though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative could be to create a new work packet to be executed after the transitive closure (or maybe release?) for major GCs. In that work packet we now have all the correct information about if a current object address is marked or not, and if it is, then we can reset the unlog bit for that object.

Copy link
Member

@qinsoon qinsoon Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to unconditionally set the log bit in ProcessModBuf (no matter nursery or mature GCs), and bulk clear log bits when we release memory for dead objects.

In the current code base, for Immix space, we will set log bits during tracing if unlog_object_when_traced is true, and will clear log bits in prepare if reset_log_bit_in_major_gc is true. Only in Sticky Immix, both flags are set to true. The former is needed, as we 'promote' nursery objects in immix space when we trace it (and we do not have to copy it to 'promote'). But the latter may not be necessary -- we can bulk clear log bits for the dead objects after a GC. For other spaces, we can always bulk clear log bits after a GC for reclaimed memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this warrants a group discussion. We can discuss this in the next group meeting. I'll do a hack for ART in the meantime since this current PR is probably still incorrect because of LOS objects' unlog bits not being reset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cases where we recycle mature LOS objects? I think that is broken as well. If the unlog bit was set for it in a previous GC and it has died since then, if we allocate a new LOS object at the same address, we have a young object with the unlog bit set which is also wrong. We need to clear the unlog bit for dead LOS objects in the sweep here

@k-sareen
Copy link
Collaborator Author

@wks Looks like OpenJDK actually triggers the assertion 😆 . I think the assertion is sound so this is actually an OpenJDK bug. The assertion is not triggered for ART.

@@ -231,6 +231,7 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> {
target: Option<ObjectReference>,
) {
if self.log_object(src) {
debug_assert!(src.is_live::<S::VM>(), "{} was logged but is not live", src);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_live doesn't make sense during mutator time. The liveness of an object is only determined after computing transitive closure. During mutator time, all objects that can be accessed are, by definition, live.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the unlog bit should only be set for mature objects, so they have to be considered alive. Or at the very least have to be marked. I previously had is_marked there but objects in the bootimage may not actually have mark bits set so it was triggering the assertion incorrectly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought I actually agree with you and will remove the debug assert from the write barrier code. I think the debug assert inside the ProcessModBuf work packet is correct though

@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 0282ba5 to 4a29049 Compare July 28, 2024 02:41
@k-sareen k-sareen changed the title Only reset unlog bits of objects in modbuf for nursery GCs or if in bootimage Only reset unlog bits of objects in modbuf for nursery GCs Jul 28, 2024
@k-sareen k-sareen requested a review from wks July 29, 2024 04:11
@k-sareen
Copy link
Collaborator Author

Just fixed it like we discussed in a previous meeting

@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from e26f37c to 0753a49 Compare July 29, 2024 05:16
@@ -275,6 +275,17 @@ impl<VM: VMBinding> VMSpace<VM> {
);
debug_assert!(self.in_space(object));
if self.mark_state.test_and_mark::<VM>(object) {
// Flip the per-object unlogged bits to "unlogged" state for objects inside the
// bootimage
#[cfg(feature = "set_unlog_bits_vm_space")]
Copy link
Collaborator Author

@k-sareen k-sareen Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can remove the conditional compile feature here since it is likely objects in the VM space will want to have the unlog bit set and the feature was more for initialization

@k-sareen
Copy link
Collaborator Author

@wks Can you re-review this?

@wks
Copy link
Collaborator

wks commented Aug 23, 2024

If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

I assume the "issues for future allocations" means newly allocated (young) objects may have unlog bit set, and may be logged when written into, and we will retain more objects than necessary.

But I don't think it is sufficient to ensure this by skipping setting unlog bits in mature GCs. There may be dead mature objects that are not written to (i.e. unlogged). When those objects die, their unlog bits will remain set. If future allocations reuse the memory space of those objects, some young objects will also have unlog bits set.

If you want to ensure nursery objects never get logged, we may need to (bulk) clear unlog bits for reclaimed regions, like how we clear VO bits. I think this only affects StickyImmix because we never set the unlog bits for objects in the GenCopy or GenImmix nursery in the first place.

@k-sareen
Copy link
Collaborator Author

Hence why we do a bulk clear?

@k-sareen
Copy link
Collaborator Author

Oh. How odd. This PR doesn't do the bulk clear. I must have messed up when cherry-picking from my branch

@wks
Copy link
Collaborator

wks commented Aug 23, 2024

Instead of bulk clearing, it is also possible to copy the mark bits over to the unlog bits (and remove ImmixSpaceArgs::unlog_object_when_traced and ImmixSpaceArgs::reset_log_bit_in_major_gc). I think that's a bit like VO bits where there are two different strategies, and we found that it is faster to just copy over mark bits. We may refactor the code to unify the handling of unlog bits and VO bits.

p.s. I think the so-called "unlog bit" should really be named "is candidate for logging". Obviously nursery objects are not such candidates.

@k-sareen
Copy link
Collaborator Author

What that suggests is we need to be more systematic with how we deal with stale metadata values

This fixes a bug wherein we were erroneously resetting the unlog bits
for objects in the modbuf for full-heap GCs. If we reset the unlog bits
in a full-heap GC, we may end up setting the unlog bit for an object
that actually died which may cause issues for future allocations.

We now set the unlog bits for mature objects when tracing for VM
space (if compiled with `set_unlog_bits_vm_space`), immortal space, and
large object space.

This PR also adds debug assertions checking that any object that has
been added to the modbuf is considered "live" by MMTk, or in other
words, it is a mature object.
@k-sareen k-sareen force-pushed the fix/modbuf-global-gc-bug branch from 5e37a4b to 2b32380 Compare November 29, 2024 00:43
@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

@wks What needs to be done for this PR to get it to merge? I think it's very important to merge bug fixes ASAP since it leads to duplication of debugging efforts otherwise. This is a very very subtle bug.

EDIT: I can remove the object.is_live() debug assertions if you want, although I still think they're right.

@qinsoon
Copy link
Member

qinsoon commented Nov 29, 2024

If I understand correct, the issue is that for dead objects, we do not clear their unlog bits, and new objects allocated will have random unlog bits. The obvious fix would be to just clear the unlog bits before new allocation.

The PR tries to maintain the unlog bits before the objects become dead: if an object dies, its unlog bit is 0, and if an object lives, its unlog bit is 1. This doesn't sound like a straight solution to the problem. What is the reason for using this solution rather than just clearing the unlog bits?

@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

The PR tries to maintain the unlog bits before the objects become dead: if an object dies, its unlog bit is 0, and if an object lives, its unlog bit is 1. This doesn't sound like a straight solution to the problem. What is the reason for using this solution rather than just clearing the unlog bits?

Fast-path allocation.

EDIT: It seems like I haven't cherry-picked a further change from here: k-sareen@8f6fcd8

The above commit bulk clears the unlog bits for a global GC and then sets them again while tracing.

@k-sareen
Copy link
Collaborator Author

k-sareen commented Nov 29, 2024

Really what we want is a meta-module that takes care of all relevant metadata when an object dies/a space is released/a TLAB is allocated. What we have currently is a hack since each metadata (like VO-bit or side forwarding bits) has been special cased. But I feel like that is out of scope of this PR.

We now:
 - Clear the unlog bits for dead LOS objects
 - Bulk clear the unlog bits for MarkSweepSpace for full heap GC
 - Set the unlog bits for LOS and MarkSweepSpace objects during trace
   object

This is because we want to ensure the unlog bits are only set for
objects that are alive. We do not reset the unlogs bits inside the
`ProcessModBuf` work packet because we do not know what objects are
alive at that moment in time.
@wks
Copy link
Collaborator

wks commented Apr 10, 2025

One more bug: In GenImmix, ImmixSpaceArgs::unlog_object_when_traced is set to false. This means it will not set the unlog bits when tracing objects in the ImmixSpace. However, this PR skips modbuf processing in major GCs. This means if a live mature object is logged, its unlog bit will remain to be 0 after the major GC. This object will not be logged again in the mutator time after the GC when written to, and this will cause its young children to be erroneously considered as garbage.

diff --git a/src/plan/generational/immix/global.rs b/src/plan/generational/immix/global.rs
index ba2850b3a4..207fde4f2c 100644
--- a/src/plan/generational/immix/global.rs
+++ b/src/plan/generational/immix/global.rs
@@ -252,7 +252,7 @@ impl<VM: VMBinding> GenImmix<VM> {
                 reset_log_bit_in_major_gc: false,
                 // We don't need to unlog objects at tracing. Instead, we unlog objects at copying.
                 // Any object is moved into the mature space, or is copied inside the mature space. We will unlog it.
-                unlog_object_when_traced: false,
+                unlog_object_when_traced: true,
                 // In GenImmix, young objects are not allocated in ImmixSpace directly.
                 #[cfg(feature = "vo_bit")]
                 mixed_age: false,

Setting unlog_object_when_traced: true is sufficient to fix this problem. We don't need to set reset_log_bit_in_major_gc because we never allocate young objects into ImmixSpace in GenImmix.

It should be unnecessary for StickyImmix, too, because this PR now clear unlog bits when freeing lines. I think I made a mistake to think there are stale unmark bits in StickyImmix, but since reset_log_bit_in_major_gc == true, there can't be stale unlog bits after major GC. And there can't be any stale log bits after minor GCs, either, because no lines that contain mature objects are freed in minor GCs. We may revert a4dea71, or make the bzero only executed in major GCs.

@wks wks dismissed their stale review April 10, 2025 03:06

There can't be stale unlog bits given that ImmixSpaceArgs::reset_log_bit_in_major_gc is set to true. This shows we really need a centralized framework to clear the metadata bits.

@k-sareen
Copy link
Collaborator Author

This shows we really need a centralized framework to clear the metadata bits.

Yes I totally agree. This is so incredibly confusing that I myself don't like this PR. But we need to have a fix for this bug. The centralized metadata framework is definitely the ideal.

@wks
Copy link
Collaborator

wks commented Apr 10, 2025

I am considering a, possibly unlikely, scenario. What if a whole chunk of ImmixSpace is reused for another space, say CopySpace? If we allow stale bits to exist, the bits may end up being carried to a different space. Well, this is impossible for Map64 because the possible address ranges of different spaces never overlap, but is should be possible for Map32, although very unlikely. This not only affects the unlog bits, but also any side metadata that allow stale values to exist.

@wks
Copy link
Collaborator

wks commented Apr 10, 2025

I am considering a, possibly unlikely, scenario. What if a whole chunk of ImmixSpace is reused for another space, say CopySpace? If we allow stale bits to exist, the bits may end up being carried to a different space. Well, this is impossible for Map64 because the possible address ranges of different spaces never overlap, but is should be possible for Map32, although very unlikely. This not only affects the unlog bits, but also any side metadata that allow stale values to exist.

Currently ImmixSpace uses BlockPageResource, and it returns pages to the BlockPool, and never return any pages (or chunks) back to the VMMap. It means we are temporarily immune to this problem. But we must reconsider this issue in the future.

@k-sareen
Copy link
Collaborator Author

Well this is something the metadata framework should handle, right. Clear all metadata for a new chunk when it is (re)allocated.

@wks
Copy link
Collaborator

wks commented Apr 10, 2025

I am considering a, possibly unlikely, scenario. What if a whole chunk of ImmixSpace is reused for another space, say CopySpace? If we allow stale bits to exist, the bits may end up being carried to a different space. Well, this is impossible for Map64 because the possible address ranges of different spaces never overlap, but is should be possible for Map32, although very unlikely. This not only affects the unlog bits, but also any side metadata that allow stale values to exist.

I think this is actually happening for GenCopy. Because the nursery, the from-space and the to-space are all CopySpace, they do return chunks back to VMMap, and some stale unlog bits are carried over to the nursery.

@wks
Copy link
Collaborator

wks commented Apr 10, 2025

The following patch is sufficient to fix CopySpace. Just leave no stale log bits behind when releasing. (p.s. I wonder why this has been working for years.)

diff --git a/src/policy/copyspace.rs b/src/policy/copyspace.rs
index ddf8e5bb04..bbc66ccd4d 100644
--- a/src/policy/copyspace.rs
+++ b/src/policy/copyspace.rs
@@ -196,6 +196,12 @@ impl<VM: VMBinding> CopySpace<VM> {
                 side_forwarding_status_table.bzero_metadata(start, size);
             }
 
+            if self.common.needs_log_bit {
+                if let MetadataSpec::OnSide(side) = *VM::VMObjectModel::GLOBAL_LOG_BIT_SPEC {
+                    side.bzero_metadata(start, size);
+                }
+            }
+
             // Clear VO bits because all objects in the space are dead.
             #[cfg(feature = "vo_bit")]
             crate::util::metadata::vo_bit::bzero_vo_bit(start, size);

@wks
Copy link
Collaborator

wks commented Apr 10, 2025

This patch can assert there is no stale log bits when a bump allocator acquires a page. This is too aggressive, and we probably don't want this to be merged.

diff --git a/src/util/alloc/bumpallocator.rs b/src/util/alloc/bumpallocator.rs
index a436b2e07a..73301bab33 100644
--- a/src/util/alloc/bumpallocator.rs
+++ b/src/util/alloc/bumpallocator.rs
@@ -1,13 +1,13 @@
 use std::sync::Arc;
 
-use crate::util::Address;
+use crate::util::{Address, ObjectReference};
 
 use crate::util::alloc::Allocator;
 
 use crate::policy::space::Space;
 use crate::util::conversions::bytes_to_pages_up;
 use crate::util::opaque_pointer::*;
-use crate::vm::VMBinding;
+use crate::vm::{ObjectModel, VMBinding};
 
 /// Size of a bump allocator block. Currently it is set to 32 KB.
 const BLOCK_SIZE: usize = 8 << crate::util::constants::LOG_BYTES_IN_PAGE;
@@ -209,6 +209,16 @@ impl<VM: VMBinding> BumpAllocator<VM> {
                 block_size,
                 acquired_start
             );
+
+            for addr in (acquired_start.as_usize()..acquired_start.as_usize() + block_size).step_by(8) {
+                let objref = unsafe {
+                    ObjectReference::from_raw_address_unchecked(Address::from_usize(addr))
+                };
+                let unlogged = <VM as VMBinding>::VMObjectModel::GLOBAL_LOG_BIT_SPEC
+                    .is_unlogged::<VM>(objref, std::sync::atomic::Ordering::Relaxed);
+                assert!(!unlogged);
+            }
+
             if !stress_test {
                 self.set_limit(acquired_start, acquired_start + block_size);
                 self.alloc(size, align, offset)

@k-sareen
Copy link
Collaborator Author

The following patch is sufficient to fix CopySpace. Just leave no stale log bits behind when releasing. (p.s. I wonder why this has been working for years.)

diff --git a/src/policy/copyspace.rs b/src/policy/copyspace.rs
index ddf8e5bb04..bbc66ccd4d 100644
--- a/src/policy/copyspace.rs
+++ b/src/policy/copyspace.rs
@@ -196,6 +196,12 @@ impl<VM: VMBinding> CopySpace<VM> {
                 side_forwarding_status_table.bzero_metadata(start, size);
             }
 
+            if self.common.needs_log_bit {
+                if let MetadataSpec::OnSide(side) = *VM::VMObjectModel::GLOBAL_LOG_BIT_SPEC {
+                    side.bzero_metadata(start, size);
+                }
+            }
+
             // Clear VO bits because all objects in the space are dead.
             #[cfg(feature = "vo_bit")]
             crate::util::metadata::vo_bit::bzero_vo_bit(start, size);

Thanks. And yes, I wonder the same... This is a very non-trivial issue. We really should prioritize a better internal API for this.

@k-sareen
Copy link
Collaborator Author

@wks The assertion failure comes from SemiSpace and it's an assertion for the VO-bit... Sigh.

@wks
Copy link
Collaborator

wks commented Apr 11, 2025

@wks The assertion failure comes from SemiSpace and it's an assertion for the VO-bit... Sigh.

The bug is in MonotonicPageResource::iterate_allocated_regions. Particularly, this line:

                    let size = if self.pr.cursor().chunk_index() == start.chunk_index() {
                        self.pr.cursor() - start

The intention is that if the cursor of the PR is within the same chunk, it will only scan up to that cursor.

However, there is a corner case. self.pr.cursor() may actually be "the end of the previous chunk" which has the same address as "the start of the current chunk". It will happen if

  1. The Map32 gives chunk 2 to copyspace0. The chunk range is 0x41400000-0x41800000.
  2. The MonotonicPageResource of copyspace0 allocated from 0x41400000 to 0x41800000, and used up chunk2.
  3. The Map32 gives chunk 1 to copyspace0. The chunk range is 0x41000000-0x41400000
  4. The MonotonicPageResource of copyspace0 allocated from 0x41000000 to 0x41400000, and used up chunk1, too. The cursor stops at 0x41400000, the end of the 0x41000000-0x41400000 range.
  5. GC starts, and evacuated copyspace0. The GC then starts to free all chunks the page resource obtained from Map32.
  6. When visiting the 0x41400000-0x41800000 range, it thinks the cursor 0x41400000 is within that range. (Actually it is not. The cursor is at the end of the previous chunk.)
  7. The GC only clears the metadata from 0x41400000 to the cursor 0x41400000, which is exactly 0 bytes.
  8. The stale VO bits (and other bits, too) in 0x41400000-0x41800000 remained to the next GC, and caused assertion error.

So we should fix the iterator of the MonotonicPageResource.

@k-sareen
Copy link
Collaborator Author

Wow. Thank you for catching that. Is the solution just to do (self.pr.cursor() + 1).chunk_index() or what? Seems like a tricky edge case.

@wks
Copy link
Collaborator

wks commented Apr 11, 2025

I made a PR for this. #1299

I just used the simplest solution. If it is intended for optimizing for the case that "the cursor is within that region", we just test if the cursor is within the region by start < cursor && cursor < end.

@k-sareen k-sareen requested a review from wks April 12, 2025 01:08
Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment problem. It looks good otherwise.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k-sareen
Copy link
Collaborator Author

@wks Do you know what commit details are used for the rebase commit? The details from the first one? Or does it use the description of the PR? I just wanted to clean up some stuff in the description before I merged it.

@wks
Copy link
Collaborator

wks commented Apr 12, 2025

@wks Do you know what commit details are used for the rebase commit? The details from the first one? Or does it use the description of the PR? I just wanted to clean up some stuff in the description before I merged it.

When we merge this PR, the commit message will use the title of this PR as the first line, followed by the description of this PR (i.e. the content of #1169 (comment)). You can edit them anytime before merging.

@k-sareen k-sareen changed the title Only reset unlog bits of objects in modbuf for nursery GCs Ensure log bits are correctly maintained Apr 12, 2025
@k-sareen
Copy link
Collaborator Author

Thank you. I've edited the description and title of the PR. If you think it's sufficient, please feel free to merge it. Thank you again for all your debugging help.

@wks wks added this pull request to the merge queue Apr 12, 2025
Merged via the queue into mmtk:master with commit 973bd02 Apr 12, 2025
31 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-extended-testing Run extended tests for the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants