Skip to content

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Aug 19, 2025

  • Adds nexus_generation to the blueprint, for Nexus zones, and also as a top-level field
    • When provisioning a new nexus zone: if the matches any existing zone, use that nexus_generation value
    • Otherwise: choose a generation number higher than all existing instances.
  • Changes deployment of Nexus zones, to proactively provision new zones alongside old ones, rather than doing a replacement.
    • This PR does not implement the handoff process. However, it does permit "new Nexus zones" to expunge old Nexus zones which have an older nexus_generation, if any of the "new Nexuses" are running.
  • Adds a do_plan_nexus_generation_update method to the planner, which decides when the top-level Nexus generation number should be incremented.

Fixes #8853, #8843

@smklein smklein force-pushed the nexus_generation branch 2 times, most recently from 85446aa to 7a3d744 Compare August 20, 2025 20:54
///
/// If `must_have_nexus_zones` is false, then these settings
/// are permitted to use default values.
pub fn sled_add_zone_nexus_internal(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tests which want to "add a Nexus zone, even though we don't have existing Nexus zones".

They previously called sled_add_zone_nexus_with_config directly, but I want them to converge on this common pathway as much as possible, to share nexus_generation calculation logic.

To mitigate:

  • This API exposes a must_have_nexus_zones argument, which can toggle "whether or not we must copy data from existing Nexus zones or not"
  • Most callers will use sled_add_zone_nexus, which uses must_have_nexus_zones = true
  • Callers in test cases that want to spawn Nexuses from nothing can use must_have_nexus_zones = false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about instead having nexus_generation be something that the caller always specifies? Then the code paths also won't diverge. I like this for a few reasons:

  • We can keep the existing sled_add_zone_nexus() / sled_add_zone_nexus_with_config() split. I think this was pretty clean -- it clearly separated the two use cases and was very explicit in the second one ("I'm giving you exactly the config that you need"). must_have_nexus_zones confuses me -- what happens if I "must have them" but I don't? What happens if I don't need them but they're there?
  • It allows reconfigurator-cli (and tests) to control this directly. That in turn means people can test the handoff behavior without worrying about the images. (I guess the way I think about this is: the handoff behavior is purely a function of the generation numbers. For deployed systems, the images are used to determine the generation numbers. But that's a planning choice. Everything would work -- and it's probably useful for testing and such -- if someone used the same images everywhere but picked different generation numbers in order to trigger a handoff.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable. I still want to test some of the determine_nexus_generation error cases -- and that's easier to do if i just call that internal function directly - but that's definitely still possible with your proposal.

Updated in 5ce4870

@smklein smklein force-pushed the nexus_generation branch 4 times, most recently from e688fc6 to 8235551 Compare August 21, 2025 19:50
@smklein smklein marked this pull request as ready for review August 21, 2025 20:32
parent = blueprint;
}

panic!("did not converge after {MAX_PLANNING_ITERATIONS} iterations");
}

struct BlueprintGenerator {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this struct to help the actual contents of test_nexus_generation_update be easier to write... but after doing so, I'd be kinda on-board to move more tests over to using this explicitly.

IMO it helps make the test much more concise when blueprint generation is as one-liner.

Comment on lines +1836 to +1839
report.set_waiting_on(
NexusGenerationBumpWaitingOn::NewNexusBringup,
);
return Ok(report);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do not have test coverage for this case, and would like to add it before this PR merges. I have struggled to do it through zone manipulation - because Nexus is discretionary, we'll be eager to add the new Nexus zones if we can (and why not? They should wait on boot for handoff).

To force this to happen, I'm thinking I'll need to construct a scenario where we expunge a sled so that we cannot actually place this new Nexus, and observe that the handoff does not occur while we're operating at a reduced capacity.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @smklein!

This PR's gotten pretty big and I think it has at least two pretty separable pieces:

  • Adding nexus_generation to the blueprint (in-memory + database). These parts of this PR already look pretty solid to me. That's also all I need for #8875.
  • The planner changes that implement the behaviors around nexus_generation. This is a lot trickier and will take more time to get to ground.

Could you separate these into separate PRs? That'll be much easier to review, get confidence in, and it will also unblock the quiesce work sooner.

You could also separate out the change to the way we report "discretionary zones placed", but that's small and simple enough that it's less critical to me.

@@ -109,6 +112,31 @@ const NUM_CONCURRENT_MGS_UPDATES: usize = 1;
/// A receipt that `check_input_validity` has been run prior to planning.
struct InputChecked;

#[derive(Debug)]
#[expect(dead_code)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is admittedly the first time I've seen #[expect(dead_code)], but why is it here? It looks like this struct is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think it's because we only use it for the Debug impl.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These structs were admittedly here on main - I was moving them to be usable outside the single do_plan_zone_updates function. (see:

#[derive(Debug)]
#[expect(dead_code)]
struct ZoneCurrentlyUpdating<'a> {
zone_id: OmicronZoneUuid,
zone_kind: ZoneKind,
reason: UpdatingReason<'a>,
}
#[derive(Debug)]
#[expect(dead_code)]
enum UpdatingReason<'a> {
ImageSourceMismatch {
bp_image_source: &'a BlueprintZoneImageSource,
inv_image_source: &'a OmicronZoneImageSource,
},
MissingInInventory {
bp_image_source: &'a BlueprintZoneImageSource,
},
ReconciliationError {
bp_image_source: &'a BlueprintZoneImageSource,
inv_image_source: &'a OmicronZoneImageSource,
message: &'a str,
},
}
)

I believe that this is using expect instead of allow as a part of the new rust 1.81 features, where:

We definitely are using these fields, because they're emitted to a log via the debug implementation, but they're otherwise not used directly. I can confirm that removing them results in (unwanted) compiler warnings about the fields never being read -- even though they do end up in logs.

Comment on lines 115 to 138
#[derive(Debug)]
#[expect(dead_code)]
struct ZoneCurrentlyUpdating<'a> {
zone_id: OmicronZoneUuid,
zone_kind: ZoneKind,
reason: UpdatingReason<'a>,
}

#[derive(Debug)]
#[expect(dead_code)]
enum UpdatingReason<'a> {
ImageSourceMismatch {
bp_image_source: &'a BlueprintZoneImageSource,
inv_image_source: &'a OmicronZoneImageSource,
},
MissingInInventory {
bp_image_source: &'a BlueprintZoneImageSource,
},
ReconciliationError {
bp_image_source: &'a BlueprintZoneImageSource,
inv_image_source: &'a OmicronZoneImageSource,
message: &'a str,
},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc comments might help here. I'm confused about what these are supposed to mean. I would have thought ZoneCurrentlyUpdating with a reason would mean "this zone is being updated and here's why". But then I don't get why MissingInInventory or ReconciliationError would be a reason that a zone would be updating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, I'm refactoring this to access the get_zones_not_yet_propagated_to_inventory function - but I believe that's why this is reconciliation-focused. These responses are more about "is the inventory in-sync with a blueprint" than answering the more specific question of "has an update completed".

I'll update names away from "update" and more towards "zone propagation" in 55ebd1c

)
})
.collect(),
let image_sources = match zone_kind {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could use a comment explaining more context here. Something like:

// Our goal here is to make sure that if we have less redundancy for discretionary zones than needed that we deploy additional ones.  For most zones, we only care about the total count of that kind of zone.  The way we deploy Nexus means we need the expected count for redundancy for _both_ active zone images.

No pressure to use any of that text -- it's just an example of what felt missing.

let our_image = self.lookup_current_nexus_image();

let mut images = vec![];
if old_image != new_image {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we replace the Nexus identity in the PlanningInput with list of which Nexus instances are in charge, then I think the logic here becomes something like:

  • always include the new image
  • also include the image for the Nexus instances currently in charge, if it's different

Comment on lines +2342 to +2347
if self.nexus_generation != current_generation {
return Err(Error::NexusGenerationMismatch {
expected: current_generation,
actual: self.nexus_generation,
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check this?

Comment on lines +2221 to +2261
ZoneKind::Nexus => {
// Get the nexus_generation of the zone being considered for shutdown
let zone_nexus_generation = match &zone.zone_type {
BlueprintZoneType::Nexus(nexus_zone) => {
nexus_zone.nexus_generation
}
_ => unreachable!("zone kind is Nexus but type is not"),
};

let Some(current_gen) = self.lookup_current_nexus_generation()
else {
// If we don't know the current Nexus zone ID, or its
// generation, we can't perform the handoff safety check.
report.unsafe_zone(
zone,
Nexus {
zone_generation: zone_nexus_generation,
current_nexus_generation: None,
},
);
return false;
};

// It's only safe to shut down if handoff has occurred.
//
// That only happens when the current generation of Nexus (the
// one running right now) is greater than the zone we're
// considering expunging.
if current_gen <= zone_nexus_generation {
report.unsafe_zone(
zone,
Nexus {
zone_generation: zone_nexus_generation,
current_nexus_generation: Some(current_gen),
},
);
return false;
}

true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problem is this trying to prevent?

///
/// If `must_have_nexus_zones` is false, then these settings
/// are permitted to use default values.
pub fn sled_add_zone_nexus_internal(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about instead having nexus_generation be something that the caller always specifies? Then the code paths also won't diverge. I like this for a few reasons:

  • We can keep the existing sled_add_zone_nexus() / sled_add_zone_nexus_with_config() split. I think this was pretty clean -- it clearly separated the two use cases and was very explicit in the second one ("I'm giving you exactly the config that you need"). must_have_nexus_zones confuses me -- what happens if I "must have them" but I don't? What happens if I don't need them but they're there?
  • It allows reconfigurator-cli (and tests) to control this directly. That in turn means people can test the handoff behavior without worrying about the images. (I guess the way I think about this is: the handoff behavior is purely a function of the generation numbers. For deployed systems, the images are used to determine the generation numbers. But that's a planning choice. Everything would work -- and it's probably useful for testing and such -- if someone used the same images everywhere but picked different generation numbers in order to trigger a handoff.)

Comment on lines +127 to +131
/// ID of the currently running Nexus zone
///
/// This is used to identify which Nexus is currently executing the planning
/// operation, which is needed for safe shutdown decisions during handoff.
current_nexus_zone_id: Option<OmicronZoneUuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly suggest that instead of putting the current Nexus zone into the planning input, let's put either the currently in-charge Nexus generation or else the set of Nexus instances currently in control. (If you have the blueprint, you can compute either of these from the other.) PlanningInputFromDb could determine this based on the contents of db_metadata_nexus. (That could be done in a separate PR if we want to keep this PR decoupled from the db_metadata_nexus one.)

There are a few of reasons for this:

  • It's confusing to me that this field is both "important" and "optional". What's the semantics of it being None? Does that mean certain planning operations fail or do the wrong thing? Do we just not do any of those operations from the contexts where we're providing None today? On the other hand, if we say this is the set of instances currently in-charge, I'm hoping we can fill in some values here. I took a quick look through the callers that are providing None here and I think they're basically all either using PlanningInputFromDb::assemble (which can get the real value of "who's in charge" from the database) or else are tests that have a blueprint available (so we could have a helper that pulls them out of that blueprint).
  • In terms of comprehensibility of the system: it feels weird to me that the result of planning would depend on who is doing the planning. With this PR, every Nexus in a running system will be providing different input to its planner, which feels like it confuses the aim of determinism in the planning process.
  • It would eliminate quite a lot of call sites where you've had to add None.

Comment on lines +254 to +256
pub fn set_current_nexus_zone_id(&mut self, id: OmicronZoneUuid) {
self.current_nexus_zone_id = Some(id);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I've thought of these types like PlanningInput being immutable and PlanningInputBuilder being the mutable version. What do you think of having callers do this:

let mut new_builder = planning_input.into_builder();
new_builder.set( ... )
let planning_input = new_builder.build();

It's a little more verbose but I feel like preserves the nice property that: when you're modifying it, you're working with a builder. The planning input itself remains immutable. (Again, take it or leave it.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconfigurator: Add nexus_generation to blueprint, Nexus zones
3 participants