Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Aug 20, 2025

Depends on #8863. Fixes #8855 and #8856.

Note: since this branches from 8863, if that branch gets force-pushed, this one will also need to be force-pushed. This may break your incremental review flow.

This PR makes the initial changes required for Nexus quiesce described by RFD 588:

  • Refactors the mechanics of Nexus quiesce to use a new NexusQuiesceHandle. This was so that that can be invoked from a background task.
  • Triggers quiesce from the blueprint_execution background task based on whether the current target blueprint says we should be handing off. (This is what fixes Reconfigurator: Start quiescing if target bp nexus_generation is greater than our nexus_generation #8855 and Nexus Boot: Start quiescing if target bp nexus_generation is greater than our nexus_generation #8856.) It does this even if blueprint execution is disabled.
  • Disallows the creation of new sagas until we figure out if we're quiesced or not. In practice, this should happen soon after Nexus startup. It depends on the blueprint_loader and then blueprint_execution background tasks running.
  • Changes the quiesce states to reflect what they will be when we finish RFD 588. "WaitingFor" is changed to "Draining" (e.g., WaitingForSagas -> DrainingSagas) and there's a new RecordingQuiesce state that will cover the period after we've determined we're quiesced and before we've written the database record saying so.
  • Makes it legal to re-assign sagas after becoming "locally drained" (as defined in RFD 588). Nexus instances could coordinate on saga quiesce #8796 explains why this is desirable. RFD 588 explains how this will eventually result in a safe handoff. However, this PR does not implement 8796 -- more below.

This PR does not change the quiesce process to coordinate among the Nexus instances (as described in RFD 588) nor update the db_metadata_nexus table. Both of these are blocked on #8845.

The net result is a little awkward:

  • There's this new RecordingQuiesce state that's basically unused (it is technically used, but we transition through it immediately)
  • Since this PR now allows saga reassignment after becoming locally drained, but does not wait for other Nexus instances to finish draining before entering db-quiesce, it is now possible to drain sagas fro this Nexus and have it enter db-quiesce and wind up assigning sagas to itself, which would then become implicitly abandoned.

I think this is okay as an intermediate state, even on "main", since this cannot be triggered except by an online update, and we know we're going to fix these as part of shipping that.

@davepacheco davepacheco requested a review from jgallagher August 20, 2025 23:16
@davepacheco
Copy link
Collaborator Author

I think this is almost ready for review except that I think I'm going to need to have the test suite wait for the saga quiesce determination to be made before running the test (a lot of tests probably assume they can just go ahead and run sagas right away) and I'm working through some issues trying to do that.

@smklein smklein force-pushed the nexus_generation branch 5 times, most recently from 8235551 to ecd6a00 Compare August 21, 2025 19:53
@davepacheco davepacheco force-pushed the dap/handoff-quiesce-1 branch from 161491a to 127d5a8 Compare August 21, 2025 22:38
@davepacheco
Copy link
Collaborator Author

As expected, I force-pushed this to sync up with #8863. I gather that branch will not be force-pushed again so I think this one should be stable, too.

I'm still working through testing.

@@ -5016,7 +5016,7 @@ async fn cmd_db_dns_diff(
// Load the added and removed items.
use nexus_db_schema::schema::dns_name::dsl;

let added = dsl::dns_name
let mut added = dsl::dns_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file are so that omdb output doesn't change as a result of unspecified sort order (in this case, of DNS records). This became a problem after I added the "second" Nexus zone to the blueprint (see other comment).

+ test-suite-silo.sys A 127.0.0.1
+ test-suite-silo.sys (records: 2)
+ A 127.0.0.1
+ AAAA 100::1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra records here and in other outputs below result from having added a "second" Nexus to the test suite / omicron-dev blueprint (see other comment).

@@ -489,15 +493,21 @@ task: "nat_garbage_collector"

task: "blueprint_loader"
configured period: every <REDACTED_DURATION>m <REDACTED_DURATION>s
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of the background tasks' output changed because previously there was no target blueprint when they had run. The changes in this PR make test suite startup block on the first blueprint having been loaded, which made a lot of these tasks find something to do.

started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: no blueprint
warning: unknown background task: "crdb_node_id_collector" (don't know how to interpret details: Object {"errors": Array [Object {"err": String("failed to fetch node ID for zone ..........<REDACTED_UUID>........... at http://[::1]:REDACTED_PORT: Communication Error: error sending request for url (http://[::1]:REDACTED_PORT/node/id): error sending request for url (http://[::1]:REDACTED_PORT/node/id): client error (Connect): tcp connect error: Connection refused (os error 146)"), "zone_id": String("..........<REDACTED_UUID>...........")}], "nsuccess": Number(0)})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This curious line seems to be more fallout from having a target blueprint loaded by the time omdb queried background task status. This presumably always produced this error in the test suite because the test suite doesn't start a cockroach admin server, but we didn't notice this here because there wasn't a target blueprint loaded yet so we didn't try to do anything.

@@ -251,6 +230,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
],
// This one should fail because it has no parent.
&["nexus", "blueprints", "diff", &initial_blueprint_id],
// chicken switches: show and set
&["nexus", "chicken-switches", "show", "current"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These got moved later in the sequence because they actually changed the output of omdb nexus blueprints show, which is also tested in this sequence. That had changed because setting this chicken switch enabled the planner, which then went and made a new blueprint.

for task in &[
&self.background_tasks.task_internal_dns_config,
&self.background_tasks.task_internal_dns_servers,
&self.background_tasks.task_external_dns_config,
&self.background_tasks.task_external_dns_servers,
&self.background_tasks.task_external_endpoints,
&self.background_tasks.task_inventory_collection,
&self.background_tasks.task_blueprint_loader,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is appropriate because we just inserted the first blueprint, so it makes sense to activate the blueprint loader so that it loads it.

This is important because now we're blocking saga enablement and so Nexus startup on having loaded and started executing the first blueprint. So without this, it could take quite a while for Nexus to notice there was a blueprint, enable sagas, and complete startup.

Comment on lines -306 to -309
// - each Nexus created for testing gets its own id so they don't see each
// others sagas and try to recover them
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is ancient and (I hope) wrong. I think we always create a fresh context with a fresh database and don't need to generate a new id every time.

Fixing this was important to ensure that the sort order for omdb nexus blueprints show was consistent when run in the context of the test suite. Otherwise, now that there are two Nexus zones, one of which has a fixed uuid, they could flip-flop in their order based on the uuid that got assigned here.

Comment on lines +843 to +845
// Besides the Nexus that we just started, add an entry in the blueprint
// for the Nexus that developers can start using
// nexus/examples/config-second.toml.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "second" Nexus instance that I mentioned in other comments here.

We have a useful dev flow for running a second Nexus instance against omicron-dev run-all. I use this a lot, including for this change. It's broken by this PR without this change because the new Nexus doesn't find itself in the blueprint and doesn't know if it should come up quiesced or not.

This fix is janky, but I think not more so than the rest of the test suite's startup behavior.

@davepacheco
Copy link
Collaborator Author

Besides CI, I've tested that the manual quiesce process works basically the same way that it did before, the same way I tested it at demo day (using omdb nexus sagas demo-create / demo-complete to control sagas and omdb nexus quiesce show / start to observe and control quiesce state). It's hard to do a fuller test until more of this is done but let me know if there are other tests folks think I should do.

I also tested that if Nexus can't figure out if it's quiesced, it reports that and disallows sagas. Before I fixed the test suite to include this second Nexus in its blueprint, the new Nexus would report:

Aug 21 19:04:01.375 ERRO blueprint execution: failed to determine if this Nexus is quiescing, error: zone a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f does not exist in blueprint, background_task: blueprint_executor, component: BackgroundTasks, component: nexus, component: ServerContext, name: a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f, file: nexus/src/app/background/tasks/blueprint_execution.rs:123

trying to query its quiesce state reports:

$ ./target/debug/omdb nexus --nexus-internal-url http://[::1]:12223/ quiesce show
note: using Nexus URL http://[::1]:12223/
has not yet determined if it is quiescing
sagas running: 0
database connections held: 0

$ ./target/debug/omdb -w nexus --nexus-internal-url http://[::1]:12223/ sagas demo-create
note: using Nexus URL http://[::1]:12223/
Error: creating demo saga

Caused by:
    Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "0a2e1c34-0b99-4f95-8dd7-d8ead852501a", "content-length": "133", "date": "Thu, 21 Aug 2025 19:03:24 GMT"}; value: Error { error_code: Some("ServiceNotAvailable"), message: "Service Unavailable", request_id: "0a2e1c34-0b99-4f95-8dd7-d8ead852501a" }

which was this:

Aug 21 19:03:24.104 INFO request completed, error_message_external: Service Unavailable, error_message_internal: saga creation is disallowed (unknown yet if we're quiescing), latency_us: 338, response_code: 503, uri: //demo-saga, method: POST, req_id: 0a2e1c34-0b99-4f95-8dd7-d8ead852501a, remote_addr: [::1]:42433, local_addr: [::1]:12223, component: dropshot_internal, name: a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f, file: /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.3/src/server.rs:855

and creating sagas fails as it should:


Another test that would be good to add is:

  • it quiesces when the blueprint says so
  • it comes up quiesced when the blueprint says so

@davepacheco davepacheco marked this pull request as ready for review August 21, 2025 23:13
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?

Ok(saga_quiesce
.reassign_sagas(async || {
// For any expunged Nexus zones, re-assign in-progress
// sagas to some other Nexus. If this fails for some
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this comment was here before, but can we refine "some other Nexus" to "ourself"?

*q = new_state;
true
}
_ => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand _ to list all the cases? I find it hard to reason about // All other cases are ... when they aren't listed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it in a previous iteration but I found it less clear. I can put that back and see what it looks like.

/// cannot then re-enable sagas.
pub fn set_quiescing(&self, quiescing: bool) {
self.inner.send_if_modified(|q| {
let new_state = if quiescing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - this is only used in the SagasAllowed::DisallowedUnknown arm below - can we move it into that arm?

);
false
}
_ => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this match would be clearer if we nested the ifs in the two branches above, which (I think?) would let us drop this _ entirely; e.g.,

SagasAllowed::Allowed => { 
    // If sagas are currently allowed but we need to quiesce, switch to disallowing them.
    if quiescing {
        // ... all the stuff from above ...
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know what you mean. I think that would make it clearer what the current _ covers (though I tried to explain that with the comment). But I think it would be less clear what the bigger picture is. I think of this like: there are three interesting cases and they're represented by the non-_ arms here. With your change, two of the "interesting" cases are actually sub-branches of two of the arms, and the other two branches within those arms are the very same "uninteresting" case (quiesce state already matches what we're being asked to do)). I don't feel strongly!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just really dislike _ and will take nearly any excuse to drop it. 😂

@davepacheco
Copy link
Collaborator Author

I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?

I think it's even less bad than that. The problem introduced by this PR if we're testing online update is that if we get far enough into the update to start the handoff process and at some point before the handoff completes there's an expunged Nexus zone with sagas assigned to it, then we'd wind up with implicitly abandoned sagas. For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right?

@jgallagher
Copy link
Contributor

For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right?

Ahh yeah great point; thanks. I'm much less concerned now.

@davepacheco
Copy link
Collaborator Author

@jgallagher I applied the fix that I think you were suggesting for the failing test. Now we only add the "second" Nexus when running omicron-dev run-all. This did fix the test locally and I'm also able to run the "second Nexus" flow with omicron-dev run-all.

We might run into new problems soon? I'm not sure. With #8845 the new Nexus won't find its db_metadata_nexus record and so won't start up, but maybe that'll work okay because blueprint execution in the first Nexus will create that record. If this becomes a problem, I think we may want to look at creating a more explicit step in the flow where we properly add the Nexus to the system.

At the very least, this change allows this PR to not break things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants