start updating quiesce for new Nexus handoff #8875

davepacheco · 2025-08-20T23:16:49Z

Depends on #8863. Fixes #8855 and #8856.

Note: since this branches from 8863, if that branch gets force-pushed, this one will also need to be force-pushed. This may break your incremental review flow.

This PR makes the initial changes required for Nexus quiesce described by RFD 588:

Refactors the mechanics of Nexus quiesce to use a new NexusQuiesceHandle. This was so that that can be invoked from a background task.
Triggers quiesce from the blueprint_execution background task based on whether the current target blueprint says we should be handing off. (This is what fixes Reconfigurator: Start quiescing if target bp nexus_generation is greater than our nexus_generation #8855 and Nexus Boot: Start quiescing if target bp nexus_generation is greater than our nexus_generation #8856.) It does this even if blueprint execution is disabled.
Disallows the creation of new sagas until we figure out if we're quiesced or not. In practice, this should happen soon after Nexus startup. It depends on the blueprint_loader and then blueprint_execution background tasks running.
Changes the quiesce states to reflect what they will be when we finish RFD 588. "WaitingFor" is changed to "Draining" (e.g., WaitingForSagas -> DrainingSagas) and there's a new RecordingQuiesce state that will cover the period after we've determined we're quiesced and before we've written the database record saying so.
Makes it legal to re-assign sagas after becoming "locally drained" (as defined in RFD 588). Nexus instances could coordinate on saga quiesce #8796 explains why this is desirable. RFD 588 explains how this will eventually result in a safe handoff. However, this PR does not implement 8796 -- more below.

This PR does not change the quiesce process to coordinate among the Nexus instances (as described in RFD 588) nor update the db_metadata_nexus table. Both of these are blocked on #8845.

The net result is a little awkward:

There's this new RecordingQuiesce state that's basically unused (it is technically used, but we transition through it immediately)
Since this PR now allows saga reassignment after becoming locally drained, but does not wait for other Nexus instances to finish draining before entering db-quiesce, it is now possible to drain sagas fro this Nexus and have it enter db-quiesce and wind up assigning sagas to itself, which would then become implicitly abandoned.

I think this is okay as an intermediate state, even on "main", since this cannot be triggered except by an online update, and we know we're going to fix these as part of shipping that.

davepacheco · 2025-08-20T23:43:45Z

I think this is almost ready for review except that I think I'm going to need to have the test suite wait for the saga quiesce determination to be made before running the test (a lot of tests probably assume they can just go ahead and run sagas right away) and I'm working through some issues trying to do that.

davepacheco · 2025-08-21T22:39:55Z

As expected, I force-pushed this to sync up with #8863. I gather that branch will not be force-pushed again so I think this one should be stable, too.

I'm still working through testing.

davepacheco · 2025-08-21T22:41:56Z

dev-tools/omdb/src/bin/omdb/db.rs

@@ -5016,7 +5016,7 @@ async fn cmd_db_dns_diff(
        // Load the added and removed items.
        use nexus_db_schema::schema::dns_name::dsl;

-        let added = dsl::dns_name
+        let mut added = dsl::dns_name


The changes in this file are so that omdb output doesn't change as a result of unspecified sort order (in this case, of DNS records). This became a problem after I added the "second" Nexus zone to the blueprint (see other comment).

davepacheco · 2025-08-21T22:43:08Z

dev-tools/omdb/tests/successes.out

-+  test-suite-silo.sys                                A    127.0.0.1
+  test-suite-silo.sys                                (records: 2)
+      A    127.0.0.1
+      AAAA 100::1


The extra records here and in other outputs below result from having added a "second" Nexus to the test suite / omicron-dev blueprint (see other comment).

davepacheco · 2025-08-21T22:43:50Z

dev-tools/omdb/tests/successes.out

@@ -489,15 +493,21 @@ task: "nat_garbage_collector"

 task: "blueprint_loader"
  configured period: every <REDACTED_DURATION>m <REDACTED_DURATION>s
-  last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
+  last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal


A bunch of the background tasks' output changed because previously there was no target blueprint when they had run. The changes in this PR make test suite startup block on the first blueprint having been loaded, which made a lot of these tasks find something to do.

davepacheco · 2025-08-21T22:45:00Z

dev-tools/omdb/tests/successes.out

    started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
-    last completion reported error: no blueprint
+warning: unknown background task: "crdb_node_id_collector" (don't know how to interpret details: Object {"errors": Array [Object {"err": String("failed to fetch node ID for zone ..........<REDACTED_UUID>........... at http://[::1]:REDACTED_PORT: Communication Error: error sending request for url (http://[::1]:REDACTED_PORT/node/id): error sending request for url (http://[::1]:REDACTED_PORT/node/id): client error (Connect): tcp connect error: Connection refused (os error 146)"), "zone_id": String("..........<REDACTED_UUID>...........")}], "nsuccess": Number(0)})


This curious line seems to be more fallout from having a target blueprint loaded by the time omdb queried background task status. This presumably always produced this error in the test suite because the test suite doesn't start a cockroach admin server, but we didn't notice this here because there wasn't a target blueprint loaded yet so we didn't try to do anything.

davepacheco · 2025-08-21T22:46:19Z

dev-tools/omdb/tests/test_all_output.rs

@@ -251,6 +230,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
        ],
        // This one should fail because it has no parent.
        &["nexus", "blueprints", "diff", &initial_blueprint_id],
+        // chicken switches: show and set
+        &["nexus", "chicken-switches", "show", "current"],


These got moved later in the sequence because they actually changed the output of omdb nexus blueprints show, which is also tested in this sequence. That had changed because setting this chicken switch enabled the planner, which then went and made a new blueprint.

davepacheco · 2025-08-21T22:48:53Z

nexus/src/app/rack.rs

        for task in &[
            &self.background_tasks.task_internal_dns_config,
            &self.background_tasks.task_internal_dns_servers,
            &self.background_tasks.task_external_dns_config,
            &self.background_tasks.task_external_dns_servers,
            &self.background_tasks.task_external_endpoints,
            &self.background_tasks.task_inventory_collection,
+            &self.background_tasks.task_blueprint_loader,


This is appropriate because we just inserted the first blueprint, so it makes sense to activate the blueprint loader so that it loads it.

This is important because now we're blocking saga enablement and so Nexus startup on having loaded and started executing the first blueprint. So without this, it could take quite a while for Nexus to notice there was a blueprint, enable sagas, and complete startup.

davepacheco · 2025-08-21T22:50:18Z

nexus/test-utils/src/lib.rs

-    // - each Nexus created for testing gets its own id so they don't see each
-    //   others sagas and try to recover them


I think this comment is ancient and (I hope) wrong. I think we always create a fresh context with a fresh database and don't need to generate a new id every time.

Fixing this was important to ensure that the sort order for omdb nexus blueprints show was consistent when run in the context of the test suite. Otherwise, now that there are two Nexus zones, one of which has a fixed uuid, they could flip-flop in their order based on the uuid that got assigned here.

davepacheco · 2025-08-21T22:52:44Z

nexus/test-utils/src/lib.rs

+        // Besides the Nexus that we just started, add an entry in the blueprint
+        // for the Nexus that developers can start using
+        // nexus/examples/config-second.toml.


This is the "second" Nexus instance that I mentioned in other comments here.

We have a useful dev flow for running a second Nexus instance against omicron-dev run-all. I use this a lot, including for this change. It's broken by this PR without this change because the new Nexus doesn't find itself in the blueprint and doesn't know if it should come up quiesced or not.

This fix is janky, but I think not more so than the rest of the test suite's startup behavior.

davepacheco · 2025-08-21T23:12:44Z

Besides CI, I've tested that the manual quiesce process works basically the same way that it did before, the same way I tested it at demo day (using omdb nexus sagas demo-create / demo-complete to control sagas and omdb nexus quiesce show / start to observe and control quiesce state). It's hard to do a fuller test until more of this is done but let me know if there are other tests folks think I should do.

I also tested that if Nexus can't figure out if it's quiesced, it reports that and disallows sagas. Before I fixed the test suite to include this second Nexus in its blueprint, the new Nexus would report:

Aug 21 19:04:01.375 ERRO blueprint execution: failed to determine if this Nexus is quiescing, error: zone a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f does not exist in blueprint, background_task: blueprint_executor, component: BackgroundTasks, component: nexus, component: ServerContext, name: a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f, file: nexus/src/app/background/tasks/blueprint_execution.rs:123

trying to query its quiesce state reports:

$ ./target/debug/omdb nexus --nexus-internal-url http://[::1]:12223/ quiesce show
note: using Nexus URL http://[::1]:12223/
has not yet determined if it is quiescing
sagas running: 0
database connections held: 0

$ ./target/debug/omdb -w nexus --nexus-internal-url http://[::1]:12223/ sagas demo-create
note: using Nexus URL http://[::1]:12223/
Error: creating demo saga

Caused by:
    Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "0a2e1c34-0b99-4f95-8dd7-d8ead852501a", "content-length": "133", "date": "Thu, 21 Aug 2025 19:03:24 GMT"}; value: Error { error_code: Some("ServiceNotAvailable"), message: "Service Unavailable", request_id: "0a2e1c34-0b99-4f95-8dd7-d8ead852501a" }

which was this:

Aug 21 19:03:24.104 INFO request completed, error_message_external: Service Unavailable, error_message_internal: saga creation is disallowed (unknown yet if we're quiescing), latency_us: 338, response_code: 503, uri: //demo-saga, method: POST, req_id: 0a2e1c34-0b99-4f95-8dd7-d8ead852501a, remote_addr: [::1]:42433, local_addr: [::1]:12223, component: dropshot_internal, name: a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f, file: /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.3/src/server.rs:855

and creating sagas fails as it should:

Another test that would be good to add is:

it quiesces when the blueprint says so
it comes up quiesced when the blueprint says so

jgallagher

I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?

jgallagher · 2025-08-22T19:21:46Z

nexus/reconfigurator/execution/src/lib.rs

+                Ok(saga_quiesce
+                    .reassign_sagas(async || {
+                        // For any expunged Nexus zones, re-assign in-progress
+                        // sagas to some other Nexus.  If this fails for some


I realize this comment was here before, but can we refine "some other Nexus" to "ourself"?

nexus/src/app/background/tasks/blueprint_execution.rs

jgallagher · 2025-08-22T19:27:08Z

nexus/src/app/quiesce.rs

+                    *q = new_state;
+                    true
+                }
+                _ => {


Can we expand _ to list all the cases? I find it hard to reason about // All other cases are ... when they aren't listed.

I had it in a previous iteration but I found it less clear. I can put that back and see what it looks like.

jgallagher · 2025-08-22T19:44:25Z

nexus/types/src/quiesce.rs

+    /// cannot then re-enable sagas.
+    pub fn set_quiescing(&self, quiescing: bool) {
+        self.inner.send_if_modified(|q| {
+            let new_state = if quiescing {


Nit - this is only used in the SagasAllowed::DisallowedUnknown arm below - can we move it into that arm?

jgallagher · 2025-08-22T19:45:11Z

nexus/types/src/quiesce.rs

+                    );
+                    false
+                }
+                _ => {


I wonder if this match would be clearer if we nested the ifs in the two branches above, which (I think?) would let us drop this _ entirely; e.g.,

SagasAllowed::Allowed => { // If sagas are currently allowed but we need to quiesce, switch to disallowing them. if quiescing { // ... all the stuff from above ... } }

I think I know what you mean. I think that would make it clearer what the current _ covers (though I tried to explain that with the comment). But I think it would be less clear what the bigger picture is. I think of this like: there are three interesting cases and they're represented by the non-_ arms here. With your change, two of the "interesting" cases are actually sub-branches of two of the arms, and the other two branches within those arms are the very same "uninteresting" case (quiesce state already matches what we're being asked to do)). I don't feel strongly!

I think I just really dislike _ and will take nearly any excuse to drop it. 😂

nexus/types/src/quiesce.rs

davepacheco · 2025-08-22T20:58:52Z

I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?

I think it's even less bad than that. The problem introduced by this PR if we're testing online update is that if we get far enough into the update to start the handoff process and at some point before the handoff completes there's an expunged Nexus zone with sagas assigned to it, then we'd wind up with implicitly abandoned sagas. For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right?

jgallagher · 2025-08-22T21:08:31Z

For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right?

Ahh yeah great point; thanks. I'm much less concerned now.

davepacheco · 2025-08-23T00:43:30Z

@jgallagher I applied the fix that I think you were suggesting for the failing test. Now we only add the "second" Nexus when running omicron-dev run-all. This did fix the test locally and I'm also able to run the "second Nexus" flow with omicron-dev run-all.

We might run into new problems soon? I'm not sure. With #8845 the new Nexus won't find its db_metadata_nexus record and so won't start up, but maybe that'll work okay because blueprint execution in the first Nexus will create that record. If this becomes a problem, I think we may want to look at creating a more explicit step in the flow where we properly add the Nexus to the system.

At the very least, this change allows this PR to not break things.

davepacheco requested a review from jgallagher August 20, 2025 23:16

smklein force-pushed the nexus_generation branch 5 times, most recently from 8235551 to ecd6a00 Compare August 21, 2025 19:53

davepacheco added 5 commits August 21, 2025 15:38

update quiesce states to reflect RFD 588

051ecaf

self-review + regenerate API spec

a8862d5

tests need to wait for sagas to be enabled

356d60b

need to activate blueprint loader after inserting initial blueprint

5f43b60

add the "second" Nexus to the test suite blueprint; fix omdb tests

127d5a8

davepacheco force-pushed the dap/handoff-quiesce-1 branch from 161491a to 127d5a8 Compare August 21, 2025 22:38

davepacheco commented Aug 21, 2025

View reviewed changes

davepacheco marked this pull request as ready for review August 21, 2025 23:13

jgallagher approved these changes Aug 22, 2025

View reviewed changes

davepacheco mentioned this pull request Aug 22, 2025

Add nexus_generation to blueprint #8863

Open

davepacheco added 2 commits August 22, 2025 16:27

review feedback

fe85a10

fix tests on GNU/Linux

b8d3ee3

davepacheco added 4 commits August 22, 2025 17:43

fix end to end dns test

a6f2f63

fix omdb test

b09b83f

add test that Nexus quiesces when reading a blueprint saying so

912ea8f

Merge branch 'nexus_generation' into dap/handoff-quiesce-1

dc85999

		// - each Nexus created for testing gets its own id so they don't see each
		// others sagas and try to recover them

start updating quiesce for new Nexus handoff #8875

Are you sure you want to change the base?

start updating quiesce for new Nexus handoff #8875

Uh oh!

Conversation

davepacheco commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Aug 20, 2025

Uh oh!

davepacheco commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Aug 21, 2025

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davepacheco commented Aug 22, 2025

Uh oh!

jgallagher commented Aug 22, 2025

Uh oh!

davepacheco commented Aug 23, 2025

Uh oh!

Uh oh!

davepacheco commented Aug 20, 2025 •

edited

Loading