-
Notifications
You must be signed in to change notification settings - Fork 52
start updating quiesce for new Nexus handoff #8875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nexus_generation
Are you sure you want to change the base?
Conversation
I think this is almost ready for review except that I think I'm going to need to have the test suite wait for the saga quiesce determination to be made before running the test (a lot of tests probably assume they can just go ahead and run sagas right away) and I'm working through some issues trying to do that. |
8235551
to
ecd6a00
Compare
161491a
to
127d5a8
Compare
As expected, I force-pushed this to sync up with #8863. I gather that branch will not be force-pushed again so I think this one should be stable, too. I'm still working through testing. |
@@ -5016,7 +5016,7 @@ async fn cmd_db_dns_diff( | |||
// Load the added and removed items. | |||
use nexus_db_schema::schema::dns_name::dsl; | |||
|
|||
let added = dsl::dns_name | |||
let mut added = dsl::dns_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file are so that omdb
output doesn't change as a result of unspecified sort order (in this case, of DNS records). This became a problem after I added the "second" Nexus zone to the blueprint (see other comment).
dev-tools/omdb/tests/successes.out
Outdated
+ test-suite-silo.sys A 127.0.0.1 | ||
+ test-suite-silo.sys (records: 2) | ||
+ A 127.0.0.1 | ||
+ AAAA 100::1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra records here and in other outputs below result from having added a "second" Nexus to the test suite / omicron-dev blueprint (see other comment).
@@ -489,15 +493,21 @@ task: "nat_garbage_collector" | |||
|
|||
task: "blueprint_loader" | |||
configured period: every <REDACTED_DURATION>m <REDACTED_DURATION>s | |||
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing | |||
last completed activation: <REDACTED ITERATIONS>, triggered by an explicit signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of the background tasks' output changed because previously there was no target blueprint when they had run. The changes in this PR make test suite startup block on the first blueprint having been loaded, which made a lot of these tasks find something to do.
dev-tools/omdb/tests/successes.out
Outdated
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms | ||
last completion reported error: no blueprint | ||
warning: unknown background task: "crdb_node_id_collector" (don't know how to interpret details: Object {"errors": Array [Object {"err": String("failed to fetch node ID for zone ..........<REDACTED_UUID>........... at http://[::1]:REDACTED_PORT: Communication Error: error sending request for url (http://[::1]:REDACTED_PORT/node/id): error sending request for url (http://[::1]:REDACTED_PORT/node/id): client error (Connect): tcp connect error: Connection refused (os error 146)"), "zone_id": String("..........<REDACTED_UUID>...........")}], "nsuccess": Number(0)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This curious line seems to be more fallout from having a target blueprint loaded by the time omdb queried background task status. This presumably always produced this error in the test suite because the test suite doesn't start a cockroach admin server, but we didn't notice this here because there wasn't a target blueprint loaded yet so we didn't try to do anything.
@@ -251,6 +230,27 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) { | |||
], | |||
// This one should fail because it has no parent. | |||
&["nexus", "blueprints", "diff", &initial_blueprint_id], | |||
// chicken switches: show and set | |||
&["nexus", "chicken-switches", "show", "current"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These got moved later in the sequence because they actually changed the output of omdb nexus blueprints show
, which is also tested in this sequence. That had changed because setting this chicken switch enabled the planner, which then went and made a new blueprint.
for task in &[ | ||
&self.background_tasks.task_internal_dns_config, | ||
&self.background_tasks.task_internal_dns_servers, | ||
&self.background_tasks.task_external_dns_config, | ||
&self.background_tasks.task_external_dns_servers, | ||
&self.background_tasks.task_external_endpoints, | ||
&self.background_tasks.task_inventory_collection, | ||
&self.background_tasks.task_blueprint_loader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is appropriate because we just inserted the first blueprint, so it makes sense to activate the blueprint loader so that it loads it.
This is important because now we're blocking saga enablement and so Nexus startup on having loaded and started executing the first blueprint. So without this, it could take quite a while for Nexus to notice there was a blueprint, enable sagas, and complete startup.
// - each Nexus created for testing gets its own id so they don't see each | ||
// others sagas and try to recover them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is ancient and (I hope) wrong. I think we always create a fresh context with a fresh database and don't need to generate a new id every time.
Fixing this was important to ensure that the sort order for omdb nexus blueprints show
was consistent when run in the context of the test suite. Otherwise, now that there are two Nexus zones, one of which has a fixed uuid, they could flip-flop in their order based on the uuid that got assigned here.
// Besides the Nexus that we just started, add an entry in the blueprint | ||
// for the Nexus that developers can start using | ||
// nexus/examples/config-second.toml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "second" Nexus instance that I mentioned in other comments here.
We have a useful dev flow for running a second Nexus instance against omicron-dev run-all
. I use this a lot, including for this change. It's broken by this PR without this change because the new Nexus doesn't find itself in the blueprint and doesn't know if it should come up quiesced or not.
This fix is janky, but I think not more so than the rest of the test suite's startup behavior.
Besides CI, I've tested that the manual quiesce process works basically the same way that it did before, the same way I tested it at demo day (using I also tested that if Nexus can't figure out if it's quiesced, it reports that and disallows sagas. Before I fixed the test suite to include this second Nexus in its blueprint, the new Nexus would report:
trying to query its quiesce state reports:
which was this:
and creating sagas fails as it should:
Another test that would be good to add is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about landing this on main without the followup work to coordinate Nexuses because we do want to test online update off of main. But maybe in tests we're not likely to have all that many sagas anyway, and if we end up with implicitly abandoned ones it's similarly not that big of a deal?
Ok(saga_quiesce | ||
.reassign_sagas(async || { | ||
// For any expunged Nexus zones, re-assign in-progress | ||
// sagas to some other Nexus. If this fails for some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this comment was here before, but can we refine "some other Nexus" to "ourself"?
nexus/src/app/quiesce.rs
Outdated
*q = new_state; | ||
true | ||
} | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand _
to list all the cases? I find it hard to reason about // All other cases are ...
when they aren't listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it in a previous iteration but I found it less clear. I can put that back and see what it looks like.
nexus/types/src/quiesce.rs
Outdated
/// cannot then re-enable sagas. | ||
pub fn set_quiescing(&self, quiescing: bool) { | ||
self.inner.send_if_modified(|q| { | ||
let new_state = if quiescing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - this is only used in the SagasAllowed::DisallowedUnknown
arm below - can we move it into that arm?
nexus/types/src/quiesce.rs
Outdated
); | ||
false | ||
} | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this match would be clearer if we nested the if
s in the two branches above, which (I think?) would let us drop this _
entirely; e.g.,
SagasAllowed::Allowed => {
// If sagas are currently allowed but we need to quiesce, switch to disallowing them.
if quiescing {
// ... all the stuff from above ...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know what you mean. I think that would make it clearer what the current _
covers (though I tried to explain that with the comment). But I think it would be less clear what the bigger picture is. I think of this like: there are three interesting cases and they're represented by the non-_
arms here. With your change, two of the "interesting" cases are actually sub-branches of two of the arms, and the other two branches within those arms are the very same "uninteresting" case (quiesce state already matches what we're being asked to do)). I don't feel strongly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just really dislike _
and will take nearly any excuse to drop it. 😂
I think it's even less bad than that. The problem introduced by this PR if we're testing online update is that if we get far enough into the update to start the handoff process and at some point before the handoff completes there's an expunged Nexus zone with sagas assigned to it, then we'd wind up with implicitly abandoned sagas. For this to happen, we must have expunged a Nexus outside the upgrade process with sagas assigned to it. That seems pretty unlikely in our testing, right? |
Ahh yeah great point; thanks. I'm much less concerned now. |
@jgallagher I applied the fix that I think you were suggesting for the failing test. Now we only add the "second" Nexus when running We might run into new problems soon? I'm not sure. With #8845 the new Nexus won't find its At the very least, this change allows this PR to not break things. |
Depends on #8863. Fixes #8855 and #8856.
Note: since this branches from 8863, if that branch gets force-pushed, this one will also need to be force-pushed. This may break your incremental review flow.
This PR makes the initial changes required for Nexus quiesce described by RFD 588:
NexusQuiesceHandle
. This was so that that can be invoked from a background task.blueprint_execution
background task based on whether the current target blueprint says we should be handing off. (This is what fixes Reconfigurator: Start quiescing if target bpnexus_generation
is greater than ournexus_generation
#8855 and Nexus Boot: Start quiescing if target bpnexus_generation
is greater than ournexus_generation
#8856.) It does this even if blueprint execution is disabled.WaitingForSagas
->DrainingSagas
) and there's a newRecordingQuiesce
state that will cover the period after we've determined we're quiesced and before we've written the database record saying so.This PR does not change the quiesce process to coordinate among the Nexus instances (as described in RFD 588) nor update the
db_metadata_nexus
table. Both of these are blocked on #8845.The net result is a little awkward:
RecordingQuiesce
state that's basically unused (it is technically used, but we transition through it immediately)I think this is okay as an intermediate state, even on "main", since this cannot be triggered except by an online update, and we know we're going to fix these as part of shipping that.