Skip to content

TQ: Implement prepare and commit for initial config #8682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

Initial configurations can be prepared and committed with the implemented handlers. This is tested along with aborts at Nexus for when the coordinator for the initial configuration has crashed in a new property based test.

The new property based test runs all possible nodes in the universe as the system under test (SUT), rather than running only the coordinator. This allows a full deterministic simulation of the protocol and checking of invariants at all nodes. It's also easier to write and understand as we don't have to capture and mock replies to the coordinator. I had always intended to write this test, but started with modelling the coordinator first since I thought it would be easier to incrementally build the protocol that way. However, it appears just as easy to incrementally build with all nodes as the SUT.

The new test does not have a model of the system, which is exceedingly hard to do for such a protocol. Instead the test checks invariants of the real state of the SUT after every action, and allows peppering in postconditions as necessary for each action or operation.

The Node API has also changed to not worry about time at all, and instead deals in terms of connections and disconnections. This makes for simpler code IMO, and matches what was done for LRTQ. We always are operating over sprockets streams, which run over TLS over TCP and so it makes little sense to model things as if arbitrary packets can get dropped and reordered.

As a result of the new proptest and the change in time usage, I've decided to drop the coordinator test altogether. It's too complicated for its value add and urgency is a priority.

@andrewjstone andrewjstone requested a review from sunshowers July 24, 2025 16:35
//
// Nexus should only attempt to commit nodes that have acknowledged
// a `Prepare`. The most likely reason that this has occurred
// is that the node has lost its state on the M.2 drives. It can
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that recovery is not actually guaranteed here, as the drive could have been wiped after acking the latest configuration but not yet rotating. A new configuration could have then been issued that doesn't contain the encrypted rack secret for the unrotated keys' epoch. I think that this is a rare scenario, but I also think we probably don't need to handle byzantine failure here. Instead, this should probably be an alarm state (similar to what was done in #8062) and support call if the data on the M.2s is gone.

Initial configurations can be prepared and committed with the implemented
handlers. This is tested along with aborts at Nexus for when the coordinator for
the initial configuration has crashed in a new property based test.

The new property based test runs all possible nodes in the universe as
the system under test (SUT), rather than running only the coordinator.
This allows a full deterministic simulation of the protocol and checking
of invariants at all nodes. It's also easier to write and understand
as we don't have to capture and mock replies to the coordinator. I
had always intended to write this test, but started with modelling the
coordinator first since I thought it would be easier to incrementally
build the protocol that way. However, it appears just as easy to
incrementally build with all nodes as the SUT.

The new test does not have a model of the system, which is exceedingly
hard to do for such a protocol. Instead the test checks invariants of
the real state of the SUT after every action, and allows peppering in
postconditions as necessary for each action or operation.

The Node API has also changed to not worry about time at all, and
instead deals in terms of connections and disconnections. This makes
for simpler code IMO, and matches what was done for LRTQ. We always are
operating over sprockets streams, which run over TLS over TCP and so
it makes little sense to model things as if arbitrary packets can get
dropped and reordered.

As a result of the new proptest and the change in time usage, I've
decided to drop the coordinator test altogether. It's too complicated
for its value add and urgency is a priority.
@andrewjstone andrewjstone force-pushed the tq-commit-and-prepare-ack branch from 771cde7 to 071f1cf Compare July 24, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant