Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite and simplify CheckpointExecutor #21234

Merged
merged 8 commits into from
Feb 20, 2025
Merged

Conversation

mystenmark
Copy link
Contributor

@mystenmark mystenmark commented Feb 14, 2025

Stacked on #21232, #21233 and #21235

This simplifies CheckpointExecutor massively, in preparation for optimizations which will be needed for throughput improvements

TODO

  • Antithesis tests - passed an 8 hour test
  • PTN tests - no performance or correctness problems observed
  • sui-single-node-benchmark tests - performance is comparable with old version after dialing concurrency down a bit

Copy link

vercel bot commented Feb 14, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 20, 2025 8:18pm
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Feb 20, 2025 8:18pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Feb 20, 2025 8:18pm


// Gets the next checkpoint to schedule for execution. If the epoch is already
// completed, returns None.
fn get_next_to_schedule(&self) -> Option<u64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is u64?

should this be CheckpointSequenceNumber?

#[instrument(level = "error", skip_all, fields(epoch = ?self.epoch_store.epoch()))]
pub async fn run_epoch(self, run_with_range: Option<RunWithRange>) -> StopReason {
let _metrics_guard = mysten_metrics::monitored_scope("CheckpointExecutor::run_epoch");
debug!(?run_with_range, "CheckpointExecutor::run_epoch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be part of the function level instrumentation above, or why is it a separate debug!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[instrument] on its own doesn't produce any logs, it only creates a span.

/// forked, and return when finished.
/// If `run_with_range` is set, execution will stop early.
#[instrument(level = "error", skip_all, fields(epoch = ?self.epoch_store.epoch()))]
pub async fn run_epoch(self, run_with_range: Option<RunWithRange>) -> StopReason {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not something you named this PR, but RunWithRange isn't really a range right? it's just an upper bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose so! Going to leave that be for now though

stop_seq,
};

futures::stream::unfold(Some(state), |state| async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you but I wonder if this could be written to read more like a straightforward function without needing things wrapped in an Option<State> and so on, if you used async_stream::stream!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started off with async_stream and I could go back to it... but I ran into a bug and found that debugging async_stream::stream! to be quite awful. If you expand that macro it's incomprehensible. I could go back to it though, no strong opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could see how that would happen. Fully defer to you on which style is more readable/maintanable

let delta_t = now.duration_since(last_update);
let delta_c = transaction_count - self.transaction_count;
let tps = delta_c as f64 / delta_t.as_secs_f64();
self.tps = self.tps * 0.9 + tps * 0.1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this 0.9 and 0.1 stuff? if it's a moving average can this use simple_moving_average?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an exponential moving average https://en.wikipedia.org/wiki/Exponential_smoothing - I could do an SMA if this proves to be inaccurate. I don't even know what this metric is used for, just trying not to break any dependencies. I checked in PTN that this gives reasonable estimates.

Copy link
Contributor

@aschran aschran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preemptively approving, i only had nits

@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 18:49 — with GitHub Actions Inactive
@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 20:16 — with GitHub Actions Inactive
@mystenmark mystenmark merged commit fc0bad2 into main Feb 20, 2025
47 checks passed
@mystenmark mystenmark deleted the mlogan-ckpt-exec-rewrite branch February 20, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants