Skip to content

Pull out cockroach-metrics, add minimal stats to inventory #8426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Jul 10, 2025

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 23, 2025

This PR directly follows the extraction from #8379

It pulls two of these metrics into inventory, where they will be used by the reconfigurator to decide if a Cockroach zone can be safely updated in #8441.

@smklein smklein marked this pull request as ready for review June 27, 2025 18:00
@smklein
Copy link
Collaborator Author

smklein commented Jun 27, 2025

Interesting; I'm seeing "ranges_underreplicated = 55" on the CI test failure here -- didn't see that locally, didn't actually realize that could happen with a single node. I might need to make this test more lenient.

Base automatically changed from crdb-prometheus to main June 30, 2025 18:34
.await
.context("looking up cockroach addresses")?;

// TODO: Allow a hard-coded option to find the admin interface here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add cockroach-admin to DNS (so the call above could use ServiceName::CockroachAdmin)? (Not a blocker for this PR certainly, but maybe the cleanest way to address this TODO)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file an issue for this: #8496

Updated references to this in 208b663

})
.collect();

cockroach_admin_client.update_backends(admin_addresses.as_slice()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of the conversation we had yesterday about CRDB safety - should we record the results of these metrics from all nodes instead of the first that responds successfully? Would it be reasonable to treat a failure to fetch metrics from a node as "the cluster may not be healthy"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain about this. There's basically two (or more?) signals about node health we could use:

  1. What do the HTTP endpoints say (e.g., we ask node 2, it says nodes 1-5 are healthy)
  2. What do we infer from querying HTTP endpoints (e.g., we don't hear back from node 2's HTTP endpoint in time)

(also arguably, 3. What if one or both of (1), (2) say "we're healthy" but then requests to that node fail for some other reason?)

What should we do in the case where "the rest of the cluster has identified node N as healthy, but the HTTP server for node N is not responding?"

Beyond a simple failure (e.g., misconfiguring the HTTP server), I'm really unsure how to proceed in this case. If we are using it as a signal, does that mean we need to wait for all CRDB nodes to respond to us to decide if they're healthy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our discussion, I updated this PR to now check all cockroachdb nodes, and store them all in inventory.

it's worth looking at the async fn collect_all_cockroach again - when we collect data from CRDB nodes, we're doing it by address, but I'm trying to store information by "node ID".

In the follow-up PR to this one (#8441) I will be verifying that "all nodes are reporting valid status" before we attempt to upgrade

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8441 is now updated, if you want to take a look at usage.

This now requires: At least COCKROACHDB_REDUNDANCY nodes are reporting: "liveness_livenodes >= COCKROACHDB_REDUNDANCY", with no underreplicated ranges.

If any nodes are missing this info, or return partial info, the update does not proceed.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I feel a lot better about recording the results from all the nodes. One nontrivial suggestion about how we find node IDs; happy to chat if I've misunderstood or what I'm suggesting isn't as straightforward as I'm hoping it is.

Comment on lines 35 to 39
// It's important that we have *some* timeout here - currently,
// inventory collection will query all nodes to confirm they're
// responding. However, it's very possible that one node is down,
// and that should not block collection indefinitely.
let timeout_duration = std::time::Duration::from_secs(15);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe an unreasonable fear, but - if any non-inventory bits want to start talking to the cockroach admin server and reach for this, will they get a potentially-surprising timeout set? Would it be reasonable to take timeout_duration as a parameter to new and force the user of the client to pick?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in eef1ee1

Comment on lines 431 to 438
// To access this data, we:
//
// 1. Make the assumption that the IP address of the Cockroach Admin
// server is the same as the Cockroach SQL server.
// 2. Create the mapping of "IP -> Response"
//
// If we find any responses that are in disagreement with each other,
// flag an error and stop the collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these assumptions are all fine. Maybe there's a cleaner way to do this, though: we already have a background task that collects the node IDs from the admin server. It makes the same assumption that the node ID <=> IP address mapping doesn't change (which I think is fine, given reconfigurator is responsible for allocating IPs, and it won't reuse IPs for new cockroach zones).

To support that, the admin server has a local_node_id() endpoint. Internally to the admin server, it caches its local node ID (since it can't ever change), so it's quick to access once it's been discovered. If I'm reading correctly, we only call fetch_node_status_from_all_nodes() to build the node ID <=> IP mappings so that we can attach node IDs to the results we get from fetch_prometheus_metrics_from_all_nodes(), right? What if we had the admin server include its local node ID in the response for fetch_prometheus_metrics_from_all_nodes()? Then we wouldn't need to build this mapping at all, because the responses would already tell us which node they were coming from, I think? We'd also be able to drop two kinds of runtime errors here (conflicting node IDs and getting metrics from an unknown IP), although maybe that first one turns into "what if two different admin servers claim to have the same ID" (which maybe found_cockroach_metrics() would have to tell us)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and updated this in dae8bdf

For now, rather than changing the crdb-admin API, I'm just querying the "local_node_id" endpoint first, and using that info. Seems functionally equivalent.

I do still think there's value in accessing node_status -- it can help us identify "which nodes" CockroachDB thinks are up/down/etc -- but agreed, we don't need it now if we're getting the node ID a different way.

anyhow::anyhow!(
"Failed to get node ID from {}: {}",
addr,
e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will lose underlying sources from e. Can we switch from .map_err() to .with_context(|| format!("Failed to get node ID from {addr}")) here and below? I believe anyhow then attaches e as the source to itself. (If I'm misremembering or that doesn't work, we could also use InlineErrorChain here when we stringify to make sure we keep the source chain.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 0211c8f

self.in_progress.found_cockroach_metrics(*node_id, metrics);
// Store results for each successful node using the node ID returned by each node
for (node_id, metrics) in metrics_results {
self.in_progress.found_cockroach_metrics(node_id, metrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@smklein smklein enabled auto-merge (squash) July 10, 2025 17:21
@smklein smklein merged commit ae3ca81 into main Jul 10, 2025
17 checks passed
@smklein smklein deleted the range-in-inventory branch July 10, 2025 19:26
smklein added a commit that referenced this pull request Jul 10, 2025
Updates the reconfigurator to evaluate cockroachdb cluster health before
upgrading zones

Only updates zones if:
- Ranges underreplicated == 0, and
- Live nodes == COCKROACH_REDUNDANCY
- At least COCKROACH_REDUNDANCY nodes are reporting their status
successfully

Builds on #8379 and
#8426
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants