-
Notifications
You must be signed in to change notification settings - Fork 48
Pull out cockroach-metrics, add minimal stats to inventory #8426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Interesting; I'm seeing "ranges_underreplicated = 55" on the CI test failure here -- didn't see that locally, didn't actually realize that could happen with a single node. I might need to make this test more lenient. |
.await | ||
.context("looking up cockroach addresses")?; | ||
|
||
// TODO: Allow a hard-coded option to find the admin interface here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add cockroach-admin to DNS (so the call above could use ServiceName::CockroachAdmin
)? (Not a blocker for this PR certainly, but maybe the cleanest way to address this TODO)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | ||
.collect(); | ||
|
||
cockroach_admin_client.update_backends(admin_addresses.as_slice()).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the conversation we had yesterday about CRDB safety - should we record the results of these metrics from all nodes instead of the first that responds successfully? Would it be reasonable to treat a failure to fetch metrics from a node as "the cluster may not be healthy"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncertain about this. There's basically two (or more?) signals about node health we could use:
- What do the HTTP endpoints say (e.g., we ask node 2, it says nodes 1-5 are healthy)
- What do we infer from querying HTTP endpoints (e.g., we don't hear back from node 2's HTTP endpoint in time)
(also arguably, 3. What if one or both of (1), (2) say "we're healthy" but then requests to that node fail for some other reason?)
What should we do in the case where "the rest of the cluster has identified node N as healthy, but the HTTP server for node N is not responding?"
Beyond a simple failure (e.g., misconfiguring the HTTP server), I'm really unsure how to proceed in this case. If we are using it as a signal, does that mean we need to wait for all CRDB nodes to respond to us to decide if they're healthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our discussion, I updated this PR to now check all cockroachdb nodes, and store them all in inventory.
it's worth looking at the async fn collect_all_cockroach
again - when we collect data from CRDB nodes, we're doing it by address, but I'm trying to store information by "node ID".
In the follow-up PR to this one (#8441) I will be verifying that "all nodes are reporting valid status" before we attempt to upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#8441 is now updated, if you want to take a look at usage.
This now requires: At least COCKROACHDB_REDUNDANCY nodes are reporting: "liveness_livenodes >= COCKROACHDB_REDUNDANCY", with no underreplicated ranges.
If any nodes are missing this info, or return partial info, the update does not proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I feel a lot better about recording the results from all the nodes. One nontrivial suggestion about how we find node IDs; happy to chat if I've misunderstood or what I'm suggesting isn't as straightforward as I'm hoping it is.
cockroach-metrics/src/lib.rs
Outdated
// It's important that we have *some* timeout here - currently, | ||
// inventory collection will query all nodes to confirm they're | ||
// responding. However, it's very possible that one node is down, | ||
// and that should not block collection indefinitely. | ||
let timeout_duration = std::time::Duration::from_secs(15); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe an unreasonable fear, but - if any non-inventory bits want to start talking to the cockroach admin server and reach for this, will they get a potentially-surprising timeout set? Would it be reasonable to take timeout_duration
as a parameter to new and force the user of the client to pick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in eef1ee1
nexus/inventory/src/collector.rs
Outdated
// To access this data, we: | ||
// | ||
// 1. Make the assumption that the IP address of the Cockroach Admin | ||
// server is the same as the Cockroach SQL server. | ||
// 2. Create the mapping of "IP -> Response" | ||
// | ||
// If we find any responses that are in disagreement with each other, | ||
// flag an error and stop the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these assumptions are all fine. Maybe there's a cleaner way to do this, though: we already have a background task that collects the node IDs from the admin server. It makes the same assumption that the node ID <=> IP address
mapping doesn't change (which I think is fine, given reconfigurator is responsible for allocating IPs, and it won't reuse IPs for new cockroach zones).
To support that, the admin server has a local_node_id()
endpoint. Internally to the admin server, it caches its local node ID (since it can't ever change), so it's quick to access once it's been discovered. If I'm reading correctly, we only call fetch_node_status_from_all_nodes()
to build the node ID <=> IP mappings so that we can attach node IDs to the results we get from fetch_prometheus_metrics_from_all_nodes()
, right? What if we had the admin server include its local node ID in the response for fetch_prometheus_metrics_from_all_nodes()
? Then we wouldn't need to build this mapping at all, because the responses would already tell us which node they were coming from, I think? We'd also be able to drop two kinds of runtime errors here (conflicting node IDs and getting metrics from an unknown IP), although maybe that first one turns into "what if two different admin servers claim to have the same ID" (which maybe found_cockroach_metrics()
would have to tell us)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and updated this in dae8bdf
For now, rather than changing the crdb-admin API, I'm just querying the "local_node_id" endpoint first, and using that info. Seems functionally equivalent.
I do still think there's value in accessing node_status -- it can help us identify "which nodes" CockroachDB thinks are up/down/etc -- but agreed, we don't need it now if we're getting the node ID a different way.
cockroach-metrics/src/lib.rs
Outdated
anyhow::anyhow!( | ||
"Failed to get node ID from {}: {}", | ||
addr, | ||
e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will lose underlying sources from e
. Can we switch from .map_err()
to .with_context(|| format!("Failed to get node ID from {addr}"))
here and below? I believe anyhow then attaches e
as the source to itself. (If I'm misremembering or that doesn't work, we could also use InlineErrorChain
here when we stringify to make sure we keep the source chain.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 0211c8f
self.in_progress.found_cockroach_metrics(*node_id, metrics); | ||
// Store results for each successful node using the node ID returned by each node | ||
for (node_id, metrics) in metrics_results { | ||
self.in_progress.found_cockroach_metrics(node_id, metrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
This PR directly follows the extraction from #8379
It pulls two of these metrics into inventory, where they will be used by the reconfigurator to decide if a Cockroach zone can be safely updated in #8441.