Pull out cockroach-metrics, add minimal stats to inventory #8426

smklein · 2025-06-23T23:19:50Z

This PR directly follows the extraction from #8379

It pulls two of these metrics into inventory, where they will be used by the reconfigurator to decide if a Cockroach zone can be safely updated in #8441.

smklein · 2025-06-27T19:31:04Z

Interesting; I'm seeing "ranges_underreplicated = 55" on the CI test failure here -- didn't see that locally, didn't actually realize that could happen with a single node. I might need to make this test more lenient.

jgallagher · 2025-07-01T16:16:52Z

nexus/src/app/background/tasks/inventory_collection.rs

+        .await
+        .context("looking up cockroach addresses")?;
+
+    // TODO: Allow a hard-coded option to find the admin interface here.


Should we add cockroach-admin to DNS (so the call above could use ServiceName::CockroachAdmin)? (Not a blocker for this PR certainly, but maybe the cleanest way to address this TODO)?

I'll file an issue for this: #8496

Updated references to this in 208b663

jgallagher · 2025-07-01T16:18:07Z

nexus/src/app/background/tasks/inventory_collection.rs

+        })
+        .collect();
+
+    cockroach_admin_client.update_backends(admin_addresses.as_slice()).await;


In the context of the conversation we had yesterday about CRDB safety - should we record the results of these metrics from all nodes instead of the first that responds successfully? Would it be reasonable to treat a failure to fetch metrics from a node as "the cluster may not be healthy"?

I'm uncertain about this. There's basically two (or more?) signals about node health we could use:

What do the HTTP endpoints say (e.g., we ask node 2, it says nodes 1-5 are healthy)

What do we infer from querying HTTP endpoints (e.g., we don't hear back from node 2's HTTP endpoint in time)

(also arguably, 3. What if one or both of (1), (2) say "we're healthy" but then requests to that node fail for some other reason?)

What should we do in the case where "the rest of the cluster has identified node N as healthy, but the HTTP server for node N is not responding?"

Beyond a simple failure (e.g., misconfiguring the HTTP server), I'm really unsure how to proceed in this case. If we are using it as a signal, does that mean we need to wait for all CRDB nodes to respond to us to decide if they're healthy?

As per our discussion, I updated this PR to now check all cockroachdb nodes, and store them all in inventory.

it's worth looking at the async fn collect_all_cockroach again - when we collect data from CRDB nodes, we're doing it by address, but I'm trying to store information by "node ID".

In the follow-up PR to this one (#8441) I will be verifying that "all nodes are reporting valid status" before we attempt to upgrade

#8441 is now updated, if you want to take a look at usage.

This now requires: At least COCKROACHDB_REDUNDANCY nodes are reporting: "liveness_livenodes >= COCKROACHDB_REDUNDANCY", with no underreplicated ranges.

If any nodes are missing this info, or return partial info, the update does not proceed.

nexus/db-queries/src/db/datastore/inventory.rs

cockroach-metrics/src/lib.rs

nexus/db-queries/src/db/datastore/inventory.rs

jgallagher

Thanks, I feel a lot better about recording the results from all the nodes. One nontrivial suggestion about how we find node IDs; happy to chat if I've misunderstood or what I'm suggesting isn't as straightforward as I'm hoping it is.

jgallagher · 2025-07-08T14:21:25Z

cockroach-metrics/src/lib.rs

+        // It's important that we have *some* timeout here - currently,
+        // inventory collection will query all nodes to confirm they're
+        // responding. However, it's very possible that one node is down,
+        // and that should not block collection indefinitely.
+        let timeout_duration = std::time::Duration::from_secs(15);


This is maybe an unreasonable fear, but - if any non-inventory bits want to start talking to the cockroach admin server and reach for this, will they get a potentially-surprising timeout set? Would it be reasonable to take timeout_duration as a parameter to new and force the user of the client to pick?

Done in eef1ee1

jgallagher · 2025-07-08T14:41:00Z

nexus/inventory/src/collector.rs

+        // To access this data, we:
+        //
+        // 1. Make the assumption that the IP address of the Cockroach Admin
+        // server is the same as the Cockroach SQL server.
+        // 2. Create the mapping of "IP -> Response"
+        //
+        // If we find any responses that are in disagreement with each other,
+        // flag an error and stop the collection.


I think these assumptions are all fine. Maybe there's a cleaner way to do this, though: we already have a background task that collects the node IDs from the admin server. It makes the same assumption that the node ID <=> IP address mapping doesn't change (which I think is fine, given reconfigurator is responsible for allocating IPs, and it won't reuse IPs for new cockroach zones).

To support that, the admin server has a local_node_id() endpoint. Internally to the admin server, it caches its local node ID (since it can't ever change), so it's quick to access once it's been discovered. If I'm reading correctly, we only call fetch_node_status_from_all_nodes() to build the node ID <=> IP mappings so that we can attach node IDs to the results we get from fetch_prometheus_metrics_from_all_nodes(), right? What if we had the admin server include its local node ID in the response for fetch_prometheus_metrics_from_all_nodes()? Then we wouldn't need to build this mapping at all, because the responses would already tell us which node they were coming from, I think? We'd also be able to drop two kinds of runtime errors here (conflicting node IDs and getting metrics from an unknown IP), although maybe that first one turns into "what if two different admin servers claim to have the same ID" (which maybe found_cockroach_metrics() would have to tell us)?

I went ahead and updated this in dae8bdf

For now, rather than changing the crdb-admin API, I'm just querying the "local_node_id" endpoint first, and using that info. Seems functionally equivalent.

I do still think there's value in accessing node_status -- it can help us identify "which nodes" CockroachDB thinks are up/down/etc -- but agreed, we don't need it now if we're getting the node ID a different way.

jgallagher · 2025-07-08T19:22:05Z

cockroach-metrics/src/lib.rs

+                                anyhow::anyhow!(
+                                    "Failed to get node ID from {}: {}",
+                                    addr,
+                                    e


I think this will lose underlying sources from e. Can we switch from .map_err() to .with_context(|| format!("Failed to get node ID from {addr}")) here and below? I believe anyhow then attaches e as the source to itself. (If I'm misremembering or that doesn't work, we could also use InlineErrorChain here when we stringify to make sure we keep the source chain.)

Updated in 0211c8f

jgallagher · 2025-07-08T19:23:01Z

nexus/inventory/src/collector.rs

-            self.in_progress.found_cockroach_metrics(*node_id, metrics);
+        // Store results for each successful node using the node ID returned by each node
+        for (node_id, metrics) in metrics_results {
+            self.in_progress.found_cockroach_metrics(node_id, metrics);


Updates the reconfigurator to evaluate cockroachdb cluster health before upgrading zones Only updates zones if: - Ranges underreplicated == 0, and - Live nodes == COCKROACH_REDUNDANCY - At least COCKROACH_REDUNDANCY nodes are reporting their status successfully Builds on #8379 and #8426

smklein added 25 commits June 2, 2025 17:16

starting parsing

d924fcd

[test-utils] Parse CockroachDB's chosen HTTP address

6432b23

fmt

21976c8

Merge branch 'main' into crdb-parse-http-addr

f512e9f

Merge branch 'main' into crdb-parse-http-addr

e8a9969

Merge branch 'main' into crdb-parse-http-addr

60d4517

Merge branch 'main' into crdb-parse-http-addr

40128bf

Proxy cockroach HTTP requests through the admin server

6684dbe

Access prometheus metrics through admin server

184eba8

clippy

cfa2f13

Pull out cockroach-metrics, add minimal stats to inventory

0fe774f

Merge branch 'main' into crdb-parse-http-addr

c400d20

Merge branch 'crdb-parse-http-addr' into crdb-admin-access

de94374

Merge branch 'crdb-admin-access' into crdb-prometheus

5454b87

Merge branch 'crdb-prometheus' into range-in-inventory

81f648f

EXPECTORATE

34ea517

Fix doctests

cd46e44

Merge branch 'main' into crdb-parse-http-addr

4ecca5c

Merge branch 'crdb-parse-http-addr' into crdb-admin-access

74fd628

Merge branch 'crdb-admin-access' into crdb-prometheus

7d19f04

Merge branch 'crdb-prometheus' into range-in-inventory

6276d9c

Add liveness_live_nodes

82e542b

Add it to inventory

3536254

Move cockroach metrics to new crate

7bcf3f5

Merge branch 'crdb-prometheus' into range-in-inventory

a1bdb8e

smklein marked this pull request as ready for review June 27, 2025 18:00

smklein mentioned this pull request Jun 27, 2025

[nexus] Use cockroachdb range stats in reconfigurator planner #8441

Merged

smklein requested review from andrewjstone and jgallagher June 27, 2025 18:02

smklein added 2 commits June 27, 2025 14:15

feedback

dbadc5b

Merge branch 'crdb-prometheus' into range-in-inventory

0a2427d

Base automatically changed from crdb-prometheus to main June 30, 2025 18:34

smklein added 2 commits June 30, 2025 11:35

Merge branch 'main' into range-in-inventory

1a38870

btree beats hash

396af02

jgallagher reviewed Jul 1, 2025

View reviewed changes

smklein added 4 commits July 1, 2025 12:03

Merge branch 'main' into range-in-inventory

cbd3ae5

8496 issue

208b663

Elaborate on CockroachStatus

f6a92a1

Track which node returned which status

648b266

hawkw reviewed Jul 3, 2025

View reviewed changes

cockroach-metrics/src/lib.rs Outdated Show resolved Hide resolved

cockroach-metrics/src/lib.rs Outdated Show resolved Hide resolved

nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved

smklein added 7 commits July 7, 2025 08:57

Actually update the schema

e556f41

review feedback

2999dca

Merge branch 'main' into range-in-inventory

19b3940

ls-apis

8d04517

update docs

0637cbe

Add schema changes for quiescing

17ca05f

Merge branch 'main' into range-in-inventory

e036174

jgallagher reviewed Jul 8, 2025

View reviewed changes

smklein added 3 commits July 8, 2025 08:46

158

e9c91fc

Plumb timeout

eef1ee1

NodeId is now a String, queried from crdb-admin

dae8bdf

jgallagher approved these changes Jul 8, 2025

View reviewed changes

smklein added 2 commits July 8, 2025 12:30

with_context

0211c8f

Merge branch 'main' into range-in-inventory

27c0ed6

smklein enabled auto-merge (squash) July 10, 2025 17:21

smklein merged commit ae3ca81 into main Jul 10, 2025
17 checks passed

smklein deleted the range-in-inventory branch July 10, 2025 19:26

jgallagher mentioned this pull request Jul 18, 2025

Reconfigurator: Displaying inventory collections should show cockroach status #8637

Closed

Pull out cockroach-metrics, add minimal stats to inventory #8426

Pull out cockroach-metrics, add minimal stats to inventory #8426

Uh oh!

Conversation

smklein commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

smklein commented Jun 23, 2025 •

edited

Loading

smklein commented Jun 27, 2025 •

edited

Loading