[nexus] config flag to disable SP ereport ingestion #8709

hawkw · 2025-07-28T22:07:04Z

PR #8296 added the sp_ereport_ingester background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds:

20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
    background_task = sp_ereport_ingester
    gateway_url = http://[fd00:1122:3344:108::2]:12225
    result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} })
20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    background_task = sp_ereport_ingester
    committed_ena = None
    error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    file = nexus/src/app/background/tasks/ereport_ingester.rs:380
    gateway_addr = [fd00:1122:3344:108::2]:12225
    restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart)
    slot = 29
    sp_type = sled
    start_ena = None

Similarly, MGS will also have a bunch of noisy complaints about these requests failing.

The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.

PR #8296 added the `sp_ereport_ingester` background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds: ``` 20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response background_task = sp_ereport_ingester gateway_url = http://[fd00:1122:3344:108::2]:12225 result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} }) 20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } background_task = sp_ereport_ingester committed_ena = None error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" } file = nexus/src/app/background/tasks/ereport_ingester.rs:380 gateway_addr = [fd00:1122:3344:108::2]:12225 restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart) slot = 29 sp_type = sled start_ena = None ``` Similarly, MGS will also have a bunch of noisy complaints about these requests failing. The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.

jgallagher

Changes LGTM; looks like the failing tests are some other configs that need the new field?

(Also just checking: all the "R17"s in the PR description should be "R16" right?)

nexus-config/src/nexus_config.rs

Co-authored-by: John Gallagher <[email protected]>

hawkw · 2025-07-29T17:12:17Z

Changes LGTM; looks like the failing tests are some other configs that need the new field?

Oh, I see the problem, I thought I had made it default to false but I didn't actually do that. Whoopsie.

(Also just checking: all the "R17"s in the PR description should be "R16" right?)

Agh, yeah, good catch.

hawkw requested a review from jgallagher July 28, 2025 22:07

jgallagher approved these changes Jul 29, 2025

View reviewed changes

nexus-config/src/nexus_config.rs Outdated Show resolved Hide resolved

Update nexus-config/src/nexus_config.rs

1459154

Co-authored-by: John Gallagher <[email protected]>

disable flag should default to false

778294f

hawkw enabled auto-merge (squash) July 29, 2025 17:28

hawkw merged commit 46b7f4e into main Jul 29, 2025
16 checks passed

hawkw deleted the eliza/disable-ereports branch July 29, 2025 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nexus] config flag to disable SP ereport ingestion #8709

[nexus] config flag to disable SP ereport ingestion #8709

Uh oh!

hawkw commented Jul 28, 2025

Uh oh!

jgallagher left a comment

Uh oh!

Uh oh!

hawkw commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

[nexus] config flag to disable SP ereport ingestion #8709

[nexus] config flag to disable SP ereport ingestion #8709

Uh oh!

Conversation

hawkw commented Jul 28, 2025

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hawkw commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!