Skip to content

[nexus] config flag to disable SP ereport ingestion #8709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 29, 2025
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jul 28, 2025

PR #8296 added the sp_ereport_ingester background task to Nexus for periodically collecting ereports from SPs via MGS. However, the Hubris PR adding the Hubris task that actually responds to these requests from the control plane, oxidecomputer/hubris#2126, won't make it in until after R17. This means that if we release R17 with a control plane that tries to collect ereports, and a SP firmware that doesn't know how to respond to such requests, the Nexus logs will be littered with 36 log lines like this every 30 seconds:

20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
    background_task = sp_ereport_ingester
    gateway_url = http://[fd00:1122:3344:108::2]:12225
    result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} })
20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    background_task = sp_ereport_ingester
    committed_ena = None
    error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    file = nexus/src/app/background/tasks/ereport_ingester.rs:380
    gateway_addr = [fd00:1122:3344:108::2]:12225
    restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart)
    slot = 29
    sp_type = sled
    start_ena = None

Similarly, MGS will also have a bunch of noisy complaints about these requests failing.

The consequences of this are really not terrible: it just means we'll be logging a lot of errors. But it seems mildly unfortunate to be constantly trying to do something that's invariably doomed to failure, and then yelling about how it didn't work. So, this commit adds a config flag for disabling the whole thing, which we can turn on for R17's production Nexus config and then turn back off when the Hubris changes make it in. I did this using a config setting, rather than hard-coding it to always be disabled, because there are also integration tests for this stuff, which will break if we disabled it everywhere.

PR #8296 added the `sp_ereport_ingester` background task to Nexus for
periodically collecting ereports from SPs via MGS. However, the Hubris
PR adding the Hubris task that actually responds to these requests from
the control plane, oxidecomputer/hubris#2126, won't make it in until
after R17. This means that if we release R17 with a control plane that
tries to collect ereports, and a SP firmware that doesn't know how to
respond to such requests, the Nexus logs will be littered with 36 log
lines like this every 30 seconds:

```
20:58:04.603Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): client response
    background_task = sp_ereport_ingester
    gateway_url = http://[fd00:1122:3344:108::2]:12225
    result = Ok(Response { url: "http://[fd00:1122:3344:108::2]:12225/sp/sled/29/ereports?limit=255&restart_id=00000000-0000-0000-0000-000000000000", status: 503, headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"} })
20:58:04.603Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): ereport collection: unanticipated MGS request error: Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    background_task = sp_ereport_ingester
    committed_ena = None
    error = Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "35390a4a-6d3a-4683-be88-217267b46da0", "content-length": "224", "date": "Mon, 28 Jul 2025 20:58:04 GMT"}; value: Error { error_code: Some("SpCommunicationFailed"), message: "error communicating with SP SpIdentifier { typ: Sled, slot: 29 }: RPC call failed (gave up after 5 attempts)", request_id: "35390a4a-6d3a-4683-be88-217267b46da0" }
    file = nexus/src/app/background/tasks/ereport_ingester.rs:380
    gateway_addr = [fd00:1122:3344:108::2]:12225
    restart_id = 00000000-0000-0000-0000-000000000000 (ereporter_restart)
    slot = 29
    sp_type = sled
    start_ena = None
```

Similarly, MGS will also have a bunch of noisy complaints about these
requests failing.

The consequences of this are really not terrible: it just means we'll be
logging a lot of errors. But it seems mildly unfortunate to be
constantly trying to do something that's invariably doomed to failure,
and then yelling about how it didn't work. So, this commit adds a config
flag for disabling the whole thing, which we can turn on for R17's
production Nexus config and then turn back off when the Hubris changes
make it in. I did this using a config setting, rather than hard-coding
it to always be disabled, because there are also integration tests for
this stuff, which will break if we disabled it everywhere.
@hawkw hawkw requested a review from jgallagher July 28, 2025 22:07
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM; looks like the failing tests are some other configs that need the new field?

(Also just checking: all the "R17"s in the PR description should be "R16" right?)

@hawkw
Copy link
Member Author

hawkw commented Jul 29, 2025

Changes LGTM; looks like the failing tests are some other configs that need the new field?

Oh, I see the problem, I thought I had made it default to false but I didn't actually do that. Whoopsie.

(Also just checking: all the "R17"s in the PR description should be "R16" right?)

Agh, yeah, good catch.

@hawkw hawkw enabled auto-merge (squash) July 29, 2025 17:28
@hawkw hawkw merged commit 46b7f4e into main Jul 29, 2025
16 checks passed
@hawkw hawkw deleted the eliza/disable-ereports branch July 29, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants