-
Notifications
You must be signed in to change notification settings - Fork 207
cpu-seq: Add more ereports #2242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: eliza/derive-ereport
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together @hawkw. I have a bunch of thoughts with varying degrees of cogency.
// | ||
// Interrupts | ||
// | ||
#[cbor(rename = "hw.cpu.thermtrip")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been going back and forth with myself on whether to phrase this as specific to amd or not. e.g. hw.cpu.amd.thermtrip
, as we're referring to its thermal trip assertion pin. I think it's probably better to keep this as is.
// | ||
#[cbor(rename = "hw.cpu.thermtrip")] | ||
Thermtrip, | ||
#[cbor(rename = "hw.seq.smerr")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I don't think we should phrase this in terms of the sequencer as this is a fact about the CPU. So probably closer to hw.cpu.smerr
. It's not clear to me again if we want to namespace this or if we'd try to make these similar. This almost feels like hw.cpu.amd.smerr
(though I had to look up what smerr is).
Thermtrip, | ||
#[cbor(rename = "hw.seq.smerr")] | ||
Smerr, | ||
#[cbor(rename = "hw.seq.a0_mapo")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I don't think we should phrase this as specific to the sequencer, this feels more related to power. So I would think of this as hw.pwr.a0.mapo
. This kind of organization assumes we want to make a class of A0/A2-specific power events. I also think we should consider instead hw.pwr.mapo
with a power domain present as an argument. A small part of me prefers this in case we detect A2 issues or we have a more complex situation with the Metro FPGA.
// | ||
// Initialization failures | ||
// | ||
#[cbor(rename = "hw.cpu.a0_fail.unknown")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get where you're coming from here, but I think I would basically keep this as hw.cpu.unsup
or you know the old callback to enotsup
. I think unknown
here is a bit confusing and it's not clear to me that an explicit a0_fail
subclass makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i like unsup
--- i thought about unknown_cpu
but that felt like a few too many characters. I agree that unknown
feels like it could mean "unknown error", which is not what this is supposed to mean. if we dropped a0_fail
, unknown_cpu
seems not too bad...
class: EreportClass::Thermtrip, | ||
version: 0, | ||
report: &HOST_CPU_REFDES, | ||
// TODO(eliza): eventually, it would be nice to include sequencer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sequencer registers are you imagining for this?
|
||
#[derive(Copy, Clone, Eq, PartialEq, microcbor::Encode, counters::Count)] | ||
pub enum EreportClass { | ||
#[cbor(rename = "hw.cpu.thermtrip")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing these in both Gimlet and Cosmo makes me wonder if there should be shared definitions here even if it has to get compiled into both tasks. That way we're consistent about the data that each contains.
A0Timeout, | ||
#[cbor(rename = "hw.a0_fail.timeout.groupc")] | ||
A0TimeoutGroupC, | ||
#[cbor(rename = "hw.pwr.pmbus.a0_fail.i2c_err")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class on here is weird to me, but I think we want to figure out the above first. It's not really a PMBus-specific happening here right? It's actually that we couldn't communicate over I2C based on my memory, though that might be off.
record_reg(Addr::FLT_A0_SMSTATUS); | ||
record_reg(Addr::FLT_GROUPB_PG); | ||
record_reg(Addr::FLT_GROUPC_PG); | ||
let seq_status = SeqStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is interesting and suggests that the sequencing stuff may need to be board specific. This almost makes me want to think that this is hw.gimlet.a0.seq.fail
or something. This needs more thought, but it's clear that we can't use the same payload between both. Or that we need some way for consumers to know that the board will change the payload. The reason I have the board here is that it's board-specific.
refdes: &'static str, | ||
rail: &'static str, | ||
#[count(children)] | ||
err: i2c::ResponseCode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something that we need to think carefully about. I think there's a desire to change this around a bit right, but this is starting to bake it into a more public API.
#[cbor(flatten)] | ||
refdes: &'static HostCpuRefdes, | ||
#[cbor(flatten)] | ||
coretype: Coretype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have coretype but not SP3r1/SP3r2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coretype
here is a Rust struct that actually includes coretype
/sp3r1
/sp3r2
fields; it's flatten
ed so we would actually generate the CBOR:
"coretype": <bool>,
"sp3r1": <bool>,
"sp3r2": <bool>,
It's a separate struct because it gets passed around in a few places here. Naming it Coretype
may have been a mistake.
Depends on #2246
This branch adds new ereports to
drv_gimlet_seq_server
anddrv_cosmo_seq_server
for failures going to A0 and for events thatresult in unexpected power-offs, such as THERMTRIP, sequencer FPGA A0
MAPO, and SMERR assertions. This depends on the
microcbor
derives forCBOR encoding I added in #2246, which make it much easier for a task to
record a variety of ereports represented by different Rust types,
without having to worry too much about the maximum length of the
encoding buffer.
This branch isn't ready to merge yet, as it depends on #2246. However,
I'd love to get reviews on this now, mostly to get some feedback on the
data included in these ereports and the class hierarchy/general message
schemas. I've kinda made these up; if there's more information that
would be useful to include in these, I'd love to get some advice from
@rmustacc and others.