-
Notifications
You must be signed in to change notification settings - Fork 207
Add APOB messages to host_sp_comms #2006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
task/host-sp-comms/src/main.rs
Outdated
hf.bonus_sector_erase(offset) | ||
.map_err(|err| APOBError::EraseFailed { offset, err })?; | ||
} else { | ||
// Read back the page and confirm that it's all empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this fail if there's a blip in the IPCC path and the host resends an APOB
request? (I'm not sure what the expectations are for the offsets the host is providing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has changed a bunch since February; messages should now all be idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some mostly kind of annoying nitpicks.
3c34974
to
593980a
Compare
65a65da
to
ebcfe39
Compare
ded20cd
to
7846e44
Compare
task/host-sp-comms/src/main.rs
Outdated
fn apob_write( | ||
hf: &HostFlash, | ||
mut offset: u64, | ||
data: &[u8], | ||
) -> Result<(), ApobError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need something in the broader API to discover what the erase size granularity is or at least make sure that we're sending stuff that is page size aligned. This gets to what @jgallagher gets at below. But if the host sent things that wasn't page aligned then we'd erase the entire page because our API is not doing a read-modify-write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has all changed now – we ensure that the to-be-written page is erased when the state machine starts, and writes are idempotent.
14165d3
to
b0acdc3
Compare
f7e8314
to
4a07f2b
Compare
ded1fa1
to
ad93477
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will probably do another pass later
checksum: 0, // dummy value | ||
}; | ||
out.checksum = out.expected_checksum(); | ||
assert!(out.is_valid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert
becomes panic, is that the behavior we want here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so; the only way this should panic is if someone has broken the code in a dramatic way (e.g. editing the implementation of is_valid
so that previously valid data is no longer valid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is also copied from HfRawPersistentData
, so I didn't think about it too much!
/// Either 0 or 1; directly translatable to [`ApobSlot`] | ||
pub slot_select: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of places where we have unreachable
because of using the u32
, maaaybe an enum would be cleaner and reduce a few checks? Or is the issue this needs to match exactly what the host is expecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't seen at all by the host, but we're reading / writing this object directly to disk, so we need zerocopy
-friendly types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, zerocopy::TryFromBytes
can be derived for enums (though I'm not sure how annoying that would be to actually use here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be usable, but the docs seem to imply that you shouldn't it when round-tripping through bytes (?!).
I've opened google/zerocopy#2722 to ask for clarification
fn fail(err: drv_hf_api::HfError) { | ||
let mut buffer = [0; hf::idl::INCOMING_SIZE]; | ||
let mut server = hf::idl::FailServer::new(err); | ||
let mut server = hf::FailServer(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, do we lose the idl
generation of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah – I switched to finer-grained error types for a few methods, which means that the IDL generator can't make a FailServer
(which assumes a single error type).
drv/cosmo-hf/src/apob.rs
Outdated
pub(crate) fn write( | ||
&mut self, | ||
drv: &mut FlashDriver, | ||
offset: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the u64
everywhere? It seems like everything gets converted/checked against u32
anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial u64
is sent by the host, but I pushed the u32
conversion upstream into host_sp_comms
.
49afe3d
to
695c7d5
Compare
@mkeeter - here are the IPCC messages which are expected to be seen before the host is finished with APOB. Anything else incoming from the host should trigger the lockdown.
|
efc7039
to
8d35455
Compare
@citrus-it Great, thanks! I've pushed this list to Hubris and to RFD 593. |
8d35455
to
58e761d
Compare
58e761d
to
f883776
Compare
interrupts = {"uart7.irq" = "usart-irq"} | ||
priority = 8 | ||
max-sizes = {flash = 65952, ram = 65536} | ||
max-sizes = {flash = 67648, ram = 65536} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, feels a little unfortunate that this change impacts Gimlet at all, given that (IIUC) the APOB stuff is ~SP5-specific. Is this just due to the addition of new IPCC messages?
actual: u32, | ||
}, | ||
} | ||
ringbuf!(Trace, 16, Trace::None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 entries is a pretty small ringbuf, might we want to also generate counters for the various operations like slot erase/sector erase/persistent data write, and/or the total number of errors?
} | ||
|
||
#[derive(Copy, Clone, PartialEq)] | ||
pub(crate) enum ApobState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicky, feel free to ignore: it might be nice to have a doc comment on this referencing the state diagram in https://rfd.shared.oxide.computer/rfd/593?expand=1#_production_strength_implementation ?
// We do a CRC32 of everything except the checksum, which is positioned | ||
// at the end of the struct and is a `u32` | ||
let size = core::mem::size_of::<ApobRawPersistentData>() | ||
- core::mem::size_of::<u32>(); | ||
c.update(&self.as_bytes()[..size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit, feel free to ignore: it occurs to me that this could also be achieved by having two #[repr(C)] #[derive(IntoBytes)]
structs and nesting the "used to calculate CRC32" parts inside the outer struct, like:
#[derive(
Copy, Clone, Eq, PartialEq, IntoBytes, FromBytes, Immutable, KnownLayout,
)]
#[repr(C)]
pub struct ApobRawPersistentData {
pub inner: ApobRawPersistentDataInner,
checksum: u32,
}
#[derive(
Copy, Clone, Eq, PartialEq, IntoBytes, FromBytes, Immutable, KnownLayout,
)]
#[repr(C)]
pub struct ApobRawPersistentDataInner {
oxide_magic: u32,
header_version: u32,
pub monotonic_counter: u64,
pub slot_select: u32,
}
impl ApobRawPersistentData {
// ...
fn expected_checksum(&self) -> u32 {
static CRC: crc::Crc<u32> = crc::Crc::<u32>::new(&crc::CRC_32_ISCSI);
let mut c = CRC.digest();
// We do a CRC32 of everything except the checksum, which is positioned
// at the end of the struct and is a `u32`
c.update(&self.inner.as_bytes());
c.finalize()
}
// ...
}
is this worth it? well...perhaps not!
/// Searches for an active slot in the metadata regions, updating the offset | ||
/// in the FPGA driver if found, and erases unused or invalid slots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the erasure happen in this function? i don't think that happens in set_apob_offset
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: OH it's get_slot
that does that. this naming was shocking and disturbing to me, i would not expect get_thing
to also delete all the other thing
s.
pub dev: HfDevSelect, | ||
hash: HashData, | ||
|
||
// Used in the `apob` module for implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for implementation?
)] | ||
pub enum ApobBeginError { | ||
/// APOB is not implemented on this hardware | ||
NotImplemented = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImAGimlet
:)
IdolError, | ||
counters::Count, | ||
)] | ||
pub enum ApobCommitError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for having separate error types for each operation <3
FPGA images and collateral are generated from: | ||
[this sha](https://github.com/oxidecomputer/quartz/commit/bdc5fb31e1905a1b66c19647fe2d156dd1b97b7b) | ||
[release](https://api.github.com/repos/oxidecomputer/quartz/releases/242278756) | ||
FPGA images and collateral are generated from some hw guy's local build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this is going to be changed before merging this? :)
if let Ok(d) = data.try_into() { | ||
let hash = ApobHash::Sha256(d); | ||
match hf.apob_begin(length, hash) { | ||
Ok(()) => ApobBeginResult::Ok, | ||
Err(ApobBeginError::NotImplemented) => { | ||
ApobBeginResult::NotImplemented | ||
} | ||
Err(ApobBeginError::InvalidState) => { | ||
ApobBeginResult::InvalidState | ||
} | ||
Err(ApobBeginError::BadDataLength) => { | ||
ApobBeginResult::BadDataLength | ||
} | ||
} | ||
} else { | ||
ApobBeginResult::BadHashLength | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take it or leave it, might be stupid, but we could make this less rightward-drift-y
if let Ok(d) = data.try_into() { | |
let hash = ApobHash::Sha256(d); | |
match hf.apob_begin(length, hash) { | |
Ok(()) => ApobBeginResult::Ok, | |
Err(ApobBeginError::NotImplemented) => { | |
ApobBeginResult::NotImplemented | |
} | |
Err(ApobBeginError::InvalidState) => { | |
ApobBeginResult::InvalidState | |
} | |
Err(ApobBeginError::BadDataLength) => { | |
ApobBeginResult::BadDataLength | |
} | |
} | |
} else { | |
ApobBeginResult::BadHashLength | |
} | |
let Ok(d) = data.try_into() else { | |
return ApobBeginResult::BadHashLength; | |
}; | |
let hash = ApobHash::Sha256(d); | |
match hf.apob_begin(length, hash) { | |
Ok(()) => ApobBeginResult::Ok, | |
Err(ApobBeginError::NotImplemented) => { | |
ApobBeginResult::NotImplemented | |
} | |
Err(ApobBeginError::InvalidState) => { | |
ApobBeginResult::InvalidState | |
} | |
Err(ApobBeginError::BadDataLength) => { | |
ApobBeginResult::BadDataLength | |
} | |
} |
Implements the state machine described in RFD 593
See https://github.com/oxidecomputer/stlouis/issues/707, https://github.com/oxidecomputer/rfd/pull/855 , oxidecomputer/amd-host-image-builder#222