Skip to content

Conversation

mkeeter
Copy link
Collaborator

@mkeeter mkeeter commented Feb 7, 2025

hf.bonus_sector_erase(offset)
.map_err(|err| APOBError::EraseFailed { offset, err })?;
} else {
// Read back the page and confirm that it's all empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this fail if there's a blip in the IPCC path and the host resends an APOB request? (I'm not sure what the expectations are for the offsets the host is providing.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed a bunch since February; messages should now all be idempotent.

@mkeeter mkeeter marked this pull request as draft February 7, 2025 17:08
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some mostly kind of annoying nitpicks.

Comment on lines 1079 to 1083
fn apob_write(
hf: &HostFlash,
mut offset: u64,
data: &[u8],
) -> Result<(), ApobError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need something in the broader API to discover what the erase size granularity is or at least make sure that we're sending stuff that is page size aligned. This gets to what @jgallagher gets at below. But if the host sent things that wasn't page aligned then we'd erase the entire page because our API is not doing a read-modify-write.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has all changed now – we ensure that the to-be-written page is erased when the state machine starts, and writes are idempotent.

@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch 3 times, most recently from 14165d3 to b0acdc3 Compare September 26, 2025 15:37
@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch from f7e8314 to 4a07f2b Compare October 1, 2025 14:00
@mkeeter mkeeter marked this pull request as ready for review October 1, 2025 14:13
@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch from ded1fa1 to ad93477 Compare October 1, 2025 14:27
Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will probably do another pass later

checksum: 0, // dummy value
};
out.checksum = out.expected_checksum();
assert!(out.is_valid());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert becomes panic, is that the behavior we want here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so; the only way this should panic is if someone has broken the code in a dramatic way (e.g. editing the implementation of is_valid so that previously valid data is no longer valid).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is also copied from HfRawPersistentData, so I didn't think about it too much!

Comment on lines +265 to +267
/// Either 0 or 1; directly translatable to [`ApobSlot`]
pub slot_select: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of places where we have unreachable because of using the u32, maaaybe an enum would be cleaner and reduce a few checks? Or is the issue this needs to match exactly what the host is expecting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't seen at all by the host, but we're reading / writing this object directly to disk, so we need zerocopy-friendly types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, zerocopy::TryFromBytes can be derived for enums (though I'm not sure how annoying that would be to actually use here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be usable, but the docs seem to imply that you shouldn't it when round-tripping through bytes (?!).

I've opened google/zerocopy#2722 to ask for clarification

fn fail(err: drv_hf_api::HfError) {
let mut buffer = [0; hf::idl::INCOMING_SIZE];
let mut server = hf::idl::FailServer::new(err);
let mut server = hf::FailServer(err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, do we lose the idl generation of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah – I switched to finer-grained error types for a few methods, which means that the IDL generator can't make a FailServer (which assumes a single error type).

pub(crate) fn write(
&mut self,
drv: &mut FlashDriver,
offset: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the u64 everywhere? It seems like everything gets converted/checked against u32 anyway

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial u64 is sent by the host, but I pushed the u32 conversion upstream into host_sp_comms.

@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch 2 times, most recently from 49afe3d to 695c7d5 Compare October 2, 2025 14:30
@citrus-it
Copy link
Contributor

@mkeeter - here are the IPCC messages which are expected to be seen before the host is finished with APOB. Anything else incoming from the host should trigger the lockdown.

humility: ring buffer task_host_sp_comms::__RINGBUF in host_sp_comms:
   TOTAL VARIANT
      99 Request(ApobRead)
      96 Request(ApobData)
       2 Request(KeyLookup)
       1 Request(GetBootStorageUnit)
       1 Request(GetIdentity)
       1 Request(GetStatus)
       1 Request(AckSpStart)
       1 Request(ApobBegin)
       1 Request(ApobCommit)

@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch from efc7039 to 8d35455 Compare October 2, 2025 20:35
@mkeeter
Copy link
Collaborator Author

mkeeter commented Oct 2, 2025

@citrus-it Great, thanks! I've pushed this list to Hubris and to RFD 593.

@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch from 8d35455 to 58e761d Compare October 2, 2025 20:40
@mkeeter mkeeter force-pushed the mkeeter/ipcc-apob branch from 58e761d to f883776 Compare October 3, 2025 15:15
interrupts = {"uart7.irq" = "usart-irq"}
priority = 8
max-sizes = {flash = 65952, ram = 65536}
max-sizes = {flash = 67648, ram = 65536}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, feels a little unfortunate that this change impacts Gimlet at all, given that (IIUC) the APOB stuff is ~SP5-specific. Is this just due to the addition of new IPCC messages?

actual: u32,
},
}
ringbuf!(Trace, 16, Trace::None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 entries is a pretty small ringbuf, might we want to also generate counters for the various operations like slot erase/sector erase/persistent data write, and/or the total number of errors?

}

#[derive(Copy, Clone, PartialEq)]
pub(crate) enum ApobState {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, feel free to ignore: it might be nice to have a doc comment on this referencing the state diagram in https://rfd.shared.oxide.computer/rfd/593?expand=1#_production_strength_implementation ?

Comment on lines +309 to +313
// We do a CRC32 of everything except the checksum, which is positioned
// at the end of the struct and is a `u32`
let size = core::mem::size_of::<ApobRawPersistentData>()
- core::mem::size_of::<u32>();
c.update(&self.as_bytes()[..size]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit, feel free to ignore: it occurs to me that this could also be achieved by having two #[repr(C)] #[derive(IntoBytes)] structs and nesting the "used to calculate CRC32" parts inside the outer struct, like:

#[derive(
    Copy, Clone, Eq, PartialEq, IntoBytes, FromBytes, Immutable, KnownLayout,
)]
#[repr(C)]
pub struct ApobRawPersistentData {
    pub inner: ApobRawPersistentDataInner,
    checksum: u32,
}

#[derive(
    Copy, Clone, Eq, PartialEq, IntoBytes, FromBytes, Immutable, KnownLayout,
)]
#[repr(C)]
pub struct ApobRawPersistentDataInner {
    oxide_magic: u32,
    header_version: u32,
    pub monotonic_counter: u64,
    pub slot_select: u32,
}

impl ApobRawPersistentData {

    // ...

    fn expected_checksum(&self) -> u32 {
        static CRC: crc::Crc<u32> = crc::Crc::<u32>::new(&crc::CRC_32_ISCSI);
        let mut c = CRC.digest();
        // We do a CRC32 of everything except the checksum, which is positioned
        // at the end of the struct and is a `u32`
        c.update(&self.inner.as_bytes());
        c.finalize()
    }
    
    // ...
}

is this worth it? well...perhaps not!

Comment on lines +350 to +351
/// Searches for an active slot in the metadata regions, updating the offset
/// in the FPGA driver if found, and erases unused or invalid slots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the erasure happen in this function? i don't think that happens in set_apob_offset, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: OH it's get_slot that does that. this naming was shocking and disturbing to me, i would not expect get_thing to also delete all the other things.

pub dev: HfDevSelect,
hash: HashData,

// Used in the `apob` module for implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for implementation?

)]
pub enum ApobBeginError {
/// APOB is not implemented on this hardware
NotImplemented = 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImAGimlet :)

IdolError,
counters::Count,
)]
pub enum ApobCommitError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for having separate error types for each operation <3

FPGA images and collateral are generated from:
[this sha](https://github.com/oxidecomputer/quartz/commit/bdc5fb31e1905a1b66c19647fe2d156dd1b97b7b)
[release](https://api.github.com/repos/oxidecomputer/quartz/releases/242278756)
FPGA images and collateral are generated from some hw guy's local build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is going to be changed before merging this? :)

Comment on lines +1134 to +1150
if let Ok(d) = data.try_into() {
let hash = ApobHash::Sha256(d);
match hf.apob_begin(length, hash) {
Ok(()) => ApobBeginResult::Ok,
Err(ApobBeginError::NotImplemented) => {
ApobBeginResult::NotImplemented
}
Err(ApobBeginError::InvalidState) => {
ApobBeginResult::InvalidState
}
Err(ApobBeginError::BadDataLength) => {
ApobBeginResult::BadDataLength
}
}
} else {
ApobBeginResult::BadHashLength
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it, might be stupid, but we could make this less rightward-drift-y

Suggested change
if let Ok(d) = data.try_into() {
let hash = ApobHash::Sha256(d);
match hf.apob_begin(length, hash) {
Ok(()) => ApobBeginResult::Ok,
Err(ApobBeginError::NotImplemented) => {
ApobBeginResult::NotImplemented
}
Err(ApobBeginError::InvalidState) => {
ApobBeginResult::InvalidState
}
Err(ApobBeginError::BadDataLength) => {
ApobBeginResult::BadDataLength
}
}
} else {
ApobBeginResult::BadHashLength
}
let Ok(d) = data.try_into() else {
return ApobBeginResult::BadHashLength;
};
let hash = ApobHash::Sha256(d);
match hf.apob_begin(length, hash) {
Ok(()) => ApobBeginResult::Ok,
Err(ApobBeginError::NotImplemented) => {
ApobBeginResult::NotImplemented
}
Err(ApobBeginError::InvalidState) => {
ApobBeginResult::InvalidState
}
Err(ApobBeginError::BadDataLength) => {
ApobBeginResult::BadDataLength
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants