Shutdown switch zone when sidecar disappears #9181

Nieuwejaar · 2025-10-09T14:31:40Z

Use a device contract to detect when a sidecar goes away.
Trigger a shutdown of the switch zone, so the tofino driver can be detached cleanly.

Nieuwejaar · 2025-10-09T14:45:57Z

This PR should allow us to cleanly handle the disappearance and/or re-appearance of a sidecar. The only known exception is if the transition happens while the switch zone is being set up. See: #9182.

jgallagher · 2025-10-09T18:22:37Z

sled-hardware/src/illumos/mod.rs

+                    let rt = tokio::runtime::Runtime::new().unwrap();
+                    rt.block_on(block_on_switch_zone());


This seems a little sketchy. It looks like we're kinda twisting ourselves in knots to bounce between sync and async, but IIUC we expect this monitor_tofino function to run practically forever, right? (It would only exit when the device disappears?) Should we spawn a real thread instead of using tokio::spawn_blocking? I believe that's what we do for the contract reaper bits on startup.

I don't see any reason not to spawn a new thread rather than a blocking task. That's not going to make this rt.block_on() go away though. The zone API is all async, so I need to jump back into async mode to watch for the zone disappearance. Either that, or write my own blocking version of zoneadm list.

Either that, or write my own blocking version of zoneadm list.

This seems prefereable - spawning a new runtime here (a) shouldn't .unwrap(), but handling errors seems painful and (b) is just using all the defaults, which means it would create a 128-thread worker pool just for this one task. We do already have a blocking zoneadm list: https://docs.rs/zone/latest/zone/struct.Adm.html#method.list_blocking

I think we ought to be one world or the other: either

Make this a thread, and get rid of all of the async, or

Make this monitor itself an async task, and get rid of any blocking calls it makes (which means it could call block_on_switch_zone().await without creating a new runtime)

Fair enough. I'll switch to a blocking zoneadm list.

We can't get rid of the task's blocking calls, because its core operation is a blocking call to ct_event_read_critical(). Unfortunately, despite the name, the underlying system call is ioctl rather than read, so we can't turn this into an AsyncFd and poll() on it.

I think that's the way I'd go, but a couple other possible options if there's some problem with that down the road:

I think (?) we could pass in a handle to the exiting tokio runtime, and use that instead of having to create our own. (I still don't love mixing sync / async in that way, but at least we'd solve the issues with creating a new runtime.)

If we wanted to make this an async task, we could spawn_blocking() the ioctl (and any other blocking calls).

Watch running zones than installed prior to acking tofino removal.

Nieuwejaar · 2025-10-15T15:12:59Z

I just pushed a couple of changes that should make the "tofino has gone away" handling more responsive.
Putting the tofino contract handler and devinfo poller into standalone threads reduces the frequency of that still-unexplained delay in getting SIGCLD when a child process exits. I am unable to reproduce this problem in a multithreaded C program, so I suspect there is some weird tokio interaction going on here.

Cargo.toml

jgallagher · 2025-10-22T18:47:57Z

sled-hardware/src/illumos/mod.rs

+                    // the device to be detached cleanly, which will
+                    // subsequently allow the device to be re-attached cleanly
+                    // if/when the sidecar is powered back on.
+                    block_on_switch_zone(&log);


This feels a lot cleaner as all synchronous code, thanks. I have a couple (possibly related?) concerns:

Fork to call zoneadm list every second in block_on_switch_zone() seems less than ideal - in theory sled-agent knows when it's shutting down the switch zone, so ought to be able to tell this thread when it's gone (as opposed to this thread doing its own monitoring). But that's probably pretty messy, and doing that cleanly might require the nontrivial rework related to sled-agent: HardwareMonitor task gets wedged if it tries to start up an unhealthy switch zone #8970 and switch zone setup should be interruptable #9182. We only go into this loop in the window between receiving CT_DEV_EV_OFFLINE from the kernel and ack'ing that event though, right? So the only sled-agents that go into the block_on_switch_zone() loop are those that previously had an attached tofino and now do not? (When I skimmed over this PR the very first time I misread this and thought we were monitoring the switch zone for the entire duration of the switch zone being up, which was much worse.)

My (very limited) understanding is that this potentially induces a temporary 3-way deadlock pretty similar to xde deadlock due to upcalls and sled-agent forks opte#758 involving fork / ioctl / snapshotting devinfo tree. I don't think we should merge this PR if we think it's still possible to induce that? I'm not sure if the best way forward is OS work to avoid locking or changing sled-agent to avoid one of the three-way issues (e.g., I think @dancrossnyc has mentioned moving the contract waiting into a separate process instead of a separate thread) or some combination of the two.

Yes, having the code that issues the zoneadm halt notify us when the command finishes would be ideal, but I couldn't figure out a clean way to set up that line of communication. In practice, the time from sending the "tofino is gone" message until the zone is shut down is usually only a couple of seconds, so this command generally only gets called 1-3 times. The exception is if the tofino disappears while the zone is still being set up, which is a separate issue.

I'd have to spend some time digging to be sure, but I don't think moving the contract code into a separate process will fix the problem. I believe that any process that attempts to access devinfo will get blocked by the lock held by the kernel code waiting for the contract event to be acked. If we moved the sled-agent code that accesses devinfo into a separate process, I believe that would fix the issue - in the sense that the new process would be blocked rather than sled-agent.

I don't think that's true; the other missing ingredient here is vfork. The problem, as I came to understand it, is that you've got a thread trying to execute vforkx that is trying to quiesce every LWP in the process. However, another LWP is blocked in a synchronous ioctl that's got some kind of hold on the dev tree, but that thread has been put to sleep in order to accommodate a concurrent call to vforkx. Meanwhile, you have yet another LWP in the same process that is spinning trying to make a copy of the devtree, and that is blocked by the hold owned by the sleeping thread blocked in the contract ioctl.

My theory is that, if you remove the vfork from the picture, the lwp in the contracts code won't be blocked at a quiescent point, and will relinquish the hold on the dev tree while it's waiting for an event. I suppose I may be wrong, and it may still deadlock against the devtree snapshotting code, but I kind of feel like that would have been observed before if it were the case: I mean, wouldn't prtconf block in that case? In any event, if we can take the vfork out of the equation here we're reducing the surface area for concurrency problems.

For that matter, I think we should also move the code that is spawning external commands into a separate "runner" process.

I don't know how, or if, it affects the rest of your analysis, but the contract thread is not in an ioctl. It uses one ioctl to get the ID of the contract being executed and a second ioctl to acknowledge that we've done all we need to do to allow the device detach to continue. Between the ioctls, we are free to do as we wish in user space.

The devtree lock is held in the kernel by the devcfg infrastructure, calling e_ddi_offline_notify() -> contract_device_offline()->contract_device_publish()

The reason we haven't seen this block before, in prtconf or anything else, is that the contract is new. That can hold the lock for up to 60 seconds before timing out. If we ack the contract in a timely fashion, that lock gets released well before it hits the timeout. If you try doing a prtconf from the command line while sled-agent is hung here, that prtconf will also hang. I suspect ls /devices will also hang, but I'm much less confident about that.

We could have a dead simple binary that does nothing but wait on the contract, halt the switch zone when the tofino disappears, ack the contract event, and then let sled-agent recover from that. This would work because that binary would never touch the dev tree, and thus wouldn't get blocked on the dev tree lock. This would also sidestep the "uninterruptible switch zone setup" issue. The downside, of course, is that this is appallingly hacky.

So it would appear that, yes, there really is a thread blocked on an ioctl waiting for an event in the contracts code.

Oh! I think you're seeing a different thread waiting on a different contract.

No, that really is the device contract event receive ioctl.

We also have a thread that collects process events to track the completion of jobs being run in zones.

For example:

> fffffcfa3b1cb3c0::findstack -v stack pointer for thread fffffcfa3b1cb3c0 (sled-agent/131): fffff7880dd029f0 [ fffff7880dd029f0 _resume_from_idle+0x12b() ] fffff7880dd02a20 swtch+0x139() fffff7880dd02a90 stop+0x445(7, 0) fffff7880dd02b30 issig_forreal+0x227() fffff7880dd02b50 issig+0x15(0) fffff7880dd02bc0 cv_wait_sig+0x28f(fffffcfa35f1a6c4, fffffcfa30a65518) fffff7880dd02cb0 cte_get_event+0x93(fffffcfa35f1a690, 0, 56d1c00, 0, 0, 1) fffff7880dd02d10 ctfs_endpoint_ioctl+0xe1(fffffcfa35f1a688, 63746502, 56d1c00, fffffcfa40bccbb0, fffffffffc4a9900, 0) fffff7880dd02d50 ctfs_bu_ioctl+0x4e(fffffcfa4230b180, 63746502, 56d1c00, 202001, fffffcfa40bccbb0, fffff7880dd02e18) fffff7880dd02de0 fop_ioctl+0x40(fffffcfa4230b180, 63746502, 56d1c00, 202001, fffffcfa40bccbb0, fffff7880dd02e18) fffff7880dd02f00 ioctl+0x144(a, 63746502, 56d1c00) fffff7880dd02f10 sys_syscall+0x17d() > fffffcfa4230b180::print vnode_t v_path v_path = 0xfffffcfa3df88080 "/system/contract/process/pbundle" ``

Yes. That call is blocked, waiting to be signaled by contract_device_publish. But it is also stopped, due to vfork, and hence, unable to return to user space and initiate shutting down the zone and ACK'ing the detach.

I'm trying to get london back into a state where I can reproduce this. While that's in progresss...

The reason any of this came to light at all, is because I noticed that the zoneadm halt and/or zoneadm uninstall were taking an unreasonable time to be wait()ed on. Neither of those zoneadm commands will be started until after that initial ioctl() returns. It is the completion of that ioctl() that causes us to kick off the zone shutdown. You can see an instance of this in the matrix chat.

It seems. like you've been looking at a variant in which we get hung even before the contract event reaches sled-agent, but that's not the more common case that I've been seeing.

It's the same underlying issue. In that case, the CT_ERECV ioctl returns and initiates the zone shutdown via spawning zoneadm. But contract_device_publish is still waiting for an ACK from user space with a hold on the detaching devinfo tree node; concurrently, another thread comes along and tries to snapshot the devinfo tree (we saw this in some of the earlier truss output you shared) and blocks on the hold owned by the detach code, and there is a thread vfork'ing. The vforking thread quiesces the thread that should generate the ACK, preventing it from running (and thus re-entering the kernel to send the ACK that contract_device_publish is waiting for), which in turn blocks the snapshotting thread, which blocks vfork, which inhibits delivery of the SIGCLD. Eventually the detaching code times out on waiting for the contract ACK, releases its hold on the devinfo node which unblocks the snapshotting LWP, which then quiesces, unblocking the vfork'ing LWP, which subsequently unblocks everything, and the SIGCLD is delivered.

So it would appear that, yes, there really is a thread blocked on an ioctl waiting for an event in the contracts code.

Oh! I think you're seeing a different thread waiting on a different contract.

No, that really is the device contract event receive ioctl.

We also have a thread that collects process events to track the completion of jobs being run in zones.
For example:

> fffffcfa3b1cb3c0::findstack -v stack pointer for thread fffffcfa3b1cb3c0 (sled-agent/131): fffff7880dd029f0 [ fffff7880dd029f0 _resume_from_idle+0x12b() ] fffff7880dd02a20 swtch+0x139() fffff7880dd02a90 stop+0x445(7, 0) fffff7880dd02b30 issig_forreal+0x227() fffff7880dd02b50 issig+0x15(0) fffff7880dd02bc0 cv_wait_sig+0x28f(fffffcfa35f1a6c4, fffffcfa30a65518) fffff7880dd02cb0 cte_get_event+0x93(fffffcfa35f1a690, 0, 56d1c00, 0, 0, 1) fffff7880dd02d10 ctfs_endpoint_ioctl+0xe1(fffffcfa35f1a688, 63746502, 56d1c00, fffffcfa40bccbb0, fffffffffc4a9900, 0) fffff7880dd02d50 ctfs_bu_ioctl+0x4e(fffffcfa4230b180, 63746502, 56d1c00, 202001, fffffcfa40bccbb0, fffff7880dd02e18) fffff7880dd02de0 fop_ioctl+0x40(fffffcfa4230b180, 63746502, 56d1c00, 202001, fffffcfa40bccbb0, fffff7880dd02e18) fffff7880dd02f00 ioctl+0x144(a, 63746502, 56d1c00) fffff7880dd02f10 sys_syscall+0x17d() > fffffcfa4230b180::print vnode_t v_path v_path = 0xfffffcfa3df88080 "/system/contract/process/pbundle" ``

Yes. That call is blocked, waiting to be signaled by contract_device_publish. But it is also stopped, due to vfork, and hence, unable to return to user space and initiate shutting down the zone and ACK'ing the detach.

Ah, waitasec...yeah, you're right here. This ioctl is blocked on the process contract; fd 16 is the device contract. Regardless, the analysis is sound and the conclusion still holds: it's entirely possible for vfork to prevent the CT_ERECV ioctl on the device contract from returning, and a concurrent devinfo tree snapshot to block vfork, thus starving the entire thing. In this case, indeed the code was executing after the event was received, but vfork prevents it from reentering the kernel to send the ACK, and the detach code keeping a hold on the devnode blocks the snapshotting code that blocks the vfork until the timeout expires and everything is unwound.

There is a coredump at /net/catacomb/data/staff/core/omicron-9181.
This is the thread handling the disappearance of the tofino device:

> fffffcfa75e82c00::findstack stack pointer for thread fffffcfa75e82c00 (devctl/1): fffff7881436c450 [ fffff7881436c450 _resume_from_idle+0x12b() ] fffff7881436c480 swtch+0x139() fffff7881436c500 cv_timedwait_hires+0xd7() fffff7881436c550 cv_timedwait+0x52() fffff7881436c590 ct_barrier_wait_for_empty+0x4a() fffff7881436c600 wait_for_acks+0x40() fffff7881436c6e0 contract_device_publish+0x53c() fffff7881436c750 contract_device_offline+0x61() fffff7881436c790 e_ddi_offline_notify+0x34() fffff7881436c800 devi_detach_node+0x5f() fffff7881436c880 unconfig_immediate_children+0x16f() fffff7881436c8d0 ndi_busop_bus_unconfig+0x77() fffff7881436c930 tofino_bus_unconfig+0x67() fffff7881436c9c0 devi_unconfig_common+0x19f() fffff7881436ca10 devi_unconfig_branch+0x69() fffff7881436caa0 ndi_devi_unconfig_one+0xb3() fffff7881436cb00 ndi_devctl_device_offline+0xb2() fffff7881436cb80 ndi_devctl_ioctl+0x95() fffff7881436cbf0 pcie_ioctl+0xa9() fffff7881436cc80 pcieb_ioctl+0xa3() fffff7881436ccc0 cdev_ioctl+0x3f() fffff7881436cd10 spec_ioctl+0x55() fffff7881436cda0 fop_ioctl+0x40() fffff7881436cec0 ioctl+0x144() fffff7881436cf10 sys_syscall32+0x105()

This is the sled-agent thread accessing devinfo:

stack pointer for thread fffffcfa39bb6000 (sled-agent/135): fffff7880c3c7950 [ fffff7880c3c7950 _resume_from_idle+0x12b() ] fffff7880c3c7980 swtch+0x139() fffff7880c3c79b0 cv_wait+0x70() fffff7880c3c79f0 mt_config_fini+0x3b() fffff7880c3c7a30 config_grand_children+0x3c() fffff7880c3c7a80 devi_config_common+0xe6() fffff7880c3c7aa0 ndi_devi_config+0x1a() fffff7880c3c7b00 di_copytree+0xd8() fffff7880c3c7be0 di_snapshot+0x1dc() fffff7880c3c7c10 di_snapshot_and_clean+0x24() fffff7880c3c7cc0 di_ioctl+0x5bf() fffff7880c3c7d00 cdev_ioctl+0x3f() fffff7880c3c7d50 spec_ioctl+0x55() fffff7880c3c7de0 fop_ioctl+0x40() fffff7880c3c7f00 ioctl+0x144() fffff7880c3c7f10 sys_syscall+0x17d()

I believe this is the device contract thread. It has received the remove event, and is doing a zoneadm list waiting for the switch zone to disappear. It may also be the thread that is doing the zoneadm halt. It's probably just luck of the draw as to which thread fork()s first.

stack pointer for thread fffffcfa37b90460 (sled-agent/134): fffff7880ca72d90 [ fffff7880ca72d90 _resume_from_idle+0x12b() ] fffff7880ca72dc0 swtch+0x139() fffff7880ca72df0 cv_wait+0x70() fffff7880ca72e30 holdlwps+0xc7() fffff7880ca72ee0 cfork+0x93() fffff7880ca72f00 forksys+0x6f() fffff7880ca72f10 sys_syscall+0x17d()

This is thread watching for process contract events:

stack pointer for thread fffffcfa3b1cb3c0 (sled-agent/131): fffff7880dd029f0 [ fffff7880dd029f0 _resume_from_idle+0x12b() ] fffff7880dd02a20 swtch+0x139() fffff7880dd02a90 stop+0x445() fffff7880dd02b30 issig_forreal+0x227() fffff7880dd02b50 issig+0x15() fffff7880dd02bc0 cv_wait_sig+0x28f() fffff7880dd02cb0 cte_get_event+0x93() fffff7880dd02d10 ctfs_endpoint_ioctl+0xe1() fffff7880dd02d50 ctfs_bu_ioctl+0x4e() fffff7880dd02de0 fop_ioctl+0x40() fffff7880dd02f00 ioctl+0x144() fffff7880dd02f10 sys_syscall+0x17d()

After walking through the kernel dump, and discussing this mess, with Andy and Dan, my current plan is to move to contract monitoring into a child process. Upon receiving a device removal event, that process will shut down the switch zone, ack the event (allowing the device to be detached), and then notify sled-agent so it can do its internal switch zone cleanup.

With this approach, sled-agent's devinfo probing will still get blocked if invoked while the contract is live, but in the absence of a vfork() no other threads should be affected. That block should only last for as long as it takes to shut down the switch zone, which should only be a second or two.

jgallagher · 2025-10-22T19:05:24Z

illumos-utils/src/contract.rs

+        typ: ContractType,
+        ctid: ctid_t,
+    ) -> Result<Self, crate::ExecutionError> {
+        let path = path(typ, Some(ctid), "ctl");


It looks like when this is called by process_contract_reaper, this constructs the path

format!("/system/contract/process/{}/ctl", ctid)

but the old abandon_contract() function constructed the path

format!("/proc/{}/contracts/{}/ctl", process::id(), ctid)

Are those equivalent?

This module has enough rework of the contract wrapping (and has a fair bit of unsafe) that I'd love to get a couple different reviews on it - maybe @citrus-it could take a look too? I'm happy to review from a purely FFI point of view, but I'm having a hard time comparing the old and new when it's in a different file and has a bunch of structural changes, so am a lot less confident nothing changed accidentally.

Yes, those are equivalent. The /proc entry is a symlink to /system/contract

BRM42220051 # ctstat -i 9744 CTID ZONEID TYPE STATE HOLDER EVENTS QTIME NTIME 9744 31 process owned 25149 0 - - BRM42220051 # ls -l /proc/25149/contracts/ lr-x------ 1 root root 0 Oct 23 05:47 9744 -> /system/contract/process/9744

illumos-utils/src/contract.rs

jgallagher · 2025-10-22T19:11:48Z

illumos-utils/src/contract.rs

+                        cpath.iter().map(|c| *c as u8).collect();
+                    Err(err(format!(
+                        "set_minor to {} in device template",
+                        str::from_utf8(&cpath[..]).unwrap()


cpath is coming from an external source, so it's probably best not to panic here if it somehow contains non-UTF8 data. Maybe String::from_utf8_lossy(cpath) instead?

jgallagher · 2025-10-22T19:19:36Z

illumos-utils/src/contract.rs

+        // We also need to drop any bytes after the terminating NULL.
+        let left = 10;
+        let right = path_buf.iter().position(|a| *a == 0).unwrap();
+        Some(path_buf[left..right + 1].to_vec())


Should this return an Option<CString> instead? It looks like that's what the caller really wants (and ends up allocating a second Vec<u8> to do the conversion). Totally untested but something like

let path = path_buf .iter() .skip(left) .take_while(|c| *c != 0) .map(|c| c as u8); Some(CString::new(path.collect()).unwrap())

citrus-it · 2025-10-24T12:18:10Z

I've opened https://www.illumos.org/issues/17688 to cover the deadlock opportunity afforded by the illumos device contracts system. I think we have a path forward for this change that doesn't involve waiting for that to be fixed in illumos.

Nieuwejaar added 5 commits October 6, 2025 22:34

Handle surprise removal of a sidecar

18731b9

fix linux build

1e819ce

try again

0fd9ae8

cleanup

c61e606

cleanup

df846ca

Nieuwejaar requested a review from jgallagher October 9, 2025 14:31

jgallagher reviewed Oct 9, 2025

View reviewed changes

Nieuwejaar added 2 commits October 14, 2025 10:26

monitor tofino should not drop back into async

ec2a0b7

Move tofino and hardware monitors into their own threads.

3ee0512

Watch running zones than installed prior to acking tofino removal.

jgallagher reviewed Oct 22, 2025

View reviewed changes

citrus-it self-requested a review October 23, 2025 14:51

		let rt = tokio::runtime::Runtime::new().unwrap();
		rt.block_on(block_on_switch_zone());

Shutdown switch zone when sidecar disappears #9181

Are you sure you want to change the base?

Shutdown switch zone when sidecar disappears #9181

Uh oh!

Conversation

Nieuwejaar commented Oct 9, 2025

Uh oh!

Nieuwejaar commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nieuwejaar commented Oct 15, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nieuwejaar Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

citrus-it commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jgallagher Oct 13, 2025 •

edited

Loading

Nieuwejaar Oct 22, 2025 •

edited

Loading