-
Notifications
You must be signed in to change notification settings - Fork 67
Description
In #9720, we encountered an OS bug that caused IPCC ioctls to hang indefinitely. In sled-agent, we call these directly from tokio worker threads, which led to several worker threads getting stuck, and eventually hitting #9619 where the entire runtime blocked (even though some worker threads were still parked / idle). #9619 proposes we implement the general workaround where we periodically spawn a new task into the runtime, which will unstick a runtime stuck because the one thread responsible for I/O is blocked polling the future that caused it to wake up. However, it seems unlikely this would have helped much in the #9720 case - we probably would have only delayed sled-agent hanging, because eventually we would have issued enough IPCC calls to hang all the worker threads.
I'm inclined to say we should treat IPCC calls as "blocking I/O" calls - that seems pretty accurate, since we're doing I/O over a uart to the SP (and/or RoT, depending on the IPCC command) - and put them in spawn_blocking. But I'm not sure what that would do in a case like #9720 - if every IPCC call hangs, would we eventually exhaust the spawn_blocking pool? Presumably sled-agent would remain generally responsive (except in paths that depended on those IPCC calls?), but what would happen in the limit?