Skip to content

Sled-agent hang #9720

@askfongjojo

Description

@askfongjojo

This is seen on rack2, running on a rather recent build (5eb1337/4e5b80e) that includes #9612. The problem was seen when I attempted to stop two large VMs - one using nexus API (vm1-0, the request timed out and the propolis remained), the other via in-guest shutdown command (vm1-1, the propolis zone was destroyed but a new one came up to replace it).

I was actively using these VMs to test vm-to-vm send/receive throughput using iperf3, with a couple of requests using 32 parallel streams.

Here are their current states and the VMs on that sled:

root@oxz_switch0:~# omdb db instances --running | egrep 'ID|BRM44220005' 2>/dev/null
ID                                   STATE   INTENT  PROPOLIS_ID                          SLED_ID                              HOST_SERIAL NAME                          
b2ac1cb1-424e-46f7-96f7-afb3311276f7 running running 74063f78-69c1-4cae-9ccc-85152e7dc9e9 f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 ping-sender-00                
ef230468-fbb5-42b7-afdd-7402b6b11408 running running f0a345b4-3691-4166-be94-bb6c569483b5 f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 sugarplum-3                   
04e79415-67da-4a23-97f7-612a3bd166cc running running 63904b13-f67a-4072-8022-c6677f4437ca f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 i2                            
4ba709ae-a878-44de-8532-23c020438bd1 running running 25df0a91-2f0e-4310-b4ca-00bd5a56604e f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 manydisks                     
8d1ec0fb-c39a-4cf8-8dd5-b24841580e41 running running 39faf316-4ff8-4827-af63-bcf9b52aa122 f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 uploader                      
9a02ea6f-d12c-40f4-906e-9c5cda2f03ef running stopped f757204e-4ae5-458f-a588-b3eed7faccb0 f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 vm1-noble-1                   
c488ed75-6125-4545-ab16-7b119603fc01 running stopped 19e169fc-1ffe-4515-94d6-0df5f6879c75 f15774c1-b8e5-434f-a493-ec43f96cba06 BRM44220005 vm1-noble-0 

Turned out that sled-agent is hung which explains the instance behavior:

BRM44220005 # ls -l /var/svc/log/oxide-sled-agent\:default.log
-rw-r--r--   1 root     root           0 Jan 26 03:10 /var/svc/log/oxide-sled-agent:default.log

I'm leaving everything as is for live debugging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions