We had a discussion about the us-east loadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that some nginx processes had become stuck in "uninterruptible sleep" (D in ps output). Looking in dmesg after a echo w > /proc/sysrq-trigger showed that they were stuck in the kernel during an fsync.
This is a fairly pathological failure, but there are a few things we could do to ameliorate it:
I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.
We had a discussion about the
us-eastloadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that somenginxprocesses had become stuck in "uninterruptible sleep" (Dinpsoutput). Looking indmesgafter aecho w > /proc/sysrq-triggershowed that they were stuck in the kernel during anfsync.This is a fairly pathological failure, but there are a few things we could do to ameliorate it:
nginxprocesses. Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert$(($(nproc) * 2))into theworker_processesdirective.I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.