Skip to content

kernel ext4 lockup causing nginx slowdown #177

@staticfloat

Description

@staticfloat

We had a discussion about the us-east loadbalancer getting slow. Initial inspection showed that the network interface was rarely achieving more than 180Mbps out. Diving deeper, it was found that some nginx processes had become stuck in "uninterruptible sleep" (D in ps output). Looking in dmesg after a echo w > /proc/sysrq-trigger showed that they were stuck in the kernel during an fsync.

This is a fairly pathological failure, but there are a few things we could do to ameliorate it:

  • Oversubscribe the number of nginx processes. Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert $(($(nproc) * 2)) into the worker_processes directive.
  • Better monitoring on the instances; our grafana server has bitrotted, we should resurrect that so that we can notice the lowered network throughput and increased CPU time as shown in this lightsail graph: image

I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you don't have to reboot. If this happens again, I'll consider it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions