kernel ext4 lockup causing `nginx` slowdown

We had a discussion about the `us-east` loadbalancer getting slow.  Initial inspection showed that the network interface was rarely achieving more than 180Mbps out.  Diving deeper, it was found that some `nginx` processes had become stuck in "uninterruptible sleep" (`D` in `ps` output).  Looking in `dmesg` after a `echo w > /proc/sysrq-trigger` showed that they were stuck in the kernel during an `fsync`.

This is a fairly pathological failure, but there are a few things we could do to ameliorate it:
- [ ] Oversubscribe the number of `nginx` processes.  Right now we only have 2 processes on the loadbalancer; we could probably double this and go up to 2x per core (so 4 processes total) without any harmful effects, which would at least delay the problem in the future. This needs a templating step on the "optimized" nginx config to insert `$(($(nproc) * 2))` into the `worker_processes` directive.
- [ ] Better monitoring on the instances; our grafana server has bitrotted, we should resurrect that so that we can notice the lowered network throughput and increased CPU time as shown in this lightsail graph: <img width="697" alt="image" src="https://github.com/JuliaPackaging/PkgServer.jl/assets/130920/95354b0b-f00f-4481-8c34-24dfd832adb2">

I am loathe to do something drastic like auto-reboot the loadbalancer because it is supposed to be the piece that you _don't have to reboot_.  If this happens again, I'll consider it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel ext4 lockup causing `nginx` slowdown #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kernel ext4 lockup causing nginx slowdown #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

kernel ext4 lockup causing `nginx` slowdown #177