Snapshots of PostgreSQL on haumea.nixos.org cause full disk issues #446

mweinelt · 2024-06-21T16:14:20Z

Hydra's database on haumea.nixos.org runs PostgreSQL on ZFS with zrepl for snapshot-based backups. Every once in a while we see the size of snapshots increase from <1G to 70-120G which results in a full disk.

My current working theory is

It is not related to WAL, since the WAL is only ~500MB in size, and we use a zrepl hook to force a CHECKPOINT before the snapshot g ets taken
It is likely an index that gets reshuffled (jobsetevalmembers_pk index is ~60GB in size)
We are likely also seeing an effect of write amplification (PostgreSQL uses 8K records, ZFS was configured for 16K records, down from its 128K default), going further down to 8K is not recommended. Instead maybe use 128K records and a dedicated SLOG device?

mweinelt · 2024-06-27T16:08:06Z

Time-wise I could correlate this with automatic vacuuming.

https://github.com/NixOS/infra/blob/master/build/haumea/postgresql.nix#L79-L87

mweinelt · 2024-06-27T16:09:33Z

Migrated compression from zstd to lz4, cause that is probably lighter on the CPU.

vcunat · 2024-06-27T17:21:24Z

Years ago (ec61098) @edolstra tweaked some vacuuming parameters to 1/100 of the defaults. Maybe we could ease that a bit, as apparently we're suffering from too much vacuuming?

mweinelt added the bug label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshots of PostgreSQL on haumea.nixos.org cause full disk issues #446

Snapshots of PostgreSQL on haumea.nixos.org cause full disk issues #446

mweinelt commented Jun 21, 2024

mweinelt commented Jun 27, 2024

mweinelt commented Jun 27, 2024

vcunat commented Jun 27, 2024

Snapshots of PostgreSQL on haumea.nixos.org cause full disk issues #446

Snapshots of PostgreSQL on haumea.nixos.org cause full disk issues #446

Comments

mweinelt commented Jun 21, 2024

mweinelt commented Jun 27, 2024

mweinelt commented Jun 27, 2024

vcunat commented Jun 27, 2024