pg_tde: harden archive/restore helpers to clean up tmpfs on failure #582

leventov · 2025-09-17T03:34:18Z

The pg_tde helper binaries decrypt/reencrypt WAL via a temporary file under TMPFS_DIRECTORY (/dev/shm). On very low‑memory systems this tmpfs can fill, and previous behavior left temporary files/directories behind on failures or abrupt termination, compounding ENOSPC and retries.

Changes:

Always remove the temporary file and directory on normal exit and on
SIGINT/SIGTERM (atexit() + signal handlers).
Check the tmpfs dir capacity before creating the temp file. The temp file size is taken to be:
- In archive (pg_tde_restore_encrypt.c): size = encrypted source file size + 4MB slack.
- In restore (pg_tde_archive_decrypt.c): read xlog_seg_size from pg_control (derived from DEST-PATH
  data dir); fallback to 16MB if unavailable + 4MB slack.
- If the check fails due to insufficient remaining /dev/shm capacity, exit nonzero with a clear error so Postgres will retry later rather than failing mid‑write.

FWIW, the initial condition that has led this failures was a programming mistake on my part rather than pg_tde's defect, but even then, I think it's better to harden the achirve/restore binaries so that they don't exacerbate the problem by cluttering tmpfs.

I've not tested this patch in a realistic environment yet (I indent to do that, but maybe only the next week or so), yet I would like to publish the patch already for review and feedback.

I'm not an expert in C, Postgres, or pgBackRest; the patch is authored by AI coding agent; but I've reviewed it and it makes sense to me.

…ir if the operation fails

it-percona-cla · 2025-09-17T03:34:23Z

All committers have signed the CLA.

jeltz · 2025-09-17T16:12:23Z

Thanks for the PR. This is for sure a real bug and we have reported it in our issue tracking system as PG-1932. As a C developer I am not really a fan of what the AI has coded so we will likely write our own patch for it which I hope you do not mind. Again thanks a lot for the bug report and the PR!

leventov · 2025-09-18T05:12:55Z

@jeltz thanks for looking into this. No, I don't mind at all. Would be curious to learn the proper fix!

jeltz · 2025-10-15T08:57:42Z

Closing this PR since development is moving to https://github.com/percona/pg_tde. Also I do remember this PR, we have just been very busy with work on PostgreSQL 18 support which is what caused us to split the repos. :)

As reported by Roman Leventov in percona/postgres#582 we had an issue where we easily could leak tmeporary files and directories when on failure, especially when the wrapped restore or archive command fails. The solution in this commit is to move the cleanup of temporary files and directories to and atexit() callback plus add a signal handler for SIGINT and SIGTERM. The signal handler makes sure to kill itself after running cleanup so the parent process gets the right error. We have a tiny race condition between creating the temporary directory and registering the signal handler but that does not seem worth fixing since it would just leak a directory, not the WAL file, the race conditions being tiny.

As reported by Roman Leventov in percona/postgres#582 we had an issue where we easily could leak tmeporary files and directories when on failure, especially when the wrapped restore or archive command fails. The solution in this commit is to move the cleanup of temporary files and directories to and atexit() callback plus add a signal handler for SIGINT and SIGTERM. The signal handler makes sure to kill itself after running cleanup so the parent process gets the right error. We have a tiny race condition between creating the temporary directory and registering the signal handler but that does not seem worth fixing since it would just leak a directory, not the WAL file, the race conditions being tiny. With the current hardcoded path of /dev/shm and the TAp test framework testing this seems a pain so no tests are included.

As reported by Roman Leventov in percona/postgres#582 we had an issue where we easily could leak temporary files and directories when on failure, especially when the wrapped restore or archive command fails. The solution in this commit is to move the cleanup of temporary files and directories to and atexit() callback plus add a signal handler for SIGINT and SIGTERM. The signal handler makes sure to kill itself after running cleanup so the parent process gets the right error. We have a tiny race condition between creating the temporary directory and registering the signal handler but that does not seem worth fixing since it would just leak a directory, not the WAL file, the race conditions being tiny. With the current hardcoded path of /dev/shm and the TAP test framework testing this seems a pain so no tests are included.

jeltz · 2025-11-13T11:19:25Z

@leventov Sorry for the silence but we have been very busy with the PostgreSQL 18 release which necessitated splitting the repository into two (this one which now only contains our custom hooks and https://github.com/percona/pg_tde which contains the actual extension).

But now there is a PR, percona/pg_tde#467, for fixing this. It will sadly probably not make it to the first pg_tde release with PG 18 support (scheduled to be called 2.1) since code freeze for that has already happened but it will be released in 2.2, whenever that happens.

leventov · 2025-11-13T15:34:49Z

@jeltz thank you very much!

As reported by Roman Leventov in percona/postgres#582 we had an issue where we easily could leak temporary files and directories when on failure, especially when the wrapped restore or archive command fails. The solution in this commit is to move the cleanup of temporary files and directories to and atexit() callback plus add a signal handler for SIGINT and SIGTERM. The signal handler makes sure to kill itself after running cleanup so the parent process gets the right error. We have a tiny race condition between creating the temporary directory and registering the signal handler but that does not seem worth fixing since it would just leak a directory, not the WAL file, the race conditions being tiny. With the current hardcoded path of /dev/shm and the TAP test framework testing this seems a pain so no tests are included.

pg_tde_archive_decrypt and pg_tde_restore_encrypt to clean up tmpfs d…

843f44a

…ir if the operation fails

jeltz closed this Oct 15, 2025

jeltz mentioned this pull request Nov 13, 2025

PG-1932 Make sure archive and restore comamnds do not leak files percona/pg_tde#467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pg_tde: harden archive/restore helpers to clean up tmpfs on failure #582

pg_tde: harden archive/restore helpers to clean up tmpfs on failure #582

leventov commented Sep 17, 2025

Uh oh!

it-percona-cla commented Sep 17, 2025 •

edited

Loading

Uh oh!

jeltz commented Sep 17, 2025 •

edited by atlassian bot

Loading

Uh oh!

leventov commented Sep 18, 2025

Uh oh!

jeltz commented Oct 15, 2025

Uh oh!

jeltz commented Nov 13, 2025 •

edited

Loading

Uh oh!

leventov commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pg_tde: harden archive/restore helpers to clean up tmpfs on failure #582

pg_tde: harden archive/restore helpers to clean up tmpfs on failure #582

Conversation

leventov commented Sep 17, 2025

Uh oh!

it-percona-cla commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeltz commented Sep 17, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leventov commented Sep 18, 2025

Uh oh!

jeltz commented Oct 15, 2025

Uh oh!

jeltz commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leventov commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

it-percona-cla commented Sep 17, 2025 •

edited

Loading

jeltz commented Sep 17, 2025 •

edited by atlassian bot

Loading

jeltz commented Nov 13, 2025 •

edited

Loading