Skip to content

Save and quit on sigint and sigterm #260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 9, 2025
Merged

Save and quit on sigint and sigterm #260

merged 2 commits into from
May 9, 2025

Conversation

jlamypoirier
Copy link
Collaborator

✨ Description

Add a signal handler for training that catches interrupts. When a signal is received, the trainer finishes the current batch, then saves a checkpoint and stops gracefully.

This should be usually enough to save and quit within the allotted time, with some exceptions:

  • The validation phase is not interrupted, so a batch with validation may not have time to save. Addressing this would require saving a validation state so we can resume mid-validation.
  • Exports may take a while, so a batch with export may not have time to save.

Nevertheless, this PR is an improvement and should be good enough for #241.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks!

@jlamypoirier jlamypoirier marked this pull request as ready for review May 9, 2025 19:34
@jlamypoirier jlamypoirier merged commit 98d3969 into main May 9, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the save_on_sigterm branch May 9, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants