-
Notifications
You must be signed in to change notification settings - Fork 228
Add more clarity to how to re-run failed AWS GPU jobs #1324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more clarity to how to re-run failed AWS GPU jobs #1324
Conversation
When tests or benchmarks fail, display warnings explaining that "Re-run failed jobs" won't work due to ephemeral EC2 runners. - Error annotations appear at top of job view - Job summary table in the Summary tab - ASCII banner in log output Signed-off-by: Eric Shi <[email protected]>
Signed-off-by: Eric Shi <[email protected]>
📝 WalkthroughWalkthroughThe PR adds failure-handling guidance to two AWS GPU workflow files and restricts one workflow to execute only in the primary repository. New steps that trigger on failure provide users with re-run instructions, while a repository guard ensures GPU tests only run in the specified repository context. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/aws_gpu_tests.yml (1)
121-158: Consider extracting re-run instructions to a composite action.The re-run instructions are duplicated across aws_gpu_tests.yml and aws_gpu_benchmarks.yml. While the duplication provides clarity and keeps each workflow self-contained, you could create a composite action to maintain the instructions in a single location.
Example structure for a composite action
Create
.github/actions/ec2-rerun-instructions/action.yml:name: 'EC2 Re-run Instructions' description: 'Display instructions for re-running workflows with ephemeral EC2 runners' runs: using: 'composite' steps: - name: Re-run instructions if: failure() shell: bash run: | # (same script content)Then reference it in both workflows:
- name: Re-run instructions if: failure() uses: ./.github/actions/ec2-rerun-instructions
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/aws_gpu_benchmarks.yml.github/workflows/aws_gpu_tests.yml.github/workflows/push_aws_gpu.yml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.
📚 Learning: 2025-10-04T06:44:52.713Z
Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.
Applied to files:
.github/workflows/push_aws_gpu.yml
📚 Learning: 2026-01-04T01:26:02.866Z
Learnt from: shi-eric
Repo: newton-physics/newton PR: 1300
File: .github/workflows/pr_license_check.yml:24-24
Timestamp: 2026-01-04T01:26:02.866Z
Learning: When reviewing code related to CI checks that verify Git SHAs against version tags in GitHub Actions, handle annotated tags by dereferencing to obtain the actual commit SHA. GitHub's endpoint /repos/{owner}/{repo}/git/refs/tags/{tag} returns the tag object SHA for annotated tags; use /repos/{owner}/{repo}/git/tags/{tag_sha} to dereference to the commit SHA, or verify the commit directly in the repository instead of assuming a mismatch. Apply this pattern whenever validating tag-to-commit relationships in workflow checks across files under .github/workflows.
Applied to files:
.github/workflows/push_aws_gpu.yml.github/workflows/aws_gpu_tests.yml.github/workflows/aws_gpu_benchmarks.yml
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run GPU Benchmarks / Run GPU Benchmarks on AWS EC2
- GitHub Check: Run GPU Tests / Run GPU Unit Tests on AWS EC2
- GitHub Check: run-newton-tests / newton-unittests (windows-latest)
- GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
🔇 Additional comments (3)
.github/workflows/push_aws_gpu.yml (1)
16-16: LGTM! Repository guard prevents execution in forks.The guard correctly restricts AWS GPU workflows to the primary repository, preventing resource consumption and credential issues in forks.
.github/workflows/aws_gpu_benchmarks.yml (1)
134-171: Excellent multi-channel user guidance for ephemeral runners!The implementation provides clear instructions through error annotations, job summary, and log output. The
if: failure()condition ensures guidance appears whenever the workflow fails, and the heredoc syntax is correct..github/workflows/aws_gpu_tests.yml (1)
121-158: LGTM! Consistent re-run guidance across GPU workflows.The implementation matches the pattern in aws_gpu_benchmarks.yml, providing clear user instructions through multiple channels. The
if: failure()condition ensures guidance appears for any failure, including early-stage issues like checkout or setup failures.
adenzler-nvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
8ccb2a9
Description
As discussed on Slack this week, the AWS GPU jobs use ephemeral runners that are created/destroyed just to service the job in that particular workflow (rather than servicing jobs from any workflow in the repo). This pull request tries to add more visibility into this behavior for developers seeking to re-run failed workflows (e.g. due to flaky test behavior).
The
push_aws_gpu.ymljob also has an added check to make sure the GitHub repo isnewton-physics/newton.Newton Migration Guide
Please ensure the migration guide for warp.sim users is up-to-date with the changes made in this PR.
docs/migration.rstis up-to dateBefore your PR is "Ready for review"
newton/tests/test_examples.py)pre-commit run -aSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.