Add more clarity to how to re-run failed AWS GPU jobs #1324

shi-eric · 2026-01-11T02:51:35Z

Description

As discussed on Slack this week, the AWS GPU jobs use ephemeral runners that are created/destroyed just to service the job in that particular workflow (rather than servicing jobs from any workflow in the repo). This pull request tries to add more visibility into this behavior for developers seeking to re-run failed workflows (e.g. due to flaky test behavior).

The push_aws_gpu.yml job also has an added check to make sure the GitHub repo is newton-physics/newton.

Newton Migration Guide

Please ensure the migration guide for warp.sim users is up-to-date with the changes made in this PR.

The migration guide in docs/migration.rst is up-to date

Before your PR is "Ready for review"

Necessary tests have been added and new examples are tested (see newton/tests/test_examples.py)
Documentation is up-to-date
Code passes formatting and linting checks with pre-commit run -a

Summary by CodeRabbit

Chores
- Improved error handling and messaging for GPU benchmark and unit test job failures, with detailed re-run guidance emphasizing proper ephemeral EC2 runner restart procedures
- Added repository-specific execution guard to restrict GPU workflows to the appropriate repository for enhanced safety and control

_{✏️ Tip: You can customize this high-level summary in your review settings.}

When tests or benchmarks fail, display warnings explaining that "Re-run failed jobs" won't work due to ephemeral EC2 runners. - Error annotations appear at top of job view - Job summary table in the Summary tab - ASCII banner in log output Signed-off-by: Eric Shi <[email protected]>

Signed-off-by: Eric Shi <[email protected]>

coderabbitai · 2026-01-11T02:51:45Z

📝 Walkthrough

Walkthrough

The PR adds failure-handling guidance to two AWS GPU workflow files and restricts one workflow to execute only in the primary repository. New steps that trigger on failure provide users with re-run instructions, while a repository guard ensures GPU tests only run in the specified repository context.

Changes

Cohort / File(s)	Summary
AWS GPU Failure Guidance `.github/workflows/aws_gpu_benchmarks.yml`, `.github/workflows/aws_gpu_tests.yml`	Added conditional steps that execute on failure, emitting error annotations and writing re-run guidance to job summaries and logs. Both steps advise against "Re-run failed jobs" and promote "Re-run all jobs" for ephemeral EC2 runners. (+38 lines each)
Repository Guard `.github/workflows/push_aws_gpu.yml`	Added repository-specific condition restricting workflow execution to `newton-physics/newton` repository only. (+1 line)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

newton-physics/newton#1084: Directly related; modifies the same AWS GPU workflow files that were introduced/refactored in that PR.

Suggested reviewers

eric-heiden

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main objective: adding clarity to re-run instructions for failed AWS GPU jobs, which aligns with the changeset that adds guidance steps to two workflows.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-11T03:00:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/aws_gpu_tests.yml (1)
121-158: Consider extracting re-run instructions to a composite action.

The re-run instructions are duplicated across aws_gpu_tests.yml and aws_gpu_benchmarks.yml. While the duplication provides clarity and keeps each workflow self-contained, you could create a composite action to maintain the instructions in a single location.
Example structure for a composite action

Create .github/actions/ec2-rerun-instructions/action.yml:
name: 'EC2 Re-run Instructions'
description: 'Display instructions for re-running workflows with ephemeral EC2 runners'
runs:
  using: 'composite'
  steps:
    - name: Re-run instructions
      if: failure()
      shell: bash
      run: |
        # (same script content)
Then reference it in both workflows:
- name: Re-run instructions
  if: failure()
  uses: ./.github/actions/ec2-rerun-instructions

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 173e042 and d18a324.

📒 Files selected for processing (3)

.github/workflows/aws_gpu_benchmarks.yml
.github/workflows/aws_gpu_tests.yml
.github/workflows/push_aws_gpu.yml

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.

📚 Learning: 2025-10-04T06:44:52.713Z

Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.

Applied to files:

.github/workflows/push_aws_gpu.yml

📚 Learning: 2026-01-04T01:26:02.866Z

Learnt from: shi-eric
Repo: newton-physics/newton PR: 1300
File: .github/workflows/pr_license_check.yml:24-24
Timestamp: 2026-01-04T01:26:02.866Z
Learning: When reviewing code related to CI checks that verify Git SHAs against version tags in GitHub Actions, handle annotated tags by dereferencing to obtain the actual commit SHA. GitHub's endpoint /repos/{owner}/{repo}/git/refs/tags/{tag} returns the tag object SHA for annotated tags; use /repos/{owner}/{repo}/git/tags/{tag_sha} to dereference to the commit SHA, or verify the commit directly in the repository instead of assuming a mismatch. Apply this pattern whenever validating tag-to-commit relationships in workflow checks across files under .github/workflows.

Applied to files:

.github/workflows/push_aws_gpu.yml
.github/workflows/aws_gpu_tests.yml
.github/workflows/aws_gpu_benchmarks.yml

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Run GPU Benchmarks / Run GPU Benchmarks on AWS EC2
GitHub Check: Run GPU Tests / Run GPU Unit Tests on AWS EC2
GitHub Check: run-newton-tests / newton-unittests (windows-latest)
GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)

🔇 Additional comments (3)

.github/workflows/push_aws_gpu.yml (1)

16-16: LGTM! Repository guard prevents execution in forks.

The guard correctly restricts AWS GPU workflows to the primary repository, preventing resource consumption and credential issues in forks.

.github/workflows/aws_gpu_benchmarks.yml (1)

134-171: Excellent multi-channel user guidance for ephemeral runners!

The implementation provides clear instructions through error annotations, job summary, and log output. The if: failure() condition ensures guidance appears whenever the workflow fails, and the heredoc syntax is correct.

.github/workflows/aws_gpu_tests.yml (1)

121-158: LGTM! Consistent re-run guidance across GPU workflows.

The implementation matches the pattern in aws_gpu_benchmarks.yml, providing clear user instructions through multiple channels. The if: failure() condition ensures guidance appears for any failure, including early-stage issues like checkout or setup failures.

adenzler-nvidia

Thank you!

shi-eric added 2 commits January 10, 2026 18:46

Prevent Push - AWS GPU workflow from running in forks

d18a324

Signed-off-by: Eric Shi <[email protected]>

shi-eric requested a review from eric-heiden January 11, 2026 02:51

shi-eric added the automation Issues related to ci/cd and automation in general label Jan 11, 2026

coderabbitai bot reviewed Jan 11, 2026

View reviewed changes

shi-eric self-assigned this Jan 11, 2026

shi-eric requested a review from adenzler-nvidia January 13, 2026 07:20

adenzler-nvidia approved these changes Jan 13, 2026

View reviewed changes

adenzler-nvidia added this pull request to the merge queue Jan 13, 2026

Merged via the queue into newton-physics:main with commit 8ccb2a9 Jan 13, 2026
34 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more clarity to how to re-run failed AWS GPU jobs #1324

Add more clarity to how to re-run failed AWS GPU jobs #1324

Uh oh!

shi-eric commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov bot commented Jan 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

adenzler-nvidia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add more clarity to how to re-run failed AWS GPU jobs #1324

Add more clarity to how to re-run failed AWS GPU jobs #1324

Uh oh!

Conversation

shi-eric commented Jan 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Newton Migration Guide

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov bot commented Jan 11, 2026

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

adenzler-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shi-eric commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 11, 2026 •

edited

Loading