Skip to content

Conversation

@shi-eric
Copy link
Member

@shi-eric shi-eric commented Jan 11, 2026

Description

As discussed on Slack this week, the AWS GPU jobs use ephemeral runners that are created/destroyed just to service the job in that particular workflow (rather than servicing jobs from any workflow in the repo). This pull request tries to add more visibility into this behavior for developers seeking to re-run failed workflows (e.g. due to flaky test behavior).

The push_aws_gpu.yml job also has an added check to make sure the GitHub repo is newton-physics/newton.

Newton Migration Guide

Please ensure the migration guide for warp.sim users is up-to-date with the changes made in this PR.

  • The migration guide in docs/migration.rst is up-to date

Before your PR is "Ready for review"

  • Necessary tests have been added and new examples are tested (see newton/tests/test_examples.py)
  • Documentation is up-to-date
  • Code passes formatting and linting checks with pre-commit run -a

Summary by CodeRabbit

  • Chores
    • Improved error handling and messaging for GPU benchmark and unit test job failures, with detailed re-run guidance emphasizing proper ephemeral EC2 runner restart procedures
    • Added repository-specific execution guard to restrict GPU workflows to the appropriate repository for enhanced safety and control

✏️ Tip: You can customize this high-level summary in your review settings.

When tests or benchmarks fail, display warnings explaining that
"Re-run failed jobs" won't work due to ephemeral EC2 runners.

- Error annotations appear at top of job view
- Job summary table in the Summary tab
- ASCII banner in log output

Signed-off-by: Eric Shi <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 11, 2026

📝 Walkthrough

Walkthrough

The PR adds failure-handling guidance to two AWS GPU workflow files and restricts one workflow to execute only in the primary repository. New steps that trigger on failure provide users with re-run instructions, while a repository guard ensures GPU tests only run in the specified repository context.

Changes

Cohort / File(s) Summary
AWS GPU Failure Guidance
​.github/workflows/aws_gpu_benchmarks.yml, ​.github/workflows/aws_gpu_tests.yml
Added conditional steps that execute on failure, emitting error annotations and writing re-run guidance to job summaries and logs. Both steps advise against "Re-run failed jobs" and promote "Re-run all jobs" for ephemeral EC2 runners. (+38 lines each)
Repository Guard
​.github/workflows/push_aws_gpu.yml
Added repository-specific condition restricting workflow execution to newton-physics/newton repository only. (+1 line)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • newton-physics/newton#1084: Directly related; modifies the same AWS GPU workflow files that were introduced/refactored in that PR.

Suggested reviewers

  • eric-heiden
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main objective: adding clarity to re-run instructions for failed AWS GPU jobs, which aligns with the changeset that adds guidance steps to two workflows.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@shi-eric shi-eric requested a review from eric-heiden January 11, 2026 02:51
@codecov
Copy link

codecov bot commented Jan 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@shi-eric shi-eric added the automation Issues related to ci/cd and automation in general label Jan 11, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/aws_gpu_tests.yml (1)

121-158: Consider extracting re-run instructions to a composite action.

The re-run instructions are duplicated across aws_gpu_tests.yml and aws_gpu_benchmarks.yml. While the duplication provides clarity and keeps each workflow self-contained, you could create a composite action to maintain the instructions in a single location.

Example structure for a composite action

Create .github/actions/ec2-rerun-instructions/action.yml:

name: 'EC2 Re-run Instructions'
description: 'Display instructions for re-running workflows with ephemeral EC2 runners'
runs:
  using: 'composite'
  steps:
    - name: Re-run instructions
      if: failure()
      shell: bash
      run: |
        # (same script content)

Then reference it in both workflows:

- name: Re-run instructions
  if: failure()
  uses: ./.github/actions/ec2-rerun-instructions
📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 173e042 and d18a324.

📒 Files selected for processing (3)
  • .github/workflows/aws_gpu_benchmarks.yml
  • .github/workflows/aws_gpu_tests.yml
  • .github/workflows/push_aws_gpu.yml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.
📚 Learning: 2025-10-04T06:44:52.713Z
Learnt from: shi-eric
Repo: newton-physics/newton PR: 879
File: .gitlab-ci.yml:125-145
Timestamp: 2025-10-04T06:44:52.713Z
Learning: In the Newton project, the "linux-x86_64 test warp nightly" GitLab CI job intentionally runs on every pipeline (not limited to `.test_common` rules) to detect Warp nightly integration issues early, since Warp nightly releases are unpredictable. This design is acceptable because `allow_failure: true` prevents blocking the pipeline.

Applied to files:

  • .github/workflows/push_aws_gpu.yml
📚 Learning: 2026-01-04T01:26:02.866Z
Learnt from: shi-eric
Repo: newton-physics/newton PR: 1300
File: .github/workflows/pr_license_check.yml:24-24
Timestamp: 2026-01-04T01:26:02.866Z
Learning: When reviewing code related to CI checks that verify Git SHAs against version tags in GitHub Actions, handle annotated tags by dereferencing to obtain the actual commit SHA. GitHub's endpoint /repos/{owner}/{repo}/git/refs/tags/{tag} returns the tag object SHA for annotated tags; use /repos/{owner}/{repo}/git/tags/{tag_sha} to dereference to the commit SHA, or verify the commit directly in the repository instead of assuming a mismatch. Apply this pattern whenever validating tag-to-commit relationships in workflow checks across files under .github/workflows.

Applied to files:

  • .github/workflows/push_aws_gpu.yml
  • .github/workflows/aws_gpu_tests.yml
  • .github/workflows/aws_gpu_benchmarks.yml
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Run GPU Benchmarks / Run GPU Benchmarks on AWS EC2
  • GitHub Check: Run GPU Tests / Run GPU Unit Tests on AWS EC2
  • GitHub Check: run-newton-tests / newton-unittests (windows-latest)
  • GitHub Check: run-newton-tests / newton-unittests (ubuntu-latest)
🔇 Additional comments (3)
.github/workflows/push_aws_gpu.yml (1)

16-16: LGTM! Repository guard prevents execution in forks.

The guard correctly restricts AWS GPU workflows to the primary repository, preventing resource consumption and credential issues in forks.

.github/workflows/aws_gpu_benchmarks.yml (1)

134-171: Excellent multi-channel user guidance for ephemeral runners!

The implementation provides clear instructions through error annotations, job summary, and log output. The if: failure() condition ensures guidance appears whenever the workflow fails, and the heredoc syntax is correct.

.github/workflows/aws_gpu_tests.yml (1)

121-158: LGTM! Consistent re-run guidance across GPU workflows.

The implementation matches the pattern in aws_gpu_benchmarks.yml, providing clear user instructions through multiple channels. The if: failure() condition ensures guidance appears for any failure, including early-stage issues like checkout or setup failures.

@shi-eric shi-eric self-assigned this Jan 11, 2026
Copy link
Member

@adenzler-nvidia adenzler-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@adenzler-nvidia adenzler-nvidia added this pull request to the merge queue Jan 13, 2026
Merged via the queue into newton-physics:main with commit 8ccb2a9 Jan 13, 2026
34 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation Issues related to ci/cd and automation in general

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants