-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job timeouts in "CI - sharktank / Data-dependent Tests" after updating IREE versions #888
Comments
The test that is timing out may be https://github.com/nod-ai/shark-ai/blob/main/sharktank/tests/models/flux/flux_test.py. If that is runnable locally, someone could try debugging it using different versions of the IREE packages. |
Locally to avoid using just device 0 I made this change #891 to allow me to specify other device for the tests. With it I am hitting another earlier error coming from HIP's
The native stack trace is
Interestingly this happens only when I run multiple tests one after another. If I run just the offending test all is well. |
The offending IREE change seems to be iree-org/iree#19826. Without it I don't get the error and the CLIP tests run fine. The particular order of IREE runtime calls when tests are ordered in this way seem to cause the problem.
|
@AWoloszyn, have you encountered something like this before? |
The non-reproducibility of the hang was caused by some modification I made to be able to specify a concrete device for the tests. Right now the tests will run on device 0. I have missed a place where device 0 was still create but unused in the After making only 1 device being used I was able to hit the hang at
stack trace:
This is on top of iree-org/iree@1bf7249. |
I was able to more reliably reproduce the hang on top of this branch that has some unmerged commits into main.
|
Would it be okay to disable the test that hangs and continue to update the IREE versions in shark-ai, then fix the test (with either changes to shark-ai or IREE) later? Or do we want this to block the updates until it is resolved? |
What's the latest status here? shark-ai is still pinned to IREE versions from two weeks ago and our next stable release across all projects is scheduled for 1 week from now. We would like to cut a release candidate by ~Wednesday of this week, using all the latest code. Need to continue updating and get visibility into any new issues with the release ASAP. |
I'm not exactly up to date on all things @sogartar mentioned. But I'm getting various problems that look like iree-shortfin compatibility issues. I have 0 familiarity with the IREE codebase so all I can do is paste a catalogue of the things I'm encountering. Some of these replicate on iree 0120 too though. Here's where I'm tracking them: #904 |
@ScottTodd, @AWoloszyn is looking into the problem. To the best of my knowledge he has some insights, but it is not resolved yet. |
It seems that we have intermittent hangs even with IREE 3.2.0rc20250120. The assumption was that bumping up from that version caused the issue, but it looks like it is present in this and probably earlier versions. |
Logging more information as tests run will help with the triage process for issues like #888. Logs before: https://github.com/nod-ai/shark-ai/actions/runs/13150995160/job/36698277502#step:6:27 Logs after: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:29
Diff: iree-org/iree@iree-3.2.0rc20250206...iree-3.3.0rc20250211 IREE bump duty engineer this week: @renxida Note: there are some CI failures but I double checked against main, and there are no new failures / no new failing tests. Perplexity and benchmarking 8b are failing due to ci migration problems, and sharktank data dependent tests are failing due to #888 Auto-generated by GitHub Actions using [`.github/workflows/update_iree_requirement_pins.yml`](https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/update_iree_requirement_pins.yml). Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shark-pr-automator[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shark-pr-automator[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Found one issue with our error propagation for semaphore failures, currently testing if this is causing the issue. |
Diff: iree-org/iree@iree-3.2.0rc20250206...iree-3.3.0rc20250211 IREE bump duty engineer this week: @renxida Note: there are some CI failures but I double checked against main, and there are no new failures / no new failing tests. Perplexity and benchmarking 8b are failing due to ci migration problems, and sharktank data dependent tests are failing due to #888 Auto-generated by GitHub Actions using [`.github/workflows/update_iree_requirement_pins.yml`](https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/update_iree_requirement_pins.yml). Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shark-pr-automator[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shark-pr-automator[bot] <41898282+github-actions[bot]@users.noreply.github.com>
That error propagation was not causing the issue. Still investigating possible causes. I am starting to suspect something wrong with the kernels but I want to make sure, this is 100% reproducible for me, and it always fails at EXACTLY the same place. |
Closing in on a root cause. |
Recent attempts to update the versions of IREE used in shark-ai have resulted in 6 hour job timeouts in the "CI - sharktank / Data-dependent Tests" job, source in this file: https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-sharktank.yml.
First observed on version
3.2.0rc20250124
with #867.Logs with IREE version
3.2.0rc20250129
can be seen on #879, for example https://github.com/nod-ai/shark-ai/actions/runs/13035503409/job/36364799644?pr=879Suggested actions
Other details
I added pytest-timeout in #868 and that seemed to stop tests as expected with a 10 second timeout, but a 600 second timeout is clearly not working. The runner itself could be unhealthy, or the tests could be stalled in a way that avoids the timeout (pytest-xdist and pytest-timeout are sometimes not compatible).
The text was updated successfully, but these errors were encountered: