-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mi300
runner for toy_llama tests
#19961
base: main
Are you sure you want to change the base?
Conversation
- name: hip_task_mi300 | ||
target: target_hip | ||
gpu: gfx942 | ||
runs-on: linux-mi300-gpu-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these runners need GPU drivers, either preinstalled or via Docker?
https://github.com/iree-org/iree/actions/runs/13270996475/job/37050767466?pr=19961#step:8:330
ERROR iree-test-suites/sharktank_models/llama3.1/test_llama.py::test_prefill[hip] - RuntimeError: Error creating driver: iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path:
Tried: libamdhip64.so
iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory
See what this other workflow does:
iree/.github/workflows/pkgci_test_amd_mi300.yml
Lines 20 to 25 in e5a5881
jobs: | |
test_mi300: | |
runs-on: linux-mi300-gpu-1 | |
container: | |
image: rocm/dev-ubuntu-22.04:6.3 | |
options: --user root --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined |
cc @yamiyysu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the line to register container for mi300 runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing some deps, like these:
iree/.github/workflows/pkgci_test_amd_mi300.yml
Lines 38 to 41 in e5a5881
- name: "Install dependencies" | |
run: | | |
sudo apt-get update | |
sudo apt-get install -y cmake ninja-build clang lld git |
Trying to install those will run into #19955 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, latest run hit that error
# Dynamically assign container for `linux-mi300-gpu-1` | ||
container: ${{ fromJSON(toJSON(matrix.container || {})) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a syntax error here: https://github.com/iree-org/iree/actions/runs/13272021471
Invalid workflow file: .github/workflows/pkgci.yml#L111
The workflow is not valid. In .github/workflows/pkgci.yml (Line: 111, Col: 11): Error from called workflow iree-org/iree/.github/workflows/pkgci_test_sharktank.yml@29df51be7e68c626460b68d7e9a80f654091adb8 (Line: 25, Col: 16): Unexpected symbol: '{}'. Located at position 37 within expression: fromJSON(toJSON(matrix.container || {}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed what should be a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far so good 🤞
Please also resolve the DCO check: https://iree.dev/developers/general/contributing/#developer-certificate-of-origin |
Currently, we only run on a
w7900
when we run the toy_llama tests. I recently bisected an issue due to sharding accuracy being broken for llama in shortfin, on mi300 (#19872), and when I runiree-test-suites
locally, I get failures:I suspect that this would have been able to catch the bug had it been enabled. If it didn't, then we can determine how to extend the tests to cover the case we saw.
Reference issues for sharded failure: #19948, nod-ai/shark-ai#934
ci-exactly: build_packages,test_sharktank