Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mi300 runner for toy_llama tests #19961

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

stbaione
Copy link

@stbaione stbaione commented Feb 11, 2025

Currently, we only run on a w7900 when we run the toy_llama tests. I recently bisected an issue due to sharding accuracy being broken for llama in shortfin, on mi300 (#19872), and when I run iree-test-suites locally, I get failures:

================================================= short test summary info ==================================================
PASSED sharktank_models/clip/test_clip.py::test_results_close[bf16-hip]
PASSED sharktank_models/clip/test_clip.py::test_results_close[f32-hip]
PASSED sharktank_models/llama3.1/test_llama.py::test_prefill[hip]
PASSED sharktank_models/llama3.1/test_llama.py::test_decode[hip]
PASSED sharktank_models/llama3.1/test_llama.py::test_prefill_decode[hip]
PASSED sharktank_models/llama3.1/test_llama.py::test_prefill[hip_tp2]
PASSED sharktank_models/llama3.1/test_llama.py::test_prefill_decode[hip_tp2]
FAILED sharktank_models/llama3.1/test_llama.py::test_decode[hip_tp2] - AssertionError: cross entropy outside of tolerance
================================== 1 failed, 7 passed, 8 deselected, 2 warnings in 17.84s ==================================

I suspect that this would have been able to catch the bug had it been enabled. If it didn't, then we can determine how to extend the tests to cover the case we saw.

Reference issues for sharded failure: #19948, nod-ai/shark-ai#934

ci-exactly: build_packages,test_sharktank

@stbaione stbaione requested a review from ScottTodd as a code owner February 11, 2025 19:31
Comment on lines +38 to +41
- name: hip_task_mi300
target: target_hip
gpu: gfx942
runs-on: linux-mi300-gpu-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these runners need GPU drivers, either preinstalled or via Docker?

https://github.com/iree-org/iree/actions/runs/13270996475/job/37050767466?pr=19961#step:8:330

ERROR iree-test-suites/sharktank_models/llama3.1/test_llama.py::test_prefill[hip] - RuntimeError: Error creating driver: iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path: 
  Tried: libamdhip64.so
    iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory

See what this other workflow does:

jobs:
test_mi300:
runs-on: linux-mi300-gpu-1
container:
image: rocm/dev-ubuntu-22.04:6.3
options: --user root --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined

cc @yamiyysu

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the line to register container for mi300 runner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing some deps, like these:

- name: "Install dependencies"
run: |
sudo apt-get update
sudo apt-get install -y cmake ninja-build clang lld git

Trying to install those will run into #19955 though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, latest run hit that error

@stbaione stbaione requested a review from ScottTodd February 11, 2025 21:33
Comment on lines 24 to 25
# Dynamically assign container for `linux-mi300-gpu-1`
container: ${{ fromJSON(toJSON(matrix.container || {})) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a syntax error here: https://github.com/iree-org/iree/actions/runs/13272021471

Invalid workflow file: .github/workflows/pkgci.yml#L111
The workflow is not valid. In .github/workflows/pkgci.yml (Line: 111, Col: 11): Error from called workflow iree-org/iree/.github/workflows/pkgci_test_sharktank.yml@29df51be7e68c626460b68d7e9a80f654091adb8 (Line: 25, Col: 16): Unexpected symbol: '{}'. Located at position 37 within expression: fromJSON(toJSON(matrix.container || {}))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed what should be a fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good 🤞

@ScottTodd
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants