Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more logging to sharktank data tests. #917

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Feb 5, 2025

Copy link
Contributor

@renxida renxida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More logs good.

@ScottTodd
Copy link
Member Author

(Waiting on CI runners to be available to actually see if this works)

@ScottTodd
Copy link
Member Author

@ScottTodd ScottTodd marked this pull request as ready for review February 6, 2025 00:02
@ScottTodd
Copy link
Member Author

Workflow had a few failures that I can only assume are unrelated to the logging changes. Maybe due to other workloads on the machine? (out of memory)

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeSDXLDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:669: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeFluxDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 

dataset.save(output_irpa_file, io_report_callback=logger.info)
dataset.save(output_irpa_file, io_report_callback=logger.debug)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe this wasn't enough to tune the verbosity: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:483

Wed, 05 Feb 2025 23:59:01 GMT ----------------------------- Captured stdout call -----------------------------
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.bias, [512], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.weight, [512, 4, 3, 3], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.bias, [128], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.weight, [128], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.bias, [3], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.weight, [3, 128, 3, 3], torch.float32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the same logs on https://github.com/nod-ai/shark-ai/actions/runs/13166355851/job/36801067392?pr=918#step:6:138, just when the test fails. That can be addressed separately then.

@sogartar
Copy link
Contributor

sogartar commented Feb 6, 2025

One thing that a lot of tests would benefit is enabling HIP error logging. It is enabled with env var AMD_LOG_LEVEL=1. Maybe put this in CI env preparation.

For example the below error may reveal something. Is it again running out of memory?

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 

@renxida
Copy link
Contributor

renxida commented Feb 6, 2025

Workflow had a few failures that I can only assume are unrelated to the logging changes. Maybe due to other workloads on the machine? (out of memory)

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeSDXLDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:669: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeFluxDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 

#918 also fails with ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import;

Going to try adding AMD_LOG_LEVEL=1 too.

Better, maybe we should make a separate PR with AMD_LOG_LEVEL=1 and both pull from it?

@ScottTodd
Copy link
Member Author

Better, maybe we should make a separate PR with AMD_LOG_LEVEL=1 and both pull from it?

Yeah I'm not set up to test that myself so I'd rather someone more familiar with the AMDGPU runtime make such a change.

@ScottTodd
Copy link
Member Author

Got a clean CI run after some retries: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36802999721?pr=917 . PTAL?

Copy link
Contributor

@renxida renxida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

dataset.save(output_irpa_file, io_report_callback=logger.info)
dataset.save(output_irpa_file, io_report_callback=logger.debug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT

@ScottTodd ScottTodd merged commit a40a486 into nod-ai:main Feb 6, 2025
33 checks passed
@ScottTodd ScottTodd deleted the data-tests-logging branch February 6, 2025 23:36
renxida added a commit that referenced this pull request Feb 14, 2025
In the cycle leading up to the 3.2.0 release, we had some trouble
triaging CI error reports / routing bugs to the appropriate teams /
people. This and #917 are part of that.

The plan is to make shortfin's llm server log all relevant
InferenceExecRequests when something (e.g. kvcache allocations / nans in
logits) doesn't check out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants