Add more logging to sharktank data tests. #917

ScottTodd · 2025-02-05T18:27:30Z

Logging more information as tests run will help with the triage process for issues like #888.

Logs before: https://github.com/nod-ai/shark-ai/actions/runs/13150995160/job/36698277502#step:6:27
Logs after: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:29

renxida

More logs good.

ScottTodd · 2025-02-05T19:33:21Z

(Waiting on CI runners to be available to actually see if this works)

ScottTodd · 2025-02-05T22:43:07Z

Ah.... that's a lot of log spam: https://github.com/nod-ai/shark-ai/actions/runs/13164091940/job/36749667622?pr=917

ScottTodd · 2025-02-06T00:05:09Z

Workflow had a few failures that I can only assume are unrelated to the logging changes. Maybe due to other workloads on the machine? (out of memory)

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeSDXLDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:669: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeFluxDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import;

ScottTodd · 2025-02-06T00:08:04Z

sharktank/sharktank/tools/import_hf_dataset.py

-    dataset.save(output_irpa_file, io_report_callback=logger.info)
+    dataset.save(output_irpa_file, io_report_callback=logger.debug)


Ah, maybe this wasn't enough to tune the verbosity: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:483

Wed, 05 Feb 2025 23:59:01 GMT ----------------------------- Captured stdout call ----------------------------- Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.bias, [512], torch.float32) Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.weight, [512, 4, 3, 3], torch.float32) Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.bias, [128], torch.float32) Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.weight, [128], torch.float32) Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.bias, [3], torch.float32) Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.weight, [3, 128, 3, 3], torch.float32)

I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT

Oh, I see the same logs on https://github.com/nod-ai/shark-ai/actions/runs/13166355851/job/36801067392?pr=918#step:6:138, just when the test fails. That can be addressed separately then.

sogartar · 2025-02-06T00:25:44Z

One thing that a lot of tests would benefit is enabling HIP error logging. It is enabled with env var AMD_LOG_LEVEL=1. Maybe put this in CI env preparation.

For example the below error may reveal something. Is it again running out of memory?

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import;

renxida · 2025-02-06T15:19:50Z

Workflow had a few failures that I can only assume are unrelated to the logging changes. Maybe due to other workloads on the machine? (out of memory)

Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32 - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeSDXLDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/hip_allocator.c:669: RESOURCE_EXHAUSTED; HIP driver error 'hipErrorOutOfMemory' (2): out of memory; while invoking native function hal.device.queue.alloca; while calling import; 
Wed, 05 Feb 2025 23:59:01 GMT FAILED sharktank/tests/models/vae/vae_test.py::VaeFluxDecoderTest::testVaeIreeVsHuggingFace - RuntimeError: Error invoking function: c/runtime/src/iree/hal/drivers/hip/event_semaphore.c:677: ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import;

#918 also fails with ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import;

Going to try adding AMD_LOG_LEVEL=1 too.

Better, maybe we should make a separate PR with AMD_LOG_LEVEL=1 and both pull from it?

ScottTodd · 2025-02-06T16:50:55Z

Better, maybe we should make a separate PR with AMD_LOG_LEVEL=1 and both pull from it?

Yeah I'm not set up to test that myself so I'd rather someone more familiar with the AMDGPU runtime make such a change.

ScottTodd · 2025-02-06T18:34:04Z

Got a clean CI run after some retries: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36802999721?pr=917 . PTAL?

renxida

lgtm

renxida · 2025-02-06T15:17:51Z

sharktank/sharktank/tools/import_hf_dataset.py

-    dataset.save(output_irpa_file, io_report_callback=logger.info)
+    dataset.save(output_irpa_file, io_report_callback=logger.debug)


I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT

In the cycle leading up to the 3.2.0 release, we had some trouble triaging CI error reports / routing bugs to the appropriate teams / people. This and #917 are part of that. The plan is to make shortfin's llm server log all relevant InferenceExecRequests when something (e.g. kvcache allocations / nans in logits) doesn't check out.

Add more logging to sharktank data tests.

1fade67

renxida approved these changes Feb 5, 2025

View reviewed changes

Adjust logging, fix path...??

dd77674

ScottTodd marked this pull request as ready for review February 6, 2025 00:02

ScottTodd requested review from sogartar and KyleHerndon February 6, 2025 00:03

ScottTodd commented Feb 6, 2025

View reviewed changes

renxida approved these changes Feb 6, 2025

View reviewed changes

ScottTodd merged commit a40a486 into nod-ai:main Feb 6, 2025
33 checks passed

ScottTodd deleted the data-tests-logging branch February 6, 2025 23:36

renxida mentioned this pull request Feb 14, 2025

Add InferenceExecRequest.__repr__() for debug logging #958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more logging to sharktank data tests. #917

Add more logging to sharktank data tests. #917

ScottTodd commented Feb 5, 2025 •

edited

Loading

renxida left a comment

ScottTodd commented Feb 5, 2025

ScottTodd commented Feb 5, 2025

ScottTodd commented Feb 6, 2025

ScottTodd Feb 6, 2025

renxida Feb 6, 2025

ScottTodd Feb 6, 2025

sogartar commented Feb 6, 2025 •

edited

Loading

renxida commented Feb 6, 2025

ScottTodd commented Feb 6, 2025

ScottTodd commented Feb 6, 2025

renxida left a comment

renxida Feb 6, 2025

		dataset.save(output_irpa_file, io_report_callback=logger.info)
		dataset.save(output_irpa_file, io_report_callback=logger.debug)

Add more logging to sharktank data tests. #917

Add more logging to sharktank data tests. #917

Conversation

ScottTodd commented Feb 5, 2025 • edited Loading

renxida left a comment

Choose a reason for hiding this comment

ScottTodd commented Feb 5, 2025

ScottTodd commented Feb 5, 2025

ScottTodd commented Feb 6, 2025

ScottTodd Feb 6, 2025

Choose a reason for hiding this comment

renxida Feb 6, 2025

Choose a reason for hiding this comment

ScottTodd Feb 6, 2025

Choose a reason for hiding this comment

sogartar commented Feb 6, 2025 • edited Loading

renxida commented Feb 6, 2025

ScottTodd commented Feb 6, 2025

ScottTodd commented Feb 6, 2025

renxida left a comment

Choose a reason for hiding this comment

renxida Feb 6, 2025

Choose a reason for hiding this comment

ScottTodd commented Feb 5, 2025 •

edited

Loading

sogartar commented Feb 6, 2025 •

edited

Loading