-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more logging to sharktank data tests. #917
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More logs good.
(Waiting on CI runners to be available to actually see if this works) |
Ah.... that's a lot of log spam: https://github.com/nod-ai/shark-ai/actions/runs/13164091940/job/36749667622?pr=917 |
Workflow had a few failures that I can only assume are unrelated to the logging changes. Maybe due to other workloads on the machine? (out of memory)
|
dataset.save(output_irpa_file, io_report_callback=logger.info) | ||
dataset.save(output_irpa_file, io_report_callback=logger.debug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, maybe this wasn't enough to tune the verbosity: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:483
Wed, 05 Feb 2025 23:59:01 GMT ----------------------------- Captured stdout call -----------------------------
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.bias, [512], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_in.weight, [512, 4, 3, 3], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.bias, [128], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_norm_out.weight, [128], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.bias, [3], torch.float32)
Wed, 05 Feb 2025 23:59:01 GMT Add PrimitiveTensor(decoder.conv_out.weight, [3, 128, 3, 3], torch.float32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the same logs on https://github.com/nod-ai/shark-ai/actions/runs/13166355851/job/36801067392?pr=918#step:6:138, just when the test fails. That can be addressed separately then.
One thing that a lot of tests would benefit is enabling HIP error logging. It is enabled with env var For example the below error may reveal something. Is it again running out of memory?
|
#918 also fails with ABORTED; the semaphore was aborted; while invoking native function hal.fence.await; while calling import; Going to try adding AMD_LOG_LEVEL=1 too. Better, maybe we should make a separate PR with AMD_LOG_LEVEL=1 and both pull from it? |
Yeah I'm not set up to test that myself so I'd rather someone more familiar with the AMDGPU runtime make such a change. |
Got a clean CI run after some retries: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36802999721?pr=917 . PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
dataset.save(output_irpa_file, io_report_callback=logger.info) | ||
dataset.save(output_irpa_file, io_report_callback=logger.debug) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pytest's --log-cli-level=info is a minimum-log-level filter for saving what it captures on stdout. So level=debug gets filtered out by pytest even though dataset.save does output it to STDOUT
In the cycle leading up to the 3.2.0 release, we had some trouble triaging CI error reports / routing bugs to the appropriate teams / people. This and #917 are part of that. The plan is to make shortfin's llm server log all relevant InferenceExecRequests when something (e.g. kvcache allocations / nans in logits) doesn't check out.
Logging more information as tests run will help with the triage process for issues like #888.
Logs before: https://github.com/nod-ai/shark-ai/actions/runs/13150995160/job/36698277502#step:6:27
Logs after: https://github.com/nod-ai/shark-ai/actions/runs/13168068091/job/36752792808?pr=917#step:6:29