Skip to content

Conversation

akashveramd
Copy link

@akashveramd akashveramd commented Jul 19, 2025

Created this PR to fix test_sharded_grad_scaler_found_inf failing test in Jira ticket https://ontrack-internal.amd.com/browse/SWDEV-479939.

The test was failing for Navi arch, that too when the test runs with cpu_offload=true. The cpu_offload=true, results in running the grad scalar optimizer to run on CPU. The grad scalar optimizer uses vectorized & scalar operations to find inf values in tensors. It seems for Navi arch, the vectorized operation is running unreliably, perhaps taking longer time to execute and resulting in failure. Adding a sleep statement in the vectorized operation helps run it successfully.

@akashveramd akashveramd self-assigned this Jul 19, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 19, 2025

Jenkins build for 72dc0f072a8a9007122a8747a770a722a7838d4e commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 19, 2025

Jenkins build for 72dc0f072a8a9007122a8747a770a722a7838d4e commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant