Skip to content

Error Handling: propagate status for ReleaseGilAndTransferData and XlaDataToTensors. #9431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: ysiraichi/status-for-oom-errors
Choose a base branch
from

Conversation

ysiraichi
Copy link
Collaborator

This PR refactors our error handling by replacing GetValueOrThrow with proper status propagation using absl::StatusOr<T> and XLA_ASSIGN_OR_RETURN macros.

Key Changes:

  • ReleaseGilAndTransferData Function:

    • Updated the function signature to return absl::StatusOr<std::vector<xla::Literal>>.
    • Replaced GetComputationClientOrDie() with GetComputationClient().
    • Utilized XLA_ASSIGN_OR_RETURN for client acquisition and TransferFromDevice calls.
    • Updated callers in tensor_util.cpp and xla_graph_executor.cpp to handle the new StatusOr<T> return type.
  • XlaDataToTensors Function:

    • Modified the function signature to return absl::StatusOr<std::vector<at::Tensor>>.
    • Replaced GetValueOrThrow with XLA_ASSIGN_OR_RETURN for the ReleaseGilAndTransferData call.
    • Updated all callers (including XLATensor::ToTensor, test_xla_sharding.cpp, init_python_bindings.cpp, and xla_backend_impl.cpp) to correctly handle the StatusOr<T> return type.
    • Added necessary status.h includes to xla_backend_impl.cpp and test_xla_sharding.cpp.

These modifications align with existing status propagation patterns in the codebase, as seen in pjrt_registry.cpp, and maintain API-level backward compatibility while improving internal error handling within the tensor conversion pipeline.

@ysiraichi
Copy link
Collaborator Author

Blocked until #9429 is merged.

@ysiraichi ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 9d505e7 to 5d4742b Compare July 1, 2025 16:44
@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 247fdf5 to b390a61 Compare July 1, 2025 18:11
@ysiraichi ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 5d4742b to 40a75d7 Compare July 1, 2025 18:11
@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from b390a61 to 821c384 Compare July 1, 2025 18:15
@ysiraichi ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch 2 times, most recently from b0e25da to 97ef4c1 Compare July 3, 2025 14:41
@ysiraichi ysiraichi force-pushed the ysiraichi/status-for-oom-errors branch from 821c384 to 08c5ecd Compare July 3, 2025 14:41
ysiraichi added 2 commits July 3, 2025 12:42
…ansferData`

Modify `ReleaseGilAndTransferData` function to use proper status propagation
instead of `GetValueOrThrow` with `GetComputationClientOrDie`. This improves
error handling by allowing status types to be propagated up the call stack
rather than immediately throwing exceptions.

Changes:
- Update function signature to return `absl::StatusOr<std::vector<xla::Literal>>`
- Replace `GetComputationClientOrDie()` with `GetComputationClient()`
- Use `XLA_ASSIGN_OR_RETURN` macros for both client acquisition and `TransferFromDevice`
- Update callers in tensor_util.cpp and xla_graph_executor.cpp to handle `StatusOr<T>`

This follows the status propagation patterns used elsewhere in the codebase
and aligns with the examples in pjrt_registry.cpp.
Modify `XlaDataToTensors` function to use proper status propagation instead of
`GetValueOrThrow`, and update all callers to handle the new `StatusOr<T>` return type.
This continues the status propagation improvements started with
`ReleaseGilAndTransferData`.

Changes:
- Update `XlaDataToTensors` signature to return `absl::StatusOr<std::vector<at::Tensor>>`
- Replace `GetValueOrThrow` with `XLA_ASSIGN_OR_RETURN` for `ReleaseGilAndTransferData` call
- Update all callers to use `GetValueOrThrow` wrapper:
  - `XLATensor::ToTensor` in tensor.cpp:515
  - test_xla_sharding.cpp:31
  - init_python_bindings.cpp:2716
  - xla_backend_impl.cpp:95
- Add necessary status.h includes to xla_backend_impl.cpp and test_xla_sharding.cpp

This maintains backward compatibility at the API level while enabling proper
status propagation internally within the tensor conversion pipeline.
@ysiraichi ysiraichi force-pushed the ysiraichi/propagate-status-for-oom branch from 97ef4c1 to de09876 Compare July 3, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant