Ensuring PJRT Client destroy/destructor is called #9675

saarthak-aws · 2025-10-07T03:19:53Z

Re-introducing a static unique_ptr to manage the lifecycle of PjRtComputationClient, ensuring that the destructor / destroy method of PJRT Client is called.

After building in this change, ans running the reproduction steps mentioned in #9669, I have manually confirmed that that PJRT Client destructor is called

ubuntu@ip-[redacted]:~$ export TF_CPP_MIN_LOG_LEVEL=0; export TF_CPP_VMODULE="cpu_client=1"; export NEURON_RT_LOG_LEVEL=DEBUG; export PJRT_DEVICE=CPU
(aws_neuronx_venv_pytorch_2_8) ubuntu@ip-172-31-59-9:~$ python -c "import torch_xla; device=torch_xla.device()"
WARNING:root:MASTER_ADDR environment variable is not set, defaulting to localhost
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1759797507.812598  222797 cpu_client.cc:311] PjRtCpuClient created.
I0000 00:00:1759797508.238195  222797 cpu_client.cc:314] PjRtCpuClient destroyed.

ubuntu@ip-[redacted]:~$ pip list | grep torch
torch                     2.9.0a0+git21fec65
torch-xla                 2.9.0+git11590c1

zhanyong-wan · 2025-10-07T15:11:16Z

torch_xla/csrc/runtime/runtime.cpp

  // reference.
-  static const auto& maybe_client =
-      *new absl::StatusOr<ComputationClient*>(InitializeComputationClient());
+  static absl::StatusOr<std::unique_ptr<ComputationClient>> init_result =


This violates Google's C++ style guide: https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables

For singleton objects, we deliberately do not want their destructors to be called, as that can lead to race condition at program exit time.

I'm not sure what this PR is trying to achieve. Could you clarify why you want to sure that the PjRt client dtor is called? Usually we don't destroy the singleton objects - we just let the OS reclaim the resources when the process terminates.

@zhanyong-wan thanks for the feedback. Could you give an example of the race condition you mentioned and why it was not addressed until v2.8?

The style guide I mentioned noted: "When destructors are trivial, their execution is not subject to ordering at all (they are effectively not "run"); otherwise we are exposed to the risk of accessing objects after the end of their lifetime. Therefore, we only allow objects with static storage duration if they are trivially destructible. Fundamental types (like pointers and int) are trivially destructible, as are arrays of trivially destructible types."

For example, at program exit time there could be long-running threads accessing global variables. If a global variable is destructed, such access is undefined behavior.

As to why it wasn't addressed until v2.8, I don't know the history, but my guess is that we just noticed the potential race and decided to fix it.

In PR #9384, we introduced StatusOr<T> for error handling, which can be trivially destructible when T is trivially destructible. However, looking at PjrtComputationClient's implementation with its explicit destructor and member variables, it appears to not be trivially destructible. Could you shed some light on why we think PjrtComputationClient could be trivially destructible?

@rajkthakur , StatusOr<T> is not trivially destructible, regardless of whether T is trivially destructible. PjrtComputationClient is not trivially destructible and not meant to be. I don't understand what you mean by "we think PjrtComputationClient could be trivially destructible".

Thanks for clarifying. It seems a bug that neuron hangs sometimes if the clean-up is left to the OS. My suggestion would be to root cause and fix that bug.

Re: the shutdown approach, I don't think we can count on no further access to client_ after the atexit hook is called. The whole point of Google's policy on global variable destruction is that there can be long-running threads after the exit hook is called. Think about the case where someone starts a computation in a long-running thread and then exit. The thread is never joined and thus may still access client_ after the program exit hook.

While we investigate why leaving cleanup to the OS leaves Neuron backend in a bad state, do you have any thoughts on what would be the correct approach for implementing the Shutdown method?

We would have to leave the client_ accessible after we have destroyed the actual xla::PjRtClient (since destruction ends up calling PJRT_Client_Destroy). One way I can think of doing so is to switch to a stub implementation of _client at this point, so that long running threads can access _client, but they would get some default behavior. Is that the right approach/pattern?

I think the best course of action is to fix the hang, as implementing Shutdown correctly adds significant complexity to the design.

That said, here's how Shutdown should work if done correctly: it should allow in-flight computation that needs the client to finish, and it should let new computation (if any) that wants to use the client fail to get the client. This means we'll likely need to use a shared_ptr to hold the client (so that in-flight computation can extend its lifespan).

As you can see, this is doable but not trivial. Hence my advice to avoid it.

Maybe a shorter shutdown function could be something like:

Check that the client device is Neuron

Dynamic cast the inner PjRtClient into xla::PjRtCApiClient

Call PJRT_Client_Destroy

(@zhanyong-wan what do you think?)

Notes:

This could work, if you keep track (inside the plugin implementation) of whether PJRT_Client_Destroy was already called for the given PJRT_Client, erroring out otherwise. Else, we are going to get UB.

Since we need to interact with PjRtClient, this will probably need to be added as a new virtual function of ComputationClient

With all that said, I believe the best solution would be to figure out what exactly is causing the hanging problem. On the other hand, it feels like not calling PJRT_Client_Destroy is a bug from the perspective of PJRT semantics (couldn't really find anywhere).

@ysiraichi , let's avoid the hack and fix the hang. It adds significant complexity if different devices have different shutdown logic.

root added 2 commits October 7, 2025 01:02

adding static unique_ptr to manage lifecycle of PjrtClient

7dac2d7

lint

d77ff94

rajkthakur approved these changes Oct 7, 2025

View reviewed changes

jeffhataws requested review from bhavya01, ghpvnist, qihqi, ysiraichi and zhanyong-wan October 7, 2025 05:19

zhanyong-wan requested changes Oct 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensuring PJRT Client destroy/destructor is called #9675

Ensuring PJRT Client destroy/destructor is called #9675

saarthak-aws commented Oct 7, 2025 •

edited

Loading

Uh oh!

zhanyong-wan Oct 7, 2025

Uh oh!

jeffhataws Oct 7, 2025

Uh oh!

zhanyong-wan Oct 7, 2025

Uh oh!

rajkthakur Oct 7, 2025 •

edited

Loading

Uh oh!

zhanyong-wan Oct 8, 2025

Uh oh!

zhanyong-wan Oct 9, 2025

Uh oh!

saarthak-aws Oct 9, 2025

Uh oh!

zhanyong-wan Oct 10, 2025

Uh oh!

ysiraichi Oct 13, 2025 •

edited

Loading

Uh oh!

zhanyong-wan Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Ensuring PJRT Client destroy/destructor is called #9675

Are you sure you want to change the base?

Ensuring PJRT Client destroy/destructor is called #9675

Conversation

saarthak-aws commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajkthakur Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysiraichi Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saarthak-aws commented Oct 7, 2025 •

edited

Loading

rajkthakur Oct 7, 2025 •

edited

Loading

ysiraichi Oct 13, 2025 •

edited

Loading