Replace `xm.mark_step` with `torch_xla.sync()` wherever possible #9070

ghpvnist · 2025-05-01T00:24:04Z

Also update remaining uses of mark_step to use/reference torch_xla.sync().

tengyifei

LGTM modulo one nit on the comment.

I think we have some tests that look for strings (yuck!) and there's a subtle string mismatch.

tengyifei · 2025-05-01T07:09:00Z

torch_xla/torch_xla.py

  """Launches all pending graph operations.

  Args:
    wait (bool): whether to block the current process until the execution finished.
-
+    reset_scope (bool): whether to reset the tracing scope of lazy tensor.


Could you elaborate what this means? By reading the code, I wasn't sure what does "resetting" mean. Does it mean any tracing scope set by the user for the profiler are invalidated? Maybe could you run some simple tests to verify its behavior?

yaoshiang · 2025-05-01T15:29:10Z

This comment shouldn't hold up this PR, but I want to register that I think that we have an opportunity to rename this again in the future to make it easier for users to understand what this does without reading documentation. The problem with sync is that it is too close to torch.cuda.synchronize, which has a different meaning. So torch_xla.sync is going to be confusing the majority of our users.

What I think we should do is have a

torch_xla.synchronize which ONLY waits for all results from the underlying devices to return, similar to torch.cuda.synchronize.
torch_xla.compile(), which will match the norms of torch developers, who expect a possible graph break at the end of a decorated function.
torch_xla._barrier(), which will become a documented but not recommended usage to indicate that lazytensor should break the graph.

There are unresolved design questions:
What happens if a synchronize is inside a compile()? We'd have to look to how this works in CUDA to try to make this parallel.

Next torch.compile vs torch_xla.compile()

torch.cuda.synchronize(device=None)[SOURCE][SOURCE]
Wait for all kernels in all streams on a CUDA device to complete.

Parameters
device (torch.device or int, optional) – device for which to synchronize. It uses the current device, given by current_device(), if device is None (default).

bhavya01 · 2025-05-02T17:01:40Z

This comment shouldn't hold up this PR, but I want to register that I think that we have an opportunity to rename this again in the future to make it easier for users to understand what this does without reading documentation. The problem with sync is that it is too close to torch.cuda.synchronize, which has a different meaning. So torch_xla.sync is going to be confusing the majority of our users.

What I think we should do is have a

torch_xla.synchronize which ONLY waits for all results from the underlying devices to return, similar to torch.cuda.synchronize. torch_xla.compile(), which will match the norms of torch developers, who expect a possible graph break at the end of a decorated function. torch_xla._barrier(), which will become a documented but not recommended usage to indicate that lazytensor should break the graph.

There are unresolved design questions: What happens if a synchronize is inside a compile()? We'd have to look to how this works in CUDA to try to make this parallel.

Next torch.compile vs torch_xla.compile()

torch.cuda.synchronize(device=None)[SOURCE][SOURCE] Wait for all kernels in all streams on a CUDA device to complete.

Parameters device (torch.device or int, optional) – device for which to synchronize. It uses the current device, given by current_device(), if device is None (default).

Agree that torch_xla.sync() is not the best naming. Should we explicitly try to educate the users by using something like compile_all_lazy_graphs()? torch_xla.barrier is also a good option. My only concern is that a barrier is generally used in the context of threads and processes and we already have APIs like apply_backward_optimization_barrier. Just wanted to avoid the term from being too overloaded.

tengyifei · 2025-05-03T00:41:12Z

IMO it's very worth having a separate discussion for renaming torch_xla.sync(), but outside of this PR. I agree that sync is also confusing.

ghpvnist added 2 commits April 30, 2025 22:38

Replace xm.mark_step with torch_xla.sync() in examples and tests

08e33a3

Sync impl of torch_xla.sync() and mark_step.

7615d81

Also update remaining uses of mark_step to use/reference torch_xla.sync().

ghpvnist requested review from mikegre-google and tengyifei as code owners May 1, 2025 00:24

tengyifei approved these changes May 1, 2025

View reviewed changes

ghpvnist added 2 commits May 2, 2025 18:26

Format files

b121218

Fix test and change remaining uses

6db8e26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `xm.mark_step` with `torch_xla.sync()` wherever possible #9070

Replace `xm.mark_step` with `torch_xla.sync()` wherever possible #9070

ghpvnist commented May 1, 2025

tengyifei left a comment

tengyifei May 1, 2025

yaoshiang commented May 1, 2025

bhavya01 commented May 2, 2025

Next torch.compile vs torch_xla.compile()

tengyifei commented May 3, 2025

Replace xm.mark_step with torch_xla.sync() wherever possible #9070

Are you sure you want to change the base?

Replace xm.mark_step with torch_xla.sync() wherever possible #9070

Conversation

ghpvnist commented May 1, 2025

tengyifei left a comment

Choose a reason for hiding this comment

tengyifei May 1, 2025

Choose a reason for hiding this comment

yaoshiang commented May 1, 2025

Next torch.compile vs torch_xla.compile()

bhavya01 commented May 2, 2025

Next torch.compile vs torch_xla.compile()

tengyifei commented May 3, 2025

Replace `xm.mark_step` with `torch_xla.sync()` wherever possible #9070

Replace `xm.mark_step` with `torch_xla.sync()` wherever possible #9070