Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator #9682

iwknow · 2025-10-20T03:21:57Z

Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator

These helper functions will be used with XLA hook later.

Refer to #9159

…` to XLA random number generator

This reverts commit 79d4b42.

iwknow · 2025-10-22T21:35:08Z

@qihqi do you know why do i always get build timeout? it builds successfully locally on my machine. i am not able to find any clue of the cause from the build log. please take a look

iwknow · 2025-10-24T17:29:36Z

@ysiraichi @qihqi can you please take a look. thanks!

ysiraichi · 2025-10-28T12:10:08Z

do you know why do i always get build timeout?

This might be because your PR is not from a branch on this PyTorch/XLA repository.
That causes the CI not to use the remote cache.
I'm currently working on using GitHub cache for mitigating that #9659.

ysiraichi

Thank you for the PR. I think that it looks good overall.
Could you add a few C++ tests and check everything is working?

torch_xla/csrc/xla_generator.cpp

torch_xla/csrc/xla_generator.h

torch_xla/csrc/xla_generator.cpp

iwknow · 2025-10-30T05:14:07Z

strangely, i am no longer able to build the //test/cpp:test_xla_generator. The command that i use bazel build //test/cpp:test_xla_generator --experimental_ui_max_stdouterr_bytes=-1 the last flag is to let it print without length limit. the error i get is:

bazel-out/k8-opt/bin/_solib_k8/_U_A_Atorch_S_S_Clibc10___Ubuild_Slib/libc10.so: error: undefined reference to 'log', version 'GLIBC_2.29'
/usr/local/lib/libpython3.10.so: error: undefined reference to 'sem_clockwait', version 'GLIBC_2.30'
bazel-out/k8-opt/bin/_solib_k8/_U_A_Atorch_S_S_Clibc10___Ubuild_Slib/libc10.so: error: undefined reference to 'pthread_cond_clockwait', version 'GLIBC_2.30'
collect2: error: ld returned 1 exit status
Target //test/cpp:test_xla_generator failed to build

@ysiraichi do you have any clue about the issue? i was previously able to build and run //test/cpp:test_xla_generator

ysiraichi · 2025-10-30T13:07:25Z

Not sure what happened there.
It looks like you have an old glibc version in your system.
Are you using the docker image?

iwknow · 2025-10-30T16:17:50Z

Not sure what happened there. It looks like you have an old glibc version in your system. Are you using the docker image?

i am using the tpu-contributor dev container. I got the following when i run ldd --version

ldd (Debian GLIBC 2.31-13+deb11u11) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

could this be related to build cache? or it is incompatible with my pytorch? should i sync my local pytorch version to the head? Also, i don't have local pytorch-xla installed as a python package. i am not sure if this is related.

ysiraichi · 2025-10-30T17:05:14Z

Try recompiling everything from scratch: PyTorch and PyTorch/XLA (clean the cache).

PRs from external repositories are timeouting on `_build_torch_xla.yml` workflow. That's because in those cases, [the remote cache is disabled][1]. In such cases, [the fixed 45 minutes][2] is not enough anymore. See, for example, PR #9682 that fails due to this timeout. Here's my plan to address this issue: - Bump the timeout by 5 minutes (this PR) - Create a disk-cache using GitHub cache actions for reducing build time on PRs from external repositories (see [#9659][3] for more information) This PR will go through the following steps: - [x] Reproduce the CI build timeout - [x] Bump the timeout by 5 minutes [1]: https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/_build_torch_xla.yml#L36 [2]: https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/build_and_test.yml#L44 [3]: #9659

iwknow · 2025-10-31T21:00:06Z

Not sure what happened there. It looks like you have an old glibc version in your system. Are you using the docker image?

i am using the tpu-contributor dev container. I got the following when i run ldd --version
ldd (Debian GLIBC 2.31-13+deb11u11) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
could this be related to build cache? or it is incompatible with my pytorch? should i sync my local pytorch version to the head? Also, i don't have local pytorch-xla installed as a python package. i am not sure if this is related.

for those who face the similar issue, the root cause is the mismatch of the toolchain versions between the system-installed and conda-managed. make sure you use either system gcc+system python+system glibc, or Conda gcc+Conda python+Conda sysroot. For me, i use mix of them and that causes this problem.

iwknow · 2025-11-02T00:35:52Z

all tests passed and feedbacks addressed. please take a look @ysiraichi

ysiraichi

Thank you for the PR.
Overall looks good. I left a few minor comments there.

ysiraichi · 2025-11-03T10:41:39Z

test/cpp/test_xla_generator.cpp

+// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
+// runtime.
+static void EnsurePjrtCpuBackend() {
+  const char* pjrt = std::getenv("PJRT_DEVICE");
+  if (pjrt == nullptr || pjrt[0] == '\0') {
+    // Use CPU backend with a single device by default.
+    setenv("PJRT_DEVICE", "CPU", 1);
+  }
+  const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
+  if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
+    setenv("CPU_NUM_DEVICES", "1", 0);
+  }
+}
+


I don't think we need this.
This is already done when running the tests.

i think it is actually needed. i run the test using bazel test //test/cpp:test_xla_generator. it fails without this function. I believe pjrt will be set when you run the test suite using some script. however, it doesn't harm to have this function if you just want to run this test along. i keep it as is. please reopen this comment if you have a different opinion. thanks

also, i updated the function to allow override of the environment variables. which is particularly useful in this test suite.

I still think it's better to error if the environment variables are not set.
Plus, that's how every other C++ test works, today.

i want to clarify my understanding. you want to set the environment variables in test/cpp/run_tests.sh instead of setting it as a part of the test environment setup process? These two variables are required for this test suite and the latest implementation (not this version) allows flexible override on these two values for more specific test scenarios (for example, some test with CPU_NUM_DEVICES = 1 and some tests with CPU_NUM_DEVICES = 2). if i understand correctly, setting in test/cpp/run_tests.sh is pretty much setting a fixed value for all tests which is less flexible and stable.

please let me know if my understanding is correct and what you want. i will make the change accordingly. thank you very much for reviewing!

Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.

torch_xla/csrc/xla_generator.cpp

test/cpp/test_xla_generator.cpp

torch_xla/csrc/xla_generator.cpp

test/cpp/test_xla_generator.cpp

ysiraichi

Thank you for the changes.
The Status returning functions look great, now.
I still have a few comments, though.

test/cpp/test_xla_generator.cpp

torch_xla/csrc/xla_generator.cpp

ysiraichi · 2025-11-04T21:27:29Z

test/cpp/test_xla_generator.cpp

+// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
+// runtime.
+static void EnsurePjrtCpuBackend() {
+  const char* pjrt = std::getenv("PJRT_DEVICE");
+  if (pjrt == nullptr || pjrt[0] == '\0') {
+    // Use CPU backend with a single device by default.
+    setenv("PJRT_DEVICE", "CPU", 1);
+  }
+  const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
+  if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
+    setenv("CPU_NUM_DEVICES", "1", 0);
+  }
+}
+


Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.

ysiraichi · 2025-11-04T21:31:25Z

test/cpp/test_xla_generator.cpp

+}
+
+TEST_F(XLAGeneratorTest, CreateXLAGenerator) {
+  EnsurePjrtCpuBackend("CPU", "2");


I don't think this works the way you think it does.
The PjRt client is initialized only once per process. So, changing PJRT_DEVICE and CPU_NUM_DEVICES at, say, the 2nd test won't have any effect.

you are right. EnsurePjrtCpuBackend should be called before the client is initialized, that is the reason i put it in SetUpTestCase for all the test cases.

Problem is: the client is initialized once per process. All the test cases run in the same process.
This will not work!

ysiraichi

The PR is looking great. Thank you for bearing with me!
We are almost there.

ysiraichi · 2025-11-06T21:02:56Z

torch_xla/csrc/xla_generator.cpp

+ * Warning: this function must only be called once!
+ */
+static absl::Status InitXLAGenVector() {
+  static absl::Status init_status = []() {


Let's leak it (see this example).

ysiraichi · 2025-11-06T21:04:32Z

torch_xla/csrc/xla_generator.cpp

+  XLA_RETURN_IF_ERROR(InitXLAGenVector(),
+                      "Failed to initialize XLA generators");


Is this actually needed for CreateXLAGenerator?

ysiraichi · 2025-11-06T21:09:20Z

test/cpp/test_xla_generator.cpp

+// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
+// runtime. Optionally allow overriding the environment values by passing
+// `pjrt_device` and/or `cpu_num_devices`.
+static void EnsurePjrtCpuBackend(const char* pjrt_device = nullptr,
+                                 const char* cpu_num_devices = nullptr) {
+  // PJRT_DEVICE: override if provided, otherwise set default if not present
+  if (pjrt_device != nullptr && pjrt_device[0] != '\0') {
+    // Force override of any existing value
+    setenv("PJRT_DEVICE", pjrt_device, 1);
+  } else {
+    const char* pjrt = std::getenv("PJRT_DEVICE");
+    if (pjrt == nullptr || pjrt[0] == '\0') {
+      // Use CPU backend with a single device by default.
+      setenv("PJRT_DEVICE", "CPU", 1);
+    }
+  }
+
+  // CPU_NUM_DEVICES: override if provided, otherwise set default if not present
+  if (cpu_num_devices != nullptr && cpu_num_devices[0] != '\0') {
+    // Force override of any existing value
+    setenv("CPU_NUM_DEVICES", cpu_num_devices, 1);
+  } else {
+    const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
+    if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
+      // Default to a single CPU device. Preserve existing behavior of not
+      // overwriting if already present (use overwrite=0 to match previous
+      // semantics).
+      setenv("CPU_NUM_DEVICES", "1", 0);
+    }
+  }
+}
+


As discussed above, let's not do this.
Let's leave the environment variable initialization to the test runner script.

ysiraichi · 2025-11-06T21:11:19Z

Could you leave the review comments unresolved? It's easier to see what's been addressed.
I will resolve them once I check on them.

iwknow and others added 8 commits August 5, 2025 05:32

Add xla random generator.

07ca239

format cpp files

1d67a02

Merge branch 'pytorch:master' into generator

2fbef1a

Merge branch 'pytorch:master' into generator

d60c447

Merge branch 'pytorch:master' into generator

6d5d2d2

Merge branch 'pytorch:master' into generator

fa49099

Add helper functions getDefaultXLAGenerator and `createXLAGenerator…

79ff99b

…` to XLA random number generator

implement XLAHooks and register it to PyTorch when loaded.

79d4b42

iwknow mentioned this pull request Oct 20, 2025

implement XLAHooks and register it to PyTorch when loaded. #9683

Open

iwknow added 3 commits October 20, 2025 03:39

format

914d708

Revert "implement XLAHooks and register it to PyTorch when loaded."

1224531

This reverts commit 79d4b42.

Add missing include

80b2078

ysiraichi requested changes Oct 28, 2025

View reviewed changes

ysiraichi mentioned this pull request Oct 28, 2025

Increase CI build timeout. #9694

Merged

2 tasks

iwknow and others added 2 commits October 29, 2025 10:49

Merge branch 'pytorch:master' into generator

60b408d

improve error reporting and function naming.

c65ce8f

iwknow requested a review from ysiraichi October 29, 2025 17:51

format

e745336

Add unit tests for GetDefaultXLAGenerator and CreateXLAGenerator

e89e659

ysiraichi requested changes Nov 3, 2025

View reviewed changes

Improve InitXLAGenVector function and the unit tests accordingly.

4cc594f

iwknow requested a review from ysiraichi November 4, 2025 04:59

ysiraichi requested changes Nov 4, 2025

View reviewed changes

Address feedbacks.

c93e10d

iwknow requested a review from ysiraichi November 6, 2025 17:55

ysiraichi requested changes Nov 6, 2025

View reviewed changes

ysiraichi reviewed Nov 6, 2025

View reviewed changes

		XLA_RETURN_IF_ERROR(InitXLAGenVector(),
		"Failed to initialize XLA generators");

Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator #9682

Are you sure you want to change the base?

Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator #9682

Uh oh!

Conversation

iwknow commented Oct 20, 2025

Uh oh!

iwknow commented Oct 22, 2025

Uh oh!

iwknow commented Oct 24, 2025

Uh oh!

ysiraichi commented Oct 28, 2025

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iwknow commented Oct 30, 2025

Uh oh!

ysiraichi commented Oct 30, 2025

Uh oh!

iwknow commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysiraichi commented Oct 30, 2025

Uh oh!

iwknow commented Oct 31, 2025

Uh oh!

iwknow commented Nov 2, 2025

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysiraichi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iwknow commented Oct 30, 2025 •

edited

Loading