Skip to content

Conversation

@iwknow
Copy link
Collaborator

@iwknow iwknow commented Oct 20, 2025

Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator

These helper functions will be used with XLA hook later.

Refer to #9159

@iwknow
Copy link
Collaborator Author

iwknow commented Oct 22, 2025

@qihqi do you know why do i always get build timeout? it builds successfully locally on my machine. i am not able to find any clue of the cause from the build log. please take a look

@iwknow
Copy link
Collaborator Author

iwknow commented Oct 24, 2025

@ysiraichi @qihqi can you please take a look. thanks!

@ysiraichi
Copy link
Collaborator

do you know why do i always get build timeout?

This might be because your PR is not from a branch on this PyTorch/XLA repository.
That causes the CI not to use the remote cache.
I'm currently working on using GitHub cache for mitigating that #9659.

Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I think that it looks good overall.
Could you add a few C++ tests and check everything is working?

@ysiraichi ysiraichi mentioned this pull request Oct 28, 2025
2 tasks
@iwknow iwknow requested a review from ysiraichi October 29, 2025 17:51
@iwknow
Copy link
Collaborator Author

iwknow commented Oct 30, 2025

strangely, i am no longer able to build the //test/cpp:test_xla_generator. The command that i use bazel build //test/cpp:test_xla_generator --experimental_ui_max_stdouterr_bytes=-1 the last flag is to let it print without length limit. the error i get is:

bazel-out/k8-opt/bin/_solib_k8/_U_A_Atorch_S_S_Clibc10___Ubuild_Slib/libc10.so: error: undefined reference to 'log', version 'GLIBC_2.29'
/usr/local/lib/libpython3.10.so: error: undefined reference to 'sem_clockwait', version 'GLIBC_2.30'
bazel-out/k8-opt/bin/_solib_k8/_U_A_Atorch_S_S_Clibc10___Ubuild_Slib/libc10.so: error: undefined reference to 'pthread_cond_clockwait', version 'GLIBC_2.30'
collect2: error: ld returned 1 exit status
Target //test/cpp:test_xla_generator failed to build

@ysiraichi do you have any clue about the issue? i was previously able to build and run //test/cpp:test_xla_generator

@ysiraichi
Copy link
Collaborator

Not sure what happened there.
It looks like you have an old glibc version in your system.
Are you using the docker image?

@iwknow
Copy link
Collaborator Author

iwknow commented Oct 30, 2025

Not sure what happened there. It looks like you have an old glibc version in your system. Are you using the docker image?

i am using the tpu-contributor dev container. I got the following when i run ldd --version

ldd (Debian GLIBC 2.31-13+deb11u11) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

could this be related to build cache? or it is incompatible with my pytorch? should i sync my local pytorch version to the head? Also, i don't have local pytorch-xla installed as a python package. i am not sure if this is related.

@ysiraichi
Copy link
Collaborator

Try recompiling everything from scratch: PyTorch and PyTorch/XLA (clean the cache).

ysiraichi added a commit that referenced this pull request Oct 31, 2025
PRs from external repositories are timeouting on `_build_torch_xla.yml`
workflow. That's because in those cases, [the remote cache is
disabled][1]. In such cases, [the fixed 45 minutes][2] is not enough
anymore. See, for example, PR #9682 that fails due to this timeout.

Here's my plan to address this issue:

- Bump the timeout by 5 minutes (this PR)
- Create a disk-cache using GitHub cache actions for reducing build time
on PRs from external repositories (see [#9659][3] for more information)

This PR will go through the following steps:

- [x] Reproduce the CI build timeout 
- [x] Bump the timeout by 5 minutes

[1]:
https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/_build_torch_xla.yml#L36
[2]:
https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/build_and_test.yml#L44
[3]: #9659
@iwknow
Copy link
Collaborator Author

iwknow commented Oct 31, 2025

Not sure what happened there. It looks like you have an old glibc version in your system. Are you using the docker image?

i am using the tpu-contributor dev container. I got the following when i run ldd --version

ldd (Debian GLIBC 2.31-13+deb11u11) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

could this be related to build cache? or it is incompatible with my pytorch? should i sync my local pytorch version to the head? Also, i don't have local pytorch-xla installed as a python package. i am not sure if this is related.

for those who face the similar issue, the root cause is the mismatch of the toolchain versions between the system-installed and conda-managed. make sure you use either system gcc+system python+system glibc, or Conda gcc+Conda python+Conda sysroot. For me, i use mix of them and that causes this problem.

@iwknow
Copy link
Collaborator Author

iwknow commented Nov 2, 2025

all tests passed and feedbacks addressed. please take a look @ysiraichi

Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
Overall looks good. I left a few minor comments there.

Comment on lines 23 to 36
// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
// runtime.
static void EnsurePjrtCpuBackend() {
const char* pjrt = std::getenv("PJRT_DEVICE");
if (pjrt == nullptr || pjrt[0] == '\0') {
// Use CPU backend with a single device by default.
setenv("PJRT_DEVICE", "CPU", 1);
}
const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
setenv("CPU_NUM_DEVICES", "1", 0);
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this.
This is already done when running the tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is actually needed. i run the test using bazel test //test/cpp:test_xla_generator. it fails without this function. I believe pjrt will be set when you run the test suite using some script. however, it doesn't harm to have this function if you just want to run this test along. i keep it as is. please reopen this comment if you have a different opinion. thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i updated the function to allow override of the environment variables. which is particularly useful in this test suite.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's better to error if the environment variables are not set.
Plus, that's how every other C++ test works, today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to clarify my understanding. you want to set the environment variables in test/cpp/run_tests.sh instead of setting it as a part of the test environment setup process? These two variables are required for this test suite and the latest implementation (not this version) allows flexible override on these two values for more specific test scenarios (for example, some test with CPU_NUM_DEVICES = 1 and some tests with CPU_NUM_DEVICES = 2). if i understand correctly, setting in test/cpp/run_tests.sh is pretty much setting a fixed value for all tests which is less flexible and stable.

please let me know if my understanding is correct and what you want. i will make the change accordingly. thank you very much for reviewing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.

@iwknow iwknow requested a review from ysiraichi November 4, 2025 04:59
Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes.
The Status returning functions look great, now.
I still have a few comments, though.

Comment on lines 23 to 36
// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
// runtime.
static void EnsurePjrtCpuBackend() {
const char* pjrt = std::getenv("PJRT_DEVICE");
if (pjrt == nullptr || pjrt[0] == '\0') {
// Use CPU backend with a single device by default.
setenv("PJRT_DEVICE", "CPU", 1);
}
const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
setenv("CPU_NUM_DEVICES", "1", 0);
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.

}

TEST_F(XLAGeneratorTest, CreateXLAGenerator) {
EnsurePjrtCpuBackend("CPU", "2");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works the way you think it does.
The PjRt client is initialized only once per process. So, changing PJRT_DEVICE and CPU_NUM_DEVICES at, say, the 2nd test won't have any effect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. EnsurePjrtCpuBackend should be called before the client is initialized, that is the reason i put it in SetUpTestCase for all the test cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is: the client is initialized once per process. All the test cases run in the same process.
This will not work!

@iwknow iwknow requested a review from ysiraichi November 6, 2025 17:55
Copy link
Collaborator

@ysiraichi ysiraichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is looking great. Thank you for bearing with me!
We are almost there.

* Warning: this function must only be called once!
*/
static absl::Status InitXLAGenVector() {
static absl::Status init_status = []() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leak it (see this example).

Comment on lines +102 to +103
XLA_RETURN_IF_ERROR(InitXLAGenVector(),
"Failed to initialize XLA generators");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed for CreateXLAGenerator?

Comment on lines +13 to +44
// Ensure PJRT is configured to a CPU backend for tests that touch the PJRT
// runtime. Optionally allow overriding the environment values by passing
// `pjrt_device` and/or `cpu_num_devices`.
static void EnsurePjrtCpuBackend(const char* pjrt_device = nullptr,
const char* cpu_num_devices = nullptr) {
// PJRT_DEVICE: override if provided, otherwise set default if not present
if (pjrt_device != nullptr && pjrt_device[0] != '\0') {
// Force override of any existing value
setenv("PJRT_DEVICE", pjrt_device, 1);
} else {
const char* pjrt = std::getenv("PJRT_DEVICE");
if (pjrt == nullptr || pjrt[0] == '\0') {
// Use CPU backend with a single device by default.
setenv("PJRT_DEVICE", "CPU", 1);
}
}

// CPU_NUM_DEVICES: override if provided, otherwise set default if not present
if (cpu_num_devices != nullptr && cpu_num_devices[0] != '\0') {
// Force override of any existing value
setenv("CPU_NUM_DEVICES", cpu_num_devices, 1);
} else {
const char* cpu_devices = std::getenv("CPU_NUM_DEVICES");
if (cpu_devices == nullptr || cpu_devices[0] == '\0') {
// Default to a single CPU device. Preserve existing behavior of not
// overwriting if already present (use overwrite=0 to match previous
// semantics).
setenv("CPU_NUM_DEVICES", "1", 0);
}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, let's not do this.
Let's leave the environment variable initialization to the test runner script.

@ysiraichi
Copy link
Collaborator

Could you leave the review comments unresolved? It's easier to see what's been addressed.
I will resolve them once I check on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants