-
Notifications
You must be signed in to change notification settings - Fork 559
Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator #9682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…` to XLA random number generator
|
@qihqi do you know why do i always get build timeout? it builds successfully locally on my machine. i am not able to find any clue of the cause from the build log. please take a look |
|
@ysiraichi @qihqi can you please take a look. thanks! |
This might be because your PR is not from a branch on this PyTorch/XLA repository. |
ysiraichi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I think that it looks good overall.
Could you add a few C++ tests and check everything is working?
|
strangely, i am no longer able to build the @ysiraichi do you have any clue about the issue? i was previously able to build and run |
|
Not sure what happened there. |
i am using the tpu-contributor dev container. I got the following when i run could this be related to build cache? or it is incompatible with my pytorch? should i sync my local pytorch version to the head? Also, i don't have local pytorch-xla installed as a python package. i am not sure if this is related. |
|
Try recompiling everything from scratch: PyTorch and PyTorch/XLA (clean the cache). |
PRs from external repositories are timeouting on `_build_torch_xla.yml` workflow. That's because in those cases, [the remote cache is disabled][1]. In such cases, [the fixed 45 minutes][2] is not enough anymore. See, for example, PR #9682 that fails due to this timeout. Here's my plan to address this issue: - Bump the timeout by 5 minutes (this PR) - Create a disk-cache using GitHub cache actions for reducing build time on PRs from external repositories (see [#9659][3] for more information) This PR will go through the following steps: - [x] Reproduce the CI build timeout - [x] Bump the timeout by 5 minutes [1]: https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/_build_torch_xla.yml#L36 [2]: https://github.com/pytorch/xla/blob/df6798dfb931ce7c7fe5bed2447cd1092a5981af/.github/workflows/build_and_test.yml#L44 [3]: #9659
for those who face the similar issue, the root cause is the mismatch of the toolchain versions between the system-installed and conda-managed. make sure you use either system gcc+system python+system glibc, or Conda gcc+Conda python+Conda sysroot. For me, i use mix of them and that causes this problem. |
|
all tests passed and feedbacks addressed. please take a look @ysiraichi |
ysiraichi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
Overall looks good. I left a few minor comments there.
test/cpp/test_xla_generator.cpp
Outdated
| // Ensure PJRT is configured to a CPU backend for tests that touch the PJRT | ||
| // runtime. | ||
| static void EnsurePjrtCpuBackend() { | ||
| const char* pjrt = std::getenv("PJRT_DEVICE"); | ||
| if (pjrt == nullptr || pjrt[0] == '\0') { | ||
| // Use CPU backend with a single device by default. | ||
| setenv("PJRT_DEVICE", "CPU", 1); | ||
| } | ||
| const char* cpu_devices = std::getenv("CPU_NUM_DEVICES"); | ||
| if (cpu_devices == nullptr || cpu_devices[0] == '\0') { | ||
| setenv("CPU_NUM_DEVICES", "1", 0); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this.
This is already done when running the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is actually needed. i run the test using bazel test //test/cpp:test_xla_generator. it fails without this function. I believe pjrt will be set when you run the test suite using some script. however, it doesn't harm to have this function if you just want to run this test along. i keep it as is. please reopen this comment if you have a different opinion. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, i updated the function to allow override of the environment variables. which is particularly useful in this test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's better to error if the environment variables are not set.
Plus, that's how every other C++ test works, today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to clarify my understanding. you want to set the environment variables in test/cpp/run_tests.sh instead of setting it as a part of the test environment setup process? These two variables are required for this test suite and the latest implementation (not this version) allows flexible override on these two values for more specific test scenarios (for example, some test with CPU_NUM_DEVICES = 1 and some tests with CPU_NUM_DEVICES = 2). if i understand correctly, setting in test/cpp/run_tests.sh is pretty much setting a fixed value for all tests which is less flexible and stable.
please let me know if my understanding is correct and what you want. i will make the change accordingly. thank you very much for reviewing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.
ysiraichi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes.
The Status returning functions look great, now.
I still have a few comments, though.
test/cpp/test_xla_generator.cpp
Outdated
| // Ensure PJRT is configured to a CPU backend for tests that touch the PJRT | ||
| // runtime. | ||
| static void EnsurePjrtCpuBackend() { | ||
| const char* pjrt = std::getenv("PJRT_DEVICE"); | ||
| if (pjrt == nullptr || pjrt[0] == '\0') { | ||
| // Use CPU backend with a single device by default. | ||
| setenv("PJRT_DEVICE", "CPU", 1); | ||
| } | ||
| const char* cpu_devices = std::getenv("CPU_NUM_DEVICES"); | ||
| if (cpu_devices == nullptr || cpu_devices[0] == '\0') { | ||
| setenv("CPU_NUM_DEVICES", "1", 0); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's the correct understanding. Actually, having this kind of flexibility would be nice if done in a centralized way for all C++ tests (instead of having one test that behaves differently).
That said, I don't think it's something worth worth doing right now.
test/cpp/test_xla_generator.cpp
Outdated
| } | ||
|
|
||
| TEST_F(XLAGeneratorTest, CreateXLAGenerator) { | ||
| EnsurePjrtCpuBackend("CPU", "2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works the way you think it does.
The PjRt client is initialized only once per process. So, changing PJRT_DEVICE and CPU_NUM_DEVICES at, say, the 2nd test won't have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. EnsurePjrtCpuBackend should be called before the client is initialized, that is the reason i put it in SetUpTestCase for all the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is: the client is initialized once per process. All the test cases run in the same process.
This will not work!
ysiraichi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is looking great. Thank you for bearing with me!
We are almost there.
| * Warning: this function must only be called once! | ||
| */ | ||
| static absl::Status InitXLAGenVector() { | ||
| static absl::Status init_status = []() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leak it (see this example).
| XLA_RETURN_IF_ERROR(InitXLAGenVector(), | ||
| "Failed to initialize XLA generators"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed for CreateXLAGenerator?
| // Ensure PJRT is configured to a CPU backend for tests that touch the PJRT | ||
| // runtime. Optionally allow overriding the environment values by passing | ||
| // `pjrt_device` and/or `cpu_num_devices`. | ||
| static void EnsurePjrtCpuBackend(const char* pjrt_device = nullptr, | ||
| const char* cpu_num_devices = nullptr) { | ||
| // PJRT_DEVICE: override if provided, otherwise set default if not present | ||
| if (pjrt_device != nullptr && pjrt_device[0] != '\0') { | ||
| // Force override of any existing value | ||
| setenv("PJRT_DEVICE", pjrt_device, 1); | ||
| } else { | ||
| const char* pjrt = std::getenv("PJRT_DEVICE"); | ||
| if (pjrt == nullptr || pjrt[0] == '\0') { | ||
| // Use CPU backend with a single device by default. | ||
| setenv("PJRT_DEVICE", "CPU", 1); | ||
| } | ||
| } | ||
|
|
||
| // CPU_NUM_DEVICES: override if provided, otherwise set default if not present | ||
| if (cpu_num_devices != nullptr && cpu_num_devices[0] != '\0') { | ||
| // Force override of any existing value | ||
| setenv("CPU_NUM_DEVICES", cpu_num_devices, 1); | ||
| } else { | ||
| const char* cpu_devices = std::getenv("CPU_NUM_DEVICES"); | ||
| if (cpu_devices == nullptr || cpu_devices[0] == '\0') { | ||
| // Default to a single CPU device. Preserve existing behavior of not | ||
| // overwriting if already present (use overwrite=0 to match previous | ||
| // semantics). | ||
| setenv("CPU_NUM_DEVICES", "1", 0); | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, let's not do this.
Let's leave the environment variable initialization to the test runner script.
|
Could you leave the review comments unresolved? It's easier to see what's been addressed. |
Add helper functions getDefaultXLAGenerator and createXLAGenerator to XLA random number generator
These helper functions will be used with XLA hook later.
Refer to #9159