-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch Support #493
Comments
@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here. |
Great observation and I believe you are correct: here it shows the |
@bradmiro would you mind contributing a patch to fix that? |
Sure, I can look into this. |
@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group. The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189 Changing the backend to |
That should not matter, all those backend should work 🤔 Have you tried other backends? |
The The |
This might be a PyTorch thing, I can look into it more probably early next week. Unsure about |
|
Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:
Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
Dataproc 2.0 (Hadoop 3.2.1)
Config:
Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!
The text was updated successfully, but these errors were encountered: