-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_dist.py incompatible with new PyTorch #3
Comments
Also, when I turned on cuda in run(): model = model.cuda(rank) I encountered the following error, which is caused by .cuda(rank) since it automatically assumes I have multiple GPUs: Process Process-4: |
Same here, with
Workaround:
This is still there. |
@marcomilanesio just remove |
perfect ! |
line120, should change to |
When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.
One issue, torch.utils.data.DataLoader expects batch_size to be integer
File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0
Another issue, no idea what is happening here
File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'
The text was updated successfully, but these errors were encountered: