Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_dist.py incompatible with new PyTorch #3

Open
LUKELIEM opened this issue Jul 21, 2019 · 5 comments
Open

train_dist.py incompatible with new PyTorch #3

LUKELIEM opened this issue Jul 21, 2019 · 5 comments

Comments

@LUKELIEM
Copy link

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

@LUKELIEM
Copy link
Author

LUKELIEM commented Jul 21, 2019

Also, when I turned on cuda in run():

model = model.cuda(rank)
...
data, target = Variable(data.cuda(rank)), Variable(target.cuda(rank))
...

I encountered the following error, which is caused by .cuda(rank) since it automatically assumes I have multiple GPUs:

Process Process-4:
Traceback (most recent call last):
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "", line 120, in init_processes
fn(rank, size)
File "", line 95, in run
model = model.cuda(rank)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: invalid device ordinal
/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/ipykernel/main.py:55: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
Process Process-3:
Traceback (most recent call last):
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "", line 120, in init_processes
fn(rank, size)
File "", line 109, in run
average_gradients(model)
File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 907, in all_reduce
work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1556653183467/work/third_party/gloo/gloo/transport/tcp/pair.cc:572] Connection closed by peer [127.0.0.1]:57517

@marcomilanesio
Copy link

When I tried to run train_dist.py, I ran into multiple errors that seem to be due to the fact that this code is written on older versions of Python and PyTorch.

One issue, torch.utils.data.DataLoader expects batch_size to be integer

File "", line 77, in partition_dataset
partition, batch_size= bsz, shuffle=True)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 179, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 162, in init
"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=64.0

Same here, with

In [2]: torch.__version__                                                       
Out[2]: '1.2.0'

Workaround:

train_set = torch.utils.data.DataLoader(partition, batch_size=int(bsz), shuffle=True)

Another issue, no idea what is happening here

File "", line 85, in average_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)
File "/home/lukeai/miniconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce
work = group.allreduce([tensor], opts)
AttributeError: 'int' object has no attribute 'allreduce'

This is still there.

@weimingwill
Copy link

@marcomilanesio just remove group=0 in dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)

@FLYING37520
Copy link

@marcomilanesio just remove group=0 in dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM, group=0)

perfect !

@LukeLIN-web
Copy link

LukeLIN-web commented Apr 22, 2022

line120, should change to epoch_loss += loss.data.item()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants