You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a recent conversation with a user that tried to use DistributedDataParallel with skorch's early stopping and this would cause the process to hang. My guess is that since ddp workers spawn their own jobs, skorch's early stopping mechanism would stop a worker, but the parent node would not get this information. This leaves the parent waiting for a child that has stopping running.
There may also be an issue with checking the validation loss with DistributedDataParallel, because each worker would have its own loss, and this would need to be gathered to actually compute the loss for a given epoch.
The text was updated successfully, but these errors were encountered:
Thanks for reporting. Do you know how this is solved more generally (say, only using PyTorch without any frameworks)? I could imagine that similar errors can occur easily, given how tricky multi-threading is in general. Unfortunately, I don't have access to a setup to experiment with this.
I know too little to really comment on that. Ideally, I would wish for skorch to get out of the way enough that users can use DistributedDataParallel if they wish so. Regarding barriers, is that something that could be achieved through callbacks?
I had a recent conversation with a user that tried to use
DistributedDataParallel
withskorch
's early stopping and this would cause the process to hang. My guess is that since ddp workers spawn their own jobs, skorch's early stopping mechanism would stop a worker, but the parent node would not get this information. This leaves the parent waiting for a child that has stopping running.There may also be an issue with checking the validation loss with
DistributedDataParallel
, because each worker would have its own loss, and this would need to be gathered to actually compute the loss for a given epoch.The text was updated successfully, but these errors were encountered: