Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Number of communication kernels don't match between workers in run #952

Open
oabuhamdan opened this issue Jun 18, 2024 · 0 comments
Open
Labels
plugin PyTorch Profiler TensorBoard Plugin related

Comments

@oabuhamdan
Copy link

oabuhamdan commented Jun 18, 2024

Distributed View is not available, and I think due to this error

E0618 17:14:15.276058 131298609845824 loader.py:150] Number of communication kernels don't match between workers in run: gpu_resnet50_cifar10_ddp_batch512_precision32_nodes3

Data Collection Env:
Python version: 3.11.7
GCC (GCC) 12.2.0
Torch: '2.3.1+cu121'
PyTorch lightning: '2.3.0'
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.8.2003 (Core)
Release: 7.8.2003
Codename: Core
SLURM environment
Cuda 12.4.1
DeepSpeed 0.14.3

Data Visualization Env:
MacBook Air M2
OS: Version 14.5 (23F79)
tensorboard==2.17.0
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
torch-tb-profiler==0.4.3

@oabuhamdan oabuhamdan changed the title Number of communication kernels don't match between workers in run [BUG] Number of communication kernels don't match between workers in run Jun 18, 2024
@sraikund16 sraikund16 added the plugin PyTorch Profiler TensorBoard Plugin related label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin PyTorch Profiler TensorBoard Plugin related
Projects
None yet
Development

No branches or pull requests

2 participants