You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you experience an NCCL slowdown the first step is to enable:
export NCCL_DEBUG=INFO
This will allow you to catch an misconfigurations in the logs, for example if you see:
version `EFA_1.2' not found (required by /opt/amazon/efa/lib/libfabric.so.1) No plugin found (libnccl-net.so), using internal implementation
This likely means it's pulling in a version of the plugin aws-ofi-nccl that's not compiled against the system libfabric. You can check this (assuming you're using conda) by running:
conda list | grep -E "nvidia|nccl|cud|torch"
If this shows something like:
nvidia-nccl-cu12 2.19.3 pypi_0 pypi
It's likely this version is getting pulled in as dependency and isn't working properly. You can override this and install aws-ofi-nccl from Amazon Pytorch like so:
If you experience an NCCL slowdown the first step is to enable:
export NCCL_DEBUG=INFO
This will allow you to catch an misconfigurations in the logs, for example if you see:
This likely means it's pulling in a version of the plugin
aws-ofi-nccl
that's not compiled against the system libfabric. You can check this (assuming you're using conda) by running:If this shows something like:
It's likely this version is getting pulled in as dependency and isn't working properly. You can override this and install
aws-ofi-nccl
from Amazon Pytorch like so:The error
version EFA_1.2 not found
should now disappear from the logs.The text was updated successfully, but these errors were encountered: