You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you've installed aws-ofi-nccl from conda and have a system with version of libfabric <1.18.2 and aws-ofi-nccl 1.9.0 you may face issues such as the following:
[0] NCCL INFO NET/Plugin : dlerror=/opt/amazon/efa/lib/libfabric.so.1: version `FABRIC_1.7' not found (required by /fsx/ubuntu/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp/lib/python3.10/site-packages/torch/lib/../../../../libnccl-net.so) No plugin found (libnccl-net.so), using internal implementation
You can fix this by upgrading to aws-ofi-nccl 1.9.1 or downgrading to aws-ofi-nccl 1.7.4 like so:
If you've installed aws-ofi-nccl from conda and have a system with version of libfabric
<1.18.2
andaws-ofi-nccl 1.9.0
you may face issues such as the following:You can fix this by upgrading to
aws-ofi-nccl 1.9.1
or downgrading toaws-ofi-nccl 1.7.4
like so:Fixed in #291
The text was updated successfully, but these errors were encountered: