-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to probe the nccl libraries that PyTorch uses #267
Conversation
cf57d36
to
07a509c
Compare
cancel or do we move forward with it? |
b6f4189
to
6463723
Compare
Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this give more structured data i.e allow it to be used within other python scripts.
The scripts in this folder disambiguate the exact NCCL libraries that a PyTorch application actually | ||
uses, in the presence of potentially multiple installed versions. | ||
|
||
## 1. Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include usage at the top
By now, hopefully you're convinced on the need for a runtime probe to pinpoint the version of NCCL | ||
loaded by PyTorch. | ||
|
||
## 2. Howto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to
Via Slurm: `srun -l -N1 ./probe-pt-nccl-aws-libs.sh` | ||
|
||
```console | ||
$ srun -l -N1 ./probe-pt-nccl-aws-libs.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be used as a library within another script? i.e. I want to use it in my efa-versions.py
to grab the nccl version but invoking the shell script and grep-ing the output is less than ideal.
0: cat /opt/amazon/efa_installed_packages: | ||
0: # EFA installer version: 1.30.0 | ||
# Debug packages installed: no | ||
0: # Packages installed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we exclude this
|
||
echo | ||
echo "cat /opt/amazon/efa_installed_packages:" | ||
cat /opt/amazon/efa_installed_packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
my quick 2c: for this purpose, it's best to re-implement the logic in pure python. Because the bulk of the work is done by Plus, with Python you'll have access to proper representation of the structured object, and I think this is a better way to have a machine-readable version for Python land. My shell script is meant for human-readability, and in it's current form is far away for the automation purpose you might have in mind. |
361d6ba
to
16db56b
Compare
16db56b
to
e815903
Compare
44e448e
to
1209815
Compare
Closing as it's stale. |
Issue #, if available: close #252
Description of changes: Probe what PyTorch actually uses for the nccl stacks.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.