-
Notifications
You must be signed in to change notification settings - Fork 130
add LSF scheduler (#588) #610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request was exported from Phabricator. Differential Revision: D40184939 |
Codecov Report
@@ Coverage Diff @@
## main #610 +/- ##
==========================================
+ Coverage 94.75% 94.99% +0.23%
==========================================
Files 65 67 +2
Lines 4287 4551 +264
==========================================
+ Hits 4062 4323 +261
- Misses 225 228 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Summary: Pull Request resolved: pytorch#610 I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: pytorch#588 Reviewed By: anirbanr-fb-r2p, msaroufim, kurman Differential Revision: D40184939 fbshipit-source-id: d4a4f68d74a2ca12f95f683080c6a00137966ca6
43ad659
to
5d5b916
Compare
This pull request was exported from Phabricator. Differential Revision: D40184939 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.
Note:
torchx log
command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e.,
bsub
). For distributed apps, it creates multiplebsub
. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.component_integration_tests.lsf.txt
Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.
The following are example commands.
Example: native hello_world and CLI utils
Example: Docker hello_world
Example: Singularity hello_world
Example: Docker Distributed
Example: Singularity Distributed
Pull Request resolved: #588
Reviewed By: msaroufim
Differential Revision: D40184939
Pulled By: msaroufim