Skip to content

add LSF scheduler (#588) #610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Oct 7, 2022

Summary:
I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

component_integration_tests.lsf.txt

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

Example: native hello_world and CLI utils

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0

Example: Docker hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3

Example: Singularity hello_world

$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3

Example: Docker Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Example: Singularity Distributed

$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"

Pull Request resolved: #588

Reviewed By: msaroufim

Differential Revision: D40184939

Pulled By: msaroufim

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Oct 7, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40184939

@codecov
Copy link

codecov bot commented Oct 7, 2022

Codecov Report

Merging #610 (5d5b916) into main (73b6f09) will increase coverage by 0.23%.
The diff coverage is 99.62%.

@@            Coverage Diff             @@
##             main     #610      +/-   ##
==========================================
+ Coverage   94.75%   94.99%   +0.23%     
==========================================
  Files          65       67       +2     
  Lines        4287     4551     +264     
==========================================
+ Hits         4062     4323     +261     
- Misses        225      228       +3     
Impacted Files Coverage Δ
torchx/schedulers/__init__.py 95.23% <ø> (ø)
torchx/schedulers/lsf_scheduler.py 99.61% <99.61%> (ø)
torchx/util/shlex.py 100.00% <100.00%> (ø)
torchx/schedulers/docker_scheduler.py 95.00% <0.00%> (-1.00%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Summary:
Pull Request resolved: pytorch#610

I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

[component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt)

Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

The following are example commands.

**Example: native hello_world and CLI utils**

```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
lsf://torchx/echo-pxc3gn5ct061k
$ torchx list -s lsf
$ torchx status lsf://torchx/echo-pxc3gn5ct061k
$ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
$ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
```

**Example: Docker hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
```

**Example: Singularity hello_world**
```
$ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
```

**Example: Docker Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

**Example: Singularity Distributed**
```
$ cp scripts/dist_app.py /mnt/data/dist/
$ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
```

Pull Request resolved: pytorch#588

Reviewed By: anirbanr-fb-r2p, msaroufim, kurman

Differential Revision: D40184939

fbshipit-source-id: d4a4f68d74a2ca12f95f683080c6a00137966ca6
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D40184939

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants