Skip to content

Commit 43ad659

Browse files
add LSF scheduler (#588)
Summary: I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters. Note: `torchx log` command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS). In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., `bsub`). For distributed apps, it creates multiple `bsub`. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there. [component_integration_tests.lsf.txt](https://github.com/pytorch/torchx/files/9424891/component_integration_tests.lsf.txt) Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images. The following are example commands. **Example: native hello_world and CLI utils** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3 lsf://torchx/echo-pxc3gn5ct061k $ torchx list -s lsf $ torchx status lsf://torchx/echo-pxc3gn5ct061k $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0 ``` **Example: Docker hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Singularity hello_world** ``` $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3 ``` **Example: Docker Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` **Example: Singularity Distributed** ``` $ cp scripts/dist_app.py /mnt/data/dist/ $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data" ``` Pull Request resolved: #588 Reviewed By: msaroufim Differential Revision: D40184939 Pulled By: msaroufim fbshipit-source-id: 5a13d2ee88b3b5cf1b8e5a3f6786b955d47f21f8
1 parent 73b6f09 commit 43ad659

File tree

6 files changed

+1147
-1
lines changed

6 files changed

+1147
-1
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ Works With
7474
schedulers/slurm
7575
schedulers/ray
7676
schedulers/aws_batch
77+
schedulers/lsf
7778

7879
.. fbcode::
7980

docs/source/schedulers/lsf.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
IBM Spectrum LSF
2+
=================
3+
4+
.. automodule:: torchx.schedulers.lsf_scheduler
5+
6+
.. currentmodule:: torchx.schedulers.lsf_scheduler
7+
8+
.. autoclass:: LsfScheduler
9+
:members:
10+
:show-inheritance:
11+
12+
.. autoclass:: LsfBsub
13+
:members:
14+
15+
Reference
16+
~~~~~~~~~~~~
17+
18+
.. autofunction:: create_scheduler

scripts/component_integration_tests.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def main() -> None:
5151
torchx_image = "dummy_image"
5252
dryrun = False
5353

54-
if scheduler in ("kubernetes", "local_docker", "aws_batch"):
54+
if scheduler in ("kubernetes", "local_docker", "aws_batch", "lsf"):
5555
try:
5656
build = build_and_push_image()
5757
torchx_image = build.torchx_image
@@ -105,6 +105,17 @@ def main() -> None:
105105
},
106106
"workspace": f"file://{os.getcwd()}",
107107
},
108+
"lsf": {
109+
"providers": [
110+
component_provider,
111+
],
112+
"image": torchx_image,
113+
"cfg": {
114+
"runtime": "docker",
115+
"jobdir": "/mnt/data/torchx",
116+
"host_network": True,
117+
},
118+
},
108119
}
109120

110121
params = run_parameters[scheduler]

torchx/schedulers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
"kubernetes": "torchx.schedulers.kubernetes_scheduler",
2020
"aws_batch": "torchx.schedulers.aws_batch_scheduler",
2121
"ray": "torchx.schedulers.ray_scheduler",
22+
"lsf": "torchx.schedulers.lsf_scheduler",
2223
}
2324

2425

0 commit comments

Comments
 (0)