Skip to content

Commit f534a1a

Browse files
authored
Add readme in distributed_training (Project-MONAI#1690)
Add readme in distributed_training ### Checks <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [ ] Avoid including large-size files in the PR. - [ ] Clean up long text outputs from code cells in the notebook. - [ ] For security purposes, please check the contents and remove any sensitive info such as user names and private key. - [ ] Ensure (1) hyperlinks and markdown anchors are working (2) use relative paths for tutorial repo files (3) put figure and graphs in the `./figure` folder - [ ] Notebook runs automatically `./runner.sh -t <path to .ipynb file>` --------- Signed-off-by: YunLiu <[email protected]>
1 parent 134195e commit f534a1a

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
## Data Preparation
2+
3+
Users need to download the dataset Task01_BrainTumour from the MICCAI challenge [Medical Segmentation Decathlon](http://medicaldecathlon.com/) for the following examples.
4+
- Define the directory where to download and extract the dataset
5+
```
6+
root_dir="/path/to/your/directory" # Change this to your desired directory
7+
```
8+
- Download and extract in one line if the directory doesn't exist
9+
```
10+
[ ! -d "${root_dir}/Task01_BrainTumour" ] && mkdir -p "${root_dir}/Task01_BrainTumour" && wget -qO- "https://msd-for-monai.s3-us-west-2.amazonaws.com/Task01_BrainTumour.tar" | tar -xv -C "${root_dir}"
11+
```
12+
13+
## Multi-GPU Training
14+
15+
Users can set your `NUM_GPUS_PER_NODE`, `NUM_NODES`, `INDEX_CURRENT_NODE`, as well as `DIR_OF_DATA` for the directory of the test dataset.
16+
Then users can execute the following command to start multi-GPU model training:
17+
18+
```
19+
torchrun --nproc_per_node=NUM_GPUS_PER_NODE --nnodes=NUM_NODES brats_training_ddp.py -d DIR_OF_DATA
20+
```
21+
22+
## Multi-Node Training
23+
24+
Let's take two-node (16 GPUs in total) model training as an example. In the primary node (node rank 0), we run the following command.
25+
26+
```
27+
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=PRIMARY_NODE_IP --master_port=1234 brats_training_ddp.py
28+
```
29+
Here, `PRIMARY_NODE_IP` is the IP address of the first node.
30+
31+
In the second node (node rank 1), we run the following command.
32+
33+
```
34+
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="localhost" --master_port=1234 brats_training_ddp.py
35+
```
36+
37+
Note that the only difference between the two commands is `--node_rank`.
38+
39+
There would be some possible delay between the execution of the two commands in the two nodes. But the first node would always wait for the second one, and they would start and train together. If there is an IP issue for the validation part during model training, please refer to the solution [here](https://discuss.pytorch.org/t/connect-127-0-1-1-a-port-connection-refused/100802/25) to resolve it.

0 commit comments

Comments
 (0)