Skip to content

When distributed training was performed, the program remained unresponsive #92

@mumu029

Description

@mumu029

I want to train the model on two servers with one GPU each. But after I set up the configuration and ran it, the program stuck in one place and didn't react. I'm sure the program works when I train with a server.

export MASTER_ADDR=192.168.1.12
export MASTER_PORT=17788
export NODE_RANK=0

(py36tr108cu117) (base) cx@v100:~/ViLT-master$ python run.py with data_root=../../data/TrinityMultimodalTrojAI-main/data/clean/ num_gpus=1 num_nodes=2 task_finetune_vqa_randaug per_gpu_batchsize=64 load_path=../../data/model_weight/vilt_200k_mlm_itm.ckpt
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank (0).
INFO - lightning - Using environment variable NODE_RANK for node rank (0).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0

The program stops at this point

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions