Why is there only one gpu being used even when I set --dp 2? #1206

catalpa-bungei · 2024-08-25T15:55:54Z

catalpa-bungei
Aug 25, 2024

I have export CUDA_VISIBLE_DEVICES=0,1 and both have memory of 23GB. However, I get the following error message after I run

python3 -m sglang.launch_server --model-path lmms-lab/
llava-onevision-qwen2-7b-si  --port=30000 --chat-template=vicuna_v1.1 --dp-size 2 --enable-p2p-check  --chunked-prefill-size=16384 --mem-fraction-static 0.7

[gpu=0] Init nccl begin.
[gpu=0] Load weight begin. avail mem=23.31 GB
Process Process-1:1:

Traceback

Traceback (most recent call last):
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/multiprocessing/process.py", line 321, in _bootstrap
self.run()
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/managers/controller_single.py", line 150, in start_controller_process
controller = ControllerSingle(
^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/managers/controller_single.py", line 84, in init
self.tp_server = ModelTpServer(
^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 99, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 128, in init
self.load_model()
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 172, in load_model
self.model = get_model(
^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model
return model_class(config=model_config.hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/llava.py", line 341, in init
self.language_model = Qwen2ForCausalLM(config, quant_config=quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 275, in init
self.model = Qwen2Model(config, quant_config=quant_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 235, in init
Qwen2DecoderLayer(config, i, quant_config=quant_config)
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 173, in init
self.self_attn = Qwen2Attention(
^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 117, in init
self.qkv_proj = QKVParallelLinear(
^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 553, in init
super().init(input_size=input_size,
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 291, in init
self.quant_method.create_weights(
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 109, in create_weights
weight = Parameter(torch.empty(sum(output_partition_sizes),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vepfs/fs_users/weiyuancheng-test/envs/sglang/lib/python3.12/site-packages/torch/utils/_device.py", line 79, in torch_function
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 17.56 MiB is free. Process 4097022 has 23.63 GiB memory in use. Of the allocated memory 23.14 GiB is allocated by PyTorch, and 91.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

After diving into the process.py, I find that when dp_worker_id=0 and gpu_ids=[0], there will be 3 processes:

dp_size 2
whether_daemonic:-------------------------------------------------------------- False False
self:-------------------------------------------------------------- <Process name='Process-1' pid=851486 parent=851370 started>
whether_daemonic:-------------------------------------------------------------- False False
whether_daemonic:-------------------------------------------------------------- False False
self:-------------------------------------------------------------- <Process name='Process-2' pid=851490 parent=851370 started>
self:-------------------------------------------------------------- <Process name='Process-1:1' pid=851489 parent=851486 started>
dp_worker_id:---------------------------------------- 0 
gpu_ids:---------------------------------------- [0]

and the 3rd process is the child of 1st process. Is it permitted? Should they become daemonic?

Answered by merrymercy

Sep 10, 2024

what is your precision? You should be able to run the fp16 version with 23G memory. Maybe you are running a fp32 one.
You can add --dtype float16 when you launch the server.

View full answer

merrymercy · 2024-09-10T08:11:51Z

merrymercy
Sep 10, 2024
Maintainer

what is your precision? You should be able to run the fp16 version with 23G memory. Maybe you are running a fp32 one.
You can add --dtype float16 when you launch the server.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is there only one gpu being used even when I set --dp 2? #1206

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Why is there only one gpu being used even when I set --dp 2? #1206

catalpa-bungei Aug 25, 2024

Replies: 1 comment

merrymercy Sep 10, 2024 Maintainer

catalpa-bungei
Aug 25, 2024

merrymercy
Sep 10, 2024
Maintainer