How to deploy one model instance across multiple GPUs to tackle the OOM problem?

I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I get is about 8GB. I have also set --world_size to 2 to use 2-way tensor parallelism. 

But when I try to start the triton server, I always get the Ouf of Memory error. It seems that one instance will be lauched in each GPU, but there is not enough memory in either of them. I know that 32GB memory combined is enough to deploy the model as I have done that on another machine, but I don't know how to deploy the model with 2 GPUs.

Can anyone help?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to deploy one model instance across multiple GPUs to tackle the OOM problem? #462

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to deploy one model instance across multiple GPUs to tackle the OOM problem? #462

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions