You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I get is about 8GB. I have also set --world_size to 2 to use 2-way tensor parallelism.
But when I try to start the triton server, I always get the Ouf of Memory error. It seems that one instance will be lauched in each GPU, but there is not enough memory in either of them. I know that 32GB memory combined is enough to deploy the model as I have done that on another machine, but I don't know how to deploy the model with 2 GPUs.