Optimal EC2 configuration and vLLM settings for max concurrency? #9773
Replies: 3 comments
-
Going a bit further, I see the issue is that the GPU cache is getting to 99% and then pending requests pile up. Guess my question then is what engine flags would allow me to stretch the GPU cache under concurrent requests? |
Beta Was this translation helpful? Give feedback.
-
+1 would really like to know the answer for this |
Beta Was this translation helpful? Give feedback.
-
You can look at https://docs.vllm.ai/en/latest/performance/optimization.html, especially the part about For your questions, I could only guess but for speed it is almost always the best to squeeze everything on GPU if possible. CPU offloading is slower and only done if not enough VRAM is available. From the linked article, it seems that your |
Beta Was this translation helpful? Give feedback.
-
Thank you for such a great open source project.
Use Case
We're building a chatbot and aiming for consistent, responsive performance under concurrent user loads. At ~15 requests, processing delays reach up to 30 seconds before streaming begins. Though streaming speed is good, we'd prefer requests to start sooner, even if they stream slower. We're also seeking optimal vLLM settings for our hardware.
Configuration
vLLM running on on a g4dn.12xlarge with AMI "Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 61.1"
**Model ** is Pixtral70b
docker-compose:
The instance is underutilized, showing CPU utilization of only 6% (of the available 48 vCPU) . When getting slammed w/ reqeusts, the GPU utilization is near 100% on each of the 4 cores, but the GPU Memory Utilization is like 63% per core
All vLLM settings are using the defaults.
Questions/Concerns
The vLLM engine args doc leaves us with some questions:
--max-num-batched-tokens=512
too low? Would increasing it help with concurrent request handling?--cpu-offload-db
or--swap-space
improve performance?--gpu-memory-utilization=0.9
?Beta Was this translation helpful? Give feedback.
All reactions