Optimal EC2 configuration and vLLM settings for max concurrency? #9773

hughesadam87 · 2024-10-28T20:43:36Z

hughesadam87
Oct 28, 2024

Thank you for such a great open source project.

Use Case

We're building a chatbot and aiming for consistent, responsive performance under concurrent user loads. At ~15 requests, processing delays reach up to 30 seconds before streaming begins. Though streaming speed is good, we'd prefer requests to start sooner, even if they stream slower. We're also seeking optimal vLLM settings for our hardware.

Configuration

vLLM running on on a g4dn.12xlarge with AMI "Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 61.1"

**Model ** is Pixtral70b

docker-compose:

services:
  pixtral:
    extends: inference-server
    environment:
      MODEL_NAME: "pixtral"
      QUANTIZATION: "None"
      MAX_MODEL_LEN: 40000
      MODEL_TYPE: "mistral"
      DTYPE: "float16"
  inference-server:
    image: ...
    volumes:
      - ...
    restart: always
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [ gpu ]

The instance is underutilized, showing CPU utilization of only 6% (of the available 48 vCPU) . When getting slammed w/ reqeusts, the GPU utilization is near 100% on each of the 4 cores, but the GPU Memory Utilization is like 63% per core

All vLLM settings are using the defaults.

Questions/Concerns

The vLLM engine args doc leaves us with some questions:

Is --max-num-batched-tokens=512 too low? Would increasing it help with concurrent request handling?
Given our mostly unused 48 CPUs, would --cpu-offload-db or --swap-space improve performance?
Why does GPU memory cap around 63% despite --gpu-memory-utilization=0.9?
Are there other flags that might help with our setup?
We're using the default tokenizer settings - could these be drastically impacting performance?

hughesadam87 · 2024-10-29T16:44:57Z

hughesadam87
Oct 29, 2024
Author

Going a bit further, I see the issue is that the GPU cache is getting to 99% and then pending requests pile up. Guess my question then is what engine flags would allow me to stretch the GPU cache under concurrent requests?

0 replies

dhirajsuvarna · 2024-12-24T18:01:57Z

dhirajsuvarna
Dec 24, 2024

+1 would really like to know the answer for this

0 replies

mahenning · 2025-02-11T14:35:28Z

mahenning
Feb 11, 2025

You can look at https://docs.vllm.ai/en/latest/performance/optimization.html, especially the part about chunked prefill and max_num_batched_tokens (also in the part of "Chunked prefill")

For your questions, I could only guess but for speed it is almost always the best to squeeze everything on GPU if possible. CPU offloading is slower and only done if not enough VRAM is available. From the linked article, it seems that your max_num_batched_tokens of 512 is indeed low and a higher value would help with TTFT (time to first token).
Given that your GPUs are at 100% utilization at answer generation stage, more GPU memory utilization would not help directly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal EC2 configuration and vLLM settings for max concurrency? #9773

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Optimal EC2 configuration and vLLM settings for max concurrency? #9773

hughesadam87 Oct 28, 2024

Use Case

Configuration

Questions/Concerns

Replies: 3 comments

hughesadam87 Oct 29, 2024 Author

dhirajsuvarna Dec 24, 2024

mahenning Feb 11, 2025

hughesadam87
Oct 28, 2024

hughesadam87
Oct 29, 2024
Author

dhirajsuvarna
Dec 24, 2024

mahenning
Feb 11, 2025