Skip to content

Aggressive retry mechanism causing VLLM shared memory exhaustion during streaming #279

@akram

Description

@akram

Issue Description

When using the LlamaStack Python client with VLLM backend, the retry mechanism during streaming responses can lead to VLLM shared memory exhaustion.

System Architecture

VLLM <= LlamaStack <= Client Application

Observed Behavior

  1. Initial streaming request is made and starts receiving chunks
  2. While waiting for VLLM response chunks, LlamaStack client automatically retries the request
  3. Multiple identical requests accumulate, eventually causing VLLM to run out of shared memory

Logs Evidence

First attempt:

DEBUG 10-10 05:27:15 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 10-10 05:27:16 [metrics.py:386] Avg prompt throughput: 0.0 tokens/s

Second attempt (automatic retry):

INFO 10-10 05:27:20 [logger.py:41] Received request chatcmpl-912e2a2b68f44d46a991b3b37cf7895a

Third attempt (another retry):

INFO 10-10 05:27:51 [logger.py:41] Received request chatcmpl-ea4023145df84ba183683a38983e5977

Final error:

ERROR 10-10 05:28:03 [serving_chat.py:932] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfResources(98304, 65536, 'shared memory').

Current Implementation

The client currently has the following retry behavior:

  1. Default configuration:
DEFAULT_MAX_RETRIES = 2
MAX_RETRY_DELAY = 8  # seconds
INITIAL_RETRY_DELAY = 0.5  # seconds
  1. Retry conditions (from _should_retry):
# Retry on request timeouts (408)
# Retry on lock timeouts (409)
# Retry on rate limits (429)
# Retry on server errors (502, 503, 504)

Impact

  • VLLM crashes due to shared memory exhaustion
  • Multiple redundant requests consume resources
  • Potential cascading failures in the system

Suggested Solutions

  1. Add streaming-specific retry configuration:
class StreamingConfig:
    max_retries: int = 0
    retry_on_timeout: bool = False
    retry_on_server_error: bool = False
  1. Add a header to indicate streaming requests that should bypass retries:
X-LlamaStack-Streaming: true
  1. Or simply detect streaming requests and disable retries automatically:
if options.stream:
    options.max_retries = 0

The third option might be the safest as it would prevent any potential memory exhaustion issues by default while still allowing users to explicitly enable retries if needed.

Questions

  1. Is there a way to configure retry behavior specifically for streaming requests in the current version?
  2. Are there plans to add more granular retry control?
  3. Is there a recommended way to handle streaming retries differently from regular requests?

Environment

  • LlamaStack Python Client (latest version)
  • VLLM backend
  • Python 3.x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions