-
Notifications
You must be signed in to change notification settings - Fork 96
Open
Description
Issue Description
When using the LlamaStack Python client with VLLM backend, the retry mechanism during streaming responses can lead to VLLM shared memory exhaustion.
System Architecture
VLLM <= LlamaStack <= Client Application
Observed Behavior
- Initial streaming request is made and starts receiving chunks
- While waiting for VLLM response chunks, LlamaStack client automatically retries the request
- Multiple identical requests accumulate, eventually causing VLLM to run out of shared memory
Logs Evidence
First attempt:
DEBUG 10-10 05:27:15 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 10-10 05:27:16 [metrics.py:386] Avg prompt throughput: 0.0 tokens/s
Second attempt (automatic retry):
INFO 10-10 05:27:20 [logger.py:41] Received request chatcmpl-912e2a2b68f44d46a991b3b37cf7895a
Third attempt (another retry):
INFO 10-10 05:27:51 [logger.py:41] Received request chatcmpl-ea4023145df84ba183683a38983e5977
Final error:
ERROR 10-10 05:28:03 [serving_chat.py:932] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfResources(98304, 65536, 'shared memory').
Current Implementation
The client currently has the following retry behavior:
- Default configuration:
DEFAULT_MAX_RETRIES = 2
MAX_RETRY_DELAY = 8 # seconds
INITIAL_RETRY_DELAY = 0.5 # seconds- Retry conditions (from
_should_retry):
# Retry on request timeouts (408)
# Retry on lock timeouts (409)
# Retry on rate limits (429)
# Retry on server errors (502, 503, 504)Impact
- VLLM crashes due to shared memory exhaustion
- Multiple redundant requests consume resources
- Potential cascading failures in the system
Suggested Solutions
- Add streaming-specific retry configuration:
class StreamingConfig:
max_retries: int = 0
retry_on_timeout: bool = False
retry_on_server_error: bool = False- Add a header to indicate streaming requests that should bypass retries:
X-LlamaStack-Streaming: true
- Or simply detect streaming requests and disable retries automatically:
if options.stream:
options.max_retries = 0The third option might be the safest as it would prevent any potential memory exhaustion issues by default while still allowing users to explicitly enable retries if needed.
Questions
- Is there a way to configure retry behavior specifically for streaming requests in the current version?
- Are there plans to add more granular retry control?
- Is there a recommended way to handle streaming retries differently from regular requests?
Environment
- LlamaStack Python Client (latest version)
- VLLM backend
- Python 3.x
Metadata
Metadata
Assignees
Labels
No labels