Aggressive retry mechanism causing VLLM shared memory exhaustion during streaming

## Issue Description
When using the LlamaStack Python client with VLLM backend, the retry mechanism during streaming responses can lead to VLLM shared memory exhaustion.

### System Architecture
```
VLLM <= LlamaStack <= Client Application
```

### Observed Behavior
1. Initial streaming request is made and starts receiving chunks
2. While waiting for VLLM response chunks, LlamaStack client automatically retries the request
3. Multiple identical requests accumulate, eventually causing VLLM to run out of shared memory

### Logs Evidence

First attempt:
```
DEBUG 10-10 05:27:15 [client.py:193] Waiting for output from MQLLMEngine.
DEBUG 10-10 05:27:16 [metrics.py:386] Avg prompt throughput: 0.0 tokens/s
```

Second attempt (automatic retry):
```
INFO 10-10 05:27:20 [logger.py:41] Received request chatcmpl-912e2a2b68f44d46a991b3b37cf7895a
```

Third attempt (another retry):
```
INFO 10-10 05:27:51 [logger.py:41] Received request chatcmpl-ea4023145df84ba183683a38983e5977
```

Final error:
```
ERROR 10-10 05:28:03 [serving_chat.py:932] vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: OutOfResources(98304, 65536, 'shared memory').
```

### Current Implementation
The client currently has the following retry behavior:

1. Default configuration:
```python
DEFAULT_MAX_RETRIES = 2
MAX_RETRY_DELAY = 8  # seconds
INITIAL_RETRY_DELAY = 0.5  # seconds
```

2. Retry conditions (from `_should_retry`):
```python
# Retry on request timeouts (408)
# Retry on lock timeouts (409)
# Retry on rate limits (429)
# Retry on server errors (502, 503, 504)
```

### Impact
- VLLM crashes due to shared memory exhaustion
- Multiple redundant requests consume resources
- Potential cascading failures in the system

### Suggested Solutions

1. Add streaming-specific retry configuration:
```python
class StreamingConfig:
    max_retries: int = 0
    retry_on_timeout: bool = False
    retry_on_server_error: bool = False
```

2. Add a header to indicate streaming requests that should bypass retries:
```
X-LlamaStack-Streaming: true
```

3. Or simply detect streaming requests and disable retries automatically:
```python
if options.stream:
    options.max_retries = 0
```

The third option might be the safest as it would prevent any potential memory exhaustion issues by default while still allowing users to explicitly enable retries if needed.

### Questions
1. Is there a way to configure retry behavior specifically for streaming requests in the current version?
2. Are there plans to add more granular retry control?
3. Is there a recommended way to handle streaming retries differently from regular requests?

### Environment
- LlamaStack Python Client (latest version)
- VLLM backend
- Python 3.x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggressive retry mechanism causing VLLM shared memory exhaustion during streaming #279

Issue Description

System Architecture

Observed Behavior

Logs Evidence

Current Implementation

Impact

Suggested Solutions

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aggressive retry mechanism causing VLLM shared memory exhaustion during streaming #279

Description

Issue Description

System Architecture

Observed Behavior

Logs Evidence

Current Implementation

Impact

Suggested Solutions

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions