TRT-LLM decoupled streaming mode stops generating tokens after first request

TRT-LLM decoupled streaming mode (used by the Triton `cosyvoice2` BLS model for gRPC streaming TTS) **stops generating tokens after the first request**. The exact same config with `decoupled: False` works perfectly for unlimited requests.

## Reproduction

**Environment:**
- GPU: NVIDIA L4 (24GB)
- Tested on TWO Triton images:
  - `soar97/triton-cosyvoice:25.06` (TRT-LLM 0.20.x)
  - `nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3` (TRT-LLM 0.18.2)
- CosyVoice2-0.5B model
- Using the **unmodified upstream** `model_repo/cosyvoice2/1/model.py` from this repo

**Steps:**
1. Build TRT engines and set up model repo per `runtime/triton_trtllm/run.sh` stages 0-3
2. Set `DECOUPLED_MODE=True` (for streaming)
3. Start Triton server
4. Send sequential gRPC streaming requests

**Config (minimal — HTTP config with only decoupled changed):**
```
# cosyvoice2/config.pbtxt
model_transaction_policy { decoupled: True }

# tensorrt_llm/config.pbtxt
model_transaction_policy { decoupled: True }
```

All other settings identical to the working HTTP/offline config (same engine, same KV cache, same batch size).

## Observed Behavior

| Request | Result |
|---------|--------|
| 1 | Works perfectly — 3s, 3 audio chunks, 2.16s audio |
| 2 | Degraded — 25s latency, 1 chunk, 0.04s audio (960 samples) |
| 3+ | Dead — TRT-LLM generates 0 tokens, hangs forever |

**Triton logs during stuck state:**
```
Active Request Count: 1
Generation Requests: 1
Used KV cache blocks: 5    ← frozen, never grows
Free KV cache blocks: 75   ← plenty available
GPU utilization: 0%         ← no GPU kernels dispatched
```

TRT-LLM's scheduler loop runs on CPU (iteration counter keeps incrementing) but dispatches **zero GPU work**. The engine is permanently stuck until the container is restarted.

## What Works

Setting `decoupled: False` in both `cosyvoice2` and `tensorrt_llm` configs (HTTP/offline mode) works **perfectly** for unlimited sequential requests with the exact same engine and model weights.

## Investigation Summary

We systematically isolated the issue:

1. **Different Triton images** — same bug on `soar97/triton-cosyvoice:25.06` and official `nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3`
2. **Different TRT engines** — same bug with both `int8` and `bfloat16` engines
3. **Different configs** — tested with HTTP config (KV cache reuse, CUDA graphs, etc.) and gRPC config. Same bug.
4. **Custom model.py code** — tested with the **unmodified upstream** `model_repo/cosyvoice2/1/model.py` from this repo. Same bug.
5. **Queue settings** — tested with `max_queue_delay_microseconds: 0` (matching `run.sh`). Same bug.
6. **`decoupled: False`** — works perfectly with the exact same engine/weights/config

**The ONLY variable that causes the failure is `decoupled: True`.**

## Expected Behavior

Sequential gRPC streaming requests should work reliably, same as HTTP/offline mode.

## System Info

```
GPU: NVIDIA L4 (24GB VRAM)
Triton Server: 2.57.0 (25.04 release)
TRT-LLM: 0.18.2
CosyVoice: commit 8b54619
Model: CosyVoice2-0.5B (bfloat16 TRT engine)
KV cache: 2560 tokens (40 blocks x 64 tokens)
max_batch_size: 4
BLS instance_group count: 1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRT-LLM decoupled streaming mode stops generating tokens after first request #1866

Reproduction

Observed Behavior

What Works

Investigation Summary

Expected Behavior

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request	Result
1	Works perfectly — 3s, 3 audio chunks, 2.16s audio
2	Degraded — 25s latency, 1 chunk, 0.04s audio (960 samples)
3+	Dead — TRT-LLM generates 0 tokens, hangs forever

TRT-LLM decoupled streaming mode stops generating tokens after first request #1866

Description

Reproduction

Observed Behavior

What Works

Investigation Summary

Expected Behavior

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions