Skip to content

TRT-LLM decoupled streaming mode stops generating tokens after first request #1866

@DovudAsadov

Description

@DovudAsadov

TRT-LLM decoupled streaming mode (used by the Triton cosyvoice2 BLS model for gRPC streaming TTS) stops generating tokens after the first request. The exact same config with decoupled: False works perfectly for unlimited requests.

Reproduction

Environment:

  • GPU: NVIDIA L4 (24GB)
  • Tested on TWO Triton images:
    • soar97/triton-cosyvoice:25.06 (TRT-LLM 0.20.x)
    • nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3 (TRT-LLM 0.18.2)
  • CosyVoice2-0.5B model
  • Using the unmodified upstream model_repo/cosyvoice2/1/model.py from this repo

Steps:

  1. Build TRT engines and set up model repo per runtime/triton_trtllm/run.sh stages 0-3
  2. Set DECOUPLED_MODE=True (for streaming)
  3. Start Triton server
  4. Send sequential gRPC streaming requests

Config (minimal — HTTP config with only decoupled changed):

# cosyvoice2/config.pbtxt
model_transaction_policy { decoupled: True }

# tensorrt_llm/config.pbtxt
model_transaction_policy { decoupled: True }

All other settings identical to the working HTTP/offline config (same engine, same KV cache, same batch size).

Observed Behavior

Request Result
1 Works perfectly — 3s, 3 audio chunks, 2.16s audio
2 Degraded — 25s latency, 1 chunk, 0.04s audio (960 samples)
3+ Dead — TRT-LLM generates 0 tokens, hangs forever

Triton logs during stuck state:

Active Request Count: 1
Generation Requests: 1
Used KV cache blocks: 5    ← frozen, never grows
Free KV cache blocks: 75   ← plenty available
GPU utilization: 0%         ← no GPU kernels dispatched

TRT-LLM's scheduler loop runs on CPU (iteration counter keeps incrementing) but dispatches zero GPU work. The engine is permanently stuck until the container is restarted.

What Works

Setting decoupled: False in both cosyvoice2 and tensorrt_llm configs (HTTP/offline mode) works perfectly for unlimited sequential requests with the exact same engine and model weights.

Investigation Summary

We systematically isolated the issue:

  1. Different Triton images — same bug on soar97/triton-cosyvoice:25.06 and official nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3
  2. Different TRT engines — same bug with both int8 and bfloat16 engines
  3. Different configs — tested with HTTP config (KV cache reuse, CUDA graphs, etc.) and gRPC config. Same bug.
  4. Custom model.py code — tested with the unmodified upstream model_repo/cosyvoice2/1/model.py from this repo. Same bug.
  5. Queue settings — tested with max_queue_delay_microseconds: 0 (matching run.sh). Same bug.
  6. decoupled: False — works perfectly with the exact same engine/weights/config

The ONLY variable that causes the failure is decoupled: True.

Expected Behavior

Sequential gRPC streaming requests should work reliably, same as HTTP/offline mode.

System Info

GPU: NVIDIA L4 (24GB VRAM)
Triton Server: 2.57.0 (25.04 release)
TRT-LLM: 0.18.2
CosyVoice: commit 8b54619
Model: CosyVoice2-0.5B (bfloat16 TRT engine)
KV cache: 2560 tokens (40 blocks x 64 tokens)
max_batch_size: 4
BLS instance_group count: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions