What each type of buffer is doing? #7695

alxmamaev · 2025-09-14T01:59:31Z

alxmamaev
Sep 14, 2025

I was trying to modify TRT LLM code to make a specific thing for my model, that’s is not implemented, and was quite confused with amount of buffers that’s existed can you please explain the structure of them?

Runtime buffers are using for storing values for TRT engine inputs and outputs
Decoder buffers, if I get right decoder means not actual decoder, rather than sampler that sample discrete tokens from models logins and doing all beam search logic which is needed
Decoder inputs and outputs - still does not understand why do we need them, I we already have decoder buffers
Encoder buffers - why do we need them as well if encoder is running in separate model instance which has their own runtime buffers
req slots - what is that?
why executor instance have multiple buffers which splits into some “mini batches”?

Funatiq · 2025-09-30T12:13:18Z

Funatiq
Sep 30, 2025
Collaborator

Hey Alex, the buffer management should be more clear in recent versions. However, I would recommend looking at the PyTorch backend which is the default since v1.0.

Runtime buffers are using for storing values for TRT engine inputs and outputs

Correct.

Decoder buffers, if I get right decoder means not actual decoder, rather than sampler that sample discrete tokens from models logins and doing all beam search logic which is needed

Correct. These have been refactored into DecoderState. There is only one of this.

Decoder inputs and outputs - still does not understand why do we need them, I we already have decoder buffers

The idea here is that inputs and outputs are only valid for a specific iteration. There can be multiple batches each with their own inputs and outputs.

Encoder buffers - why do we need them as well if encoder is running in separate model instance which has their own runtime buffers

Not sure. I think they are needed to store additional information that is not present in decoder-only models.

req slots - what is that?

req slot / batch slot is an identifier which maps a request to a specific resource slot. The slot is persistent for the whole execution of the request.

why executor instance have multiple buffers which splits into some “mini batches”?

Multiple batches (each with their own buffers) are used in TrtOverlap (overlap scheduling) and pipeline parallelism.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What each type of buffer is doing? #7695

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What each type of buffer is doing? #7695

Uh oh!

alxmamaev Sep 14, 2025

Replies: 1 comment

Uh oh!

Funatiq Sep 30, 2025 Collaborator

alxmamaev
Sep 14, 2025

Funatiq
Sep 30, 2025
Collaborator