Add composite metrics for kubernetes inference gateway metrics protocol #725

BenjaminBraunDev · 2025-03-17T19:58:29Z

In order to integration Triton Inference Server (specifically with TensorRT-LLM backend) with Gateway API Inference Extension, it must adhere to Gateway's Model Server Protocol. This protocol requires the model server to publish the following prometheus metrics under some consistent family/labels:

TotalQueuedRequests
KVCacheUtilization

Currently TensorRT-LLM backend pipes the the following TensorRT-LLM batch manager statistics as prometheus metrics:

Active Request Count
Scheduled Requests
Max KV cache blocks
Used KV cache blocks

These are realized as the following prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"}
...
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"}

These current metrics are sufficient to compose the Gateway metrics by adding a the following new metrics:

Waiting Requests = Active Request Count - Scheduled Requests
Fraction used KV cache blocks = Max KV cache blocks - Used KV cache blocks

and add these to the existing prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="waiting",version="1"}
...
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="fraction",model="tensorrt_llm",version="1"}

These can then be mapped directly to the metrics in the Gateway protocol, allowing for integration with Gateway's End-Point Picker for load balancing.

BenjaminBraunDev · 2025-03-21T19:18:25Z

Tested these changes manually with both launching an http server and hitting curl localhost:8002/metrics endpoint, and with the OpenAI API server hitting curl localhost:9000/metrics endpoint. The new metric labels within the respective families were present, and while sending inference requests, they updated correctly with more load/requests.

indrajit96

Can we make corresponding additions in documentation
Here
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#triton-metrics

all_models/inflight_batcher_llm/tensorrt_llm/1/model.py

krishung5 · 2025-03-24T07:08:21Z

@kaiyux Could you advise what would be the approach for external contribution here?

kaiyux · 2025-03-24T07:17:21Z

@kaiyux Could you advise what would be the approach for external contribution here?

Since we do not switch to GitHub development for this repo yet, we'll need someone to integration the changes into the internal repo, merge and publish the changes, and then credit the contributor. cc @juney-nvidia @schetlur-nv for vis.

krishung5 · 2025-03-24T07:32:13Z

Thanks @kaiyux! I can help with integrating to the internal repo once the changes are finalized. What steps need to be taken to properly credit the contributor?

kaiyux · 2025-03-24T08:53:49Z

Thanks @kaiyux! I can help with integrating to the internal repo once the changes are finalized. What steps need to be taken to properly credit the contributor?

We do something like the following, so that the contributor will be marked as "Co-authored"

Feel free to let me know when it gets merged and I can do it.

BenjaminBraunDev added 3 commits March 11, 2025 18:42

Add custom kv cache and waiting requests metric.

f5832ae

Add first draft of new composite metrics.

36577e0

Small change to kv custom metric.

cf3c32a

statiraju requested a review from krishung5 March 21, 2025 18:29

Merge branch 'triton-inference-server:main' into main

245be99

indrajit96 reviewed Mar 21, 2025

View reviewed changes

all_models/inflight_batcher_llm/tensorrt_llm/1/model.py Outdated Show resolved Hide resolved

all_models/inflight_batcher_llm/tensorrt_llm/1/model.py Outdated Show resolved Hide resolved

BenjaminBraunDev added 2 commits March 25, 2025 22:36

Fix composite metric calcualtion.

9509066

Fix python syntax typo.

6f23f9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add composite metrics for kubernetes inference gateway metrics protocol #725

Add composite metrics for kubernetes inference gateway metrics protocol #725

BenjaminBraunDev commented Mar 17, 2025

BenjaminBraunDev commented Mar 21, 2025

indrajit96 left a comment

krishung5 commented Mar 24, 2025

kaiyux commented Mar 24, 2025

krishung5 commented Mar 24, 2025

kaiyux commented Mar 24, 2025

Add composite metrics for kubernetes inference gateway metrics protocol #725

Are you sure you want to change the base?

Add composite metrics for kubernetes inference gateway metrics protocol #725

Conversation

BenjaminBraunDev commented Mar 17, 2025

BenjaminBraunDev commented Mar 21, 2025

indrajit96 left a comment

Choose a reason for hiding this comment

krishung5 commented Mar 24, 2025

kaiyux commented Mar 24, 2025

krishung5 commented Mar 24, 2025

kaiyux commented Mar 24, 2025