Skip to content

Add composite metrics for kubernetes inference gateway metrics protocol #725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

BenjaminBraunDev
Copy link

In order to integration Triton Inference Server (specifically with TensorRT-LLM backend) with Gateway API Inference Extension, it must adhere to Gateway's Model Server Protocol. This protocol requires the model server to publish the following prometheus metrics under some consistent family/labels:

  • TotalQueuedRequests
  • KVCacheUtilization

Currently TensorRT-LLM backend pipes the the following TensorRT-LLM batch manager statistics as prometheus metrics:

  • Active Request Count
  • Scheduled Requests
  • Max KV cache blocks
  • Used KV cache blocks

These are realized as the following prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"}
...
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"}

These current metrics are sufficient to compose the Gateway metrics by adding a the following new metrics:

  • Waiting Requests = Active Request Count - Scheduled Requests
  • Fraction used KV cache blocks = Max KV cache blocks - Used KV cache blocks

and add these to the existing prometheus metrics:

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"}
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="waiting",version="1"}
...
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"}
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="fraction",model="tensorrt_llm",version="1"}

These can then be mapped directly to the metrics in the Gateway protocol, allowing for integration with Gateway's End-Point Picker for load balancing.

@statiraju statiraju requested a review from krishung5 March 21, 2025 18:29
@BenjaminBraunDev
Copy link
Author

Tested these changes manually with both launching an http server and hitting curl localhost:8002/metrics endpoint, and with the OpenAI API server hitting curl localhost:9000/metrics endpoint. The new metric labels within the respective families were present, and while sending inference requests, they updated correctly with more load/requests.

Copy link

@indrajit96 indrajit96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krishung5
Copy link

@kaiyux Could you advise what would be the approach for external contribution here?

@kaiyux
Copy link
Collaborator

kaiyux commented Mar 24, 2025

@kaiyux Could you advise what would be the approach for external contribution here?

Since we do not switch to GitHub development for this repo yet, we'll need someone to integration the changes into the internal repo, merge and publish the changes, and then credit the contributor. cc @juney-nvidia @schetlur-nv for vis.

@krishung5
Copy link

Thanks @kaiyux! I can help with integrating to the internal repo once the changes are finalized. What steps need to be taken to properly credit the contributor?

@kaiyux
Copy link
Collaborator

kaiyux commented Mar 24, 2025

Thanks @kaiyux! I can help with integrating to the internal repo once the changes are finalized. What steps need to be taken to properly credit the contributor?

We do something like the following, so that the contributor will be marked as "Co-authored"
image

Feel free to let me know when it gets merged and I can do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants