Skip to content

How to pass parameter in ensemble model? #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GooVincent opened this issue Nov 1, 2023 · 15 comments
Closed

How to pass parameter in ensemble model? #71

GooVincent opened this issue Nov 1, 2023 · 15 comments
Assignees

Comments

@GooVincent
Copy link

As the normal procedure for tensorrtllm_backend is preprocessing -> (tensorrt_llm) process -> postprocessing. How to pass the customer parameter from the request, like request token length.

In my understanding, tensorrt_llm backend will finish the infer, it won't work to add input and output parameter. then the issue coming, in ensemble pipeline, how to pass the parameter from the preprocess module to poseprocess module.

Please any way to solve this issue?

@byshiue
Copy link
Collaborator

byshiue commented Nov 1, 2023

I think I don't really get your question. max_tokens (here) is a input form parameter, which is similar to request token length. Can you explain more for your question?

@byshiue byshiue self-assigned this Nov 1, 2023
@GooVincent
Copy link
Author

Well, I understood how to pass the parameter from the model preprocess to model tensorrt_llm, like REQUEST_INPUT_LEN
image

but how to pass REQUEST_INPUT_LEN from model tensorrt_llm to the model postprocess, I get confused, as the output for tensorrt_llm should be fixed based the llm model. correct?

@byshiue
Copy link
Collaborator

byshiue commented Nov 2, 2023

For ensemble, you should pass REQUEST_INPUT_LEN for ensemble to postprocess, instead of passing from tensorrt_llm to postprocess.

@GooVincent
Copy link
Author

Any guide for this please? as the REQUEST_INPUT_LEN is the intermediate result.

@byshiue
Copy link
Collaborator

byshiue commented Nov 2, 2023

I get your point. As far as I know, there is no way to support such feature in tensorrt_llm backend directly because it needs to change the source code of batch_manager. You can try to map the output of preporcess to input of postprocess directly. I am not sure is it doable. You can ask in tritonserver repo.

@GooVincent
Copy link
Author

okay, thanks.

@GooVincent
Copy link
Author

I get your point. As far as I know, there is no way to support such feature in tensorrt_llm backend directly because it needs to change the source code of batch_manager. You can try to map the output of preporcess to input of postprocess directly. I am not sure is it doable. You can ask in tritonserver repo.

Any way for the ensemble mode to know the request token length, as I wanna cut the original tokens.

@BasicCoder
Copy link

I get your point. As far as I know, there is no way to support such feature in tensorrt_llm backend directly because it needs to change the source code of batch_manager. You can try to map the output of preporcess to input of postprocess directly. I am not sure is it doable. You can ask in tritonserver repo.

Any way for the ensemble mode to know the request token length, as I wanna cut the original tokens.

May be you can reference this #95

@callmezhangchenchenokay
Copy link

@GooVincent Hi ,
Have you solved it yet

@callmezhangchenchenokay
Copy link

callmezhangchenchenokay commented Dec 8, 2023

Here is a demo, you can pass out the REQUEST_INPUT_LEN in the middle,
I've tested it and it's okay

https://github.com/triton-inference-server/tensorrtllm_backend/blob/57de9f572f75f61fe17b668eea1430b030e1b721/all_models/inflight_batcher_llm/postprocessing/config.pbtxt

@byshiue
Copy link
Collaborator

byshiue commented Dec 11, 2023

Since the issue is solved, close this bug. If you still have issue/question, feel free to ask and we will open the issue again.

@byshiue byshiue closed this as completed Dec 11, 2023
@callmezhangchenchenokay

Sorry to interrupt again!

The solution mentioned above requires model_transaction_policy to be set to True,
Can only be used if stream = False,

However, this problem occurs when stream = True
image

So there needs to be a way to export REQUEST_INPUT_LEN from stream =True, model_transaction_policy =True

@byshiue
Copy link
Collaborator

byshiue commented Jan 3, 2024

I don't get you point of "set model_transaction_policy to be True". model_transaction_policy is a set of policy and stream (decoupled mode) is one of them.

@callmezhangchenchenokay
Copy link

Excuse me! It's my expression
What I'm trying to say is,

‘’‘
model_transaction_policy {
decoupled: true
}
’‘’
(decoupled) If set to False, can export REQUEST_INPUT_LEN,
However, when decoupled is set to True, the following error is reported

293334648-a216f237-b0f6-4217-a849-96040316f794

@byshiue
Copy link
Collaborator

byshiue commented Jan 8, 2024

Excuse me! It's my expression What I'm trying to say is,

‘’‘ model_transaction_policy { decoupled: true } ’‘’ (decoupled) If set to False, can export REQUEST_INPUT_LEN, However, when decoupled is set to True, the following error is reported

293334648-a216f237-b0f6-4217-a849-96040316f794

Could you try scripts here https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md#end-to-end-workflow-to-run-llama to run the streamming?

pvijayakrish pushed a commit that referenced this issue Oct 8, 2024
* Add and run pre-commit hooks

* Restore clang-format

* Fix yaml spacing

* Normalize spacing

* Fix indentation of pre-commit-config.yaml

* Clang to enforce 80 chars, pre-commit all PRs

* Update copyrights

* Remove extra line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants