-
Notifications
You must be signed in to change notification settings - Fork 122
No white space included in tokens sent back by Llama2 in streaming mode #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We are experiencing the same issue. |
For now I am using a workaround that is probably not ideal. In the postprocessing script (/postprocessing/1/model.py) I changed the _postprocessing function to return the actual token ids.
I collect all the Token Ids on the User side and then decode the entire sequence which produces the correct output. |
The tokenizers in transoformers do not support this function automatically when calling decode function The standard way of going about this is holding tokens in cache until a space is detected, in which everything after the space is put again into cache. The other suggested method decodes the token_id text instead of the string text to look for a "_" symbol here is a work around with text using the second method
|
@Shixiaowei02 I can create a PR for this |
Have you tried the |
btw @jfpichlme |
Hi enochlev, Build the containerI have used Option 2 in the tensorrt-llm backend repo to build the docker container:
The docker version is: "22.04" and the tensorrt_llm git version is TensorRT-LLM backend (#324). Build the ModelsThis process now consists of two steps, first a covert_checkpoint step. Then a build step.
Combine your Model:Last step is to copy the created engine files to the tensorrt_llm/1/ directory and adapt the config files. You can see the configs of the model in my initial comment. I hope this helps you. @byshiue I will test the tensorrt_llm_bls module now. |
@jfpichlme any luck with bls + streaming? I have the same problem and for some reason can't make my grpc client to work with bls. |
Hi ekarmazin, |
@jfpichlme I kind of got it working with BLS, it does proper output with whitespaces now. But I faced an accuracy problems with enabling |
@byshiue same issue with bls model. Spaces are presented when accumulate tokens are true, and missing when false. |
@enochlev apologies for the delayed response. Would you still be able to PR the fix you suggested? |
Any update on this? |
I will find some time around work this week and push an update |
mark |
Mark |
Any update? |
Mark |
2 similar comments
Mark |
Mark |
@enochlev token_id_string = self.tokenizer.convert_ids_to_tokens(tokens[:seq_len], skip_special_tokens=True)
if len(token_id_string) > 0 and len(token_id_string[0]) > 0 and token_id_string[0][0] == "▁":
output = " " + output |
@elinx Really appreciate catching that... I just submitted a PR including your suggestion. It worked it my local environment before a submitted the PR, so it has my approval (if that means anything 😁) |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1. Set up LLama2 (7b, 13b, 70b) in streaming mode:
model_config:
preprocessing:
postprocessing:
ensemble:
2. Use Nvidia client notebook (Install does not work, but downloading langchain_nvidia_trt.llms directly solves the problem)
(I have also written my own grpc client which produces the same output)
3. Send inference request via grpc to the triton
Expected behavior
Produce output tokens including whitespace:
actual behavior
##Triton produces output tokens without whitespace:
additional notes
I am not too sure if this is a bug or that I am missing some flag. Any help is highly appreciated
Model build:
The text was updated successfully, but these errors were encountered: