-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use llama chat tags in example requests #951
Conversation
"text": "Name the capital of the United States.", | ||
"text": "<|begin_of_text|>Name the capital of the United States.<|eot_id|>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the server be inserting these itself? Maybe for a specific endpoint? Right now we just have "generate"?
Other projects have endpoints like chat/completions
that don't include these tags:
- https://docs.sglang.ai/start/send_request.html
- https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-completions-api-with-vllm
Are we expecting shortfin to sit under a frontend like sglang that will insert the tags for us? Or should we be putting similar logic in our apps too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they're called ChatTemplates
. It's a feature that we should add. For now, our SGLang serving docs use the llama tag and apply it through the sglang frontend language.
By default, if a chat_template is not specified, they apply it based on model name/path: https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/backend/runtime_endpoint.py#L48
Shouldn't be a hard thing to add, just not related to this PR, which is aimed at cleaning up the user doc experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do we expect a user to actually sent the tags themselves in requests? I'm fine with the quick fix if it reduces confusion, but I want to understand our architecture/strategy more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect that, I'll need to create a formal issue for this. The way that I see it is basically replicating what SGLang does. We have chat templates for the model variants that we support (just "llama3" for now).
If a user starts a server with a llama model, it'll use the chat templates by default. Otherwise, they can start the server with a specific template (i.e. "none"), and it'll use those.
So, by default, the llama tags will be applied if you were to just spin up the server. Would have to explicitly specify otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that as an issue to add to #921
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough for now (though see the discussion)
Use llama chat tags in example requests. More details on this can be found [here](#934 (comment))
Use llama chat tags in example requests.
More details on this can be found here