-
Notifications
You must be signed in to change notification settings - Fork 6.9k
add sglang engine demo for /v1/chat/completions #58366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a demo script for using the SGLang engine with Ray Serve. The script is a good starting point, but it has several issues that should be addressed to make it more robust, portable, and aligned with best practices.
My main feedback points are:
- The script contains a hardcoded model path, which makes it not runnable on other machines. This should be made configurable, for example, via an environment variable.
- The FastAPI endpoint implementation has several deviations from the OpenAI API it claims to emulate, such as using GET instead of POST, a non-standard URL structure, and an unstructured response format.
- There are multiple PEP 8 style violations throughout the file, including inconsistent spacing, variable naming, and use of
printfor logging.
I've left specific comments with suggestions to fix these issues. Addressing them will significantly improve the quality and usability of this demo.
| self.engine_kwargs = dict( | ||
| model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/", | ||
| mem_fraction_static = 0.8, | ||
| tp_size = 8, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block has a couple of issues:
- Hardcoded Path: The
model_pathis hardcoded. This makes the demo not portable and will fail on other machines. It's better to use an environment variable. You'll need to addimport osat the top of the file. - Style: The dictionary initialization for
engine_kwargshas some style issues. According to PEP 8, there should be no spaces around the equals sign for keyword arguments. Also, when using thedict()constructor, keys are identifiers and should not be quoted as strings.
| self.engine_kwargs = dict( | |
| model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/", | |
| mem_fraction_static = 0.8, | |
| tp_size = 8, | |
| ) | |
| self.engine_kwargs = dict( | |
| model_path=os.environ.get("MODEL_PATH", "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/"), | |
| mem_fraction_static=0.8, | |
| tp_size=8, | |
| ) |
| @app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint | ||
| async def root(self, in_prompt: str): # make endpoint async | ||
| # Asynchronous call to a method of that deployment (executes remotely) used remote | ||
| res = await self._engine_handle.chat.remote(in_prompt) | ||
| return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoint is commented as serving as an OpenAI /v1/chat/completions endpoint, but its implementation differs significantly from the OpenAI API specification:
- It uses
GETinstead ofPOST. Chat completions typically involve longer inputs that are sent in the request body. - The prompt is passed as a URL path parameter, which is not standard.
- The method name
rootis not very descriptive. A name likechat_completionswould be more appropriate. - The response is a tuple
("useing Llama-3.1-8B-Instruct for your !", in_prompt, res), which FastAPI will serialize as a JSON array. An OpenAI-compatible response should be a JSON object (e.g.,ChatCompletionobject). - There is a typo "useing" in the response string.
- The comments on lines 16 and 17 are redundant and can be removed for cleaner code.
For a demo aiming for OpenAI compatibility, it would be better to align with the standard API. The suggestion below improves naming and the response format. For a more compliant API, consider using POST with a request body model.
| @app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint | |
| async def root(self, in_prompt: str): # make endpoint async | |
| # Asynchronous call to a method of that deployment (executes remotely) used remote | |
| res = await self._engine_handle.chat.remote(in_prompt) | |
| return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res | |
| @app.get("/v1/chat/completions/{in_prompt}") # serve as an OpenAI-like /v1/chat/completions endpoint | |
| async def chat_completions(self, in_prompt: str): | |
| res = await self._engine_handle.chat.remote(in_prompt) | |
| return {"model": "Llama-3.1-8B-Instruct", "prompt": in_prompt, "response": res} |
| @@ -0,0 +1,51 @@ | |||
| import ray | |||
| import requests | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.engine = sglang.Engine(**self.engine_kwargs) | ||
|
|
||
| async def chat(self, message: str): | ||
| print('In SGLangServer CHAT with message', message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using print for logging is generally discouraged in library code. It's better to use the logging module, which allows for configurable log levels, formatting, and output streams. You can add import logging and logger = logging.getLogger(__name__) at the top of the file.
| print('In SGLangServer CHAT with message', message) | |
| logger.info(f'In SGLangServer CHAT with message: {message}') |
| res = await self.engine.async_generate( | ||
| prompt = message, | ||
| stream = False | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| sglangServer = SGLangServer.bind() | ||
| my_App = MyFastAPIDeployment.bind(sglangServer) | ||
| handle: DeploymentHandle = serve.run(my_App, blocking = True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few style issues here that go against PEP 8 guidelines:
- Variable names
sglangServerandmy_Appshould besnake_case(e.g.,sglang_server,my_app). - There should not be a space around the equals sign in
blocking = True.
| sglangServer = SGLangServer.bind() | |
| my_App = MyFastAPIDeployment.bind(sglangServer) | |
| handle: DeploymentHandle = serve.run(my_App, blocking = True) | |
| sglang_server = SGLangServer.bind() | |
| my_app = MyFastAPIDeployment.bind(sglang_server) | |
| handle: DeploymentHandle = serve.run(my_app, blocking=True) |
| def __init__(self): | ||
|
|
||
| self.engine_kwargs = dict( | ||
| model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Hardcoded Dev Path Breaks Production Deployment
Hardcoded personal development path /scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/ that is specific to a development environment and will not work for other users or in production. This appears to be accidentally committed personal configuration.
| async def root(self, in_prompt: str): # make endpoint async | ||
| # Asynchronous call to a method of that deployment (executes remotely) used remote | ||
| res = await self._engine_handle.chat.remote(in_prompt) | ||
| return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: OpenAI API: Incorrect Response Format Causing Failures
The endpoint claims to serve as an OpenAI /v1/chat/completions endpoint (see comment on line 15), but returns a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res) instead of the proper OpenAI chat completions response format. The OpenAI API requires a specific JSON structure with fields like id, object, created, model, and choices. Returning a tuple will cause any OpenAI-compatible client to fail when parsing the response.
Description
Related issues
Additional information