Skip to content

Conversation

@kangwangamd
Copy link

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@kangwangamd kangwangamd requested a review from a team as a code owner November 3, 2025 03:46
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a demo script for using the SGLang engine with Ray Serve. The script is a good starting point, but it has several issues that should be addressed to make it more robust, portable, and aligned with best practices.

My main feedback points are:

  • The script contains a hardcoded model path, which makes it not runnable on other machines. This should be made configurable, for example, via an environment variable.
  • The FastAPI endpoint implementation has several deviations from the OpenAI API it claims to emulate, such as using GET instead of POST, a non-standard URL structure, and an unstructured response format.
  • There are multiple PEP 8 style violations throughout the file, including inconsistent spacing, variable naming, and use of print for logging.

I've left specific comments with suggestions to fix these issues. Addressing them will significantly improve the quality and usability of this demo.

Comment on lines +25 to +29
self.engine_kwargs = dict(
model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
mem_fraction_static = 0.8,
tp_size = 8,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block has a couple of issues:

  1. Hardcoded Path: The model_path is hardcoded. This makes the demo not portable and will fail on other machines. It's better to use an environment variable. You'll need to add import os at the top of the file.
  2. Style: The dictionary initialization for engine_kwargs has some style issues. According to PEP 8, there should be no spaces around the equals sign for keyword arguments. Also, when using the dict() constructor, keys are identifiers and should not be quoted as strings.
Suggested change
self.engine_kwargs = dict(
model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
mem_fraction_static = 0.8,
tp_size = 8,
)
self.engine_kwargs = dict(
model_path=os.environ.get("MODEL_PATH", "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/"),
mem_fraction_static=0.8,
tp_size=8,
)

Comment on lines +15 to +19
@app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint
async def root(self, in_prompt: str): # make endpoint async
# Asynchronous call to a method of that deployment (executes remotely) used remote
res = await self._engine_handle.chat.remote(in_prompt)
return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This endpoint is commented as serving as an OpenAI /v1/chat/completions endpoint, but its implementation differs significantly from the OpenAI API specification:

  • It uses GET instead of POST. Chat completions typically involve longer inputs that are sent in the request body.
  • The prompt is passed as a URL path parameter, which is not standard.
  • The method name root is not very descriptive. A name like chat_completions would be more appropriate.
  • The response is a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res), which FastAPI will serialize as a JSON array. An OpenAI-compatible response should be a JSON object (e.g., ChatCompletion object).
  • There is a typo "useing" in the response string.
  • The comments on lines 16 and 17 are redundant and can be removed for cleaner code.

For a demo aiming for OpenAI compatibility, it would be better to align with the standard API. The suggestion below improves naming and the response format. For a more compliant API, consider using POST with a request body model.

Suggested change
@app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint
async def root(self, in_prompt: str): # make endpoint async
# Asynchronous call to a method of that deployment (executes remotely) used remote
res = await self._engine_handle.chat.remote(in_prompt)
return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
@app.get("/v1/chat/completions/{in_prompt}") # serve as an OpenAI-like /v1/chat/completions endpoint
async def chat_completions(self, in_prompt: str):
res = await self._engine_handle.chat.remote(in_prompt)
return {"model": "Llama-3.1-8B-Instruct", "prompt": in_prompt, "response": res}

@@ -0,0 +1,51 @@
import ray
import requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requests module is imported but never used. It should be removed to keep the code clean.

self.engine = sglang.Engine(**self.engine_kwargs)

async def chat(self, message: str):
print('In SGLangServer CHAT with message', message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging is generally discouraged in library code. It's better to use the logging module, which allows for configurable log levels, formatting, and output streams. You can add import logging and logger = logging.getLogger(__name__) at the top of the file.

Suggested change
print('In SGLangServer CHAT with message', message)
logger.info(f'In SGLangServer CHAT with message: {message}')

Comment on lines +42 to +45
res = await self.engine.async_generate(
prompt = message,
stream = False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, there should be no spaces around the equals sign for keyword arguments.

Suggested change
res = await self.engine.async_generate(
prompt = message,
stream = False
)
res = await self.engine.async_generate(
prompt=message,
stream=False
)

Comment on lines +48 to +50
sglangServer = SGLangServer.bind()
my_App = MyFastAPIDeployment.bind(sglangServer)
handle: DeploymentHandle = serve.run(my_App, blocking = True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a few style issues here that go against PEP 8 guidelines:

  • Variable names sglangServer and my_App should be snake_case (e.g., sglang_server, my_app).
  • There should not be a space around the equals sign in blocking = True.
Suggested change
sglangServer = SGLangServer.bind()
my_App = MyFastAPIDeployment.bind(sglangServer)
handle: DeploymentHandle = serve.run(my_App, blocking = True)
sglang_server = SGLangServer.bind()
my_app = MyFastAPIDeployment.bind(sglang_server)
handle: DeploymentHandle = serve.run(my_app, blocking=True)

def __init__(self):

self.engine_kwargs = dict(
model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Hardcoded Dev Path Breaks Production Deployment

Hardcoded personal development path /scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/ that is specific to a development environment and will not work for other users or in production. This appears to be accidentally committed personal configuration.

Fix in Cursor Fix in Web

async def root(self, in_prompt: str): # make endpoint async
# Asynchronous call to a method of that deployment (executes remotely) used remote
res = await self._engine_handle.chat.remote(in_prompt)
return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: OpenAI API: Incorrect Response Format Causing Failures

The endpoint claims to serve as an OpenAI /v1/chat/completions endpoint (see comment on line 15), but returns a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res) instead of the proper OpenAI chat completions response format. The OpenAI API requires a specific JSON structure with fields like id, object, created, model, and choices. Returning a tuple will cause any OpenAI-compatible client to fail when parsing the response.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation llm community-contribution Contributed by the community labels Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community docs An issue or change related to documentation llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

1 participant