add sglang engine demo for /v1/chat/completions #58366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

kangwangamd wants to merge 1 commit into ray-project:master from kangwangamd:feature/serve-sglang

+51 −0

kangwangamd commented Nov 3, 2025

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.


          add sglang engine demo for /v1/chat/completions

b170767

kangwangamd requested a review from a team as a code owner

November 3, 2025 03:46

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request adds a demo script for using the SGLang engine with Ray Serve. The script is a good starting point, but it has several issues that should be addressed to make it more robust, portable, and aligned with best practices.

My main feedback points are:

The script contains a hardcoded model path, which makes it not runnable on other machines. This should be made configurable, for example, via an environment variable.
The FastAPI endpoint implementation has several deviations from the OpenAI API it claims to emulate, such as using GET instead of POST, a non-standard URL structure, and an unstructured response format.
There are multiple PEP 8 style violations throughout the file, including inconsistent spacing, variable naming, and use of print for logging.

I've left specific comments with suggestions to fix these issues. Addressing them will significantly improve the quality and usability of this demo.

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

Comment on lines +25 to +29

+                      self.engine_kwargs = dict(
+                          model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
+                          mem_fraction_static = 0.8,
+                          tp_size = 8,
+                      )

Contributor

gemini-code-assist bot Nov 3, 2025

This block has a couple of issues:

Hardcoded Path: The model_path is hardcoded. This makes the demo not portable and will fail on other machines. It's better to use an environment variable. You'll need to add import os at the top of the file.
Style: The dictionary initialization for engine_kwargs has some style issues. According to PEP 8, there should be no spaces around the equals sign for keyword arguments. Also, when using the dict() constructor, keys are identifiers and should not be quoted as strings.

Suggested change

      
                    self.engine_kwargs = dict(
          
                        model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
          
                        mem_fraction_static = 0.8,
          
                        tp_size = 8,
          
                    )
          
                    self.engine_kwargs = dict(
          
                        model_path=os.environ.get("MODEL_PATH", "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/"),
          
                        mem_fraction_static=0.8,
          
                        tp_size=8,
          
                    )

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

Comment on lines +15 to +19

+                  @app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint
+                  async def root(self, in_prompt: str):                 # make endpoint async
+                      # Asynchronous call to a method of that deployment (executes remotely) used remote
+                      res = await self._engine_handle.chat.remote(in_prompt)
+                      return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res

Contributor

gemini-code-assist bot Nov 3, 2025

This endpoint is commented as serving as an OpenAI /v1/chat/completions endpoint, but its implementation differs significantly from the OpenAI API specification:

It uses GET instead of POST. Chat completions typically involve longer inputs that are sent in the request body.
The prompt is passed as a URL path parameter, which is not standard.
The method name root is not very descriptive. A name like chat_completions would be more appropriate.
The response is a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res), which FastAPI will serialize as a JSON array. An OpenAI-compatible response should be a JSON object (e.g., ChatCompletion object).
There is a typo "useing" in the response string.
The comments on lines 16 and 17 are redundant and can be removed for cleaner code.

For a demo aiming for OpenAI compatibility, it would be better to align with the standard API. The suggestion below improves naming and the response format. For a more compliant API, consider using POST with a request body model.

Suggested change

      
                @app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint 
          
                async def root(self, in_prompt: str):                 # make endpoint async
          
                    # Asynchronous call to a method of that deployment (executes remotely) used remote
          
                    res = await self._engine_handle.chat.remote(in_prompt) 
          
                    return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
          
                @app.get("/v1/chat/completions/{in_prompt}") # serve as an OpenAI-like /v1/chat/completions endpoint
          
                async def chat_completions(self, in_prompt: str):
          
                    res = await self._engine_handle.chat.remote(in_prompt)
          
                    return {"model": "Llama-3.1-8B-Instruct", "prompt": in_prompt, "response": res}

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

		@@ -0,0 +1,51 @@
		import ray
		import requests

Contributor

gemini-code-assist bot Nov 3, 2025

The requests module is imported but never used. It should be removed to keep the code clean.

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

+                      self.engine = sglang.Engine(**self.engine_kwargs)
+                  async def chat(self, message: str):
+                      print('In SGLangServer CHAT with message', message)

Contributor

gemini-code-assist bot Nov 3, 2025

Using print for logging is generally discouraged in library code. It's better to use the logging module, which allows for configurable log levels, formatting, and output streams. You can add import logging and logger = logging.getLogger(__name__) at the top of the file.

Suggested change

      
                    print('In SGLangServer CHAT with message', message)
          
                    logger.info(f'In SGLangServer CHAT with message: {message}')

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

Comment on lines +42 to +45

+                      res = await self.engine.async_generate(
+                          prompt = message,
+                          stream = False
+                      )

Contributor

gemini-code-assist bot Nov 3, 2025

According to PEP 8, there should be no spaces around the equals sign for keyword arguments.

Suggested change

      
                    res = await self.engine.async_generate(
          
                        prompt = message,
          
                        stream = False
          
                    )
          
                    res = await self.engine.async_generate(
          
                        prompt=message,
          
                        stream=False
          
                    )

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

Comment on lines +48 to +50

+              sglangServer = SGLangServer.bind()
+              my_App = MyFastAPIDeployment.bind(sglangServer)
+              handle: DeploymentHandle = serve.run(my_App, blocking = True)

Contributor

gemini-code-assist bot Nov 3, 2025

There are a few style issues here that go against PEP 8 guidelines:

Variable names sglangServer and my_App should be snake_case (e.g., sglang_server, my_app).
There should not be a space around the equals sign in blocking = True.

Suggested change

      
            sglangServer = SGLangServer.bind()
          
            my_App = MyFastAPIDeployment.bind(sglangServer)
          
            handle: DeploymentHandle = serve.run(my_App, blocking = True)
          
            sglang_server = SGLangServer.bind()
          
            my_app = MyFastAPIDeployment.bind(sglang_server)
          
            handle: DeploymentHandle = serve.run(my_app, blocking=True)

cursor bot reviewed

View reviewed changes

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

+                  def __init__(self):
+                      self.engine_kwargs = dict(
+                          model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",

cursor bot Nov 3, 2025

Bug: Hardcoded Dev Path Breaks Production Deployment

Hardcoded personal development path /scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/ that is specific to a development environment and will not work for other users or in production. This appears to be accidentally committed personal configuration.

python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

+                  async def root(self, in_prompt: str):                 # make endpoint async
+                      # Asynchronous call to a method of that deployment (executes remotely) used remote
+                      res = await self._engine_handle.chat.remote(in_prompt)
+                      return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res

cursor bot Nov 3, 2025

Bug: OpenAI API: Incorrect Response Format Causing Failures

The endpoint claims to serve as an OpenAI /v1/chat/completions endpoint (see comment on line 15), but returns a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res) instead of the proper OpenAI chat completions response format. The OpenAI API requires a specific JSON structure with fields like id, object, created, model, and choices. Returning a tuple will cause any OpenAI-compatible client to fail when parsing the response.

ray-gardener bot added serve docs llm community-contribution labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution docs llm serve