Skip to content
Open
51 changes: 51 additions & 0 deletions python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import ray
import requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The requests module is imported but never used. It should be removed to keep the code clean.

from fastapi import FastAPI
from ray import serve
from ray.serve.handle import DeploymentHandle

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class MyFastAPIDeployment:
def __init__(self, engine_handle: DeploymentHandle):
self._engine_handle = engine_handle

@app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint
async def root(self, in_prompt: str): # make endpoint async
# Asynchronous call to a method of that deployment (executes remotely) used remote
res = await self._engine_handle.chat.remote(in_prompt)
return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This endpoint is commented as serving as an OpenAI /v1/chat/completions endpoint, but its implementation differs significantly from the OpenAI API specification:

  • It uses GET instead of POST. Chat completions typically involve longer inputs that are sent in the request body.
  • The prompt is passed as a URL path parameter, which is not standard.
  • The method name root is not very descriptive. A name like chat_completions would be more appropriate.
  • The response is a tuple ("useing Llama-3.1-8B-Instruct for your !", in_prompt, res), which FastAPI will serialize as a JSON array. An OpenAI-compatible response should be a JSON object (e.g., ChatCompletion object).
  • There is a typo "useing" in the response string.
  • The comments on lines 16 and 17 are redundant and can be removed for cleaner code.

For a demo aiming for OpenAI compatibility, it would be better to align with the standard API. The suggestion below improves naming and the response format. For a more compliant API, consider using POST with a request body model.

Suggested change
@app.get("/v1/chat/completions/{in_prompt}") # serve as OPENAI /v1/chat/completions endpoint
async def root(self, in_prompt: str): # make endpoint async
# Asynchronous call to a method of that deployment (executes remotely) used remote
res = await self._engine_handle.chat.remote(in_prompt)
return "useing Llama-3.1-8B-Instruct for your !", in_prompt, res
@app.get("/v1/chat/completions/{in_prompt}") # serve as an OpenAI-like /v1/chat/completions endpoint
async def chat_completions(self, in_prompt: str):
res = await self._engine_handle.chat.remote(in_prompt)
return {"model": "Llama-3.1-8B-Instruct", "prompt": in_prompt, "response": res}


@serve.deployment
class SGLangServer:
def __init__(self):

self.engine_kwargs = dict(
model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
mem_fraction_static = 0.8,
tp_size = 8,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block has a couple of issues:

  1. Hardcoded Path: The model_path is hardcoded. This makes the demo not portable and will fail on other machines. It's better to use an environment variable. You'll need to add import os at the top of the file.
  2. Style: The dictionary initialization for engine_kwargs has some style issues. According to PEP 8, there should be no spaces around the equals sign for keyword arguments. Also, when using the dict() constructor, keys are identifiers and should not be quoted as strings.
Suggested change
self.engine_kwargs = dict(
model_path = "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/",
mem_fraction_static = 0.8,
tp_size = 8,
)
self.engine_kwargs = dict(
model_path=os.environ.get("MODEL_PATH", "/scratch2/huggingface/hub/meta-llama/Llama-3.1-8B-Instruct/"),
mem_fraction_static=0.8,
tp_size=8,
)


try:
import sglang
except ImportError as e:
raise ImportError(
"SGLang is not installed or failed to import. Please run "
"`pip install sglang[all]` to install required dependencies."
) from e
self.engine = sglang.Engine(**self.engine_kwargs)

async def chat(self, message: str):
print('In SGLangServer CHAT with message', message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging is generally discouraged in library code. It's better to use the logging module, which allows for configurable log levels, formatting, and output streams. You can add import logging and logger = logging.getLogger(__name__) at the top of the file.

Suggested change
print('In SGLangServer CHAT with message', message)
logger.info(f'In SGLangServer CHAT with message: {message}')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be ignore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change to logger is reasonable

res = await self.engine.async_generate(
prompt = message,
stream = False
)
Comment on lines +43 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, there should be no spaces around the equals sign for keyword arguments.

Suggested change
res = await self.engine.async_generate(
prompt = message,
stream = False
)
res = await self.engine.async_generate(
prompt=message,
stream=False
)

return {"echo": res}

sglangServer = SGLangServer.bind()
my_App = MyFastAPIDeployment.bind(sglangServer)
handle: DeploymentHandle = serve.run(my_App, blocking = True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a few style issues here that go against PEP 8 guidelines:

  • Variable names sglangServer and my_App should be snake_case (e.g., sglang_server, my_app).
  • There should not be a space around the equals sign in blocking = True.
Suggested change
sglangServer = SGLangServer.bind()
my_App = MyFastAPIDeployment.bind(sglangServer)
handle: DeploymentHandle = serve.run(my_App, blocking = True)
sglang_server = SGLangServer.bind()
my_app = MyFastAPIDeployment.bind(sglang_server)
handle: DeploymentHandle = serve.run(my_app, blocking=True)