Skip to content

Commit 3347313

Browse files
authored
Merge pull request #245 from CerebriumAI/sglang
Sglang Example
2 parents b6b2abf + 6a4887c commit 3347313

File tree

8 files changed

+312
-20
lines changed

8 files changed

+312
-20
lines changed

cerebrium/container-images/custom-web-servers.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@ The configuration requires three key parameters:
5252
<Info>
5353
For ASGI applications like FastAPI, include the appropriate server package
5454
(like `uvicorn`) in your dependencies. After deployment, your endpoints become
55-
available at `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
55+
available at
56+
`https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
5657
</Info>
5758

5859
Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation.

cerebrium/getting-started/introduction.mdx

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,11 @@ We can then run this function in the cloud and pass it a prompt.
5050
cerebrium run main.py::run --prompt "Hello World!"
5151
```
5252

53-
Your should see logs that output the prompt you sent in - this is running in the cloud! Let us now turn this into a scalable REST endpoint.
53+
Your should see logs that output the prompt you sent in - this is running in the cloud!
54+
55+
Use the `run` functionality for quick code iteration and testing snippets or once-off scripts that require large CPU/GPU in the cloud.
56+
57+
Let us now turn this into a scalable REST endpoint - something we could put in production!
5458

5559
### 4. Deploy your app
5660

@@ -60,11 +64,13 @@ Run the following command:
6064
cerebrium deploy
6165
```
6266

63-
This will turn the function into a callable endpoint that accepts json parameters (prompt) and can scale to 1000s of requests automatically!
67+
This will turn the function into a callable persistent [endpoint](/cerebrium/endpoints/inference-api). that accepts json parameters (prompt) and can scale to 1000s of requests automatically!
6468

6569
Once deployed, an app becomes callable through a POST endpoint `https://api.aws.us-east-1.cerebrium.ai/v4/{project-id}/{app-name}/{function-name}` and takes a json parameter, prompt
6670

67-
Great! You made it! Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.
71+
Great! You made it!
72+
73+
Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.
6874

6975
## How It Works
7076

cerebrium/scaling/graceful-termination.mdx

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ When Cerebrium needs to terminate an contanier, we do the following:
1515

1616
1. Stop routing new requests to the container.
1717
2. Send a SIGTERM signal to your container.
18-
3. Waits for `response_grace_period` seconds to elaspse.
18+
3. Waits for `response_grace_period` seconds to elaspse.
1919
4. Sends SIGKILL if the container hasn't stopped
2020

2121
Below is a chart that shows it more eloquently:
@@ -24,30 +24,29 @@ Below is a chart that shows it more eloquently:
2424
flowchart TD
2525
A[SIGTERM sent] --> B[Cortex]
2626
A --> C[Custom Runtime]
27-
27+
2828
B --> D[automatically captured]
2929
C --> E[User needs to capture]
30-
30+
3131
D --> F[request finishes]
3232
D --> G[response_grace_period reached]
33-
33+
3434
E --> H[User logic]
35-
35+
3636
F --> I[Graceful termination]
3737
G --> J[SIGKILL]
38-
38+
3939
H --> O[Graceful termination]
4040
H --> G[response_grace_period reached]
41-
41+
4242
J --> K[Gateway Timeout Error]
4343
```
4444

4545
If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
4646

47-
4847
## Example: FastAPI Implementation
4948

50-
For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
49+
For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
5150

5251
The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates.
5352

@@ -63,11 +62,11 @@ lock = asyncio.Lock()
6362
@asynccontextmanager
6463
async def lifespan(app: FastAPI):
6564
yield # Application startup complete
66-
65+
6766
# Shutdown: runs when Cerebrium sends SIGTERM
6867
global shutting_down
6968
shutting_down = True
70-
69+
7170
# Wait for active requests to complete
7271
while active_requests > 0:
7372
await asyncio.sleep(1)
@@ -79,7 +78,7 @@ async def track_requests(request, call_next):
7978
global active_requests
8079
if shutting_down:
8180
raise HTTPException(503, "Shutting down")
82-
81+
8382
async with lock:
8483
active_requests += 1
8584
try:
@@ -96,12 +95,15 @@ In your Dockerfile:
9695
```dockerfile
9796
ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
9897
```
98+
9999
Or in cerebrium.toml:
100+
100101
```toml
101102
[cerebrium.runtime.custom]
102103
entrypoint = ["fastapi", "run", "app.py", "--port", "8000"]
103104

104105
```
106+
105107
In bash scripts:
106108

107109
```bash
@@ -111,5 +113,6 @@ exec fastapi run app.py --port ${PORT:-8000}
111113
Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period.
112114

113115
<Tip>
114-
Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
115-
</Tip>
116+
Test SIGTERM handling locally before deploying: start your app, send SIGTERM
117+
with `Ctrl+C`, and verify you see graceful shutdown logs.
118+
</Tip>

cerebrium/scaling/scaling-apps.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,13 @@ During normal replica operation, this simply corresponds to a request timeout va
7979
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
8080

8181
<Note>
82-
When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
82+
When using the Cortex runtime (default), SIGTERM signals are automatically
83+
handled to allow graceful termination of requests. For custom runtimes, you'll
84+
need to implement SIGTERM handling yourself to ensure requests complete
85+
gracefully before termination. See our [Graceful Termination
86+
guide](/cerebrium/scaling/graceful-termination) for detailed implementation
87+
examples, including FastAPI patterns for tracking and completing in-flight
88+
requests during shutdown.
8389
</Note>
8490

8591
Performance metrics available through the dashboard help monitor scaling behavior:

docs.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@
111111
"pages": [
112112
"v4/examples/gpt-oss",
113113
"v4/examples/openai-compatible-endpoint-vllm",
114-
"v4/examples/streaming-falcon-7B"
114+
"v4/examples/sglang"
115115
]
116116
},
117117
{

images/sglang-arch.png

236 KB
Loading

images/sglang_advertisement.jpg

98.3 KB
Loading

0 commit comments

Comments
 (0)