Skip to content

Commit 595e7fb

Browse files
committed
Updated pre-emption
1 parent 71b606f commit 595e7fb

File tree

3 files changed

+23
-82
lines changed

3 files changed

+23
-82
lines changed

cerebrium/scaling/graceful-termination.mdx

Lines changed: 21 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -9,78 +9,43 @@ Cerebrium runs in a shared, multi-tenant environment. To efficiently scale, opti
99

1010
## Understanding Instance Termination
1111

12-
For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications, this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least responseGracePeriod has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
12+
For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications (Cerebriums default runtime), this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least `response_grace_period` has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
1313

14-
When Cerebrium needs to terminate an instance:
14+
When Cerebrium needs to terminate an contanier, we do the following:
1515

16-
1. Stops routing new requests to the instance
17-
2. Sends SIGTERM signal to your container
18-
3. Waits for `response_grace_period` seconds (if configured)
19-
4. Sends SIGKILL if the instance hasn't stopped
16+
1. Stop routing new requests to the container.
17+
2. Send a SIGTERM signal to your container.
18+
3. Waits for `response_grace_period` seconds to elaspse.
19+
4. Sends SIGKILL if the container hasn't stopped
20+
21+
Below is a chart that shows it more eloquently:
2022

2123
```mermaid
2224
flowchart TD
2325
A[SIGTERM sent] --> B[Cortex]
24-
A --> C[custom runtime]
26+
A --> C[Custom Runtime]
2527
2628
B --> D[automatically captured]
27-
C --> E[user needs to capture]
28-
29-
D --> F{request finishes}
30-
D --> G{response_grace_period reached}
29+
C --> E[User needs to capture]
3130
32-
E --> H{request busy}
33-
E --> I{container idle}
31+
D --> F[request finishes]
32+
D --> G[response_grace_period reached]
3433
35-
F --> J[graceful termination]
36-
G --> K[SIGKILL]
37-
K --> L[gateway timeout error]
34+
E --> H[User logic]
3835
39-
H --> M[Return 503]
40-
I --> N[Return 200]
36+
F --> I[Graceful termination]
37+
G --> J[SIGKILL]
4138
42-
M --> O{response_grace_period reached}
43-
M --> P{request finishes - mark as}
39+
H --> O[Graceful termination]
40+
H --> G[response_grace_period reached]
4441
45-
O --> Q[SIGKILL]
46-
Q --> R[gateway timeout error]
47-
48-
P --> S[Return 200]
49-
N --> T[graceful termination]
50-
S --> T
42+
J --> K[Gateway Timeout Error]
5143
```
5244

53-
Without `response_grace_period` configured, Cerebrium terminates instances immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
54-
55-
<Warning>
56-
If `response_grace_period` is **unset or set to 0**, requests may end abruptly during scale-down or redeploys, resulting in failed responses. Set this to roughly **1.5 × your longest expected request duration**.
57-
</Warning>
58-
59-
60-
```toml
61-
[cerebrium.scaling]
62-
# Example: 300 seconds allows long-running requests to complete
63-
response_grace_period = 300
64-
```
65-
66-
## Runtime Requirements
67-
68-
Cortex runtime (default):
69-
SIGTERM is handled automatically. Configure response_grace_period only — no code changes required.
70-
71-
Custom runtimes (FastAPI, Flask, etc.):
72-
You must implement SIGTERM handling and configure response_grace_period. Both are required.
73-
74-
```toml
75-
[cerebrium.scaling]
76-
response_grace_period = 300
45+
If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
7746

78-
[cerebrium.runtime.custom]
79-
port = 8000
80-
entrypoint = ["fastapi", "run", "app.py"]
81-
```
8247

83-
## FastAPI Implementation
48+
## Example: FastAPI Implementation
8449

8550
For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
8651

@@ -124,8 +89,6 @@ async def track_requests(request, call_next):
12489
active_requests -= 1
12590
```
12691

127-
## Critical: Use exec in Entrypoint
128-
12992
Your entrypoint must use exec or SIGTERM won't reach your application:
13093

13194
In your Dockerfile:
@@ -149,23 +112,4 @@ Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so
149112

150113
<Tip>
151114
Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
152-
</Tip>
153-
154-
## GPU Resource Cleanup
155-
156-
For GPU applications, explicitly release resources during shutdown:
157-
158-
```python
159-
@asynccontextmanager
160-
async def lifespan(app: FastAPI):
161-
model = load_model()
162-
yield
163-
# Cleanup when SIGTERM received
164-
del model
165-
torch.cuda.empty_cache()
166-
```
167-
168-
## Related Resources
169-
170-
For general instance management and scaling configuration, see [Instance Management](https://docs.cerebrium.ai/cerebrium/scaling/scaling-apps#instance-management).
171-
115+
</Tip>

cerebrium/scaling/scaling-apps.mdx

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,7 @@ During normal replica operation, this simply corresponds to a request timeout va
7979
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
8080

8181
<Note>
82-
When using the cortex runtime (default) the SIGTERM signal is captured and the
83-
app is given a chance to complete requests before being terminated. When using
84-
a custom runtime, it is the responsibility of the user to handle the SIGTERM
85-
signal and ensure that the app is given a chance to complete requests before
86-
being terminated.
82+
When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
8783
</Note>
8884

8985
Performance metrics available through the dashboard help monitor scaling behavior:

docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
"group": "Scaling apps",
4848
"pages": [
4949
"cerebrium/scaling/scaling-apps",
50+
"cerebrium/scaling/graceful-termination",
5051
"cerebrium/scaling/batching-concurrency"
5152
]
5253
},

0 commit comments

Comments
 (0)