You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cerebrium/scaling/graceful-termination.mdx
+21-77Lines changed: 21 additions & 77 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,78 +9,43 @@ Cerebrium runs in a shared, multi-tenant environment. To efficiently scale, opti
9
9
10
10
## Understanding Instance Termination
11
11
12
-
For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications, this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least responseGracePeriod has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
12
+
For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications (Cerebriums default runtime), this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least `response_grace_period` has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
13
13
14
-
When Cerebrium needs to terminate an instance:
14
+
When Cerebrium needs to terminate an contanier, we do the following:
15
15
16
-
1. Stops routing new requests to the instance
17
-
2. Sends SIGTERM signal to your container
18
-
3. Waits for `response_grace_period` seconds (if configured)
19
-
4. Sends SIGKILL if the instance hasn't stopped
16
+
1. Stop routing new requests to the container.
17
+
2. Send a SIGTERM signal to your container.
18
+
3. Waits for `response_grace_period` seconds to elaspse.
19
+
4. Sends SIGKILL if the container hasn't stopped
20
+
21
+
Below is a chart that shows it more eloquently:
20
22
21
23
```mermaid
22
24
flowchart TD
23
25
A[SIGTERM sent] --> B[Cortex]
24
-
A --> C[custom runtime]
26
+
A --> C[Custom Runtime]
25
27
26
28
B --> D[automatically captured]
27
-
C --> E[user needs to capture]
28
-
29
-
D --> F{request finishes}
30
-
D --> G{response_grace_period reached}
29
+
C --> E[User needs to capture]
31
30
32
-
E --> H{request busy}
33
-
E --> I{container idle}
31
+
D --> F[request finishes]
32
+
D --> G[response_grace_period reached]
34
33
35
-
F --> J[graceful termination]
36
-
G --> K[SIGKILL]
37
-
K --> L[gateway timeout error]
34
+
E --> H[User logic]
38
35
39
-
H --> M[Return 503]
40
-
I --> N[Return 200]
36
+
F --> I[Graceful termination]
37
+
G --> J[SIGKILL]
41
38
42
-
M --> O{response_grace_period reached}
43
-
M --> P{request finishes - mark as}
39
+
H --> O[Graceful termination]
40
+
H --> G[response_grace_period reached]
44
41
45
-
O --> Q[SIGKILL]
46
-
Q --> R[gateway timeout error]
47
-
48
-
P --> S[Return 200]
49
-
N --> T[graceful termination]
50
-
S --> T
42
+
J --> K[Gateway Timeout Error]
51
43
```
52
44
53
-
Without `response_grace_period` configured, Cerebrium terminates instances immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
54
-
55
-
<Warning>
56
-
If `response_grace_period` is **unset or set to 0**, requests may end abruptly during scale-down or redeploys, resulting in failed responses. Set this to roughly **1.5 × your longest expected request duration**.
57
-
</Warning>
58
-
59
-
60
-
```toml
61
-
[cerebrium.scaling]
62
-
# Example: 300 seconds allows long-running requests to complete
63
-
response_grace_period = 300
64
-
```
65
-
66
-
## Runtime Requirements
67
-
68
-
Cortex runtime (default):
69
-
SIGTERM is handled automatically. Configure response_grace_period only — no code changes required.
70
-
71
-
Custom runtimes (FastAPI, Flask, etc.):
72
-
You must implement SIGTERM handling and configure response_grace_period. Both are required.
73
-
74
-
```toml
75
-
[cerebrium.scaling]
76
-
response_grace_period = 300
45
+
If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
77
46
78
-
[cerebrium.runtime.custom]
79
-
port = 8000
80
-
entrypoint = ["fastapi", "run", "app.py"]
81
-
```
82
47
83
-
## FastAPI Implementation
48
+
## Example: FastAPI Implementation
84
49
85
50
For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
Your entrypoint must use exec or SIGTERM won't reach your application:
130
93
131
94
In your Dockerfile:
@@ -149,23 +112,4 @@ Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so
149
112
150
113
<Tip>
151
114
Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
152
-
</Tip>
153
-
154
-
## GPU Resource Cleanup
155
-
156
-
For GPU applications, explicitly release resources during shutdown:
157
-
158
-
```python
159
-
@asynccontextmanager
160
-
asyncdeflifespan(app: FastAPI):
161
-
model = load_model()
162
-
yield
163
-
# Cleanup when SIGTERM received
164
-
del model
165
-
torch.cuda.empty_cache()
166
-
```
167
-
168
-
## Related Resources
169
-
170
-
For general instance management and scaling configuration, see [Instance Management](https://docs.cerebrium.ai/cerebrium/scaling/scaling-apps#instance-management).
Copy file name to clipboardExpand all lines: cerebrium/scaling/scaling-apps.mdx
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,11 +79,7 @@ During normal replica operation, this simply corresponds to a request timeout va
79
79
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
80
80
81
81
<Note>
82
-
When using the cortex runtime (default) the SIGTERM signal is captured and the
83
-
app is given a chance to complete requests before being terminated. When using
84
-
a custom runtime, it is the responsibility of the user to handle the SIGTERM
85
-
signal and ensure that the app is given a chance to complete requests before
86
-
being terminated.
82
+
When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
87
83
</Note>
88
84
89
85
Performance metrics available through the dashboard help monitor scaling behavior:
0 commit comments