Updated pre-emption

milo157 · milo157 · commit 595e7fba4f6a · 2025-10-08T20:15:07.000-04:00
diff --git a/cerebrium/scaling/graceful-termination.mdx b/cerebrium/scaling/graceful-termination.mdx
@@ -9,78 +9,43 @@ Cerebrium runs in a shared, multi-tenant environment. To efficiently scale, opti
 
 ## Understanding Instance Termination
 
-For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications, this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least responseGracePeriod has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
+For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications (Cerebriums default runtime), this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least `response_grace_period` has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately.
 
-When Cerebrium needs to terminate an instance:
+When Cerebrium needs to terminate an contanier, we do the following:
 
-1. Stops routing new requests to the instance
-2. Sends SIGTERM signal to your container
-3. Waits for `response_grace_period` seconds (if configured)
-4. Sends SIGKILL if the instance hasn't stopped
+1. Stop routing new requests to the container.
+2. Send a SIGTERM signal to your container.
+3. Waits for `response_grace_period` seconds to elaspse. 
+4. Sends SIGKILL if the container hasn't stopped
+
+Below is a chart that shows it more eloquently:
 
 ```mermaid
 flowchart TD
     A[SIGTERM sent] --> B[Cortex]
-    A --> C[custom runtime]
+    A --> C[Custom Runtime]
     
     B --> D[automatically captured]
-    C --> E[user needs to capture]
-    
-    D --> F{request finishes}
-    D --> G{response_grace_period reached}
+    C --> E[User needs to capture]
     
-    E --> H{request busy}
-    E --> I{container idle}
+    D --> F[request finishes]
+    D --> G[response_grace_period reached]
     
-    F --> J[graceful termination]
-    G --> K[SIGKILL]
-    K --> L[gateway timeout error]
+    E --> H[User logic]
     
-    H --> M[Return 503]
-    I --> N[Return 200]
+    F --> I[Graceful termination]
+    G --> J[SIGKILL]
     
-    M --> O{response_grace_period reached}
-    M --> P{request finishes - mark as}
+    H --> O[Graceful termination]
+    H --> G[response_grace_period reached]
     
-    O --> Q[SIGKILL]
-    Q --> R[gateway timeout error]
-    
-    P --> S[Return 200]
-    N --> T[graceful termination]
-    S --> T
+    J --> K[Gateway Timeout Error]
 ```
 
-Without `response_grace_period` configured, Cerebrium terminates instances immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
-
-<Warning>
-If `response_grace_period` is **unset or set to 0**, requests may end abruptly during scale-down or redeploys, resulting in failed responses. Set this to roughly **1.5 × your longest expected request duration**.
-</Warning>
-
-
-```toml
-[cerebrium.scaling]
-# Example: 300 seconds allows long-running requests to complete
-response_grace_period = 300
-```
-
-## Runtime Requirements
-
-Cortex runtime (default):
-SIGTERM is handled automatically. Configure response_grace_period only — no code changes required.
-
-Custom runtimes (FastAPI, Flask, etc.):
-You must implement SIGTERM handling and configure response_grace_period. Both are required.
-
-```toml
-[cerebrium.scaling]
-response_grace_period = 300
+If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
 
-[cerebrium.runtime.custom]
-port = 8000
-entrypoint = ["fastapi", "run", "app.py"]
-```
 
-## FastAPI Implementation
+## Example: FastAPI Implementation
 
 For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM. 
 
@@ -124,8 +89,6 @@ async def track_requests(request, call_next):
             active_requests -= 1
 ```
 
-## Critical: Use exec in Entrypoint
-
 Your entrypoint must use exec or SIGTERM won't reach your application:
 
 In your Dockerfile:
@@ -149,23 +112,4 @@ Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so
 
 <Tip>
 Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
-</Tip>
-
-## GPU Resource Cleanup
-
-For GPU applications, explicitly release resources during shutdown:
-
-```python
-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    model = load_model()
-    yield
-    # Cleanup when SIGTERM received
-    del model
-    torch.cuda.empty_cache()
-```
-
-## Related Resources 
-
-For general instance management and scaling configuration, see [Instance Management](https://docs.cerebrium.ai/cerebrium/scaling/scaling-apps#instance-management).
-
+</Tip>
diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx
@@ -79,11 +79,7 @@ During normal replica operation, this simply corresponds to a request timeout va
 waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
 
 <Note>
-  When using the cortex runtime (default) the SIGTERM signal is captured and the
-  app is given a chance to complete requests before being terminated. When using
-  a custom runtime, it is the responsibility of the user to handle the SIGTERM
-  signal and ensure that the app is given a chance to complete requests before
-  being terminated.
+  When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
 </Note>
 
 Performance metrics available through the dashboard help monitor scaling behavior:
diff --git a/docs.json b/docs.json
@@ -47,6 +47,7 @@
                 "group": "Scaling apps",
                 "pages": [
                   "cerebrium/scaling/scaling-apps",
+                  "cerebrium/scaling/graceful-termination",
                   "cerebrium/scaling/batching-concurrency"
                 ]
               },

Original file line number	Diff line number	Diff line change
`@@ -47,6 +47,7 @@`
`47`	`47`	`"group": "Scaling apps",`
`48`	`48`	`"pages": [`
`49`	`49`	`"cerebrium/scaling/scaling-apps",`
	`50`	`+ "cerebrium/scaling/graceful-termination",`
`50`	`51`	`"cerebrium/scaling/batching-concurrency"`
`51`	`52`	`]`
`52`	`53`	`},`