|
| 1 | +--- |
| 2 | +title: "Preemption and Graceful Termination" |
| 3 | +description: "Implementing Graceful Termination of Instances by Handling Termination Signals" |
| 4 | +--- |
| 5 | + |
| 6 | +## Graceful Termination |
| 7 | + |
| 8 | +Cerebrium runs in a shared, multi-tenant environment. To efficiently scale, optimize compute usage, and roll out updates, the platform continuously adjusts its capacity - spinning down nodes and launching new ones as needed. During this process, workloads are seamlessly migrated to new nodes. In addition, your application has its own metric-based autoscaling criteria that dictate when instances should scale or remain active, as well as handle instance shifting during new app deployments. Therefore, in order to prevent requests from ending prematurely when we mark app instances for termination, you need to implement graceful termination. |
| 9 | + |
| 10 | +## Understanding Instance Termination |
| 11 | + |
| 12 | +For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications, this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least responseGracePeriod has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately. |
| 13 | + |
| 14 | +When Cerebrium needs to terminate an instance: |
| 15 | + |
| 16 | +1. Stops routing new requests to the instance |
| 17 | +2. Sends SIGTERM signal to your container |
| 18 | +3. Waits for `response_grace_period` seconds (if configured) |
| 19 | +4. Sends SIGKILL if the instance hasn't stopped |
| 20 | + |
| 21 | +```mermaid |
| 22 | +flowchart TD |
| 23 | + A[SIGTERM sent] --> B[Cortex] |
| 24 | + A --> C[custom runtime] |
| 25 | + |
| 26 | + B --> D[automatically captured] |
| 27 | + C --> E[user needs to capture] |
| 28 | + |
| 29 | + D --> F{request finishes} |
| 30 | + D --> G{response_grace_period reached} |
| 31 | + |
| 32 | + E --> H{request busy} |
| 33 | + E --> I{container idle} |
| 34 | + |
| 35 | + F --> J[graceful termination] |
| 36 | + G --> K[SIGKILL] |
| 37 | + K --> L[gateway timeout error] |
| 38 | + |
| 39 | + H --> M[Return 503] |
| 40 | + I --> N[Return 200] |
| 41 | + |
| 42 | + M --> O{response_grace_period reached} |
| 43 | + M --> P{request finishes - mark as} |
| 44 | + |
| 45 | + O --> Q[SIGKILL] |
| 46 | + Q --> R[gateway timeout error] |
| 47 | + |
| 48 | + P --> S[Return 200] |
| 49 | + N --> T[graceful termination] |
| 50 | + S --> T |
| 51 | +``` |
| 52 | + |
| 53 | +Without `response_grace_period` configured, Cerebrium terminates instances immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**. |
| 54 | + |
| 55 | +<Warning> |
| 56 | +If `response_grace_period` is **unset or set to 0**, requests may end abruptly during scale-down or redeploys, resulting in failed responses. Set this to roughly **1.5 × your longest expected request duration**. |
| 57 | +</Warning> |
| 58 | + |
| 59 | + |
| 60 | +```toml |
| 61 | +[cerebrium.scaling] |
| 62 | +# Example: 300 seconds allows long-running requests to complete |
| 63 | +response_grace_period = 300 |
| 64 | +``` |
| 65 | + |
| 66 | +## Runtime Requirements |
| 67 | + |
| 68 | +Cortex runtime (default): |
| 69 | +SIGTERM is handled automatically. Configure response_grace_period only — no code changes required. |
| 70 | + |
| 71 | +Custom runtimes (FastAPI, Flask, etc.): |
| 72 | +You must implement SIGTERM handling and configure response_grace_period. Both are required. |
| 73 | + |
| 74 | +```toml |
| 75 | +[cerebrium.scaling] |
| 76 | +response_grace_period = 300 |
| 77 | + |
| 78 | +[cerebrium.runtime.custom] |
| 79 | +port = 8000 |
| 80 | +entrypoint = ["fastapi", "run", "app.py"] |
| 81 | +``` |
| 82 | + |
| 83 | +## FastAPI Implementation |
| 84 | + |
| 85 | +For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM. |
| 86 | + |
| 87 | +The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates. |
| 88 | + |
| 89 | +```python |
| 90 | +from fastapi import FastAPI, HTTPException |
| 91 | +from contextlib import asynccontextmanager |
| 92 | +import asyncio |
| 93 | + |
| 94 | +active_requests = 0 |
| 95 | +shutting_down = False |
| 96 | +lock = asyncio.Lock() |
| 97 | + |
| 98 | +@asynccontextmanager |
| 99 | +async def lifespan(app: FastAPI): |
| 100 | + yield # Application startup complete |
| 101 | + |
| 102 | + # Shutdown: runs when Cerebrium sends SIGTERM |
| 103 | + global shutting_down |
| 104 | + shutting_down = True |
| 105 | + |
| 106 | + # Wait for active requests to complete |
| 107 | + while active_requests > 0: |
| 108 | + await asyncio.sleep(1) |
| 109 | + |
| 110 | +app = FastAPI(lifespan=lifespan) |
| 111 | + |
| 112 | +@app.middleware("http") |
| 113 | +async def track_requests(request, call_next): |
| 114 | + global active_requests |
| 115 | + if shutting_down: |
| 116 | + raise HTTPException(503, "Shutting down") |
| 117 | + |
| 118 | + async with lock: |
| 119 | + active_requests += 1 |
| 120 | + try: |
| 121 | + return await call_next(request) |
| 122 | + finally: |
| 123 | + async with lock: |
| 124 | + active_requests -= 1 |
| 125 | +``` |
| 126 | + |
| 127 | +## Critical: Use exec in Entrypoint |
| 128 | + |
| 129 | +Your entrypoint must use exec or SIGTERM won't reach your application: |
| 130 | + |
| 131 | +In your Dockerfile: |
| 132 | + |
| 133 | +```dockerfile |
| 134 | +ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] |
| 135 | +``` |
| 136 | +Or in cerebrium.toml: |
| 137 | +```toml |
| 138 | +[cerebrium.runtime.custom] |
| 139 | +entrypoint = ["fastapi", "run", "app.py", "--port", "8000"] |
| 140 | + |
| 141 | +``` |
| 142 | +In bash scripts: |
| 143 | + |
| 144 | +```bash |
| 145 | +exec fastapi run app.py --port ${PORT:-8000} |
| 146 | +``` |
| 147 | + |
| 148 | +Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period. |
| 149 | + |
| 150 | +<Tip> |
| 151 | +Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs. |
| 152 | +</Tip> |
| 153 | + |
| 154 | +## GPU Resource Cleanup |
| 155 | + |
| 156 | +For GPU applications, explicitly release resources during shutdown: |
| 157 | + |
| 158 | +```python |
| 159 | +@asynccontextmanager |
| 160 | +async def lifespan(app: FastAPI): |
| 161 | + model = load_model() |
| 162 | + yield |
| 163 | + # Cleanup when SIGTERM received |
| 164 | + del model |
| 165 | + torch.cuda.empty_cache() |
| 166 | +``` |
| 167 | + |
| 168 | +## Related Resources |
| 169 | + |
| 170 | +For general instance management and scaling configuration, see [Instance Management](https://docs.cerebrium.ai/cerebrium/scaling/scaling-apps#instance-management). |
| 171 | + |
0 commit comments