-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Reference
Problem
If an error happens during application startup or runtime, the container logs the error but the pod is not restarted automatically.
This leaves the pod in a broken state, and a manual restart is required.
Error log
INFO: Will watch for changes in these directories: ['/app']
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Started reloader process [1] using StatReload
DEBUG: Mounted app at /stac
DEBUG: CORS: credentials enabled with wildcard origins, using origin reflection
INFO: CORS: handling locally (allow_origins=['*'], allow_methods=['*'], allow_credentials=True)
INFO: Started server process [8]
INFO: Waiting for application startup.
DEBUG: Appending required conformance for collections filter
DEBUG: Appending required conformance for items filter
INFO: Running upstream server health checks...
INFO: Upstream API 'http://montandon-eoapi-stac:8080/' is healthy
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/starlette/routing.py", line 694, in lifespan
async with self.lifespan_context(app) as maybe_state:
~~~~~~~~~~~~~~~~~~~~~^^^^^
File "/usr/local/lib/python3.13/contextlib.py", line 214, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/fastapi/routing.py", line 201, in merged_lifespan
async with original_context(app) as maybe_original_state:
~~~~~~~~~~~~~~~~^^^^^
File "/usr/local/lib/python3.13/contextlib.py", line 214, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/app/src/stac_auth_proxy/lifespan.py", line 141, in lifespan
await check_server_healths(
settings.upstream_url, settings.oidc_discovery_internal_url
)
File "/app/src/stac_auth_proxy/lifespan.py", line 24, in check_server_healths
await check_server_health(url)
File "/app/src/stac_auth_proxy/lifespan.py", line 47, in check_server_health
response.raise_for_status()
File "/usr/local/lib/python3.13/site-packages/httpx/_models.py", line 829, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://goadmin-stage.ifrc.org/o/.well-known/openid-configuration'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502
ERROR: Application startup failed. Exiting.Note
The logs show "ERROR: Application startup failed. Exiting.", but the container does not actually exit. The pod continues running and appears healthy when checked through the Kubernetes API.
Scenario
stac-auth-proxy runs in Azure AKS. When a node scales up or down, both the stac-auth-proxy pod and its dependency go-api pod are recreated simultaneously if they were running on the node that was scaled down.
As the dependent service goadmin-stage.ifrc.org is temporarily unavailable during that time, the startup health check fails and stac-auth-proxy throws error but doesn't exit.
The pod then stays in this failed state and does not restart automatically. Recovery only happens if:
- the node scales again, or
- someone manually deletes the pod.
Possible Solution
- Exit the container when a startup error occurs so Kubernetes can restart the pod automatically.
- Add proper liveness and readiness probes to ensure Kubernetes can detect unhealthy pods and restart them when needed.