Skip to content

Helm: add Liveness/Readiness for stac-proxy deployment #140

@thenav56

Description

@thenav56

Reference

Problem

If an error happens during application startup or runtime, the container logs the error but the pod is not restarted automatically.
This leaves the pod in a broken state, and a manual restart is required.

Error log

INFO:     Will watch for changes in these directories: ['/app']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [1] using StatReload
DEBUG:    Mounted app at /stac
DEBUG:    CORS: credentials enabled with wildcard origins, using origin reflection
INFO:     CORS: handling locally (allow_origins=['*'], allow_methods=['*'], allow_credentials=True)
INFO:     Started server process [8]
INFO:     Waiting for application startup.
DEBUG:    Appending required conformance for collections filter
DEBUG:    Appending required conformance for items filter
INFO:     Running upstream server health checks...
INFO:     Upstream API 'http://montandon-eoapi-stac:8080/' is healthy
ERROR:    Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/starlette/routing.py", line 694, in lifespan
    async with self.lifespan_context(app) as maybe_state:
               ~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/usr/local/lib/python3.13/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/fastapi/routing.py", line 201, in merged_lifespan
    async with original_context(app) as maybe_original_state:
               ~~~~~~~~~~~~~~~~^^^^^
  File "/usr/local/lib/python3.13/contextlib.py", line 214, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/stac_auth_proxy/lifespan.py", line 141, in lifespan
    await check_server_healths(
        settings.upstream_url, settings.oidc_discovery_internal_url
    )
  File "/app/src/stac_auth_proxy/lifespan.py", line 24, in check_server_healths
    await check_server_health(url)
  File "/app/src/stac_auth_proxy/lifespan.py", line 47, in check_server_health
    response.raise_for_status()
  File "/usr/local/lib/python3.13/site-packages/httpx/_models.py", line 829, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)

httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'https://goadmin-stage.ifrc.org/o/.well-known/openid-configuration'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502

ERROR:    Application startup failed. Exiting.

Note

The logs show "ERROR: Application startup failed. Exiting.", but the container does not actually exit. The pod continues running and appears healthy when checked through the Kubernetes API.

Scenario

stac-auth-proxy runs in Azure AKS. When a node scales up or down, both the stac-auth-proxy pod and its dependency go-api pod are recreated simultaneously if they were running on the node that was scaled down.

As the dependent service goadmin-stage.ifrc.org is temporarily unavailable during that time, the startup health check fails and stac-auth-proxy throws error but doesn't exit.

The pod then stays in this failed state and does not restart automatically. Recovery only happens if:

  • the node scales again, or
  • someone manually deletes the pod.

Possible Solution

  • Exit the container when a startup error occurs so Kubernetes can restart the pod automatically.
  • Add proper liveness and readiness probes to ensure Kubernetes can detect unhealthy pods and restart them when needed.

@batpad @alukach @pantierra

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions