Skip to content

[STORY] Health Endpoint Cluster Awareness #430

@jsbattig

Description

@jsbattig

Part of: #408

Story: Health Endpoint Cluster Awareness

Part of: #408

[Conversation Reference: "if ONTAP fails we are in emergency mode, we are totally down"]

Story Overview

Objective: Extend the existing health endpoint to report cluster-specific status including node identity, leader/follower role, NFS mount health, PostgreSQL connectivity, and peer node visibility. The load balancer uses this endpoint for routing decisions -- an unhealthy node is removed from the pool.

User Value: Operations teams have clear visibility into each node's cluster status. Load balancers automatically route away from unhealthy nodes. Cluster problems are diagnosed quickly through structured health information.

Acceptance Criteria

AC1: Health Endpoint Reports Node Identity and Role

Scenario: The health response includes cluster membership information.

Given a CIDX node is running in cluster mode
When the health endpoint is called
Then the response includes:
  - node_id: this node's identifier
  - role: "leader" or "follower"
  - storage_mode: "postgres"
  - cluster_mode: true
  - uptime_seconds: time since node started

Technical Requirements:

  • Extend existing /health endpoint response
  • Add cluster section to health response JSON
  • Node ID from config.json
  • Role from LeaderElectionService.is_leader()
  • Storage mode from config

AC2: NFS Mount Health Check

Scenario: The health endpoint reports NFS mount status.

Given the NFS mount is configured
When the health check runs
Then it reports NFS mount status: healthy or unhealthy
And if the NFS mount is unavailable, the overall health status is UNHEALTHY
And the load balancer removes this node from the pool
And no fallback or degraded mode is attempted

Technical Requirements:

  • NFS check: os.path.ismount(mount_point) + stat test file
  • NFS failure = overall health UNHEALTHY (HTTP 503)
  • Health response includes: nfs_mount: { status: "healthy", mount_point: "/mnt/cidx-shared" }
  • NFS failure message: nfs_mount: { status: "unhealthy", error: "mount point not available" }

AC3: PostgreSQL Connectivity Health Check

Scenario: The health endpoint verifies PostgreSQL is reachable.

Given the server uses PostgreSQL for storage
When the health check runs
Then it executes a lightweight query against PostgreSQL (SELECT 1)
And if PostgreSQL is unreachable, the overall health status is UNHEALTHY
And the response includes connection pool statistics (active, idle, waiting)

Technical Requirements:

  • PostgreSQL check: SELECT 1 via connection pool
  • Timeout: 5 seconds for health check query
  • Pool stats: active connections, idle connections, pool size
  • PostgreSQL failure = overall health UNHEALTHY (HTTP 503)

AC4: Peer Node Visibility

Scenario: The health endpoint reports the status of all known cluster nodes.

Given the cluster_nodes table tracks all nodes
When the health endpoint is called
Then the response includes a list of all cluster nodes
And each node entry shows: node_id, hostname, status, last_heartbeat, is_leader
And stale nodes (heartbeat > 30s old) are flagged

Technical Requirements:

  • Query cluster_nodes table for all registered nodes
  • Flag stale nodes: last_heartbeat < NOW() - INTERVAL '30 seconds'
  • Mark current leader based on which node holds the advisory lock
  • Response: peers: [{ node_id, hostname, status, last_heartbeat, stale: bool }]

AC5: Load Balancer Compatibility

Scenario: The health endpoint returns appropriate HTTP status codes for load balancer health checks.

Given a load balancer polls the health endpoint
When all checks pass
Then HTTP 200 is returned with status: "healthy"
When any critical check fails (NFS, PostgreSQL)
Then HTTP 503 is returned with status: "unhealthy"
And the response body includes which checks failed

Technical Requirements:

  • HTTP 200 = healthy (node should receive traffic)
  • HTTP 503 = unhealthy (node should be removed from pool)
  • Response body always includes detailed check results (regardless of status)
  • Health check execution time under 1 second total

AC6: Standalone Mode Compatibility

Scenario: In standalone mode, the health endpoint works without cluster checks.

Given the server is running in standalone mode
When the health endpoint is called
Then it returns the existing health information
And cluster-specific checks (NFS, peers, leader) are not included
And the response includes: cluster_mode: false

Technical Requirements:

  • Cluster checks skipped in standalone mode
  • Existing health response preserved
  • cluster_mode: false in response for standalone

Implementation Status

  • Core implementation complete
  • Unit tests passing
  • Integration tests passing
  • E2E tests passing
  • Code review approved
  • Manual E2E testing completed
  • Documentation updated

Technical Implementation Details

Health Response Schema (Cluster Mode)

{
    "status": "healthy",
    "cluster_mode": true,
    "node": {
        "node_id": "cidx-node-01",
        "role": "leader",
        "storage_mode": "postgres",
        "uptime_seconds": 3621
    },
    "checks": {
        "nfs_mount": {
            "status": "healthy",
            "mount_point": "/mnt/cidx-shared"
        },
        "postgresql": {
            "status": "healthy",
            "pool_active": 3,
            "pool_idle": 7,
            "pool_size": 10
        },
        "leader_election": {
            "status": "healthy",
            "is_leader": true,
            "lock_held": true
        }
    },
    "peers": [
        {
            "node_id": "cidx-node-01",
            "hostname": "cidx-server-01",
            "status": "active",
            "last_heartbeat": "2026-03-12T10:30:00Z",
            "is_leader": true,
            "stale": false
        },
        {
            "node_id": "cidx-node-02",
            "hostname": "cidx-server-02",
            "status": "active",
            "last_heartbeat": "2026-03-12T10:29:55Z",
            "is_leader": false,
            "stale": false
        }
    ]
}

Health Check Decision Matrix

Check Healthy Unhealthy Impact
NFS mount ismount() + stat OK Mount missing or unresponsive Node DOWN (503)
PostgreSQL SELECT 1 returns Connection failed or timeout Node DOWN (503)
Leader lock Lock held (leader) or not needed (follower) Lock connection dropped (leader only) WARNING (leader demotes)
Peer nodes All peers have fresh heartbeat Some peers stale INFO (no impact on this node)

Integration with Existing Health Endpoint

The existing /health endpoint is extended, not replaced. In standalone mode, the response is backward-compatible. In cluster mode, additional sections are appended.

Testing Requirements

  • Automated: Health endpoint returns 200 when all checks pass.
  • Automated: Health endpoint returns 503 when NFS check fails.
  • Automated: Health endpoint returns 503 when PostgreSQL check fails.
  • Automated: Peer node listing includes all registered nodes.
  • Automated: Stale peer detection flags nodes with old heartbeats.
  • Automated: Standalone mode returns cluster_mode: false with no cluster checks.
  • Manual E2E: In cluster mode, verify health endpoint on both leader and follower. Unmount NFS on one node, verify its health returns 503. Restore mount, verify health returns 200.

Definition of Done

  • Health endpoint reports node identity and role
  • NFS mount health check returns 503 on failure
  • PostgreSQL connectivity check returns 503 on failure
  • Peer node visibility with stale detection
  • Load balancer compatible (200/503 HTTP status codes)
  • Standalone mode backward compatible
  • Health check total execution under 1 second
  • All tests pass

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions