-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Part of: #408
Story: Health Endpoint Cluster Awareness
Part of: #408
[Conversation Reference: "if ONTAP fails we are in emergency mode, we are totally down"]
Story Overview
Objective: Extend the existing health endpoint to report cluster-specific status including node identity, leader/follower role, NFS mount health, PostgreSQL connectivity, and peer node visibility. The load balancer uses this endpoint for routing decisions -- an unhealthy node is removed from the pool.
User Value: Operations teams have clear visibility into each node's cluster status. Load balancers automatically route away from unhealthy nodes. Cluster problems are diagnosed quickly through structured health information.
Acceptance Criteria
AC1: Health Endpoint Reports Node Identity and Role
Scenario: The health response includes cluster membership information.
Given a CIDX node is running in cluster mode
When the health endpoint is called
Then the response includes:
- node_id: this node's identifier
- role: "leader" or "follower"
- storage_mode: "postgres"
- cluster_mode: true
- uptime_seconds: time since node startedTechnical Requirements:
- Extend existing
/healthendpoint response - Add cluster section to health response JSON
- Node ID from config.json
- Role from LeaderElectionService.is_leader()
- Storage mode from config
AC2: NFS Mount Health Check
Scenario: The health endpoint reports NFS mount status.
Given the NFS mount is configured
When the health check runs
Then it reports NFS mount status: healthy or unhealthy
And if the NFS mount is unavailable, the overall health status is UNHEALTHY
And the load balancer removes this node from the pool
And no fallback or degraded mode is attemptedTechnical Requirements:
- NFS check:
os.path.ismount(mount_point)+ stat test file - NFS failure = overall health UNHEALTHY (HTTP 503)
- Health response includes:
nfs_mount: { status: "healthy", mount_point: "/mnt/cidx-shared" } - NFS failure message:
nfs_mount: { status: "unhealthy", error: "mount point not available" }
AC3: PostgreSQL Connectivity Health Check
Scenario: The health endpoint verifies PostgreSQL is reachable.
Given the server uses PostgreSQL for storage
When the health check runs
Then it executes a lightweight query against PostgreSQL (SELECT 1)
And if PostgreSQL is unreachable, the overall health status is UNHEALTHY
And the response includes connection pool statistics (active, idle, waiting)Technical Requirements:
- PostgreSQL check:
SELECT 1via connection pool - Timeout: 5 seconds for health check query
- Pool stats: active connections, idle connections, pool size
- PostgreSQL failure = overall health UNHEALTHY (HTTP 503)
AC4: Peer Node Visibility
Scenario: The health endpoint reports the status of all known cluster nodes.
Given the cluster_nodes table tracks all nodes
When the health endpoint is called
Then the response includes a list of all cluster nodes
And each node entry shows: node_id, hostname, status, last_heartbeat, is_leader
And stale nodes (heartbeat > 30s old) are flaggedTechnical Requirements:
- Query cluster_nodes table for all registered nodes
- Flag stale nodes:
last_heartbeat < NOW() - INTERVAL '30 seconds' - Mark current leader based on which node holds the advisory lock
- Response:
peers: [{ node_id, hostname, status, last_heartbeat, stale: bool }]
AC5: Load Balancer Compatibility
Scenario: The health endpoint returns appropriate HTTP status codes for load balancer health checks.
Given a load balancer polls the health endpoint
When all checks pass
Then HTTP 200 is returned with status: "healthy"
When any critical check fails (NFS, PostgreSQL)
Then HTTP 503 is returned with status: "unhealthy"
And the response body includes which checks failedTechnical Requirements:
- HTTP 200 = healthy (node should receive traffic)
- HTTP 503 = unhealthy (node should be removed from pool)
- Response body always includes detailed check results (regardless of status)
- Health check execution time under 1 second total
AC6: Standalone Mode Compatibility
Scenario: In standalone mode, the health endpoint works without cluster checks.
Given the server is running in standalone mode
When the health endpoint is called
Then it returns the existing health information
And cluster-specific checks (NFS, peers, leader) are not included
And the response includes: cluster_mode: falseTechnical Requirements:
- Cluster checks skipped in standalone mode
- Existing health response preserved
-
cluster_mode: falsein response for standalone
Implementation Status
- Core implementation complete
- Unit tests passing
- Integration tests passing
- E2E tests passing
- Code review approved
- Manual E2E testing completed
- Documentation updated
Technical Implementation Details
Health Response Schema (Cluster Mode)
{
"status": "healthy",
"cluster_mode": true,
"node": {
"node_id": "cidx-node-01",
"role": "leader",
"storage_mode": "postgres",
"uptime_seconds": 3621
},
"checks": {
"nfs_mount": {
"status": "healthy",
"mount_point": "/mnt/cidx-shared"
},
"postgresql": {
"status": "healthy",
"pool_active": 3,
"pool_idle": 7,
"pool_size": 10
},
"leader_election": {
"status": "healthy",
"is_leader": true,
"lock_held": true
}
},
"peers": [
{
"node_id": "cidx-node-01",
"hostname": "cidx-server-01",
"status": "active",
"last_heartbeat": "2026-03-12T10:30:00Z",
"is_leader": true,
"stale": false
},
{
"node_id": "cidx-node-02",
"hostname": "cidx-server-02",
"status": "active",
"last_heartbeat": "2026-03-12T10:29:55Z",
"is_leader": false,
"stale": false
}
]
}Health Check Decision Matrix
| Check | Healthy | Unhealthy | Impact |
|---|---|---|---|
| NFS mount | ismount() + stat OK | Mount missing or unresponsive | Node DOWN (503) |
| PostgreSQL | SELECT 1 returns | Connection failed or timeout | Node DOWN (503) |
| Leader lock | Lock held (leader) or not needed (follower) | Lock connection dropped (leader only) | WARNING (leader demotes) |
| Peer nodes | All peers have fresh heartbeat | Some peers stale | INFO (no impact on this node) |
Integration with Existing Health Endpoint
The existing /health endpoint is extended, not replaced. In standalone mode, the response is backward-compatible. In cluster mode, additional sections are appended.
Testing Requirements
- Automated: Health endpoint returns 200 when all checks pass.
- Automated: Health endpoint returns 503 when NFS check fails.
- Automated: Health endpoint returns 503 when PostgreSQL check fails.
- Automated: Peer node listing includes all registered nodes.
- Automated: Stale peer detection flags nodes with old heartbeats.
- Automated: Standalone mode returns cluster_mode: false with no cluster checks.
- Manual E2E: In cluster mode, verify health endpoint on both leader and follower. Unmount NFS on one node, verify its health returns 503. Restore mount, verify health returns 200.
Definition of Done
- Health endpoint reports node identity and role
- NFS mount health check returns 503 on failure
- PostgreSQL connectivity check returns 503 on failure
- Peer node visibility with stale detection
- Load balancer compatible (200/503 HTTP status codes)
- Standalone mode backward compatible
- Health check total execution under 1 second
- All tests pass