[STORY] Health Endpoint Cluster Awareness

**Part of**: #408




# Story: Health Endpoint Cluster Awareness

**Part of**: #408

[Conversation Reference: "if ONTAP fails we are in emergency mode, we are totally down"]

## Story Overview

**Objective**: Extend the existing health endpoint to report cluster-specific status including node identity, leader/follower role, NFS mount health, PostgreSQL connectivity, and peer node visibility. The load balancer uses this endpoint for routing decisions -- an unhealthy node is removed from the pool.

**User Value**: Operations teams have clear visibility into each node's cluster status. Load balancers automatically route away from unhealthy nodes. Cluster problems are diagnosed quickly through structured health information.

## Acceptance Criteria

### AC1: Health Endpoint Reports Node Identity and Role

**Scenario**: The health response includes cluster membership information.

```gherkin
Given a CIDX node is running in cluster mode
When the health endpoint is called
Then the response includes:
  - node_id: this node's identifier
  - role: "leader" or "follower"
  - storage_mode: "postgres"
  - cluster_mode: true
  - uptime_seconds: time since node started
```

**Technical Requirements**:
- [ ] Extend existing `/health` endpoint response
- [ ] Add cluster section to health response JSON
- [ ] Node ID from config.json
- [ ] Role from LeaderElectionService.is_leader()
- [ ] Storage mode from config

### AC2: NFS Mount Health Check

**Scenario**: The health endpoint reports NFS mount status.

```gherkin
Given the NFS mount is configured
When the health check runs
Then it reports NFS mount status: healthy or unhealthy
And if the NFS mount is unavailable, the overall health status is UNHEALTHY
And the load balancer removes this node from the pool
And no fallback or degraded mode is attempted
```

**Technical Requirements**:
- [ ] NFS check: `os.path.ismount(mount_point)` + stat test file
- [ ] NFS failure = overall health UNHEALTHY (HTTP 503)
- [ ] Health response includes: `nfs_mount: { status: "healthy", mount_point: "/mnt/cidx-shared" }`
- [ ] NFS failure message: `nfs_mount: { status: "unhealthy", error: "mount point not available" }`

### AC3: PostgreSQL Connectivity Health Check

**Scenario**: The health endpoint verifies PostgreSQL is reachable.

```gherkin
Given the server uses PostgreSQL for storage
When the health check runs
Then it executes a lightweight query against PostgreSQL (SELECT 1)
And if PostgreSQL is unreachable, the overall health status is UNHEALTHY
And the response includes connection pool statistics (active, idle, waiting)
```

**Technical Requirements**:
- [ ] PostgreSQL check: `SELECT 1` via connection pool
- [ ] Timeout: 5 seconds for health check query
- [ ] Pool stats: active connections, idle connections, pool size
- [ ] PostgreSQL failure = overall health UNHEALTHY (HTTP 503)

### AC4: Peer Node Visibility

**Scenario**: The health endpoint reports the status of all known cluster nodes.

```gherkin
Given the cluster_nodes table tracks all nodes
When the health endpoint is called
Then the response includes a list of all cluster nodes
And each node entry shows: node_id, hostname, status, last_heartbeat, is_leader
And stale nodes (heartbeat > 30s old) are flagged
```

**Technical Requirements**:
- [ ] Query cluster_nodes table for all registered nodes
- [ ] Flag stale nodes: `last_heartbeat < NOW() - INTERVAL '30 seconds'`
- [ ] Mark current leader based on which node holds the advisory lock
- [ ] Response: `peers: [{ node_id, hostname, status, last_heartbeat, stale: bool }]`

### AC5: Load Balancer Compatibility

**Scenario**: The health endpoint returns appropriate HTTP status codes for load balancer health checks.

```gherkin
Given a load balancer polls the health endpoint
When all checks pass
Then HTTP 200 is returned with status: "healthy"
When any critical check fails (NFS, PostgreSQL)
Then HTTP 503 is returned with status: "unhealthy"
And the response body includes which checks failed
```

**Technical Requirements**:
- [ ] HTTP 200 = healthy (node should receive traffic)
- [ ] HTTP 503 = unhealthy (node should be removed from pool)
- [ ] Response body always includes detailed check results (regardless of status)
- [ ] Health check execution time under 1 second total

### AC6: Standalone Mode Compatibility

**Scenario**: In standalone mode, the health endpoint works without cluster checks.

```gherkin
Given the server is running in standalone mode
When the health endpoint is called
Then it returns the existing health information
And cluster-specific checks (NFS, peers, leader) are not included
And the response includes: cluster_mode: false
```

**Technical Requirements**:
- [ ] Cluster checks skipped in standalone mode
- [ ] Existing health response preserved
- [ ] `cluster_mode: false` in response for standalone

## Implementation Status

- [ ] Core implementation complete
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] E2E tests passing
- [ ] Code review approved
- [ ] Manual E2E testing completed
- [ ] Documentation updated

## Technical Implementation Details

### Health Response Schema (Cluster Mode)

```json
{
    "status": "healthy",
    "cluster_mode": true,
    "node": {
        "node_id": "cidx-node-01",
        "role": "leader",
        "storage_mode": "postgres",
        "uptime_seconds": 3621
    },
    "checks": {
        "nfs_mount": {
            "status": "healthy",
            "mount_point": "/mnt/cidx-shared"
        },
        "postgresql": {
            "status": "healthy",
            "pool_active": 3,
            "pool_idle": 7,
            "pool_size": 10
        },
        "leader_election": {
            "status": "healthy",
            "is_leader": true,
            "lock_held": true
        }
    },
    "peers": [
        {
            "node_id": "cidx-node-01",
            "hostname": "cidx-server-01",
            "status": "active",
            "last_heartbeat": "2026-03-12T10:30:00Z",
            "is_leader": true,
            "stale": false
        },
        {
            "node_id": "cidx-node-02",
            "hostname": "cidx-server-02",
            "status": "active",
            "last_heartbeat": "2026-03-12T10:29:55Z",
            "is_leader": false,
            "stale": false
        }
    ]
}
```

### Health Check Decision Matrix

| Check | Healthy | Unhealthy | Impact |
|-------|---------|-----------|--------|
| NFS mount | ismount() + stat OK | Mount missing or unresponsive | Node DOWN (503) |
| PostgreSQL | SELECT 1 returns | Connection failed or timeout | Node DOWN (503) |
| Leader lock | Lock held (leader) or not needed (follower) | Lock connection dropped (leader only) | WARNING (leader demotes) |
| Peer nodes | All peers have fresh heartbeat | Some peers stale | INFO (no impact on this node) |

### Integration with Existing Health Endpoint

The existing `/health` endpoint is extended, not replaced. In standalone mode, the response is backward-compatible. In cluster mode, additional sections are appended.

## Testing Requirements

- **Automated**: Health endpoint returns 200 when all checks pass.
- **Automated**: Health endpoint returns 503 when NFS check fails.
- **Automated**: Health endpoint returns 503 when PostgreSQL check fails.
- **Automated**: Peer node listing includes all registered nodes.
- **Automated**: Stale peer detection flags nodes with old heartbeats.
- **Automated**: Standalone mode returns cluster_mode: false with no cluster checks.
- **Manual E2E**: In cluster mode, verify health endpoint on both leader and follower. Unmount NFS on one node, verify its health returns 503. Restore mount, verify health returns 200.

## Definition of Done

- [ ] Health endpoint reports node identity and role
- [ ] NFS mount health check returns 503 on failure
- [ ] PostgreSQL connectivity check returns 503 on failure
- [ ] Peer node visibility with stale detection
- [ ] Load balancer compatible (200/503 HTTP status codes)
- [ ] Standalone mode backward compatible
- [ ] Health check total execution under 1 second
- [ ] All tests pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STORY] Health Endpoint Cluster Awareness #430

Story: Health Endpoint Cluster Awareness

Story Overview

Acceptance Criteria

AC1: Health Endpoint Reports Node Identity and Role

AC2: NFS Mount Health Check

AC3: PostgreSQL Connectivity Health Check

AC4: Peer Node Visibility

AC5: Load Balancer Compatibility

AC6: Standalone Mode Compatibility

Implementation Status

Technical Implementation Details

Health Response Schema (Cluster Mode)

Health Check Decision Matrix

Integration with Existing Health Endpoint

Testing Requirements

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Check	Healthy	Unhealthy	Impact
NFS mount	ismount() + stat OK	Mount missing or unresponsive	Node DOWN (503)
PostgreSQL	SELECT 1 returns	Connection failed or timeout	Node DOWN (503)
Leader lock	Lock held (leader) or not needed (follower)	Lock connection dropped (leader only)	WARNING (leader demotes)
Peer nodes	All peers have fresh heartbeat	Some peers stale	INFO (no impact on this node)

[STORY] Health Endpoint Cluster Awareness #430

Description

Story: Health Endpoint Cluster Awareness

Story Overview

Acceptance Criteria

AC1: Health Endpoint Reports Node Identity and Role

AC2: NFS Mount Health Check

AC3: PostgreSQL Connectivity Health Check

AC4: Peer Node Visibility

AC5: Load Balancer Compatibility

AC6: Standalone Mode Compatibility

Implementation Status

Technical Implementation Details

Health Response Schema (Cluster Mode)

Health Check Decision Matrix

Integration with Existing Health Endpoint

Testing Requirements

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions