Troubleshoot `HEALTH_ERR` Module `devicehealth` has failed table Device already exists

In the event that a ceph health detail or a ceph -s shows the below command then please follow the below procedure to fix the issue.

Error Message:

    health: HEALTH_ERR
    Module 'devicehealth' has failed

Procedure

Stop the Ceph mgr services via systemd on ncn-s001, ncn-s002, and ncn-s003.

Find the systemd unit name.

On each node listed above run the following:

ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit'
ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf

Stop the service.

On each node listed above run the following:

ncn-s001:~ # systemctl stop ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf

Remove the Ceph pool containing the corrupt table.
1. The following commands will be executed once from ncn-s001, ncn-s002, or ncn-s003.
2. Set flag to allow pool deletion.
```
ncn-s001:~ # ceph config set mon mon_allow_pool_delete true
```
3. Delete pool
```
ncn-s001:~ # ceph osd pool rm .mgr .mgr --yes-i-really-really-mean-it
```
  The output should contain pool '.mgr' removed.
4. Unset flag to prohibit pool deletion.
```
ncn-s001:~ # ceph config set mon mon_allow_pool_delete false
```

Start the Ceph mgr services via systemd on ncn-s001, ncn-s002, and ncn-s003.

Find the systemd unit name.

On each node listed above run the following:

ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit'
ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf

Start the service.

On each node listed above run the following:

ncn-s001:~ # systemctl start ceph-660ccbec-a6c1-11ed-af32-b8599ff91d22@mgr.ncn-s001.xufexf

Verify Ceph mgr is operational.

Verify the .mgr pool was automatically created.
```
ncn-s001:~ # ceph osd lspools
```
This will list the pools. Verify that the .mgr pools is present. This could take a minute or so to create the pool if the cluster is busy. If the pool is not created, please verify that the mgr processes are running using following step.

Verify all 3 mgr instances are running.

ncn-s001:~ # ceph -s

There should see 3 mgr processes in the output like below:

  cluster:
  id:     660ccbec-a6c1-11ed-af32-b8599ff91d22
  health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ncn-s001,ncn-s003,ncn-s002 (age 12m)
    mgr: ncn-s001.xufexf(active, since 44s), standbys: ncn-s003.uieiom, ncn-s002.zlhlvs
    mds: 2/2 daemons up, 3 standby, 1 hot standby
    osd: 24 osds: 24 up (since 11m), 24 in (since 11m)
    rgw: 3 daemons active (3 hosts, 1 zones)

Additional verification steps.

Run the following from either a master node, or on one of the following: ncn-s001, ncn-s002, or ncn-s003.

Fetch the Ceph Prometheus endpoint.
```
ncn-s001:~ # ceph mgr services
```
Expected output:

IMPORTANT: The below is an example output and ip addresses may vary, so please make sure that the correct endpoint is obtained from the Ceph cluster.
```
{  
"dashboard": "https://10.252.1.11:8443/",
"prometheus": "http://10.252.1.11:9283/"   <--- This is the url you need.
}
```

Curl against the endpoint to dump metrics.

ncn-s001:~ # curl -s http://10.252.1.11:9283/metrics

Expected output:

# HELP ceph_health_status Cluster health status
# TYPE ceph_health_status untyped
ceph_health_status 0.0
# HELP ceph_mon_quorum_status Monitors in quorum
# TYPE ceph_mon_quorum_status gauge
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s001"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s003"} 1.0
ceph_mon_quorum_status{ceph_daemon="mon.ncn-s002"} 1.0
# HELP ceph_fs_metadata FS Metadata
# TYPE ceph_fs_metadata untyped
ceph_fs_metadata{data_pools="3",fs_id="1",metadata_pool="2",name="cephfs"} 1.0
ceph_fs_metadata{data_pools="9",fs_id="2",metadata_pool="8",name="admin-tools"} 1.0
...

This is a small sample of the output. If the curl is successful, then the active manager instance is active and will ensure that the standby mgr daemons are functional and ready.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshoot_HEALTH_ERR_Module_devicehealth.md

Troubleshoot_HEALTH_ERR_Module_devicehealth.md

Troubleshoot `HEALTH_ERR` Module `devicehealth` has failed table Device already exists

Procedure

Files

Troubleshoot_HEALTH_ERR_Module_devicehealth.md

Latest commit

History

Troubleshoot_HEALTH_ERR_Module_devicehealth.md

File metadata and controls

Troubleshoot HEALTH_ERR Module devicehealth has failed table Device already exists

Procedure

Troubleshoot `HEALTH_ERR` Module `devicehealth` has failed table Device already exists