In the event that a ceph health detail
or a ceph -s
shows the below command then please follow the below procedure to fix the issue.
Error Message:
health: HEALTH_ERR
Module 'devicehealth' has failed
-
Stop the Ceph mgr services via
systemd
onncn-s001
,ncn-s002
, andncn-s003
.-
Find the
systemd
unit name.-
On each node listed above run the following:
ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit' [email protected]
-
-
Stop the service.
-
On each node listed above run the following:
ncn-s001:~ # systemctl stop [email protected]
-
-
-
Remove the Ceph pool containing the corrupt table.
-
The following commands will be executed once from
ncn-s001
,ncn-s002
, orncn-s003
. -
Set flag to allow pool deletion.
ncn-s001:~ # ceph config set mon mon_allow_pool_delete true
-
Delete pool
ncn-s001:~ # ceph osd pool rm .mgr .mgr --yes-i-really-really-mean-it
The output should contain
pool '.mgr' removed
. -
Unset flag to prohibit pool deletion.
ncn-s001:~ # ceph config set mon mon_allow_pool_delete false
-
-
Start the Ceph mgr services via
systemd
onncn-s001
,ncn-s002
, andncn-s003
.-
Find the
systemd
unit name.-
On each node listed above run the following:
ncn-s001:~ # cephadm ls|jq -r '.[]|select(.systemd_unit|contains ("mgr"))|.systemd_unit' [email protected]
-
-
Start the service.
-
On each node listed above run the following:
ncn-s001:~ # systemctl start [email protected]
-
-
-
Verify Ceph mgr is operational.
-
Verify the .mgr pool was automatically created.
ncn-s001:~ # ceph osd lspools
This will list the pools. Verify that the
.mgr
pools is present. This could take a minute or so to create the pool if the cluster is busy. If the pool is not created, please verify that the mgr processes are running using following step. -
Verify all 3 mgr instances are running.
ncn-s001:~ # ceph -s
There should see 3 mgr processes in the output like below:
cluster: id: 660ccbec-a6c1-11ed-af32-b8599ff91d22 health: HEALTH_OK services: mon: 3 daemons, quorum ncn-s001,ncn-s003,ncn-s002 (age 12m) mgr: ncn-s001.xufexf(active, since 44s), standbys: ncn-s003.uieiom, ncn-s002.zlhlvs mds: 2/2 daemons up, 3 standby, 1 hot standby osd: 24 osds: 24 up (since 11m), 24 in (since 11m) rgw: 3 daemons active (3 hosts, 1 zones)
-
Additional verification steps.
- Run the following from either a master node, or on one of the following:
ncn-s001
,ncn-s002
, orncn-s003
.-
Fetch the Ceph Prometheus endpoint.
ncn-s001:~ # ceph mgr services
Expected output:
IMPORTANT: The below is an example output and ip addresses may vary, so please make sure that the correct endpoint is obtained from the Ceph cluster.
{ "dashboard": "https://10.252.1.11:8443/", "prometheus": "http://10.252.1.11:9283/" <--- This is the url you need. }
-
Curl against the endpoint to dump metrics.
ncn-s001:~ # curl -s http://10.252.1.11:9283/metrics
Expected output:
# HELP ceph_health_status Cluster health status # TYPE ceph_health_status untyped ceph_health_status 0.0 # HELP ceph_mon_quorum_status Monitors in quorum # TYPE ceph_mon_quorum_status gauge ceph_mon_quorum_status{ceph_daemon="mon.ncn-s001"} 1.0 ceph_mon_quorum_status{ceph_daemon="mon.ncn-s003"} 1.0 ceph_mon_quorum_status{ceph_daemon="mon.ncn-s002"} 1.0 # HELP ceph_fs_metadata FS Metadata # TYPE ceph_fs_metadata untyped ceph_fs_metadata{data_pools="3",fs_id="1",metadata_pool="2",name="cephfs"} 1.0 ceph_fs_metadata{data_pools="9",fs_id="2",metadata_pool="8",name="admin-tools"} 1.0 ...
This is a small sample of the output. If the
curl
is successful, then the active manager instance is active and will ensure that the standbymgr
daemons are functional and ready.
-
- Run the following from either a master node, or on one of the following:
-