Powering off a lustre server, or stopping halo_remote there, results in a crash in halo_management.
Error output:
halo_manager[2584124]: stack backtrace:
halo_manager[2584124]: TODO: handle error here.: Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
halo_manager[2584124]: thread 'main' panicked at src/host/non_ha.rs:58:18:
Reproducer:
- Generate a cluster configuration
- Start
halo_remote on the lustre servers (in my case via systemd)
- Start
halo_manager on the management node (in my case via systemd)
- Issue
halo status and get expected output describing the state of the cluster
- Stop
halo_remote on one of the nodes
- Wait a few seconds
- Issue
systemctl status halo_manager and see that halo_manager has crashed
Relevant code is one of the to-do items:
52 impl Host {
53 pub async fn observe(&self, cluster: &Cluster) {
54 loop {
55 let client = self
56 .get_client(cluster)
57 .await
58 .expect("TODO: handle error here.");
59
Observed with stack::
* 21394c9 docs: add information on how to get ZFS and Lustre OCF resource agent scripts
* fccb1db docs: add troubleshooting section to admin guide
* d33c948 readme: add HALO_LOG environment variable
Powering off a lustre server, or stopping
halo_remotethere, results in a crash inhalo_management.Error output:
Reproducer:
halo_remoteon the lustre servers (in my case via systemd)halo_manageron the management node (in my case via systemd)halo statusand get expected output describing the state of the clusterhalo_remoteon one of the nodessystemctl status halo_managerand see thathalo_managerhas crashedRelevant code is one of the to-do items:
Observed with stack::