handle down remotes

Powering off a lustre server, or stopping `halo_remote` there, results in a crash in `halo_management`.

Error output:
```
halo_manager[2584124]: stack backtrace:
halo_manager[2584124]: TODO: handle error here.: Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
halo_manager[2584124]: thread 'main' panicked at src/host/non_ha.rs:58:18:
```

Reproducer:
1. Generate a cluster configuration
2. Start `halo_remote` on the lustre servers (in my case via systemd)
3. Start `halo_manager` on the management node (in my case via systemd)
4. Issue `halo status` and get expected output describing the state of the cluster
5. Stop `halo_remote` on one of the nodes
6. Wait a few seconds
7. Issue `systemctl status halo_manager` and see that `halo_manager` has crashed

Relevant code is one of the to-do items:
```
52 impl Host {
 53     pub async fn observe(&self, cluster: &Cluster) {
 54         loop {
 55             let client = self
 56                 .get_client(cluster)
 57                 .await
 58                 .expect("TODO: handle error here.");
 59 
```

Observed with stack::
```
* 21394c9 docs: add information on how to get ZFS and Lustre OCF resource agent scripts
* fccb1db docs: add troubleshooting section to admin guide
* d33c948 readme: add HALO_LOG environment variable
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle down remotes #79

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

handle down remotes #79

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions