Skip to content

handle down remotes #79

@ofaaland

Description

@ofaaland

Powering off a lustre server, or stopping halo_remote there, results in a crash in halo_management.

Error output:

halo_manager[2584124]: stack backtrace:
halo_manager[2584124]: TODO: handle error here.: Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
halo_manager[2584124]: thread 'main' panicked at src/host/non_ha.rs:58:18:

Reproducer:

  1. Generate a cluster configuration
  2. Start halo_remote on the lustre servers (in my case via systemd)
  3. Start halo_manager on the management node (in my case via systemd)
  4. Issue halo status and get expected output describing the state of the cluster
  5. Stop halo_remote on one of the nodes
  6. Wait a few seconds
  7. Issue systemctl status halo_manager and see that halo_manager has crashed

Relevant code is one of the to-do items:

52 impl Host {
 53     pub async fn observe(&self, cluster: &Cluster) {
 54         loop {
 55             let client = self
 56                 .get_client(cluster)
 57                 .await
 58                 .expect("TODO: handle error here.");
 59 

Observed with stack::

* 21394c9 docs: add information on how to get ZFS and Lustre OCF resource agent scripts
* fccb1db docs: add troubleshooting section to admin guide
* d33c948 readme: add HALO_LOG environment variable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions