Skip to content

discovery(ansible): slm_manager health-check diagnostic tasks need block/rescue to execute on failure #7919

Description

@mrveiss

Discovery

Noticed during review of PR #7918 (merged in e35b32a).

In autobot-slm-backend/ansible/roles/slm_manager/tasks/main.yml, two diagnostic tasks were added to collect and display SLM backend logs when the health check fails:

- name: "SLM | Collect backend logs on health check failure"
  when: _slm_health is defined and _slm_health is failed
  ...

- name: "SLM | Display backend logs on health check failure"
  when: _slm_health is defined and _slm_health is failed
  ...

Problem

In Ansible, when the uri task's until + retries loop is exhausted, the task is marked as failed and Ansible halts the play (default behavior, no ignore_errors). Because the play is halted at the health check task, the subsequent diagnostic tasks never execute.

The _slm_health is failed condition is correctly written, but it can only be evaluated if Ansible reaches those tasks — which it won't when the health check fails without error handling.

Fix

Wrap the health check and diagnostics in a block/rescue pattern:

- block:
    - name: "SLM | Wait for backend health"
      ansible.builtin.uri:
        url: "http://{{ slm_backend_host }}:{{ slm_backend_port }}/api/health"
        method: GET
        status_code: 200
        timeout: 10
      register: _slm_health
      retries: 120
      delay: 5
      until: _slm_health is succeeded
      tags: ['slm', 'service']

  rescue:
    - name: "SLM | Collect backend logs on health check failure"
      ansible.builtin.shell:
        cmd: >-
          journalctl -u {{ slm_backend_service }} -n 100 --no-pager 2>&1 ||
          tail -100 {{ slm_log_dir }}/slm-backend.log 2>/dev/null ||
          echo "No logs available yet"
      register: _slm_backend_log_output
      changed_when: false
      failed_when: false
      tags: ['slm', 'service', 'debug']

    - name: "SLM | Display backend logs on health check failure"
      ansible.builtin.debug:
        msg: "{{ _slm_backend_log_output.stdout_lines | default([]) }}"
      tags: ['slm', 'service', 'debug']

    - name: "SLM | Fail with diagnostic context"
      ansible.builtin.fail:
        msg: "SLM backend health check failed after {{ 120 * 5 }}s. See logs above."
      tags: ['slm', 'service']

Impact

Low — the primary fix in PR #7918 (timeout increase + connect_timeout) works correctly. This is a diagnostics-only gap. Fresh-install failures are harder to debug without the logs, which is the original intent.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions