Discovery
Noticed during review of PR #7918 (merged in e35b32a).
In autobot-slm-backend/ansible/roles/slm_manager/tasks/main.yml, two diagnostic tasks were added to collect and display SLM backend logs when the health check fails:
- name: "SLM | Collect backend logs on health check failure"
when: _slm_health is defined and _slm_health is failed
...
- name: "SLM | Display backend logs on health check failure"
when: _slm_health is defined and _slm_health is failed
...
Problem
In Ansible, when the uri task's until + retries loop is exhausted, the task is marked as failed and Ansible halts the play (default behavior, no ignore_errors). Because the play is halted at the health check task, the subsequent diagnostic tasks never execute.
The _slm_health is failed condition is correctly written, but it can only be evaluated if Ansible reaches those tasks — which it won't when the health check fails without error handling.
Fix
Wrap the health check and diagnostics in a block/rescue pattern:
- block:
- name: "SLM | Wait for backend health"
ansible.builtin.uri:
url: "http://{{ slm_backend_host }}:{{ slm_backend_port }}/api/health"
method: GET
status_code: 200
timeout: 10
register: _slm_health
retries: 120
delay: 5
until: _slm_health is succeeded
tags: ['slm', 'service']
rescue:
- name: "SLM | Collect backend logs on health check failure"
ansible.builtin.shell:
cmd: >-
journalctl -u {{ slm_backend_service }} -n 100 --no-pager 2>&1 ||
tail -100 {{ slm_log_dir }}/slm-backend.log 2>/dev/null ||
echo "No logs available yet"
register: _slm_backend_log_output
changed_when: false
failed_when: false
tags: ['slm', 'service', 'debug']
- name: "SLM | Display backend logs on health check failure"
ansible.builtin.debug:
msg: "{{ _slm_backend_log_output.stdout_lines | default([]) }}"
tags: ['slm', 'service', 'debug']
- name: "SLM | Fail with diagnostic context"
ansible.builtin.fail:
msg: "SLM backend health check failed after {{ 120 * 5 }}s. See logs above."
tags: ['slm', 'service']
Impact
Low — the primary fix in PR #7918 (timeout increase + connect_timeout) works correctly. This is a diagnostics-only gap. Fresh-install failures are harder to debug without the logs, which is the original intent.
Discovery
Noticed during review of PR #7918 (merged in e35b32a).
In
autobot-slm-backend/ansible/roles/slm_manager/tasks/main.yml, two diagnostic tasks were added to collect and display SLM backend logs when the health check fails:Problem
In Ansible, when the
uritask'suntil+retriesloop is exhausted, the task is marked as failed and Ansible halts the play (default behavior, noignore_errors). Because the play is halted at the health check task, the subsequent diagnostic tasks never execute.The
_slm_health is failedcondition is correctly written, but it can only be evaluated if Ansible reaches those tasks — which it won't when the health check fails without error handling.Fix
Wrap the health check and diagnostics in a
block/rescuepattern:Impact
Low — the primary fix in PR #7918 (timeout increase + connect_timeout) works correctly. This is a diagnostics-only gap. Fresh-install failures are harder to debug without the logs, which is the original intent.