Dashboard integration - address coderabitai comments#93
Dashboard integration - address coderabitai comments#93maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
Conversation
Critical and major fixes: - SSH tunnel cleanup: use pgrep instead of lsof for safer process targeting - Results path: search for test-metadata.json to extract core_config_name - Prometheus metrics: apply default filter before multiplication - Metrics collector: use dynamic timeout based on test duration - Setup script: add error handling for cd command - ITL metrics: fix metric name mismatch in Server_Metrics.py dashboard - Grafana dashboard: replace hardcoded datasource UID with template variable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Remove hardcoded EC2 hostnames and SSH key path from setup-tunnels.sh - Fix cleanup-collector-scripts.sh to search /tmp instead of results/llm - Update pattern from collect_vllm_metrics.py to collect_vllm_metrics_*.py These were security concerns flagged by CodeRabbit AI review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add Docker Compose CLI v2 (docker compose) detection to start/stop-grafana playbooks - Replace Linux-only port probing (ss/netstat) with portable wait_for in start-grafana - Fix stop-grafana to use OS-specific compose files matching start-grafana behavior - Make datasource version bump deterministic instead of random - Enhance SSH tunnel detection to include hostname for precise matching - Improve container log collection to detect more failure scenarios - Fix dashboard-examples README contradictions about vllm-metrics.json source - Update default results path from ../../../../../results/llm to ../../../../results/llm - Make grafana README Prometheus config OS-agnostic with platform-specific notes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…op-grafana container detection - Revert wait_for-based port detection back to ss/netstat shell commands which are more reliable - Fix stop-grafana to detect which container runtime is actually managing containers - Check for running containers with podman/docker labels before choosing compose tool - Prevents docker compose from trying to stop podman-managed containers on Linux Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…mpose - Revert complex container detection logic that wasn't working - Check podman-compose FIRST before docker tools - Ensures containers started with podman-compose are stopped with podman-compose - Keeps Docker CLI v2 support and OS-specific compose file fixes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Check podman-compose FIRST in start-grafana.yml (matching stop-grafana.yml) - Ensures both playbooks use the same compose tool - Prevents issue where containers started with one tool can't be stopped by another - Priority order now consistent: podman-compose > docker-compose > docker compose CLI v2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The dashboard was changed to use ${DS_PROMETHEUS} template variable which
wasn't working properly. Reverting to hardcoded UID "edx8memhpd9tsa" that
was working in commit 06b70f3.
This fixes the issue where Grafana showed no stats even though:
- Prometheus was successfully scraping vLLM metrics
- Grafana could reach Prometheus
- All services were healthy
The template variable approach needs proper configuration which wasn't in place.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed datasource UID from "edx8memhpd9tsa" to "prometheus-vllm" to match the actual provisioned datasource defined in prometheus.yaml. This should fix Grafana not displaying metrics despite Prometheus successfully scraping vLLM. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
📝 WalkthroughWalkthroughUpdates span Ansible playbooks for vLLM metrics collection and Grafana orchestration, Prometheus templates, dashboard scripts, and utility tools. Changes include improved SSH process detection with hostname matching, dynamic compose-tool selection supporting CLI v2, metrics duration parameter propagation, datasource configuration updates, and documentation revisions clarifying vLLM metrics collection architecture. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
automation/test-execution/grafana/scripts/setup-tunnels.sh (1)
68-82:⚠️ Potential issue | 🟠 MajorKeep reverse-tunnel matching consistent across setup/status/stop.
The updated regex at Lines 68 and 82 is correct for the ssh command shape, but Line 125 and Line 150 still use the old
host.*portorder. That causes falseDOWNstatus and missed cleanup for active reverse tunnels.🔧 Proposed fix
@@ PUSHGATEWAY_PORT="${PROMETHEUS_PORT:-9091}" VLLM_METRICS_PORT=8000 +REVERSE_PUSH_TUNNEL_PATTERN="ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" @@ - pkill -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" 2>/dev/null || true + pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || true @@ - if pgrep -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" > /dev/null; then + if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then @@ - if pgrep -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" > /dev/null; then + if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then @@ - pkill -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" 2>/dev/null || true + pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || trueAs per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/grafana/scripts/setup-tunnels.sh` around lines 68 - 82, The reverse-tunnel matching is inconsistent: the setup/status/stop checks must use the same ssh pattern that includes the remote port mapping (${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target host (${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match "host.*port" (the status and stop checks around where you verify and kill tunnels) and change their -f regex to the same form used in the setup and verification blocks (match the literal reverse-forward snippet and the LOADGEN host), e.g. use a pattern containing "${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so status detection and cleanup consistently find the same ssh process.
🧹 Nitpick comments (1)
automation/test-execution/ansible/publish-existing-results.yml (1)
40-45: Fallback chain inresults_pathis redundant.Since
core_configuration.nameis always set at line 40 (with its own fallback), the subsequent defaults in line 45 (| default(core_config_name) | default(...)) will never be reached. This adds unnecessary complexity.Consider simplifying:
Proposed simplification
- name: Build results path ansible.builtin.set_fact: - results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name | default(core_config_name) | default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node | default(0) | string) + '-tp' + (core_configuration.tensor_parallel | default(1) | string)) }}" + results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name }}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/publish-existing-results.yml` around lines 40 - 45, The results_path template uses a redundant fallback chain because core_configuration.name is always set earlier; simplify the expression by using core_configuration.name with a single fallback to the computed default (requested_cores... with numa_node and tensor_parallel) or to core_config_name if you prefer that as the primary fallback—update the results_path set_fact to reference core_configuration.name | default(core_config_name | default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node | default(0) | string) + '-tp' + (core_configuration.tensor_parallel | default(1) | string))) so you remove the unreachable extra defaults and keep only the necessary fallback order for results_path, referencing results_path, core_configuration.name, core_config_name, and requested_cores to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 519-520: The pgrep pattern "ssh.*-L.*8000:localhost:8000" can
match tunnels to other hosts; update the pgrep invocation used here to include
the DUT hostname anchor (same approach as in start-grafana.yml) so it only
matches the specific tunnel for the target host — for example, change the
pattern used by the pgrep call that searches for the SSH tunnel to include
"8000:localhost:8000.*{{ dut_hostname }}" (or the equivalent runtime variable
like ${dut_hostname}) so only the intended SSH process is selected.
In `@automation/test-execution/ansible/start-grafana.yml`:
- Around line 259-270: The "Read current datasource version" task uses GNU-only
grep -P which fails on macOS; replace the shell command with a portable
extractor such as awk to read the version (e.g. use awk -F': ' '/version:/{print
$2; exit}' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo
"1") so the task still registers current_version and keeps changed_when: false
and the same when condition; keep the subsequent "Bump datasource version to
force reload" replace task unchanged (it will use current_version.stdout | int +
1).
In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json`:
- Line 1250: Replace hardcoded datasource UIDs ("prometheus-vllm") in the
dashboard JSON with the template variable ${DS_PROMETHEUS} so all panels use the
same datasource variable; locate occurrences of "uid": "prometheus-vllm" and
change them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel
that still uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in
their target.datasource or datasource fields to ensure consistent,
variable-driven datasource binding.
---
Outside diff comments:
In `@automation/test-execution/grafana/scripts/setup-tunnels.sh`:
- Around line 68-82: The reverse-tunnel matching is inconsistent: the
setup/status/stop checks must use the same ssh pattern that includes the remote
port mapping (${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target
host (${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match
"host.*port" (the status and stop checks around where you verify and kill
tunnels) and change their -f regex to the same form used in the setup and
verification blocks (match the literal reverse-forward snippet and the LOADGEN
host), e.g. use a pattern containing
"${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so
status detection and cleanup consistently find the same ssh process.
---
Nitpick comments:
In `@automation/test-execution/ansible/publish-existing-results.yml`:
- Around line 40-45: The results_path template uses a redundant fallback chain
because core_configuration.name is always set earlier; simplify the expression
by using core_configuration.name with a single fallback to the computed default
(requested_cores... with numa_node and tensor_parallel) or to core_config_name
if you prefer that as the primary fallback—update the results_path set_fact to
reference core_configuration.name | default(core_config_name |
default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node
| default(0) | string) + '-tp' + (core_configuration.tensor_parallel |
default(1) | string))) so you remove the unreachable extra defaults and keep
only the necessary fallback order for results_path, referencing results_path,
core_configuration.name, core_config_name, and requested_cores to locate the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5e177086-50a8-421a-85aa-b0a58d384e36
📒 Files selected for processing (13)
automation/test-execution/ansible/llm-benchmark-auto.ymlautomation/test-execution/ansible/publish-existing-results.ymlautomation/test-execution/ansible/roles/metrics_publisher/templates/prometheus_metrics.j2automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.ymlautomation/test-execution/ansible/start-grafana.ymlautomation/test-execution/ansible/stop-grafana.ymlautomation/test-execution/dashboard-examples/README.mdautomation/test-execution/dashboard-examples/setup.shautomation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.pyautomation/test-execution/grafana/README.mdautomation/test-execution/grafana/dashboards/vllm_comprehensive.jsonautomation/test-execution/grafana/scripts/setup-tunnels.shautomation/utilities/cleanup-collector-scripts.sh
| # Only kill SSH tunnel processes, not vLLM server or other services | ||
| pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1 |
There was a problem hiding this comment.
SSH tunnel pattern inconsistent with start-grafana.yml—may match unintended tunnels.
The pattern ssh.*-L.*8000:localhost:8000 here lacks hostname anchoring, while start-grafana.yml (lines 430, 451) uses ssh.*8000:localhost:8000.*{{ dut_hostname }}. On systems with multiple SSH tunnels to different hosts, this could kill the wrong tunnel.
Consider adding hostname anchoring for consistency:
Proposed fix
- name: Get tunnel PID
ansible.builtin.shell: |
- # Only kill SSH tunnel processes, not vLLM server or other services
- pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1
+ # Only kill SSH tunnel processes for the DUT, not other tunnels
+ pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1
register: tunnel_pid_check📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Only kill SSH tunnel processes, not vLLM server or other services | |
| pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1 | |
| # Only kill SSH tunnel processes for the DUT, not other tunnels | |
| pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 519 -
520, The pgrep pattern "ssh.*-L.*8000:localhost:8000" can match tunnels to other
hosts; update the pgrep invocation used here to include the DUT hostname anchor
(same approach as in start-grafana.yml) so it only matches the specific tunnel
for the target host — for example, change the pattern used by the pgrep call
that searches for the SSH tunnel to include "8000:localhost:8000.*{{
dut_hostname }}" (or the equivalent runtime variable like ${dut_hostname}) so
only the intended SSH process is selected.
| - name: Read current datasource version | ||
| ansible.builtin.shell: | | ||
| grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1" | ||
| register: current_version | ||
| changed_when: false | ||
| when: datasource_update_linux.changed or datasource_update_macos.changed | ||
|
|
||
| - name: Bump datasource version to force reload | ||
| ansible.builtin.replace: | ||
| path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | ||
| regexp: "version: \\d+" | ||
| replace: "version: {{ 99 | random(start=10) }}" | ||
| replace: "version: {{ current_version.stdout | int + 1 }}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify if the playbook is expected to run on macOS and check grep capabilities
# Check if there are any Darwin-specific conditions in the playbook
rg -n "Darwin|macos|macOS" automation/test-execution/ansible/start-grafana.yml
# Check if GNU grep might be installed via Homebrew on typical macOS setups
echo "---"
echo "Note: On macOS, GNU grep can be installed via 'brew install grep' as 'ggrep'"Repository: redhat-et/vllm-cpu-perf-eval
Length of output: 1019
Use portable grep syntax to avoid BSD grep -P incompatibility on macOS.
The grep -oP command fails on macOS (which uses BSD grep by default) since the -P flag is GNU-specific. The playbook runs on both Darwin and Linux (see lines 34–35), and the datasource version task executes on macOS when the datasource is updated. Replace with a portable alternative:
Proposed fix
- name: Read current datasource version
ansible.builtin.shell: |
- grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"
+ grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1"
register: current_version
changed_when: false
when: datasource_update_linux.changed or datasource_update_macos.changed📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Read current datasource version | |
| ansible.builtin.shell: | | |
| grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1" | |
| register: current_version | |
| changed_when: false | |
| when: datasource_update_linux.changed or datasource_update_macos.changed | |
| - name: Bump datasource version to force reload | |
| ansible.builtin.replace: | |
| path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | |
| regexp: "version: \\d+" | |
| replace: "version: {{ 99 | random(start=10) }}" | |
| replace: "version: {{ current_version.stdout | int + 1 }}" | |
| - name: Read current datasource version | |
| ansible.builtin.shell: | | |
| grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1" | |
| register: current_version | |
| changed_when: false | |
| when: datasource_update_linux.changed or datasource_update_macos.changed | |
| - name: Bump datasource version to force reload | |
| ansible.builtin.replace: | |
| path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | |
| regexp: "version: \\d+" | |
| replace: "version: {{ current_version.stdout | int + 1 }}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automation/test-execution/ansible/start-grafana.yml` around lines 259 - 270,
The "Read current datasource version" task uses GNU-only grep -P which fails on
macOS; replace the shell command with a portable extractor such as awk to read
the version (e.g. use awk -F': ' '/version:/{print $2; exit}' "{{ grafana_dir
}}/provisioning/datasources/prometheus.yaml" || echo "1") so the task still
registers current_version and keeps changed_when: false and the same when
condition; keep the subsequent "Bump datasource version to force reload" replace
task unchanged (it will use current_version.stdout | int + 1).
| "datasource": { | ||
| "type": "prometheus", | ||
| "uid": "edx8memhpd9tsa" | ||
| "uid": "prometheus-vllm" |
There was a problem hiding this comment.
Avoid hardcoded datasource UIDs in a variable-driven dashboard.
These hardcoded entries bypass ${DS_PROMETHEUS} and create inconsistent behavior (notably in panel id 15, where Line 1366 still uses ${DS_PROMETHEUS}). Keep all targets bound to the same template variable.
🔧 Proposed fix
- "uid": "prometheus-vllm"
+ "uid": "${DS_PROMETHEUS}"(Apply at Lines 1250, 1350, and 1463.)
As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1350-1350, 1463-1463
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json` at line
1250, Replace hardcoded datasource UIDs ("prometheus-vllm") in the dashboard
JSON with the template variable ${DS_PROMETHEUS} so all panels use the same
datasource variable; locate occurrences of "uid": "prometheus-vllm" and change
them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel that still
uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in their
target.datasource or datasource fields to ensure consistent, variable-driven
datasource binding.
This PR addresses the outstanding coderabitai comments from #71
Summary by CodeRabbit
Bug Fixes
Improvements