Dashboard integration - address coderabitai comments by maryamtahhan · Pull Request #93 · redhat-et/vllm-cpu-perf-eval

maryamtahhan · 2026-04-03T08:10:41Z

This PR addresses the outstanding coderabitai comments from #71

Summary by CodeRabbit

Bug Fixes
- Improved SSH tunnel detection for vLLM metrics collection to avoid targeting unrelated processes.
- Enhanced Prometheus metrics calculations with corrected evaluation order for latency fields.
- Fixed metrics collection timeout to dynamically adjust based on test duration.
- Corrected ITL (Inter-Token Latency) delta calculations in dashboard metrics.
- Better cross-platform support for Docker Compose tool detection (macOS and Linux).
- Improved Grafana datasource configuration with updated references.
Improvements
- Enhanced error handling in setup scripts for reliability.
- Updated documentation to clarify vLLM metrics collection flow.

Critical and major fixes: - SSH tunnel cleanup: use pgrep instead of lsof for safer process targeting - Results path: search for test-metadata.json to extract core_config_name - Prometheus metrics: apply default filter before multiplication - Metrics collector: use dynamic timeout based on test duration - Setup script: add error handling for cd command - ITL metrics: fix metric name mismatch in Server_Metrics.py dashboard - Grafana dashboard: replace hardcoded datasource UID with template variable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Remove hardcoded EC2 hostnames and SSH key path from setup-tunnels.sh - Fix cleanup-collector-scripts.sh to search /tmp instead of results/llm - Update pattern from collect_vllm_metrics.py to collect_vllm_metrics_*.py These were security concerns flagged by CodeRabbit AI review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Add Docker Compose CLI v2 (docker compose) detection to start/stop-grafana playbooks - Replace Linux-only port probing (ss/netstat) with portable wait_for in start-grafana - Fix stop-grafana to use OS-specific compose files matching start-grafana behavior - Make datasource version bump deterministic instead of random - Enhance SSH tunnel detection to include hostname for precise matching - Improve container log collection to detect more failure scenarios - Fix dashboard-examples README contradictions about vllm-metrics.json source - Update default results path from ../../../../../results/llm to ../../../../results/llm - Make grafana README Prometheus config OS-agnostic with platform-specific notes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…op-grafana container detection - Revert wait_for-based port detection back to ss/netstat shell commands which are more reliable - Fix stop-grafana to detect which container runtime is actually managing containers - Check for running containers with podman/docker labels before choosing compose tool - Prevents docker compose from trying to stop podman-managed containers on Linux Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…mpose - Revert complex container detection logic that wasn't working - Check podman-compose FIRST before docker tools - Ensures containers started with podman-compose are stopped with podman-compose - Keeps Docker CLI v2 support and OS-specific compose file fixes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Check podman-compose FIRST in start-grafana.yml (matching stop-grafana.yml) - Ensures both playbooks use the same compose tool - Prevents issue where containers started with one tool can't be stopped by another - Priority order now consistent: podman-compose > docker-compose > docker compose CLI v2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The dashboard was changed to use ${DS_PROMETHEUS} template variable which wasn't working properly. Reverting to hardcoded UID "edx8memhpd9tsa" that was working in commit 06b70f3. This fixes the issue where Grafana showed no stats even though: - Prometheus was successfully scraping vLLM metrics - Grafana could reach Prometheus - All services were healthy The template variable approach needs proper configuration which wasn't in place. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Changed datasource UID from "edx8memhpd9tsa" to "prometheus-vllm" to match the actual provisioned datasource defined in prometheus.yaml. This should fix Grafana not displaying metrics despite Prometheus successfully scraping vLLM. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

coderabbitai · 2026-04-03T08:11:27Z

📝 Walkthrough

Walkthrough

Updates span Ansible playbooks for vLLM metrics collection and Grafana orchestration, Prometheus templates, dashboard scripts, and utility tools. Changes include improved SSH process detection with hostname matching, dynamic compose-tool selection supporting CLI v2, metrics duration parameter propagation, datasource configuration updates, and documentation revisions clarifying vLLM metrics collection architecture.

Changes

Cohort / File(s)	Summary
vLLM Metrics Collection `automation/test-execution/ansible/llm-benchmark-auto.yml`, `automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.yml`, `automation/test-execution/ansible/publish-existing-results.yml`	Added `metrics_collection_duration` variable propagation to metrics collector stop call and wait timeout computation. Replaced SSH tunnel PID detection from `lsof` port-listener lookup to `pgrep` with SSH local-forward pattern matching. Introduced logic to discover `core_config_name` from on-disk `test-metadata.json` and derive `core_configuration.name` for results path construction.
Grafana Deployment & Tunneling `automation/test-execution/ansible/start-grafana.yml`, `automation/test-execution/ansible/stop-grafana.yml`, `automation/test-execution/grafana/scripts/setup-tunnels.sh`	Expanded compose-tool detection to include `docker compose` CLI v2 alongside legacy tools. Updated SSH tunnel process matching patterns to include target hostname constraints. Added dynamic compose-file selection based on platform (`docker-compose.macos.yml` on Darwin). Changed datasource reload versioning from random bump to deterministic increment.
Prometheus Metrics & Dashboards `automation/test-execution/ansible/roles/metrics_publisher/templates/prometheus_metrics.j2`, `automation/test-execution/grafana/dashboards/vllm_comprehensive.json`, `automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py`	Applied `default(0)` fallback to latency fields before unit conversion in Prometheus template. Updated hardcoded datasource `uid` from `edx8memhpd9tsa` to `prometheus-vllm` in dashboard panels. Changed ITL computation delta inputs from `vllm:time_per_output_token_seconds_` to `vllm:request_time_per_output_token_seconds_` metric names.
Documentation & Examples `automation/test-execution/dashboard-examples/README.md`, `automation/test-execution/grafana/README.md`	Updated documentation to reflect vLLM metrics collection via direct `/metrics` endpoint scraping instead of Prometheus intermediary. Clarified platform-specific host addressing (`host.containers.internal` for macOS, `localhost` for Linux) and adjusted default relative results directory path.
Setup & Utility Scripts `automation/test-execution/dashboard-examples/setup.sh`, `automation/utilities/cleanup-collector-scripts.sh`	Added explicit error handling to directory change with fast-fail behavior. Refactored cleanup script to target temporary directory instead of results directory, changed file pattern from `collect_vllm_metrics.py` to `collect_vllm_metrics_*.py`, and limited discovery scope to top level via `-maxdepth 1`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The PR title 'Dashboard integration - address coderabitai comments' is partially related to the changeset. While the PR objectives state this addresses CodeRabbit AI review comments, the title is vague and doesn't clearly convey the primary changes. The actual work spans multiple systems beyond just 'dashboard integration': SSH tunnel cleanup, metrics collection, Prometheus handling, Ansible playbooks, compose tooling, and documentation updates.	Consider a more descriptive title such as 'Improve metrics collection, SSH tunnel detection, and Grafana integration' or similar that captures the breadth of changes, or focus the PR scope more narrowly if dashboard integration is the primary objective.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

automation/test-execution/grafana/scripts/setup-tunnels.sh (1)

68-82: ⚠️ Potential issue | 🟠 Major

Keep reverse-tunnel matching consistent across setup/status/stop.

The updated regex at Lines 68 and 82 is correct for the ssh command shape, but Line 125 and Line 150 still use the old host.*port order. That causes false DOWN status and missed cleanup for active reverse tunnels.

🔧 Proposed fix

@@
 PUSHGATEWAY_PORT="${PROMETHEUS_PORT:-9091}"
 VLLM_METRICS_PORT=8000
+REVERSE_PUSH_TUNNEL_PATTERN="ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}"
@@
-    pkill -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" 2>/dev/null || true
+    pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || true
@@
-    if pgrep -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" > /dev/null; then
+    if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then
@@
-    if pgrep -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" > /dev/null; then
+    if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then
@@
-    pkill -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" 2>/dev/null || true
+    pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || true

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/grafana/scripts/setup-tunnels.sh` around lines 68 -
82, The reverse-tunnel matching is inconsistent: the setup/status/stop checks
must use the same ssh pattern that includes the remote port mapping
(${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target host
(${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match
"host.*port" (the status and stop checks around where you verify and kill
tunnels) and change their -f regex to the same form used in the setup and
verification blocks (match the literal reverse-forward snippet and the LOADGEN
host), e.g. use a pattern containing
"${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so
status detection and cleanup consistently find the same ssh process.

🧹 Nitpick comments (1)

automation/test-execution/ansible/publish-existing-results.yml (1)

40-45: Fallback chain in results_path is redundant.

Since core_configuration.name is always set at line 40 (with its own fallback), the subsequent defaults in line 45 (| default(core_config_name) | default(...)) will never be reached. This adds unnecessary complexity.

Consider simplifying:

Proposed simplification

     - name: Build results path
       ansible.builtin.set_fact:
-        results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name | default(core_config_name) | default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node | default(0) | string) + '-tp' + (core_configuration.tensor_parallel | default(1) | string)) }}"
+        results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/publish-existing-results.yml` around lines
40 - 45, The results_path template uses a redundant fallback chain because
core_configuration.name is always set earlier; simplify the expression by using
core_configuration.name with a single fallback to the computed default
(requested_cores... with numa_node and tensor_parallel) or to core_config_name
if you prefer that as the primary fallback—update the results_path set_fact to
reference core_configuration.name | default(core_config_name |
default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node
| default(0) | string) + '-tp' + (core_configuration.tensor_parallel |
default(1) | string))) so you remove the unreachable extra defaults and keep
only the necessary fallback order for results_path, referencing results_path,
core_configuration.name, core_config_name, and requested_cores to locate the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 519-520: The pgrep pattern "ssh.*-L.*8000:localhost:8000" can
match tunnels to other hosts; update the pgrep invocation used here to include
the DUT hostname anchor (same approach as in start-grafana.yml) so it only
matches the specific tunnel for the target host — for example, change the
pattern used by the pgrep call that searches for the SSH tunnel to include
"8000:localhost:8000.*{{ dut_hostname }}" (or the equivalent runtime variable
like ${dut_hostname}) so only the intended SSH process is selected.

In `@automation/test-execution/ansible/start-grafana.yml`:
- Around line 259-270: The "Read current datasource version" task uses GNU-only
grep -P which fails on macOS; replace the shell command with a portable
extractor such as awk to read the version (e.g. use awk -F': ' '/version:/{print
$2; exit}' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo
"1") so the task still registers current_version and keeps changed_when: false
and the same when condition; keep the subsequent "Bump datasource version to
force reload" replace task unchanged (it will use current_version.stdout | int +
1).

In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json`:
- Line 1250: Replace hardcoded datasource UIDs ("prometheus-vllm") in the
dashboard JSON with the template variable ${DS_PROMETHEUS} so all panels use the
same datasource variable; locate occurrences of "uid": "prometheus-vllm" and
change them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel
that still uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in
their target.datasource or datasource fields to ensure consistent,
variable-driven datasource binding.

---

Outside diff comments:
In `@automation/test-execution/grafana/scripts/setup-tunnels.sh`:
- Around line 68-82: The reverse-tunnel matching is inconsistent: the
setup/status/stop checks must use the same ssh pattern that includes the remote
port mapping (${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target
host (${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match
"host.*port" (the status and stop checks around where you verify and kill
tunnels) and change their -f regex to the same form used in the setup and
verification blocks (match the literal reverse-forward snippet and the LOADGEN
host), e.g. use a pattern containing
"${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so
status detection and cleanup consistently find the same ssh process.

---

Nitpick comments:
In `@automation/test-execution/ansible/publish-existing-results.yml`:
- Around line 40-45: The results_path template uses a redundant fallback chain
because core_configuration.name is always set earlier; simplify the expression
by using core_configuration.name with a single fallback to the computed default
(requested_cores... with numa_node and tensor_parallel) or to core_config_name
if you prefer that as the primary fallback—update the results_path set_fact to
reference core_configuration.name | default(core_config_name |
default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node
| default(0) | string) + '-tp' + (core_configuration.tensor_parallel |
default(1) | string))) so you remove the unreachable extra defaults and keep
only the necessary fallback order for results_path, referencing results_path,
core_configuration.name, core_config_name, and requested_cores to locate the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5e177086-50a8-421a-85aa-b0a58d384e36

📥 Commits

Reviewing files that changed from the base of the PR and between 492d475 and 1657329.

📒 Files selected for processing (13)

automation/test-execution/ansible/llm-benchmark-auto.yml
automation/test-execution/ansible/publish-existing-results.yml
automation/test-execution/ansible/roles/metrics_publisher/templates/prometheus_metrics.j2
automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.yml
automation/test-execution/ansible/start-grafana.yml
automation/test-execution/ansible/stop-grafana.yml
automation/test-execution/dashboard-examples/README.md
automation/test-execution/dashboard-examples/setup.sh
automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
automation/test-execution/grafana/README.md
automation/test-execution/grafana/dashboards/vllm_comprehensive.json
automation/test-execution/grafana/scripts/setup-tunnels.sh
automation/utilities/cleanup-collector-scripts.sh

coderabbitai · 2026-04-03T08:16:45Z

automation/test-execution/ansible/llm-benchmark-auto.yml

+            # Only kill SSH tunnel processes, not vLLM server or other services
+            pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1


⚠️ Potential issue | 🟡 Minor

SSH tunnel pattern inconsistent with start-grafana.yml—may match unintended tunnels.

The pattern ssh.*-L.*8000:localhost:8000 here lacks hostname anchoring, while start-grafana.yml (lines 430, 451) uses ssh.*8000:localhost:8000.*{{ dut_hostname }}. On systems with multiple SSH tunnels to different hosts, this could kill the wrong tunnel.

Consider adding hostname anchoring for consistency:

Proposed fix

- name: Get tunnel PID ansible.builtin.shell: | - # Only kill SSH tunnel processes, not vLLM server or other services - pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1 + # Only kill SSH tunnel processes for the DUT, not other tunnels + pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1 register: tunnel_pid_check

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Only kill SSH tunnel processes, not vLLM server or other services

pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1

# Only kill SSH tunnel processes for the DUT, not other tunnels

pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 519 - 520, The pgrep pattern "ssh.*-L.*8000:localhost:8000" can match tunnels to other hosts; update the pgrep invocation used here to include the DUT hostname anchor (same approach as in start-grafana.yml) so it only matches the specific tunnel for the target host — for example, change the pattern used by the pgrep call that searches for the SSH tunnel to include "8000:localhost:8000.*{{ dut_hostname }}" (or the equivalent runtime variable like ${dut_hostname}) so only the intended SSH process is selected.

coderabbitai · 2026-04-03T08:16:45Z

automation/test-execution/ansible/start-grafana.yml

+    - name: Read current datasource version
+      ansible.builtin.shell: |
+        grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"
+      register: current_version
+      changed_when: false
+      when: datasource_update_linux.changed or datasource_update_macos.changed
+
    - name: Bump datasource version to force reload
      ansible.builtin.replace:
        path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"
        regexp: "version: \\d+"
-        replace: "version: {{ 99 | random(start=10) }}"
+        replace: "version: {{ current_version.stdout | int + 1 }}"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify if the playbook is expected to run on macOS and check grep capabilities # Check if there are any Darwin-specific conditions in the playbook rg -n "Darwin|macos|macOS" automation/test-execution/ansible/start-grafana.yml # Check if GNU grep might be installed via Homebrew on typical macOS setups echo "---" echo "Note: On macOS, GNU grep can be installed via 'brew install grep' as 'ggrep'"

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 1019

Use portable grep syntax to avoid BSD grep -P incompatibility on macOS.

The grep -oP command fails on macOS (which uses BSD grep by default) since the -P flag is GNU-specific. The playbook runs on both Darwin and Linux (see lines 34–35), and the datasource version task executes on macOS when the datasource is updated. Replace with a portable alternative:

Proposed fix

- name: Read current datasource version ansible.builtin.shell: | - grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1" + grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1" register: current_version changed_when: false when: datasource_update_linux.changed or datasource_update_macos.changed

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Read current datasource version

ansible.builtin.shell: |

grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"

register: current_version

changed_when: false

when: datasource_update_linux.changed or datasource_update_macos.changed

- name: Bump datasource version to force reload

ansible.builtin.replace:

path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"

regexp: "version: \\d+"

replace: "version: {{ 99 | random(start=10) }}"

replace: "version: {{ current_version.stdout | int + 1 }}"

- name: Read current datasource version

ansible.builtin.shell: |

grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1"

register: current_version

changed_when: false

when: datasource_update_linux.changed or datasource_update_macos.changed

- name: Bump datasource version to force reload

ansible.builtin.replace:

path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"

regexp: "version: \\d+"

replace: "version: {{ current_version.stdout | int + 1 }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/start-grafana.yml` around lines 259 - 270, The "Read current datasource version" task uses GNU-only grep -P which fails on macOS; replace the shell command with a portable extractor such as awk to read the version (e.g. use awk -F': ' '/version:/{print $2; exit}' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1") so the task still registers current_version and keeps changed_when: false and the same when condition; keep the subsequent "Bump datasource version to force reload" replace task unchanged (it will use current_version.stdout | int + 1).

coderabbitai · 2026-04-03T08:16:45Z

automation/test-execution/grafana/dashboards/vllm_comprehensive.json

          "datasource": {
            "type": "prometheus",
-            "uid": "edx8memhpd9tsa"
+            "uid": "prometheus-vllm"


⚠️ Potential issue | 🟠 Major

Avoid hardcoded datasource UIDs in a variable-driven dashboard.

These hardcoded entries bypass ${DS_PROMETHEUS} and create inconsistent behavior (notably in panel id 15, where Line 1366 still uses ${DS_PROMETHEUS}). Keep all targets bound to the same template variable.

🔧 Proposed fix

- "uid": "prometheus-vllm" + "uid": "${DS_PROMETHEUS}"

(Apply at Lines 1250, 1350, and 1463.)

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1350-1350, 1463-1463

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json` at line 1250, Replace hardcoded datasource UIDs ("prometheus-vllm") in the dashboard JSON with the template variable ${DS_PROMETHEUS} so all panels use the same datasource variable; locate occurrences of "uid": "prometheus-vllm" and change them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel that still uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in their target.datasource or datasource fields to ensure consistent, variable-driven datasource binding.

maryamtahhan and others added 8 commits April 3, 2026 08:59

maryamtahhan requested review from PoolPooer and lmzuccarelli April 3, 2026 08:10

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

maryamtahhan changed the title ~~Dashboard integration v2~~ Dashboard integration - address coderabitai comments Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard integration - address coderabitai comments#93

Dashboard integration - address coderabitai comments#93
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:dashboard-integration-v2

maryamtahhan commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Only kill SSH tunnel processes, not vLLM server or other services
		pgrep -f "ssh.-L.8000:localhost:8000" \| head -1

Conversation

maryamtahhan commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maryamtahhan commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading