Skip to content

Dashboard integration - address coderabitai comments#93

Open
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:dashboard-integration-v2
Open

Dashboard integration - address coderabitai comments#93
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:dashboard-integration-v2

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Apr 3, 2026

This PR addresses the outstanding coderabitai comments from #71

Summary by CodeRabbit

  • Bug Fixes

    • Improved SSH tunnel detection for vLLM metrics collection to avoid targeting unrelated processes.
    • Enhanced Prometheus metrics calculations with corrected evaluation order for latency fields.
    • Fixed metrics collection timeout to dynamically adjust based on test duration.
    • Corrected ITL (Inter-Token Latency) delta calculations in dashboard metrics.
    • Better cross-platform support for Docker Compose tool detection (macOS and Linux).
    • Improved Grafana datasource configuration with updated references.
  • Improvements

    • Enhanced error handling in setup scripts for reliability.
    • Updated documentation to clarify vLLM metrics collection flow.

maryamtahhan and others added 8 commits April 3, 2026 08:59
Critical and major fixes:
- SSH tunnel cleanup: use pgrep instead of lsof for safer process targeting
- Results path: search for test-metadata.json to extract core_config_name
- Prometheus metrics: apply default filter before multiplication
- Metrics collector: use dynamic timeout based on test duration
- Setup script: add error handling for cd command
- ITL metrics: fix metric name mismatch in Server_Metrics.py dashboard
- Grafana dashboard: replace hardcoded datasource UID with template variable

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Remove hardcoded EC2 hostnames and SSH key path from setup-tunnels.sh
- Fix cleanup-collector-scripts.sh to search /tmp instead of results/llm
- Update pattern from collect_vllm_metrics.py to collect_vllm_metrics_*.py

These were security concerns flagged by CodeRabbit AI review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Add Docker Compose CLI v2 (docker compose) detection to start/stop-grafana playbooks
- Replace Linux-only port probing (ss/netstat) with portable wait_for in start-grafana
- Fix stop-grafana to use OS-specific compose files matching start-grafana behavior
- Make datasource version bump deterministic instead of random
- Enhance SSH tunnel detection to include hostname for precise matching
- Improve container log collection to detect more failure scenarios
- Fix dashboard-examples README contradictions about vllm-metrics.json source
- Update default results path from ../../../../../results/llm to ../../../../results/llm
- Make grafana README Prometheus config OS-agnostic with platform-specific notes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…op-grafana container detection

- Revert wait_for-based port detection back to ss/netstat shell commands which are more reliable
- Fix stop-grafana to detect which container runtime is actually managing containers
- Check for running containers with podman/docker labels before choosing compose tool
- Prevents docker compose from trying to stop podman-managed containers on Linux

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…mpose

- Revert complex container detection logic that wasn't working
- Check podman-compose FIRST before docker tools
- Ensures containers started with podman-compose are stopped with podman-compose
- Keeps Docker CLI v2 support and OS-specific compose file fixes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Check podman-compose FIRST in start-grafana.yml (matching stop-grafana.yml)
- Ensures both playbooks use the same compose tool
- Prevents issue where containers started with one tool can't be stopped by another
- Priority order now consistent: podman-compose > docker-compose > docker compose CLI v2

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The dashboard was changed to use ${DS_PROMETHEUS} template variable which
wasn't working properly. Reverting to hardcoded UID "edx8memhpd9tsa" that
was working in commit 06b70f3.

This fixes the issue where Grafana showed no stats even though:
- Prometheus was successfully scraping vLLM metrics
- Grafana could reach Prometheus
- All services were healthy

The template variable approach needs proper configuration which wasn't in place.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed datasource UID from "edx8memhpd9tsa" to "prometheus-vllm" to match
the actual provisioned datasource defined in prometheus.yaml.

This should fix Grafana not displaying metrics despite Prometheus successfully
scraping vLLM.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Updates span Ansible playbooks for vLLM metrics collection and Grafana orchestration, Prometheus templates, dashboard scripts, and utility tools. Changes include improved SSH process detection with hostname matching, dynamic compose-tool selection supporting CLI v2, metrics duration parameter propagation, datasource configuration updates, and documentation revisions clarifying vLLM metrics collection architecture.

Changes

Cohort / File(s) Summary
vLLM Metrics Collection
automation/test-execution/ansible/llm-benchmark-auto.yml, automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.yml, automation/test-execution/ansible/publish-existing-results.yml
Added metrics_collection_duration variable propagation to metrics collector stop call and wait timeout computation. Replaced SSH tunnel PID detection from lsof port-listener lookup to pgrep with SSH local-forward pattern matching. Introduced logic to discover core_config_name from on-disk test-metadata.json and derive core_configuration.name for results path construction.
Grafana Deployment & Tunneling
automation/test-execution/ansible/start-grafana.yml, automation/test-execution/ansible/stop-grafana.yml, automation/test-execution/grafana/scripts/setup-tunnels.sh
Expanded compose-tool detection to include docker compose CLI v2 alongside legacy tools. Updated SSH tunnel process matching patterns to include target hostname constraints. Added dynamic compose-file selection based on platform (docker-compose.macos.yml on Darwin). Changed datasource reload versioning from random bump to deterministic increment.
Prometheus Metrics & Dashboards
automation/test-execution/ansible/roles/metrics_publisher/templates/prometheus_metrics.j2, automation/test-execution/grafana/dashboards/vllm_comprehensive.json, automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
Applied default(0) fallback to latency fields before unit conversion in Prometheus template. Updated hardcoded datasource uid from edx8memhpd9tsa to prometheus-vllm in dashboard panels. Changed ITL computation delta inputs from vllm:time_per_output_token_seconds_* to vllm:request_time_per_output_token_seconds_* metric names.
Documentation & Examples
automation/test-execution/dashboard-examples/README.md, automation/test-execution/grafana/README.md
Updated documentation to reflect vLLM metrics collection via direct /metrics endpoint scraping instead of Prometheus intermediary. Clarified platform-specific host addressing (host.containers.internal for macOS, localhost for Linux) and adjusted default relative results directory path.
Setup & Utility Scripts
automation/test-execution/dashboard-examples/setup.sh, automation/utilities/cleanup-collector-scripts.sh
Added explicit error handling to directory change with fast-fail behavior. Refactored cleanup script to target temporary directory instead of results directory, changed file pattern from collect_vllm_metrics.py to collect_vllm_metrics_*.py, and limited discovery scope to top level via -maxdepth 1.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title 'Dashboard integration - address coderabitai comments' is partially related to the changeset. While the PR objectives state this addresses CodeRabbit AI review comments, the title is vague and doesn't clearly convey the primary changes. The actual work spans multiple systems beyond just 'dashboard integration': SSH tunnel cleanup, metrics collection, Prometheus handling, Ansible playbooks, compose tooling, and documentation updates. Consider a more descriptive title such as 'Improve metrics collection, SSH tunnel detection, and Grafana integration' or similar that captures the breadth of changes, or focus the PR scope more narrowly if dashboard integration is the primary objective.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
automation/test-execution/grafana/scripts/setup-tunnels.sh (1)

68-82: ⚠️ Potential issue | 🟠 Major

Keep reverse-tunnel matching consistent across setup/status/stop.

The updated regex at Lines 68 and 82 is correct for the ssh command shape, but Line 125 and Line 150 still use the old host.*port order. That causes false DOWN status and missed cleanup for active reverse tunnels.

🔧 Proposed fix
@@
 PUSHGATEWAY_PORT="${PROMETHEUS_PORT:-9091}"
 VLLM_METRICS_PORT=8000
+REVERSE_PUSH_TUNNEL_PATTERN="ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}"
@@
-    pkill -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" 2>/dev/null || true
+    pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || true
@@
-    if pgrep -f "ssh.*${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" > /dev/null; then
+    if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then
@@
-    if pgrep -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" > /dev/null; then
+    if pgrep -f "${REVERSE_PUSH_TUNNEL_PATTERN}" > /dev/null; then
@@
-    pkill -f "ssh.*${LOADGEN_HOSTNAME}.*${PUSHGATEWAY_PORT}" 2>/dev/null || true
+    pkill -f "${REVERSE_PUSH_TUNNEL_PATTERN}" 2>/dev/null || true

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/grafana/scripts/setup-tunnels.sh` around lines 68 -
82, The reverse-tunnel matching is inconsistent: the setup/status/stop checks
must use the same ssh pattern that includes the remote port mapping
(${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target host
(${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match
"host.*port" (the status and stop checks around where you verify and kill
tunnels) and change their -f regex to the same form used in the setup and
verification blocks (match the literal reverse-forward snippet and the LOADGEN
host), e.g. use a pattern containing
"${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so
status detection and cleanup consistently find the same ssh process.
🧹 Nitpick comments (1)
automation/test-execution/ansible/publish-existing-results.yml (1)

40-45: Fallback chain in results_path is redundant.

Since core_configuration.name is always set at line 40 (with its own fallback), the subsequent defaults in line 45 (| default(core_config_name) | default(...)) will never be reached. This adds unnecessary complexity.

Consider simplifying:

Proposed simplification
     - name: Build results path
       ansible.builtin.set_fact:
-        results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name | default(core_config_name) | default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node | default(0) | string) + '-tp' + (core_configuration.tensor_parallel | default(1) | string)) }}"
+        results_path: "{{ playbook_dir }}/../../../results/llm/{{ test_model | replace('/', '__') }}/{{ workload_type }}-{{ test_run_id }}/{{ core_configuration.name }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/publish-existing-results.yml` around lines
40 - 45, The results_path template uses a redundant fallback chain because
core_configuration.name is always set earlier; simplify the expression by using
core_configuration.name with a single fallback to the computed default
(requested_cores... with numa_node and tensor_parallel) or to core_config_name
if you prefer that as the primary fallback—update the results_path set_fact to
reference core_configuration.name | default(core_config_name |
default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node
| default(0) | string) + '-tp' + (core_configuration.tensor_parallel |
default(1) | string))) so you remove the unreachable extra defaults and keep
only the necessary fallback order for results_path, referencing results_path,
core_configuration.name, core_config_name, and requested_cores to locate the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 519-520: The pgrep pattern "ssh.*-L.*8000:localhost:8000" can
match tunnels to other hosts; update the pgrep invocation used here to include
the DUT hostname anchor (same approach as in start-grafana.yml) so it only
matches the specific tunnel for the target host — for example, change the
pattern used by the pgrep call that searches for the SSH tunnel to include
"8000:localhost:8000.*{{ dut_hostname }}" (or the equivalent runtime variable
like ${dut_hostname}) so only the intended SSH process is selected.

In `@automation/test-execution/ansible/start-grafana.yml`:
- Around line 259-270: The "Read current datasource version" task uses GNU-only
grep -P which fails on macOS; replace the shell command with a portable
extractor such as awk to read the version (e.g. use awk -F': ' '/version:/{print
$2; exit}' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo
"1") so the task still registers current_version and keeps changed_when: false
and the same when condition; keep the subsequent "Bump datasource version to
force reload" replace task unchanged (it will use current_version.stdout | int +
1).

In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json`:
- Line 1250: Replace hardcoded datasource UIDs ("prometheus-vllm") in the
dashboard JSON with the template variable ${DS_PROMETHEUS} so all panels use the
same datasource variable; locate occurrences of "uid": "prometheus-vllm" and
change them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel
that still uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in
their target.datasource or datasource fields to ensure consistent,
variable-driven datasource binding.

---

Outside diff comments:
In `@automation/test-execution/grafana/scripts/setup-tunnels.sh`:
- Around line 68-82: The reverse-tunnel matching is inconsistent: the
setup/status/stop checks must use the same ssh pattern that includes the remote
port mapping (${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}) and the target
host (${LOADGEN_HOSTNAME}). Locate the pgrep/pkill occurrences that still match
"host.*port" (the status and stop checks around where you verify and kill
tunnels) and change their -f regex to the same form used in the setup and
verification blocks (match the literal reverse-forward snippet and the LOADGEN
host), e.g. use a pattern containing
"${PUSHGATEWAY_PORT}:localhost:${PUSHGATEWAY_PORT}.*${LOADGEN_HOSTNAME}" so
status detection and cleanup consistently find the same ssh process.

---

Nitpick comments:
In `@automation/test-execution/ansible/publish-existing-results.yml`:
- Around line 40-45: The results_path template uses a redundant fallback chain
because core_configuration.name is always set earlier; simplify the expression
by using core_configuration.name with a single fallback to the computed default
(requested_cores... with numa_node and tensor_parallel) or to core_config_name
if you prefer that as the primary fallback—update the results_path set_fact to
reference core_configuration.name | default(core_config_name |
default(requested_cores | string + 'cores-numa' + (core_configuration.numa_node
| default(0) | string) + '-tp' + (core_configuration.tensor_parallel |
default(1) | string))) so you remove the unreachable extra defaults and keep
only the necessary fallback order for results_path, referencing results_path,
core_configuration.name, core_config_name, and requested_cores to locate the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5e177086-50a8-421a-85aa-b0a58d384e36

📥 Commits

Reviewing files that changed from the base of the PR and between 492d475 and 1657329.

📒 Files selected for processing (13)
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/ansible/publish-existing-results.yml
  • automation/test-execution/ansible/roles/metrics_publisher/templates/prometheus_metrics.j2
  • automation/test-execution/ansible/roles/vllm_metrics_collector/tasks/stop.yml
  • automation/test-execution/ansible/start-grafana.yml
  • automation/test-execution/ansible/stop-grafana.yml
  • automation/test-execution/dashboard-examples/README.md
  • automation/test-execution/dashboard-examples/setup.sh
  • automation/test-execution/dashboard-examples/vllm_dashboard/pages/2_🖥️_Server_Metrics.py
  • automation/test-execution/grafana/README.md
  • automation/test-execution/grafana/dashboards/vllm_comprehensive.json
  • automation/test-execution/grafana/scripts/setup-tunnels.sh
  • automation/utilities/cleanup-collector-scripts.sh

Comment on lines +519 to +520
# Only kill SSH tunnel processes, not vLLM server or other services
pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

SSH tunnel pattern inconsistent with start-grafana.yml—may match unintended tunnels.

The pattern ssh.*-L.*8000:localhost:8000 here lacks hostname anchoring, while start-grafana.yml (lines 430, 451) uses ssh.*8000:localhost:8000.*{{ dut_hostname }}. On systems with multiple SSH tunnels to different hosts, this could kill the wrong tunnel.

Consider adding hostname anchoring for consistency:

Proposed fix
         - name: Get tunnel PID
           ansible.builtin.shell: |
-            # Only kill SSH tunnel processes, not vLLM server or other services
-            pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1
+            # Only kill SSH tunnel processes for the DUT, not other tunnels
+            pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1
           register: tunnel_pid_check
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Only kill SSH tunnel processes, not vLLM server or other services
pgrep -f "ssh.*-L.*8000:localhost:8000" | head -1
# Only kill SSH tunnel processes for the DUT, not other tunnels
pgrep -f "ssh.*-L.*8000:localhost:8000.*{{ groups['dut'][0] }}" | head -1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 519 -
520, The pgrep pattern "ssh.*-L.*8000:localhost:8000" can match tunnels to other
hosts; update the pgrep invocation used here to include the DUT hostname anchor
(same approach as in start-grafana.yml) so it only matches the specific tunnel
for the target host — for example, change the pattern used by the pgrep call
that searches for the SSH tunnel to include "8000:localhost:8000.*{{
dut_hostname }}" (or the equivalent runtime variable like ${dut_hostname}) so
only the intended SSH process is selected.

Comment on lines +259 to +270
- name: Read current datasource version
ansible.builtin.shell: |
grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"
register: current_version
changed_when: false
when: datasource_update_linux.changed or datasource_update_macos.changed

- name: Bump datasource version to force reload
ansible.builtin.replace:
path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"
regexp: "version: \\d+"
replace: "version: {{ 99 | random(start=10) }}"
replace: "version: {{ current_version.stdout | int + 1 }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify if the playbook is expected to run on macOS and check grep capabilities

# Check if there are any Darwin-specific conditions in the playbook
rg -n "Darwin|macos|macOS" automation/test-execution/ansible/start-grafana.yml

# Check if GNU grep might be installed via Homebrew on typical macOS setups
echo "---"
echo "Note: On macOS, GNU grep can be installed via 'brew install grep' as 'ggrep'"

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 1019


Use portable grep syntax to avoid BSD grep -P incompatibility on macOS.

The grep -oP command fails on macOS (which uses BSD grep by default) since the -P flag is GNU-specific. The playbook runs on both Darwin and Linux (see lines 34–35), and the datasource version task executes on macOS when the datasource is updated. Replace with a portable alternative:

Proposed fix
    - name: Read current datasource version
      ansible.builtin.shell: |
-       grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"
+       grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1"
       register: current_version
       changed_when: false
       when: datasource_update_linux.changed or datasource_update_macos.changed
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Read current datasource version
ansible.builtin.shell: |
grep -oP 'version: \K\d+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" || echo "1"
register: current_version
changed_when: false
when: datasource_update_linux.changed or datasource_update_macos.changed
- name: Bump datasource version to force reload
ansible.builtin.replace:
path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"
regexp: "version: \\d+"
replace: "version: {{ 99 | random(start=10) }}"
replace: "version: {{ current_version.stdout | int + 1 }}"
- name: Read current datasource version
ansible.builtin.shell: |
grep -E 'version: [0-9]+' "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml" | sed 's/.*version: //' | tr -d ' ' || echo "1"
register: current_version
changed_when: false
when: datasource_update_linux.changed or datasource_update_macos.changed
- name: Bump datasource version to force reload
ansible.builtin.replace:
path: "{{ grafana_dir }}/provisioning/datasources/prometheus.yaml"
regexp: "version: \\d+"
replace: "version: {{ current_version.stdout | int + 1 }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/start-grafana.yml` around lines 259 - 270,
The "Read current datasource version" task uses GNU-only grep -P which fails on
macOS; replace the shell command with a portable extractor such as awk to read
the version (e.g. use awk -F': ' '/version:/{print $2; exit}' "{{ grafana_dir
}}/provisioning/datasources/prometheus.yaml" || echo "1") so the task still
registers current_version and keeps changed_when: false and the same when
condition; keep the subsequent "Bump datasource version to force reload" replace
task unchanged (it will use current_version.stdout | int + 1).

"datasource": {
"type": "prometheus",
"uid": "edx8memhpd9tsa"
"uid": "prometheus-vllm"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hardcoded datasource UIDs in a variable-driven dashboard.

These hardcoded entries bypass ${DS_PROMETHEUS} and create inconsistent behavior (notably in panel id 15, where Line 1366 still uses ${DS_PROMETHEUS}). Keep all targets bound to the same template variable.

🔧 Proposed fix
-            "uid": "prometheus-vllm"
+            "uid": "${DS_PROMETHEUS}"

(Apply at Lines 1250, 1350, and 1463.)

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1350-1350, 1463-1463

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/grafana/dashboards/vllm_comprehensive.json` at line
1250, Replace hardcoded datasource UIDs ("prometheus-vllm") in the dashboard
JSON with the template variable ${DS_PROMETHEUS} so all panels use the same
datasource variable; locate occurrences of "uid": "prometheus-vllm" and change
them to reference ${DS_PROMETHEUS}, and verify panel id 15 (the panel that still
uses the hardcoded UID) and any other targets use ${DS_PROMETHEUS} in their
target.datasource or datasource fields to ensure consistent, variable-driven
datasource binding.

@maryamtahhan maryamtahhan changed the title Dashboard integration v2 Dashboard integration - address coderabitai comments Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant