Fix/model matrix validation errors by maryamtahhan · Pull Request #87 · redhat-et/vllm-cpu-perf-eval

maryamtahhan · 2026-03-31T10:02:24Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for AWX runtime environments with automatic result path handling
- Enhanced embedding model metadata with dimensions and maximum sequence length specifications
Bug Fixes
- Improved container execution reliability for benchmark tests
Documentation
- Updated test suite model coverage (now includes 6 LLM models instead of 8)
- Enhanced deployment and network configuration guidance
- Clarified platform setup requirements for production-accurate measurements

…rdening This commit improves the existing Ansible playbook infrastructure for vLLM CPU performance evaluation with enhanced AWX compatibility, security hardening, and comprehensive testing. - Fix type normalization to handle AnsibleUnsafeText from AWX - Fix allocated_nodes to return integers instead of strings - Handle empty strings and Jinja2 None conversions properly - Simplify node eligibility checking and allocation logic - Improve error messages for better validation feedback - Add no_log: true to all tasks handling HF_TOKEN - Prevent token exposure in container start operations - Secure environment variable handling in AWX jobs - Add comprehensive unit tests for cpu_utils filter plugin (598 lines) - Test coverage for: CPU range conversion, NUMA extraction, multi-NUMA allocation, OMP binding, and real-world scenarios - Support for both pytest and standalone execution - Add collections/requirements.yml for Ansible collection dependencies - Better parameter validation for AWX jobs in concurrent load testing - AWX detection for result path handling - Improved NUMA topology detection in core sweep - Enhanced result path consistency in main benchmark - Better workload configuration handling - Simplify prerequisites section - Update examples with current best practices - Clearer workflow documentation Files changed: 13 files, 772 insertions(+), 323 deletions(-) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Remove undefined 'short_codegen' workload from validation lists to prevent KeyError during test execution. Add CPU allocation validation to detect under-allocation early with clear error messages. Fix pytest.raises fallback to properly suppress exceptions when pytest is unavailable. Pin exact collection versions for reproducible AWX/CI runs. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Fix inconsistent output filename reference in ansible.md (benchmarks.html → benchmarks.json) - Improve requested_cores validation to handle non-numeric input safely - Document requirements files usage (AWX/production vs CLI/development) - Fix hardcoded log_collection_dest to use centralized local_results_base - Remove obsolete vLLM configuration fields from test metadata docs - Improve node capacity validation with sorting and per-node checks in cpu_utils.py - Refactor duplicate AWX_JOB_ID lookup in llm-core-sweep-auto.yml - Add NUMA node id type coercion for consistent integer types Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

The previous commit changed NUMA node ids from strings to integers, but the selectattr filters were still trying to match them as strings, causing "No first item, sequence was empty" errors. Remove the | string filter from selectattr comparisons to match integer ids correctly. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

- Fix allocate_with_fixed_tp to filter nodes by capacity (cpu_utils.py) - Update Results summary to use local_results_base (llm-benchmark-auto.yml) - Implement core_sweep_counts parameter handling (llm-benchmark-concurrent-load.yml) - Normalize core_sweep_counts to requested_cores for single-element lists - Reject multi-element lists with helpful error directing to llm-core-sweep-auto.yml - Pass effective_requested_cores to all 3 phases - Restore model configuration facts in start-llm.yml for metadata collection - Set model_kv_cache, model_dtype, model_max_len, vllm_caching_mode - Fixes compatibility with get-vllm-config-from-dut.yml assertions - Add regression tests for serialized TP inputs (test_cpu_utils.py) - Test empty string '' behaves like auto-TP - Test string 'None' behaves like auto-TP Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

…ainer start Addresses two critical issues in vLLM configuration: 1. Per-model overrides (model_kv_cache_space, model_dtype) now properly override workload defaults instead of being ignored 2. Added preflight validation of --max-model-len against workload requirements (ISL + OSL) to fail early with clear error messages Both fixes ensure configuration errors are caught early and per-model settings take precedence over workload fallbacks. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

This commit fixes two configuration bugs found by smoke tests: Bug #1: Remove OPT models (opt-125m, opt-1.3b) - Context length (2048) incompatible with summarization workload (4096) - Legacy models causing validation failures - Reduced model count from 8 to 6 LLM models Changes: - Removed opt-125m and opt-1.3b from model-matrix.yaml - Updated README.md: 8 → 6 LLM models - Updated docs/methodology/overview.md: 8 → 6 LLM models - Removed OPT Family section from models/models.md - Removed all OPT entries from model tables and test scenarios - Removed "Decode-Heavy Models" section (was OPT-only) Bug #2: Add missing embedding model fields - granite-embedding-english-r2: added dimensions=384, max_sequence_length=512 - granite-embedding-278m-multilingual: added dimensions=768, max_sequence_length=512 These fixes ensure all model configurations pass validation tests. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

coderabbitai · 2026-03-31T10:02:41Z

📝 Walkthrough

Walkthrough

This pull request updates Ansible playbooks to detect AWX runtime environments and dynamically configure result paths, removes legacy OPT models from the model matrix, refactors CPU core allocation logic with improved NUMA handling, eliminates vLLM configuration metadata collection from playbooks, and adds comprehensive unit test coverage for CPU utilities.

Changes

Cohort / File(s)	Summary
AWX Environment Detection & Dynamic Results Paths `automation/test-execution/ansible/llm-benchmark.yml`, `automation/test-execution/ansible/llm-benchmark-auto.yml`, `automation/test-execution/ansible/llm-benchmark-concurrent-load.yml`, `automation/test-execution/ansible/llm-core-sweep-auto.yml`	Added `is_awx_job` fact based on `AWX_JOB_ID` environment variable and introduced `local_results_base` to dynamically select results directory (`$HOME/benchmark-results` for AWX, `playbook_dir/../../../results/llm` otherwise). Updated all downstream path construction and output messages to use the dynamic base path.
vLLM Configuration Metadata Removal `automation/test-execution/ansible/llm-benchmark.yml`, `automation/test-execution/ansible/llm-benchmark-auto.yml`, `automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml`	Removed collection of vLLM configuration fields (dtype, KV cache size, max model length, caching mode) from playbook metadata and DUT retrieval steps. Replaced with effective-value driven approach using `effective_kv_cache_space` and `effective_max_model_len`; `model-matrix.yaml` lookup removed from `start-llm.yml`.
CPU Allocation Logic Refactoring `automation/test-execution/ansible/filter_plugins/cpu_utils.py`, `automation/test-execution/ansible/roles/common/tasks/allocate-cores-from-count.yml`, `automation/test-execution/ansible/roles/common/tasks/detect-numa-topology.yml`	Improved NUMA node handling: normalized node IDs and core counts to numeric types, simplified tensor-parallel value computation, switched to sorted node selection by capacity, and consolidated core allocation branching. Backward-compatibility logic refactored to use numeric node IDs with direct lookups. Consolidated tensor-parallel conditional logic into single inline computation.
HF\_TOKEN Logging Protection `automation/test-execution/ansible/roles/vllm_server/tasks/start-embedding.yml`, `automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml`	Added `no_log: true` to tasks handling environment variables and container operations to suppress sensitive `HF_TOKEN` from appearing in Ansible logs.
Container Log Handling `automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml`	Refactored container log collection: removed intermediate registration/copy steps and now stream logs directly to unique per-`test_run_id` temporary files on remote host before fetching. Changed container wait failure handling to treat any non-zero RC as failure. Renamed `guidellm_wait_timeout` to `guidellm_wait_timeout_seconds`.
Model Matrix Updates `models/llm-models/model-matrix.yaml`, `models/embedding-models/model-matrix.yaml`	Removed legacy OPT models (`opt-125m`, `opt-1.3b`) from LLM model matrix (64 lines deleted). Added `dimensions` and `max_sequence_length` fields to embedding model definitions (`granite-embedding-english-r2`, `granite-embedding-278m-multilingual`).
Documentation Updates `README.md`, `automation/test-execution/ansible/README.md`, `automation/test-execution/ansible/ansible.md`, `docs/methodology/overview.md`, `models/models.md`	Updated model coverage documentation to reflect 6 LLM models (down from 8). Simplified preflight checklist in `ansible.md`, added "Verify SSH and Network Access" section, updated example model to Llama-3.2-1B. Removed vLLM configuration metadata from metadata features list. Updated model participation tables and KV cache references.
Test Infrastructure `automation/test-execution/ansible/tests/unit/test_cpu_utils.py`	Added comprehensive unit test module with 625 lines covering CPU utilities: tests for CPU range compression, NUMA topology extraction, core allocation across various topology scenarios, tensor-parallel selection, and error handling. Includes synthetic NUMA topology helper and fallback pytest implementation.
Ansible Requirements & Configuration `automation/test-execution/ansible/requirements.yml`, `collections/requirements.yml`	Updated development requirements file with clarifying comments and usage guidance. Added new canonical `collections/requirements.yml` with exact pinned versions (`containers.podman@1.9.4`, `ansible.posix@1.5.4`) for AWX/production deployments.
Core Count Validation `automation/test-execution/ansible/llm-benchmark-concurrent-load.yml`	Enhanced concurrency input validation: added assertions for `requested_cores` and `core_sweep_counts` provision, computed `effective_requested_cores` from available inputs, and tightened `base_workload` validation by removing `short_codegen`. Simplified core passing by always using `effective_requested_cores` instead of conditional `core_sweep_counts`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title refers to model matrix validation, but the changeset encompasses extensive Ansible playbook refactoring, documentation updates, test utilities, and infrastructure changes beyond just fixing model matrix validation errors.	Consider a more comprehensive title that reflects the major changes, such as 'Refactor playbooks for AWX support and remove legacy OPT models' to better represent the full scope of work.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

models/models.md (1)
128-138: ⚠️ Potential issue | 🟡 Minor

Model count mismatch in documentation.

The header states "Current LLM Models (9 total)" but the table only lists 7 models:

Llama-3.2-1B-Instruct

Llama-3.2-3B-Instruct

TinyLlama-1.1B-Chat

granite-3.2-2b-instruct

Qwen/Qwen3-0.6B

Qwen/Qwen2.5-3B-Instruct

openai/gpt-oss-20b

Update the header to reflect the actual count.
📝 Proposed fix
-## Current LLM Models (9 total)
+## Current LLM Models (7 total)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/models.md` around lines 128 - 138, The header "Current LLM Models (9
total)" is incorrect relative to the table rows; update that header to "Current
LLM Models (7 total)" so it matches the listed models (Llama-3.2-1B-Instruct,
Llama-3.2-3B-Instruct, TinyLlama-1.1B-Chat, granite-3.2-2b-instruct,
Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b) by editing the
header text in the models.md file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 144-146: The display logic in allocate-cores-from-count.yml relies
on checking "requested_tensor_parallel is not defined" to decide between
"(auto-calculated)" and "(user-specified)", but llm-benchmark-auto.yml sets
requested_tensor_parallel (and related requested_omp_* variables) using
default(None), which means the variable is always defined and causes the message
to incorrectly show "(user-specified)". Fix by restoring the original omission
behavior: change the defaults for requested_tensor_parallel,
requested_omp_threads, and requested_omp_bind back to default(omit) in the
llm-benchmark-auto.yml diff so the "is not defined" check in
allocate-cores-from-count.yml works as intended (alternatively, update the
display logic to test for "is not none" if you prefer defined-but-None
semantics).

In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml`:
- Around line 208-214: The wait task "Wait for GuideLLM benchmark to complete
(containerized)" currently aborts the play on podman wait errors; change it to
be non-fatal by removing/setting failed_when: false (keep changed_when: false)
while still registering guidellm_exit_code_container (using timeout {{
guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{
core_cfg.name }} and when: use_guidellm_container | bool); then update the
unified failure check after the diagnostic/cleanup steps to evaluate
guidellm_exit_code_container.rc and treat nonzero as a failure only once
logs/inspect/fetch/state: absent have run.
- Around line 239-257: The "Stream GuideLLM container logs directly to file"
task builds an unsafe /tmp filename using test_run_id
(command-injection/arbitrary-write) and leaves files behind; replace this by
creating a managed temp file with ansible.builtin.tempfile (register e.g.
guidellm_log_file), update the shell task to redirect podman logs into
guidellm_log_file.path instead of the interpolated /tmp path, update the "Fetch
GuideLLM logs to controller" ansible.builtin.fetch src to use
guidellm_log_file.path, and add a cleanup task to remove the temp file on the
remote host after fetching to avoid disk leaks.

In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py`:
- Around line 20-37: The mock pytest.raises currently returns True when no
exception is raised, causing false positives; update the pytest.raises.__exit__
to raise an AssertionError (or appropriate test failure) when exc_type is None
so it fails if no exception occurred, while still suppressing the expected
exception when issubclass(exc_type, self.exc) returns True and letting
unexpected exceptions propagate; modify the __exit__ logic in the pytest.raises
class to: if exc_type is None: raise AssertionError(f"Did not raise
{self.exc}"), elif issubclass(exc_type, self.exc): return True, else return
False.

In `@docs/methodology/overview.md`:
- Line 174: The documentation line that reads "Tests 6 LLM models + 2 embedding
models" is out of sync with the test matrix which lists seven LLMs; update that
text to "Tests 7 LLM models + 2 embedding models" (or otherwise change the
sentence to reflect 7 LLM models) so it matches
models/llm-models/model-matrix.yaml and the listed model names (e.g.,
meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct,
TinyLlama/TinyLlama-1.1B-Chat-v1.0, ibm-granite/granite-3.2-2b-instruct,
Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b).

In `@README.md`:
- Line 161: The LLM count line "6 LLM models + 2 embedding models" is
inconsistent with the listed models because gpt-oss-20b is included; update that
summary to "7 LLM models + 2 embedding models" (or remove gpt-oss-20b from the
list if the intent is to keep six) by editing the summary string that contains
"6 LLM models + 2 embedding models" and ensure it matches the presence of
"gpt-oss-20b" in the model list.
- Around line 181-186: The bulleted LLM list and header "LLM Models (6 total):"
are inconsistent with models/llm-models/model-matrix.yaml which includes
gpt-oss-20b; either add a new bullet "gpt-oss-20b - Large open-source (20B)" to
the list and change the header to "LLM Models (7 total):", or keep the existing
bullets but update the header and any summary count (the summary near the "Large
model support" section) to "7" so the README and model-matrix.yaml match; update
the "LLM Models" header and the bulleted list (the lines containing the listed
model names) accordingly.

---

Outside diff comments:
In `@models/models.md`:
- Around line 128-138: The header "Current LLM Models (9 total)" is incorrect
relative to the table rows; update that header to "Current LLM Models (7 total)"
so it matches the listed models (Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct,
TinyLlama-1.1B-Chat, granite-3.2-2b-instruct, Qwen/Qwen3-0.6B,
Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b) by editing the header text in the
models.md file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5bdc56bb-5f27-46f5-89c3-67029da1c3ca

📥 Commits

Reviewing files that changed from the base of the PR and between b4126fb and 338acdd.

📒 Files selected for processing (20)

README.md
automation/test-execution/ansible/README.md
automation/test-execution/ansible/ansible.md
automation/test-execution/ansible/filter_plugins/cpu_utils.py
automation/test-execution/ansible/llm-benchmark-auto.yml
automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
automation/test-execution/ansible/llm-benchmark.yml
automation/test-execution/ansible/llm-core-sweep-auto.yml
automation/test-execution/ansible/requirements.yml
automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
automation/test-execution/ansible/roles/common/tasks/allocate-cores-from-count.yml
automation/test-execution/ansible/roles/common/tasks/detect-numa-topology.yml
automation/test-execution/ansible/roles/vllm_server/tasks/start-embedding.yml
automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
automation/test-execution/ansible/tests/unit/test_cpu_utils.py
collections/requirements.yml
docs/methodology/overview.md
models/embedding-models/model-matrix.yaml
models/llm-models/model-matrix.yaml
models/models.md

💤 Files with no reviewable changes (2)

automation/test-execution/ansible/README.md
models/llm-models/model-matrix.yaml

coderabbitai · 2026-03-31T10:14:59Z

automation/test-execution/ansible/llm-benchmark-auto.yml

+    requested_tensor_parallel: "{{ hostvars['localhost']['requested_tensor_parallel'] | default(None) }}"
+    requested_omp_threads: "{{ hostvars['localhost']['requested_omp_threads'] | default(None) }}"
+    requested_omp_bind: "{{ hostvars['localhost']['requested_omp_bind'] | default(None) }}"


⚠️ Potential issue | 🟡 Minor

Display message may incorrectly indicate "user-specified" tensor parallel.

Changing from default(omit) to default(None) affects the display logic at allocate-cores-from-count.yml line 65, which uses requested_tensor_parallel is not defined to show "(auto-calculated)" vs "(user-specified)". With default(None), the variable is always defined, so the message will incorrectly show "(user-specified)" even when the user didn't provide a value.

This is a minor display issue that doesn't affect functionality, but may confuse users reviewing benchmark output.

Suggested fix in allocate-cores-from-count.yml line 65

- - "Tensor Parallel: {{ core_allocation_result.tensor_parallel }}{{ ' (auto-calculated)' if requested_tensor_parallel is not defined else ' (user-specified)' }}" + - "Tensor Parallel: {{ core_allocation_result.tensor_parallel }}{{ ' (auto-calculated)' if (requested_tensor_parallel is not defined or requested_tensor_parallel == None or requested_tensor_parallel == '') else ' (user-specified)' }}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 144 - 146, The display logic in allocate-cores-from-count.yml relies on checking "requested_tensor_parallel is not defined" to decide between "(auto-calculated)" and "(user-specified)", but llm-benchmark-auto.yml sets requested_tensor_parallel (and related requested_omp_* variables) using default(None), which means the variable is always defined and causes the message to incorrectly show "(user-specified)". Fix by restoring the original omission behavior: change the defaults for requested_tensor_parallel, requested_omp_threads, and requested_omp_bind back to default(omit) in the llm-benchmark-auto.yml diff so the "is not defined" check in allocate-cores-from-count.yml works as intended (alternatively, update the display logic to test for "is not none" if you prefer defined-but-None semantics).

coderabbitai · 2026-03-31T10:14:59Z

automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml

 - name: Wait for GuideLLM benchmark to complete (containerized)
  ansible.builtin.command:
-    cmd: "timeout {{ guidellm_wait_timeout_seconds | default(guidellm_wait_timeout) }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
+    cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
  register: guidellm_exit_code_container
  changed_when: false
-  failed_when: guidellm_exit_code_container.rc == 124
+  failed_when: guidellm_exit_code_container.rc != 0
  when: use_guidellm_container | bool


⚠️ Potential issue | 🟠 Major

Don’t fail the play before the diagnostic and cleanup tasks run.

Line 213 makes any podman wait error terminal, so the later inspect, log capture, fetch, and state: absent tasks never execute on the cases you most need them. Keep this step non-fatal, then fold guidellm_exit_code_container.rc into the unified failure path after cleanup finishes.

🩹 Minimal change

register: guidellm_exit_code_container changed_when: false - failed_when: guidellm_exit_code_container.rc != 0 + failed_when: false

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Wait for GuideLLM benchmark to complete (containerized)

ansible.builtin.command:

cmd: "timeout {{ guidellm_wait_timeout_seconds | default(guidellm_wait_timeout) }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"

cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"

register: guidellm_exit_code_container

changed_when: false

failed_when: guidellm_exit_code_container.rc == 124

failed_when: guidellm_exit_code_container.rc != 0

when: use_guidellm_container | bool

- name: Wait for GuideLLM benchmark to complete (containerized)

ansible.builtin.command:

cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"

register: guidellm_exit_code_container

changed_when: false

failed_when: false

when: use_guidellm_container | bool

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml` around lines 208 - 214, The wait task "Wait for GuideLLM benchmark to complete (containerized)" currently aborts the play on podman wait errors; change it to be non-fatal by removing/setting failed_when: false (keep changed_when: false) while still registering guidellm_exit_code_container (using timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }} and when: use_guidellm_container | bool); then update the unified failure check after the diagnostic/cleanup steps to evaluate guidellm_exit_code_container.rc and treat nonzero as a failure only once logs/inspect/fetch/state: absent have run.

coderabbitai · 2026-03-31T10:15:00Z

automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml

+- name: Stream GuideLLM container logs directly to file
  ansible.builtin.shell:
-    cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} 2>&1"
+    cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} > /tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log 2>&1"
  args:
    executable: /bin/bash
-  register: guidellm_full_logs
  changed_when: false
  failed_when: false
  when:
    - use_guidellm_container | bool
    - container_exit_code.stdout is defined

- name: Save GuideLLM logs to temporary file on remote
-  ansible.builtin.copy:
-    content: "{{ guidellm_full_logs.stdout }}"
-    dest: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}.log"
-    mode: "0644"
-  when:
-    - use_guidellm_container | bool
-    - guidellm_full_logs is defined
-    - guidellm_full_logs.stdout is defined
-
 - name: Fetch GuideLLM logs to controller
  ansible.builtin.fetch:
-    src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}.log"
+    src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log"
    dest: "{{ results_path }}/guidellm.log"
    flat: true
  when:
    - use_guidellm_container | bool
-    - guidellm_full_logs is defined
-    - guidellm_full_logs.stdout is defined



⚠️ Potential issue | 🔴 Critical

Use a managed temp file here instead of building /tmp/...{{ test_run_id }}.log by hand.

test_run_id is caller-overridable in automation/test-execution/ansible/llm-benchmark-auto.yml, then passed into this role from automation/test-execution/ansible/llm-benchmark.yml. Interpolating it into an unquoted root shell redirection path creates a command-injection/arbitrary-write risk, and the file is never removed after fetch, so repeated runs will leak disk on the DUT/AWX worker.

🔐 Safer pattern

- cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} > /tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log 2>&1" + cmd: >- + podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} + > {{ guidellm_log_file.path | quote }} 2>&1 ... - src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log" + src: "{{ guidellm_log_file.path }}"

- name: Create GuideLLM temp log file ansible.builtin.tempfile: state: file prefix: "guidellm-" suffix: ".log" register: guidellm_log_file

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml` around lines 239 - 257, The "Stream GuideLLM container logs directly to file" task builds an unsafe /tmp filename using test_run_id (command-injection/arbitrary-write) and leaves files behind; replace this by creating a managed temp file with ansible.builtin.tempfile (register e.g. guidellm_log_file), update the shell task to redirect podman logs into guidellm_log_file.path instead of the interpolated /tmp path, update the "Fetch GuideLLM logs to controller" ansible.builtin.fetch src to use guidellm_log_file.path, and add a cleanup task to remove the temp file on the remote host after fetching to avoid disk leaks.

coderabbitai · 2026-03-31T10:15:00Z

automation/test-execution/ansible/tests/unit/test_cpu_utils.py

+    # Mock pytest.raises for fallback tests
+    class pytest:
+
+        class raises:
+
+            def __init__(self, exc):
+                self.exc = exc
+
+            def __enter__(self):
+                return self
+
+            def __exit__(self, exc_type, exc_value, traceback):
+                # Check if exception matches expected type
+                if exc_type is not None and issubclass(exc_type, self.exc):
+                    # Suppress the exception by returning True
+                    return True
+                # Let unexpected exceptions propagate
+                return False


⚠️ Potential issue | 🟡 Minor

Mock pytest.raises doesn't fail when no exception is raised.

The mock silently passes when the expected exception isn't thrown. Real pytest.raises fails if no exception occurs. While the fallback smoke tests (lines 596-624) avoid exception-testing methods, this could cause false passes if someone extends them.

Proposed fix

class pytest: class raises: def __init__(self, exc): self.exc = exc + self.value = None def __enter__(self): return self def __exit__(self, exc_type, exc_value, traceback): - # Check if exception matches expected type - if exc_type is not None and issubclass(exc_type, self.exc): - # Suppress the exception by returning True + if exc_type is None: + raise AssertionError( + f"Expected {self.exc.__name__} but no exception was raised" + ) + if issubclass(exc_type, self.exc): + self.value = exc_value return True - # Let unexpected exceptions propagate return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py` around lines 20 - 37, The mock pytest.raises currently returns True when no exception is raised, causing false positives; update the pytest.raises.__exit__ to raise an AssertionError (or appropriate test failure) when exc_type is None so it fails if no exception occurred, while still suppressing the expected exception when issubclass(exc_type, self.exc) returns True and letting unexpected exceptions propagate; modify the __exit__ logic in the pytest.raises class to: if exc_type is None: raise AssertionError(f"Did not raise {self.exc}"), elif issubclass(exc_type, self.exc): return True, else return False.

coderabbitai · 2026-03-31T10:15:00Z

docs/methodology/overview.md

 **Goal:** Measure P95 latency and throughput scaling under concurrent load

- Tests 8 LLM models + 2 embedding models
+- Tests 6 LLM models + 2 embedding models


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify actual LLM model count in model-matrix.yaml echo "=== LLM models in model-matrix.yaml ===" rg -n "full_name:" models/llm-models/model-matrix.yaml | head -20 echo "" echo "=== Model counts mentioned in docs ===" rg -n "LLM models" --type=md -C1

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 1552

Update model count: documentation claims 6 LLM models but model-matrix.yaml contains 7.

The file states "Tests 6 LLM models" (matching README.md), but models/llm-models/model-matrix.yaml lists 7 LLM models:

meta-llama/Llama-3.2-1B-Instruct

meta-llama/Llama-3.2-3B-Instruct

TinyLlama/TinyLlama-1.1B-Chat-v1.0

ibm-granite/granite-3.2-2b-instruct

Qwen/Qwen3-0.6B

Qwen/Qwen2.5-3B-Instruct

openai/gpt-oss-20b

Update the count to 7 to match the actual models in the test matrix.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/methodology/overview.md` at line 174, The documentation line that reads "Tests 6 LLM models + 2 embedding models" is out of sync with the test matrix which lists seven LLMs; update that text to "Tests 7 LLM models + 2 embedding models" (or otherwise change the sentence to reflect 7 LLM models) so it matches models/llm-models/model-matrix.yaml and the listed model names (e.g., meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, TinyLlama/TinyLlama-1.1B-Chat-v1.0, ibm-granite/granite-3.2-2b-instruct, Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b).

coderabbitai · 2026-03-31T10:15:00Z

README.md


 - Concurrency levels: 1, 2, 4, 8, 16, 32
- 8 LLM models + 2 embedding models
+- 6 LLM models + 2 embedding models


⚠️ Potential issue | 🟡 Minor

Count appears inconsistent with model list.

The "6 LLM models" count at line 161 should match the actual models listed. If gpt-oss-20b is included (as referenced at line 130), this should be 7 LLM models.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` at line 161, The LLM count line "6 LLM models + 2 embedding models" is inconsistent with the listed models because gpt-oss-20b is included; update that summary to "7 LLM models + 2 embedding models" (or remove gpt-oss-20b from the list if the intent is to keep six) by editing the summary string that contains "6 LLM models + 2 embedding models" and ensure it matches the presence of "gpt-oss-20b" in the model list.

coderabbitai · 2026-03-31T10:15:00Z

README.md

+**LLM Models (6 total):**

 - Llama-3.2 (1B, 3B) - Prefill-heavy
 - TinyLlama-1.1B - Balanced small-scale
- OPT (125M, 1.3B) - Decode-heavy legacy baseline
 - Granite-3.2-2B - Balanced enterprise
 - Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if gpt-oss-20b is referenced elsewhere in README echo "=== gpt-oss-20b references in README ===" rg -n "gpt-oss" README.md echo "" echo "=== All LLM models in model-matrix.yaml ===" rg "full_name:" models/llm-models/model-matrix.yaml

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 574

Add gpt-oss-20b to the LLM models list or correct the count.

The README mentions gpt-oss-20b in the "Large model support" feature (line 130) and includes it in models/llm-models/model-matrix.yaml, but it's missing from the model list at lines 183-186. The header claims "(6 total)" but the model matrix contains 7 models. Either include gpt-oss-20b in the bulleted list or update the count and summary at line 161 to reflect the actual number.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 181 - 186, The bulleted LLM list and header "LLM Models (6 total):" are inconsistent with models/llm-models/model-matrix.yaml which includes gpt-oss-20b; either add a new bullet "gpt-oss-20b - Large open-source (20B)" to the list and change the header to "LLM Models (7 total):", or keep the existing bullets but update the header and any summary count (the summary near the "Large model support" section) to "7" so the README and model-matrix.yaml match; update the "LLM Models" header and the bulleted list (the lines containing the listed model names) accordingly.

maryamtahhan and others added 8 commits March 31, 2026 09:11

docs: restore required pacakges instructions

be4dcda

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan mentioned this pull request Mar 31, 2026

Smoke tests #86

Open

6 tasks

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/model matrix validation errors#87

Fix/model matrix validation errors#87
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:fix/model-matrix-validation-errors

maryamtahhan commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maryamtahhan commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maryamtahhan commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading