Skip to content

Fix/model matrix validation errors#87

Open
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:fix/model-matrix-validation-errors
Open

Fix/model matrix validation errors#87
maryamtahhan wants to merge 8 commits intoredhat-et:mainfrom
maryamtahhan:fix/model-matrix-validation-errors

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Mar 31, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for AWX runtime environments with automatic result path handling
    • Enhanced embedding model metadata with dimensions and maximum sequence length specifications
  • Bug Fixes

    • Improved container execution reliability for benchmark tests
  • Documentation

    • Updated test suite model coverage (now includes 6 LLM models instead of 8)
    • Enhanced deployment and network configuration guidance
    • Clarified platform setup requirements for production-accurate measurements

maryamtahhan and others added 8 commits March 31, 2026 09:11
…rdening

This commit improves the existing Ansible playbook infrastructure for vLLM
CPU performance evaluation with enhanced AWX compatibility, security hardening,
and comprehensive testing.

- Fix type normalization to handle AnsibleUnsafeText from AWX
- Fix allocated_nodes to return integers instead of strings
- Handle empty strings and Jinja2 None conversions properly
- Simplify node eligibility checking and allocation logic
- Improve error messages for better validation feedback

- Add no_log: true to all tasks handling HF_TOKEN
- Prevent token exposure in container start operations
- Secure environment variable handling in AWX jobs

- Add comprehensive unit tests for cpu_utils filter plugin (598 lines)
- Test coverage for: CPU range conversion, NUMA extraction, multi-NUMA
  allocation, OMP binding, and real-world scenarios
- Support for both pytest and standalone execution
- Add collections/requirements.yml for Ansible collection dependencies

- Better parameter validation for AWX jobs in concurrent load testing
- AWX detection for result path handling
- Improved NUMA topology detection in core sweep
- Enhanced result path consistency in main benchmark
- Better workload configuration handling

- Simplify prerequisites section
- Update examples with current best practices
- Clearer workflow documentation

Files changed: 13 files, 772 insertions(+), 323 deletions(-)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Remove undefined 'short_codegen' workload from validation lists to prevent
KeyError during test execution. Add CPU allocation validation to detect
under-allocation early with clear error messages. Fix pytest.raises fallback
to properly suppress exceptions when pytest is unavailable. Pin exact collection
versions for reproducible AWX/CI runs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Fix inconsistent output filename reference in ansible.md (benchmarks.html → benchmarks.json)
- Improve requested_cores validation to handle non-numeric input safely
- Document requirements files usage (AWX/production vs CLI/development)
- Fix hardcoded log_collection_dest to use centralized local_results_base
- Remove obsolete vLLM configuration fields from test metadata docs
- Improve node capacity validation with sorting and per-node checks in cpu_utils.py
- Refactor duplicate AWX_JOB_ID lookup in llm-core-sweep-auto.yml
- Add NUMA node id type coercion for consistent integer types

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
The previous commit changed NUMA node ids from strings to integers,
but the selectattr filters were still trying to match them as strings,
causing "No first item, sequence was empty" errors.

Remove the | string filter from selectattr comparisons to match
integer ids correctly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
- Fix allocate_with_fixed_tp to filter nodes by capacity (cpu_utils.py)
- Update Results summary to use local_results_base (llm-benchmark-auto.yml)
- Implement core_sweep_counts parameter handling (llm-benchmark-concurrent-load.yml)
  - Normalize core_sweep_counts to requested_cores for single-element lists
  - Reject multi-element lists with helpful error directing to llm-core-sweep-auto.yml
  - Pass effective_requested_cores to all 3 phases
- Restore model configuration facts in start-llm.yml for metadata collection
  - Set model_kv_cache, model_dtype, model_max_len, vllm_caching_mode
  - Fixes compatibility with get-vllm-config-from-dut.yml assertions
- Add regression tests for serialized TP inputs (test_cpu_utils.py)
  - Test empty string '' behaves like auto-TP
  - Test string 'None' behaves like auto-TP

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…ainer start

Addresses two critical issues in vLLM configuration:

1. Per-model overrides (model_kv_cache_space, model_dtype) now properly
   override workload defaults instead of being ignored
2. Added preflight validation of --max-model-len against workload
   requirements (ISL + OSL) to fail early with clear error messages

Both fixes ensure configuration errors are caught early and per-model
settings take precedence over workload fallbacks.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit fixes two configuration bugs found by smoke tests:

Bug #1: Remove OPT models (opt-125m, opt-1.3b)
- Context length (2048) incompatible with summarization workload (4096)
- Legacy models causing validation failures
- Reduced model count from 8 to 6 LLM models

Changes:
- Removed opt-125m and opt-1.3b from model-matrix.yaml
- Updated README.md: 8 → 6 LLM models
- Updated docs/methodology/overview.md: 8 → 6 LLM models
- Removed OPT Family section from models/models.md
- Removed all OPT entries from model tables and test scenarios
- Removed "Decode-Heavy Models" section (was OPT-only)

Bug #2: Add missing embedding model fields
- granite-embedding-english-r2: added dimensions=384, max_sequence_length=512
- granite-embedding-278m-multilingual: added dimensions=768, max_sequence_length=512

These fixes ensure all model configurations pass validation tests.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This pull request updates Ansible playbooks to detect AWX runtime environments and dynamically configure result paths, removes legacy OPT models from the model matrix, refactors CPU core allocation logic with improved NUMA handling, eliminates vLLM configuration metadata collection from playbooks, and adds comprehensive unit test coverage for CPU utilities.

Changes

Cohort / File(s) Summary
AWX Environment Detection & Dynamic Results Paths
automation/test-execution/ansible/llm-benchmark.yml, automation/test-execution/ansible/llm-benchmark-auto.yml, automation/test-execution/ansible/llm-benchmark-concurrent-load.yml, automation/test-execution/ansible/llm-core-sweep-auto.yml
Added is_awx_job fact based on AWX_JOB_ID environment variable and introduced local_results_base to dynamically select results directory ($HOME/benchmark-results for AWX, playbook_dir/../../../results/llm otherwise). Updated all downstream path construction and output messages to use the dynamic base path.
vLLM Configuration Metadata Removal
automation/test-execution/ansible/llm-benchmark.yml, automation/test-execution/ansible/llm-benchmark-auto.yml, automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
Removed collection of vLLM configuration fields (dtype, KV cache size, max model length, caching mode) from playbook metadata and DUT retrieval steps. Replaced with effective-value driven approach using effective_kv_cache_space and effective_max_model_len; model-matrix.yaml lookup removed from start-llm.yml.
CPU Allocation Logic Refactoring
automation/test-execution/ansible/filter_plugins/cpu_utils.py, automation/test-execution/ansible/roles/common/tasks/allocate-cores-from-count.yml, automation/test-execution/ansible/roles/common/tasks/detect-numa-topology.yml
Improved NUMA node handling: normalized node IDs and core counts to numeric types, simplified tensor-parallel value computation, switched to sorted node selection by capacity, and consolidated core allocation branching. Backward-compatibility logic refactored to use numeric node IDs with direct lookups. Consolidated tensor-parallel conditional logic into single inline computation.
HF\_TOKEN Logging Protection
automation/test-execution/ansible/roles/vllm_server/tasks/start-embedding.yml, automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
Added no_log: true to tasks handling environment variables and container operations to suppress sensitive HF_TOKEN from appearing in Ansible logs.
Container Log Handling
automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
Refactored container log collection: removed intermediate registration/copy steps and now stream logs directly to unique per-test_run_id temporary files on remote host before fetching. Changed container wait failure handling to treat any non-zero RC as failure. Renamed guidellm_wait_timeout to guidellm_wait_timeout_seconds.
Model Matrix Updates
models/llm-models/model-matrix.yaml, models/embedding-models/model-matrix.yaml
Removed legacy OPT models (opt-125m, opt-1.3b) from LLM model matrix (64 lines deleted). Added dimensions and max_sequence_length fields to embedding model definitions (granite-embedding-english-r2, granite-embedding-278m-multilingual).
Documentation Updates
README.md, automation/test-execution/ansible/README.md, automation/test-execution/ansible/ansible.md, docs/methodology/overview.md, models/models.md
Updated model coverage documentation to reflect 6 LLM models (down from 8). Simplified preflight checklist in ansible.md, added "Verify SSH and Network Access" section, updated example model to Llama-3.2-1B. Removed vLLM configuration metadata from metadata features list. Updated model participation tables and KV cache references.
Test Infrastructure
automation/test-execution/ansible/tests/unit/test_cpu_utils.py
Added comprehensive unit test module with 625 lines covering CPU utilities: tests for CPU range compression, NUMA topology extraction, core allocation across various topology scenarios, tensor-parallel selection, and error handling. Includes synthetic NUMA topology helper and fallback pytest implementation.
Ansible Requirements & Configuration
automation/test-execution/ansible/requirements.yml, collections/requirements.yml
Updated development requirements file with clarifying comments and usage guidance. Added new canonical collections/requirements.yml with exact pinned versions (containers.podman@1.9.4, ansible.posix@1.5.4) for AWX/production deployments.
Core Count Validation
automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
Enhanced concurrency input validation: added assertions for requested_cores and core_sweep_counts provision, computed effective_requested_cores from available inputs, and tightened base_workload validation by removing short_codegen. Simplified core passing by always using effective_requested_cores instead of conditional core_sweep_counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title refers to model matrix validation, but the changeset encompasses extensive Ansible playbook refactoring, documentation updates, test utilities, and infrastructure changes beyond just fixing model matrix validation errors. Consider a more comprehensive title that reflects the major changes, such as 'Refactor playbooks for AWX support and remove legacy OPT models' to better represent the full scope of work.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 95.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maryamtahhan maryamtahhan mentioned this pull request Mar 31, 2026
6 tasks
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
models/models.md (1)

128-138: ⚠️ Potential issue | 🟡 Minor

Model count mismatch in documentation.

The header states "Current LLM Models (9 total)" but the table only lists 7 models:

  1. Llama-3.2-1B-Instruct
  2. Llama-3.2-3B-Instruct
  3. TinyLlama-1.1B-Chat
  4. granite-3.2-2b-instruct
  5. Qwen/Qwen3-0.6B
  6. Qwen/Qwen2.5-3B-Instruct
  7. openai/gpt-oss-20b

Update the header to reflect the actual count.

📝 Proposed fix
-## Current LLM Models (9 total)
+## Current LLM Models (7 total)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/models.md` around lines 128 - 138, The header "Current LLM Models (9
total)" is incorrect relative to the table rows; update that header to "Current
LLM Models (7 total)" so it matches the listed models (Llama-3.2-1B-Instruct,
Llama-3.2-3B-Instruct, TinyLlama-1.1B-Chat, granite-3.2-2b-instruct,
Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b) by editing the
header text in the models.md file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-auto.yml`:
- Around line 144-146: The display logic in allocate-cores-from-count.yml relies
on checking "requested_tensor_parallel is not defined" to decide between
"(auto-calculated)" and "(user-specified)", but llm-benchmark-auto.yml sets
requested_tensor_parallel (and related requested_omp_* variables) using
default(None), which means the variable is always defined and causes the message
to incorrectly show "(user-specified)". Fix by restoring the original omission
behavior: change the defaults for requested_tensor_parallel,
requested_omp_threads, and requested_omp_bind back to default(omit) in the
llm-benchmark-auto.yml diff so the "is not defined" check in
allocate-cores-from-count.yml works as intended (alternatively, update the
display logic to test for "is not none" if you prefer defined-but-None
semantics).

In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml`:
- Around line 208-214: The wait task "Wait for GuideLLM benchmark to complete
(containerized)" currently aborts the play on podman wait errors; change it to
be non-fatal by removing/setting failed_when: false (keep changed_when: false)
while still registering guidellm_exit_code_container (using timeout {{
guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{
core_cfg.name }} and when: use_guidellm_container | bool); then update the
unified failure check after the diagnostic/cleanup steps to evaluate
guidellm_exit_code_container.rc and treat nonzero as a failure only once
logs/inspect/fetch/state: absent have run.
- Around line 239-257: The "Stream GuideLLM container logs directly to file"
task builds an unsafe /tmp filename using test_run_id
(command-injection/arbitrary-write) and leaves files behind; replace this by
creating a managed temp file with ansible.builtin.tempfile (register e.g.
guidellm_log_file), update the shell task to redirect podman logs into
guidellm_log_file.path instead of the interpolated /tmp path, update the "Fetch
GuideLLM logs to controller" ansible.builtin.fetch src to use
guidellm_log_file.path, and add a cleanup task to remove the temp file on the
remote host after fetching to avoid disk leaks.

In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py`:
- Around line 20-37: The mock pytest.raises currently returns True when no
exception is raised, causing false positives; update the pytest.raises.__exit__
to raise an AssertionError (or appropriate test failure) when exc_type is None
so it fails if no exception occurred, while still suppressing the expected
exception when issubclass(exc_type, self.exc) returns True and letting
unexpected exceptions propagate; modify the __exit__ logic in the pytest.raises
class to: if exc_type is None: raise AssertionError(f"Did not raise
{self.exc}"), elif issubclass(exc_type, self.exc): return True, else return
False.

In `@docs/methodology/overview.md`:
- Line 174: The documentation line that reads "Tests 6 LLM models + 2 embedding
models" is out of sync with the test matrix which lists seven LLMs; update that
text to "Tests 7 LLM models + 2 embedding models" (or otherwise change the
sentence to reflect 7 LLM models) so it matches
models/llm-models/model-matrix.yaml and the listed model names (e.g.,
meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct,
TinyLlama/TinyLlama-1.1B-Chat-v1.0, ibm-granite/granite-3.2-2b-instruct,
Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b).

In `@README.md`:
- Line 161: The LLM count line "6 LLM models + 2 embedding models" is
inconsistent with the listed models because gpt-oss-20b is included; update that
summary to "7 LLM models + 2 embedding models" (or remove gpt-oss-20b from the
list if the intent is to keep six) by editing the summary string that contains
"6 LLM models + 2 embedding models" and ensure it matches the presence of
"gpt-oss-20b" in the model list.
- Around line 181-186: The bulleted LLM list and header "LLM Models (6 total):"
are inconsistent with models/llm-models/model-matrix.yaml which includes
gpt-oss-20b; either add a new bullet "gpt-oss-20b - Large open-source (20B)" to
the list and change the header to "LLM Models (7 total):", or keep the existing
bullets but update the header and any summary count (the summary near the "Large
model support" section) to "7" so the README and model-matrix.yaml match; update
the "LLM Models" header and the bulleted list (the lines containing the listed
model names) accordingly.

---

Outside diff comments:
In `@models/models.md`:
- Around line 128-138: The header "Current LLM Models (9 total)" is incorrect
relative to the table rows; update that header to "Current LLM Models (7 total)"
so it matches the listed models (Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct,
TinyLlama-1.1B-Chat, granite-3.2-2b-instruct, Qwen/Qwen3-0.6B,
Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b) by editing the header text in the
models.md file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5bdc56bb-5f27-46f5-89c3-67029da1c3ca

📥 Commits

Reviewing files that changed from the base of the PR and between b4126fb and 338acdd.

📒 Files selected for processing (20)
  • README.md
  • automation/test-execution/ansible/README.md
  • automation/test-execution/ansible/ansible.md
  • automation/test-execution/ansible/filter_plugins/cpu_utils.py
  • automation/test-execution/ansible/llm-benchmark-auto.yml
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/llm-benchmark.yml
  • automation/test-execution/ansible/llm-core-sweep-auto.yml
  • automation/test-execution/ansible/requirements.yml
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • automation/test-execution/ansible/roles/common/tasks/allocate-cores-from-count.yml
  • automation/test-execution/ansible/roles/common/tasks/detect-numa-topology.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/start-embedding.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
  • automation/test-execution/ansible/tests/unit/test_cpu_utils.py
  • collections/requirements.yml
  • docs/methodology/overview.md
  • models/embedding-models/model-matrix.yaml
  • models/llm-models/model-matrix.yaml
  • models/models.md
💤 Files with no reviewable changes (2)
  • automation/test-execution/ansible/README.md
  • models/llm-models/model-matrix.yaml

Comment on lines +144 to +146
requested_tensor_parallel: "{{ hostvars['localhost']['requested_tensor_parallel'] | default(None) }}"
requested_omp_threads: "{{ hostvars['localhost']['requested_omp_threads'] | default(None) }}"
requested_omp_bind: "{{ hostvars['localhost']['requested_omp_bind'] | default(None) }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Display message may incorrectly indicate "user-specified" tensor parallel.

Changing from default(omit) to default(None) affects the display logic at allocate-cores-from-count.yml line 65, which uses requested_tensor_parallel is not defined to show "(auto-calculated)" vs "(user-specified)". With default(None), the variable is always defined, so the message will incorrectly show "(user-specified)" even when the user didn't provide a value.

This is a minor display issue that doesn't affect functionality, but may confuse users reviewing benchmark output.

Suggested fix in allocate-cores-from-count.yml line 65
-          - "Tensor Parallel: {{ core_allocation_result.tensor_parallel }}{{ ' (auto-calculated)' if requested_tensor_parallel is not defined else ' (user-specified)' }}"
+          - "Tensor Parallel: {{ core_allocation_result.tensor_parallel }}{{ ' (auto-calculated)' if (requested_tensor_parallel is not defined or requested_tensor_parallel == None or requested_tensor_parallel == '') else ' (user-specified)' }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/llm-benchmark-auto.yml` around lines 144 -
146, The display logic in allocate-cores-from-count.yml relies on checking
"requested_tensor_parallel is not defined" to decide between "(auto-calculated)"
and "(user-specified)", but llm-benchmark-auto.yml sets
requested_tensor_parallel (and related requested_omp_* variables) using
default(None), which means the variable is always defined and causes the message
to incorrectly show "(user-specified)". Fix by restoring the original omission
behavior: change the defaults for requested_tensor_parallel,
requested_omp_threads, and requested_omp_bind back to default(omit) in the
llm-benchmark-auto.yml diff so the "is not defined" check in
allocate-cores-from-count.yml works as intended (alternatively, update the
display logic to test for "is not none" if you prefer defined-but-None
semantics).

Comment on lines 208 to 214
- name: Wait for GuideLLM benchmark to complete (containerized)
ansible.builtin.command:
cmd: "timeout {{ guidellm_wait_timeout_seconds | default(guidellm_wait_timeout) }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
register: guidellm_exit_code_container
changed_when: false
failed_when: guidellm_exit_code_container.rc == 124
failed_when: guidellm_exit_code_container.rc != 0
when: use_guidellm_container | bool
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t fail the play before the diagnostic and cleanup tasks run.

Line 213 makes any podman wait error terminal, so the later inspect, log capture, fetch, and state: absent tasks never execute on the cases you most need them. Keep this step non-fatal, then fold guidellm_exit_code_container.rc into the unified failure path after cleanup finishes.

🩹 Minimal change
   register: guidellm_exit_code_container
   changed_when: false
-  failed_when: guidellm_exit_code_container.rc != 0
+  failed_when: false

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Wait for GuideLLM benchmark to complete (containerized)
ansible.builtin.command:
cmd: "timeout {{ guidellm_wait_timeout_seconds | default(guidellm_wait_timeout) }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
register: guidellm_exit_code_container
changed_when: false
failed_when: guidellm_exit_code_container.rc == 124
failed_when: guidellm_exit_code_container.rc != 0
when: use_guidellm_container | bool
- name: Wait for GuideLLM benchmark to complete (containerized)
ansible.builtin.command:
cmd: "timeout {{ guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{ core_cfg.name }}"
register: guidellm_exit_code_container
changed_when: false
failed_when: false
when: use_guidellm_container | bool
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml`
around lines 208 - 214, The wait task "Wait for GuideLLM benchmark to complete
(containerized)" currently aborts the play on podman wait errors; change it to
be non-fatal by removing/setting failed_when: false (keep changed_when: false)
while still registering guidellm_exit_code_container (using timeout {{
guidellm_wait_timeout_seconds }} podman wait guidellm-{{ workload_type }}-{{
core_cfg.name }} and when: use_guidellm_container | bool); then update the
unified failure check after the diagnostic/cleanup steps to evaluate
guidellm_exit_code_container.rc and treat nonzero as a failure only once
logs/inspect/fetch/state: absent have run.

Comment on lines +239 to 257
- name: Stream GuideLLM container logs directly to file
ansible.builtin.shell:
cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} 2>&1"
cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} > /tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log 2>&1"
args:
executable: /bin/bash
register: guidellm_full_logs
changed_when: false
failed_when: false
when:
- use_guidellm_container | bool
- container_exit_code.stdout is defined

- name: Save GuideLLM logs to temporary file on remote
ansible.builtin.copy:
content: "{{ guidellm_full_logs.stdout }}"
dest: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}.log"
mode: "0644"
when:
- use_guidellm_container | bool
- guidellm_full_logs is defined
- guidellm_full_logs.stdout is defined

- name: Fetch GuideLLM logs to controller
ansible.builtin.fetch:
src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}.log"
src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log"
dest: "{{ results_path }}/guidellm.log"
flat: true
when:
- use_guidellm_container | bool
- guidellm_full_logs is defined
- guidellm_full_logs.stdout is defined

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use a managed temp file here instead of building /tmp/...{{ test_run_id }}.log by hand.

test_run_id is caller-overridable in automation/test-execution/ansible/llm-benchmark-auto.yml, then passed into this role from automation/test-execution/ansible/llm-benchmark.yml. Interpolating it into an unquoted root shell redirection path creates a command-injection/arbitrary-write risk, and the file is never removed after fetch, so repeated runs will leak disk on the DUT/AWX worker.

🔐 Safer pattern
-    cmd: "podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }} > /tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log 2>&1"
+    cmd: >-
+      podman logs {{ ('guidellm-' ~ workload_type ~ '-' ~ core_cfg.name) | quote }}
+      > {{ guidellm_log_file.path | quote }} 2>&1
...
-    src: "/tmp/guidellm-{{ workload_type }}-{{ core_cfg.name }}-{{ test_run_id }}.log"
+    src: "{{ guidellm_log_file.path }}"
- name: Create GuideLLM temp log file
  ansible.builtin.tempfile:
    state: file
    prefix: "guidellm-"
    suffix: ".log"
  register: guidellm_log_file

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml`
around lines 239 - 257, The "Stream GuideLLM container logs directly to file"
task builds an unsafe /tmp filename using test_run_id
(command-injection/arbitrary-write) and leaves files behind; replace this by
creating a managed temp file with ansible.builtin.tempfile (register e.g.
guidellm_log_file), update the shell task to redirect podman logs into
guidellm_log_file.path instead of the interpolated /tmp path, update the "Fetch
GuideLLM logs to controller" ansible.builtin.fetch src to use
guidellm_log_file.path, and add a cleanup task to remove the temp file on the
remote host after fetching to avoid disk leaks.

Comment on lines +20 to +37
# Mock pytest.raises for fallback tests
class pytest:

class raises:

def __init__(self, exc):
self.exc = exc

def __enter__(self):
return self

def __exit__(self, exc_type, exc_value, traceback):
# Check if exception matches expected type
if exc_type is not None and issubclass(exc_type, self.exc):
# Suppress the exception by returning True
return True
# Let unexpected exceptions propagate
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mock pytest.raises doesn't fail when no exception is raised.

The mock silently passes when the expected exception isn't thrown. Real pytest.raises fails if no exception occurs. While the fallback smoke tests (lines 596-624) avoid exception-testing methods, this could cause false passes if someone extends them.

Proposed fix
     class pytest:

         class raises:

             def __init__(self, exc):
                 self.exc = exc
+                self.value = None

             def __enter__(self):
                 return self

             def __exit__(self, exc_type, exc_value, traceback):
-                # Check if exception matches expected type
-                if exc_type is not None and issubclass(exc_type, self.exc):
-                    # Suppress the exception by returning True
+                if exc_type is None:
+                    raise AssertionError(
+                        f"Expected {self.exc.__name__} but no exception was raised"
+                    )
+                if issubclass(exc_type, self.exc):
+                    self.value = exc_value
                     return True
-                # Let unexpected exceptions propagate
                 return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/tests/unit/test_cpu_utils.py` around lines
20 - 37, The mock pytest.raises currently returns True when no exception is
raised, causing false positives; update the pytest.raises.__exit__ to raise an
AssertionError (or appropriate test failure) when exc_type is None so it fails
if no exception occurred, while still suppressing the expected exception when
issubclass(exc_type, self.exc) returns True and letting unexpected exceptions
propagate; modify the __exit__ logic in the pytest.raises class to: if exc_type
is None: raise AssertionError(f"Did not raise {self.exc}"), elif
issubclass(exc_type, self.exc): return True, else return False.

**Goal:** Measure P95 latency and throughput scaling under concurrent load

- Tests 8 LLM models + 2 embedding models
- Tests 6 LLM models + 2 embedding models
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify actual LLM model count in model-matrix.yaml
echo "=== LLM models in model-matrix.yaml ==="
rg -n "full_name:" models/llm-models/model-matrix.yaml | head -20

echo ""
echo "=== Model counts mentioned in docs ==="
rg -n "LLM models" --type=md -C1

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 1552


Update model count: documentation claims 6 LLM models but model-matrix.yaml contains 7.

The file states "Tests 6 LLM models" (matching README.md), but models/llm-models/model-matrix.yaml lists 7 LLM models:

  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.2-3B-Instruct
  • TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • ibm-granite/granite-3.2-2b-instruct
  • Qwen/Qwen3-0.6B
  • Qwen/Qwen2.5-3B-Instruct
  • openai/gpt-oss-20b

Update the count to 7 to match the actual models in the test matrix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/methodology/overview.md` at line 174, The documentation line that reads
"Tests 6 LLM models + 2 embedding models" is out of sync with the test matrix
which lists seven LLMs; update that text to "Tests 7 LLM models + 2 embedding
models" (or otherwise change the sentence to reflect 7 LLM models) so it matches
models/llm-models/model-matrix.yaml and the listed model names (e.g.,
meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct,
TinyLlama/TinyLlama-1.1B-Chat-v1.0, ibm-granite/granite-3.2-2b-instruct,
Qwen/Qwen3-0.6B, Qwen/Qwen2.5-3B-Instruct, openai/gpt-oss-20b).


- Concurrency levels: 1, 2, 4, 8, 16, 32
- 8 LLM models + 2 embedding models
- 6 LLM models + 2 embedding models
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Count appears inconsistent with model list.

The "6 LLM models" count at line 161 should match the actual models listed. If gpt-oss-20b is included (as referenced at line 130), this should be 7 LLM models.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 161, The LLM count line "6 LLM models + 2 embedding
models" is inconsistent with the listed models because gpt-oss-20b is included;
update that summary to "7 LLM models + 2 embedding models" (or remove
gpt-oss-20b from the list if the intent is to keep six) by editing the summary
string that contains "6 LLM models + 2 embedding models" and ensure it matches
the presence of "gpt-oss-20b" in the model list.

Comment on lines +181 to 186
**LLM Models (6 total):**

- Llama-3.2 (1B, 3B) - Prefill-heavy
- TinyLlama-1.1B - Balanced small-scale
- OPT (125M, 1.3B) - Decode-heavy legacy baseline
- Granite-3.2-2B - Balanced enterprise
- Qwen3-0.6B, Qwen2.5-3B - High-efficiency balanced
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if gpt-oss-20b is referenced elsewhere in README
echo "=== gpt-oss-20b references in README ==="
rg -n "gpt-oss" README.md

echo ""
echo "=== All LLM models in model-matrix.yaml ==="
rg "full_name:" models/llm-models/model-matrix.yaml

Repository: redhat-et/vllm-cpu-perf-eval

Length of output: 574


Add gpt-oss-20b to the LLM models list or correct the count.

The README mentions gpt-oss-20b in the "Large model support" feature (line 130) and includes it in models/llm-models/model-matrix.yaml, but it's missing from the model list at lines 183-186. The header claims "(6 total)" but the model matrix contains 7 models. Either include gpt-oss-20b in the bulleted list or update the count and summary at line 161 to reflect the actual number.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 181 - 186, The bulleted LLM list and header "LLM
Models (6 total):" are inconsistent with models/llm-models/model-matrix.yaml
which includes gpt-oss-20b; either add a new bullet "gpt-oss-20b - Large
open-source (20B)" to the list and change the header to "LLM Models (7 total):",
or keep the existing bullets but update the header and any summary count (the
summary near the "Large model support" section) to "7" so the README and
model-matrix.yaml match; update the "LLM Models" header and the bulleted list
(the lines containing the listed model names) accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant