Skip to content

Container image env var#94

Merged
maryamtahhan merged 4 commits intoredhat-et:mainfrom
maryamtahhan:container-image-env-var
Apr 10, 2026
Merged

Container image env var#94
maryamtahhan merged 4 commits intoredhat-et:mainfrom
maryamtahhan:container-image-env-var

Conversation

@maryamtahhan
Copy link
Copy Markdown
Collaborator

@maryamtahhan maryamtahhan commented Apr 3, 2026

Summary by CodeRabbit

  • New Features

    • Add environment variables to override container images; runtime now reports which image is used.
  • Documentation

    • Setup guide updated with the new environment variable options for customizing container images.
  • Chores

    • Test infra accepts dynamic workload definitions with clearer validation messages; server startup flow adjusted to skip LLM server for embedding workloads.

@maryamtahhan maryamtahhan requested a review from jharriga April 3, 2026 16:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Adds environment-variable driven container image overrides (VLLM_CONTAINER_IMAGE, GUIDELLM_CONTAINER_IMAGE, VLLM_BENCH_CONTAINER_IMAGE), updates Ansible group vars to use those env lookups with fallbacks, and replaces fixed workload allowlists with dynamic validation against test_configs.keys(); also expands a debug output to show the GuideLLM image.

Changes

Cohort / File(s) Summary
Docs
automation/test-execution/ansible/ansible.md, docs/getting-started.md
Documented/exported new env vars: VLLM_CONTAINER_IMAGE, GUIDELLM_CONTAINER_IMAGE (with defaults).
Inventory / Container images
automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml, automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml
Replaced hardcoded container image strings with Jinja2 `lookup('env', ...)
Playbook: concurrent load
automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
Changed base_workload validation to accept any key in test_configs.keys() and updated the failure message to list supported workloads dynamically.
Role: vllm_server
automation/test-execution/ansible/roles/vllm_server/tasks/main.yml, .../start-llm.yml
Replaced static workload allowlist with dynamic test_configs.keys() checks (excluding embedding), updated fail messages, and adjusted includes so the LLM server starts for non-embedding workloads.
Role: benchmark_guidellm
automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
Extended debug output to display guidellm_cfg.container_image when using container mode; shows N/A (using host guidellm) otherwise.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Container image env var' directly refers to the main change: adding environment variable support for container images used in test execution.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml (1)

18-19: Consider using a versioned tag instead of latest for reproducibility.

Using latest tag can lead to non-reproducible benchmark results if the image is updated between runs. The role's internal default at line 46 uses v0.5.3, creating an inconsistency.

Suggested change for consistency
    # Using GuideLLM official container image
    # Can be overridden with environment variable: export GUIDELLM_CONTAINER_IMAGE=...
-   container_image: "{{ lookup('env', 'GUIDELLM_CONTAINER_IMAGE') | default('ghcr.io/vllm-project/guidellm:latest', true) }}"
+   container_image: "{{ lookup('env', 'GUIDELLM_CONTAINER_IMAGE') | default('ghcr.io/vllm-project/guidellm:v0.5.3', true) }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml`
around lines 18 - 19, The container_image variable currently defaults to the
unpinned 'ghcr.io/vllm-project/guidellm:latest', which harms reproducibility;
update the default in the container_image definition (while keeping the
GUIDELLM_CONTAINER_IMAGE env lookup override) to use the versioned tag used by
the role (e.g., 'ghcr.io/vllm-project/guidellm:v0.5.3') so container_image and
the role default are consistent and benchmark runs are reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automation/test-execution/ansible/llm-benchmark-concurrent-load.yml`:
- Around line 94-97: The validation currently allows any key from test_configs
(via base_workload in test_configs.keys()) which lets invalid types through;
change it to only accept true base workloads (e.g., restrict to ['chat','code'])
by replacing the generic membership test with an explicit allowed list or by
deriving allowed_base_workloads = ['chat','code'] and checking base_workload
against that; also update the fail_msg to list those allowed base workloads and
ensure the later variable-workload logic that references 'chat_var' and
'code_var' remains consistent with this restriction.

---

Nitpick comments:
In
`@automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml`:
- Around line 18-19: The container_image variable currently defaults to the
unpinned 'ghcr.io/vllm-project/guidellm:latest', which harms reproducibility;
update the default in the container_image definition (while keeping the
GUIDELLM_CONTAINER_IMAGE env lookup override) to use the versioned tag used by
the role (e.g., 'ghcr.io/vllm-project/guidellm:v0.5.3') so container_image and
the role default are consistent and benchmark runs are reproducible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 684522b9-f263-44b4-b2e8-f6a2147363b8

📥 Commits

Reviewing files that changed from the base of the PR and between c991528 and a995583.

📒 Files selected for processing (6)
  • automation/test-execution/ansible/ansible.md
  • automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml
  • automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • docs/getting-started.md

Comment thread automation/test-execution/ansible/llm-benchmark-concurrent-load.yml Outdated
@jharriga
Copy link
Copy Markdown
Collaborator

jharriga commented Apr 3, 2026

I reviewed this PR in two steps

  1. "-e $IMAGE" support
    Using this syntax I had success
    for VLLM_IMAGE in "${image_array[@]}"; do
    -e "VLLM_CONTAINER_IMAGE={'image': '${VLLM_IMAGE}'}"

  2. New workload_type support
    I added a new workload_type to: automation/test-execution/ansible/inventory/group_vars/all/test-workloads.yml

THEN use of this syntax resulted in failure
llm-benchmark-concurrent-load.yml \ -e "base_workload=chat_lite" \

TASK [vllm_server : Validate workload type] ************************************ fatal: [vllm-server]: FAILED! => { "assertion": "workload_type in ['summarization', 'chat', 'code', 'rag', 'embedding', 'chat_var', 'code_var']", "changed": false, "evaluated_to": false, "msg": "Invalid workload_type 'chat_lite'. Must be one of: summarization, chat, code, rag, embedding, chat_var, code_var" }

NOTE that I made no edits to vllm-cpu-perf-eval/automation/test-execution/ansible/llm-benchmark-concurrent-load.yml:94

@maryamtahhan
Copy link
Copy Markdown
Collaborator Author

I reviewed this PR in two steps

  1. "-e $IMAGE" support
    Using this syntax I had success
    for VLLM_IMAGE in "${image_array[@]}"; do
    -e "VLLM_CONTAINER_IMAGE={'image': '${VLLM_IMAGE}'}"
  2. New workload_type support
    I added a new workload_type to: automation/test-execution/ansible/inventory/group_vars/all/test-workloads.yml

THEN use of this syntax resulted in failure llm-benchmark-concurrent-load.yml \ -e "base_workload=chat_lite" \

TASK [vllm_server : Validate workload type] ************************************ fatal: [vllm-server]: FAILED! => { "assertion": "workload_type in ['summarization', 'chat', 'code', 'rag', 'embedding', 'chat_var', 'code_var']", "changed": false, "evaluated_to": false, "msg": "Invalid workload_type 'chat_lite'. Must be one of: summarization, chat, code, rag, embedding, chat_var, code_var" }

NOTE that I made no edits to vllm-cpu-perf-eval/automation/test-execution/ansible/llm-benchmark-concurrent-load.yml:94

Hi John, I think I saw something similar when I just added the workload to the end of the file. But it needs to be in the test_configs section. Was your new definition in that section?

@jharriga
Copy link
Copy Markdown
Collaborator

jharriga commented Apr 6, 2026 via email

@jharriga
Copy link
Copy Markdown
Collaborator

jharriga commented Apr 7, 2026

Retested using the changes implemented in
[https://github.com//pull/94/commits/770fc59d6e6d7f0dbdff8a854598d89900922d27]

SUCCESS. I was able to add a new workload_type 'chat_lite' and the test ran successfully.
Thank you!

@jharriga
Copy link
Copy Markdown
Collaborator

jharriga commented Apr 7, 2026

Hold on, I'm seeing an issue with the "-e $IMAGE" support.
While the Playbook is not complaining about the syntax it does appear that the ENVvar is being ignored.
No matter which container I provide the tests are always running on the DUT with the default container (docker.io/vllm/vllm-openai-cpu:v0.18.0).

Script syntax syntax
image_array=("public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.18.0")
for VLLM_IMAGE in "${image_array[@]}"; do
ansible-playbook -i inventory/hosts.yml
llm-benchmark-concurrent-load.yml
-e "VLLM_CONTAINER_IMAGE={'image': '${VLLM_IMAGE}'}"
done

Ansible Controller
laptop$ ps -efl | grep concurrent
0 S jharriga 151978 151975 3 80 0 - 112578 hrtime 14:29 pts/0 00:00:05 /usr/bin/python3 -P /usr/bin/ansible-playbook -i inventory/hosts.yml llm-benchmark-concurrent-load.yml -e test_model=Qwen/Qwen3-0.6B -e base_workload=chat -e requested_cores=16 -e skip_phase_2=true -e skip_phase_3=true -e skip_prometheus_export=true -e guidellm_max_seconds=600 -e guidellm_rate=[1,4,8] -e VLLM_CONTAINER_IMAGE={'image': 'public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.12.0'}

DUT$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
34e78c015506 docker.io/vllm/vllm-openai-cpu:v0.18.0 --model Qwen/Qwen... 4 minutes ago Up 4 minutes vllm-server

@jharriga
Copy link
Copy Markdown
Collaborator

jharriga commented Apr 8, 2026

This PR looks good.
Both image ENVvar and 'workload_type' specifications are now working
I was incorrectly passing the ENVvar using this syntax
-e "VLLM_CONTAINER_IMAGE={'image': '${VLLM_IMAGE}'}"

I changed my script syntax to use "export" and now the testruns are using the designated ENVvar container image.

for VLLM_IMAGE in "${image_array[@]}"; do
export VLLM_CONTAINER_IMAGE="${VLLM_IMAGE}"

maryamtahhan and others added 4 commits April 10, 2026 11:43
Add support for configuring container images via environment variables:
- VLLM_CONTAINER_IMAGE: vLLM server image
- GUIDELLM_CONTAINER_IMAGE: GuideLLM benchmark tool image
- VLLM_BENCH_CONTAINER_IMAGE: vLLM bench tool image

All variables include sensible defaults matching current configuration,
allowing users to easily override images without editing config files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add container image display to GuideLLM configuration output to match
the vLLM server display, making it easier to verify which image is
being used during test execution.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Replace hardcoded workload type validation with dynamic check against
test_configs.keys(), matching the approach used in llm-benchmark-auto.yml.

This allows users to add custom workloads to test-workloads.yml and
automatically use them in concurrent load testing without modifying
the playbook validation logic.

Changes:
- base_workload validation now checks test_configs.keys()
- variable workload check now dynamic instead of hardcoded list
- Updated documentation to reflect workload flexibility

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update workload type validation in vllm_server role to dynamically
check against test_configs.keys() instead of hardcoded workload list.
This allows users to add custom workloads to test-workloads.yml
without modifying role code.

Changes:
- main.yml: Validate against test_configs.keys()
- start-llm.yml: Validate non-embedding workloads dynamically
- Improved error messages to show available workloads

Fixes issue where new workloads in test-workloads.yml were rejected
by hardcoded validation, even after concurrent-load playbook was
made dynamic.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the container-image-env-var branch from 770fc59 to 6841dfe Compare April 10, 2026 10:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml (1)

24-61: Move workload validation before test_configs[workload_type] access.

Line 26 reads test_configs[workload_type] before the assert at Lines 55-60. If this task file is called directly with an invalid workload_type, execution fails early with a dict-key error and skips your clearer validation message.

Proposed reordering
-- name: Get workload and core configuration
-  ansible.builtin.set_fact:
-    workload_cfg: "{{ test_configs[workload_type] }}"
-    core_cfg: "{{ core_configuration }}"
-    container_cfg: "{{ container_runtime }}"
-
 # ============================================================================
 # Caching Mode Configuration
 # ============================================================================
@@
 # ============================================================================
 # Workload Validation
 # ============================================================================
 
 - name: Validate workload type
   ansible.builtin.assert:
     that:
       - workload_type in test_configs.keys()
       - workload_type != 'embedding'
     fail_msg: "Invalid workload_type: {{ workload_type }}. Must be a non-embedding workload from: {{ test_configs.keys() | list | select('ne', 'embedding') | sort | join(', ') }}"
+
+- name: Get workload and core configuration
+  ansible.builtin.set_fact:
+    workload_cfg: "{{ test_configs[workload_type] }}"
+    core_cfg: "{{ core_configuration }}"
+    container_cfg: "{{ container_runtime }}"
As per coding guidelines, "`**`: -Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml`
around lines 24 - 61, The workload validation currently runs after the task that
sets workload_cfg using test_configs[workload_type], causing a dict-key error
for invalid inputs; move the "Validate workload type" ansible.builtin.assert
task (the block that checks workload_type in test_configs.keys() and !=
'embedding') so it appears before the "Get workload and core configuration"
ansible.builtin.set_fact (which sets workload_cfg: "{{
test_configs[workload_type] }}"), ensuring the assert runs first and prevents
accessing test_configs with an invalid key.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml`:
- Around line 24-61: The workload validation currently runs after the task that
sets workload_cfg using test_configs[workload_type], causing a dict-key error
for invalid inputs; move the "Validate workload type" ansible.builtin.assert
task (the block that checks workload_type in test_configs.keys() and !=
'embedding') so it appears before the "Get workload and core configuration"
ansible.builtin.set_fact (which sets workload_cfg: "{{
test_configs[workload_type] }}"), ensuring the assert runs first and prevents
accessing test_configs with an invalid key.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5b227a5c-4c65-4688-9b9c-dd4444bc0673

📥 Commits

Reviewing files that changed from the base of the PR and between 770fc59 and 6841dfe.

📒 Files selected for processing (8)
  • automation/test-execution/ansible/ansible.md
  • automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml
  • automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/main.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/start-llm.yml
  • docs/getting-started.md
✅ Files skipped from review due to trivial changes (3)
  • automation/test-execution/ansible/roles/benchmark_guidellm/tasks/main.yml
  • automation/test-execution/ansible/ansible.md
  • automation/test-execution/ansible/inventory/group_vars/all/infrastructure.yml
🚧 Files skipped from review as they are similar to previous changes (4)
  • automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
  • automation/test-execution/ansible/roles/vllm_server/tasks/main.yml
  • automation/test-execution/ansible/inventory/group_vars/all/benchmark-tools.yml
  • docs/getting-started.md

@maryamtahhan maryamtahhan merged commit 6a75fa9 into redhat-et:main Apr 10, 2026
3 checks passed
@maryamtahhan maryamtahhan deleted the container-image-env-var branch April 10, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants