Skip to content

Conversation

@rajanintel24
Copy link

@rajanintel24 rajanintel24 commented Dec 16, 2025

The dryrun feature allow users to copy the vllm-server or vllm-benchmark command line file on the host machine i.e. local area, without launching the the server or the client.

To copy the command line files, pass the below environment variable:
DRY_RUN=1

On server mode, the server command line file will be saved at /.cd/vllm_server.sh.
On benchmark mode, both the command line files will be saved at /.cd/vllm_server.sh and /.cd/vllm_benchmark.sh

os.makedirs(self.log_dir, exist_ok=True)
os.execvp("bash", ["bash", self.output_script_path])
if (os.environ.get("DRYRUN_SERVER")=='1' and self.mode=='server') or \
(os.environ.get("DRYRUN_BENCHMARK")=='1' and self.mode=='benchmark'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify this to :
Just have one DRYRUN env var, and rename to DRY-RUN
Since script names are unique no need to have sub directories.
No need for mode arg then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for separate dry run variables for server and benchmark because, in case of benchmark dry run I need to allow the server launch. The docker compose enforce server is launched as a precondition for benchmark run. Having a single dry-run variable, I need think another way to allow the server launch and then stop the client launch.

Copy link
Contributor

@nngokhale nngokhale Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There 2 levels of interactions. 1. What is right for entrypoint 2. What is right for docker compose.
Entrypoint dry-run option does not need all this complexity.
For docker compose single dry-run option producing both scripts should be primary functionality.
As for docker compose server healthcheck, may be we alter that for dry-run?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dry-Run implementation updated to remove dependency on the run mode.
DRYRUN env var, and renamed to DRY_RUN

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

print(f"[INFO] This is a dry run to save the command line file {self.output_script_path}.")
shutil.copy(self.output_script_path, f"/local/{self.mode}/")
print(f"[INFO] The command line file {self.output_script_path} saved at .cd/{self.mode}/{self.output_script_path}")
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Author

@rajanintel24 rajanintel24 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel right I can bring the below lines outside the if condition:

shutil.copy(self.output_script_path, f"/local/{self.mode}/")
print(f"[INFO] The command line file {self.output_script_path} saved at .cd/{self.mode}/{self.output_script_path}")

This will allow the command line file copy in normal run as well.

Regarding, print statement - I think is useful information to be logged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about the Ctrl+C. Why not just exit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over the call Ctrl+C is needed because the docker_compose is restarting the vllm_service after every exit.

The below lines moved outside the if condition
shutil.copy(self.output_script_path, f"/local/{self.mode}/")
print(f"[INFO] The command line file {self.output_script_path} saved at .cd/{self.mode}/{self.output_script_path}")

This will allow the command line file copy in normal run as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Rajan Kumar <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

docker run -it --rm \
-e MODEL=$MODEL \
-e HF_TOKEN=$HF_TOKEN \
-e http_proxy=$http_proxy \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY_RUN env is missing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the cmd line

- SYS_NICE
ipc: host
runtime: habana
restart: unless-stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this to "on-failure",. we may not need dry-run CTRL+C code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart condition "on-failure is working. Tested for bad model name failure. The Dry_Run do not need CTRL+C anymore.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Contributor

@nngokhale nngokhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PatrykWo PatrykWo self-assigned this Jan 9, 2026
@PatrykWo
Copy link
Collaborator

PatrykWo commented Jan 13, 2026

@rajanintel24 Executing ends with the error

vllm-server-1  | Starting script, logging to logs/vllm_server.log
vllm-server-1  | Error: Permission denied. Cannot access 'vllm_server.sh' or write to '/local/'.
vllm-server-1  | [INFO] This is a dry run to save the command line file vllm_server.sh.

@rajanintel24
Copy link
Author

rajanintel24 commented Jan 13, 2026

@PatrykWo I am unable to reproduce the error you reported on the latest commit from your end.

I am using below commands to test the branch:

BUILD_ARGS="--build-arg http_proxy --build-arg https_proxy --build-arg no_proxy" docker build -f Dockerfile.ubuntu.pytorch.vllm -t cmd-dev-rk-1p23 $BUILD_ARGS .

MODEL="meta-llama/Llama-3.1-8B-Instruct" \ HF_TOKEN="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" \ HABANA_VISIBLE_DEVICES=7 \ DOCKER_IMAGE=cmd-dev-rk-1p23 \ TENSOR_PARALLEL_SIZE=1 \ DRY_RUN=1 \ docker compose up

docker run -it --rm \ -e MODEL="meta-llama/Llama-3.1-8B-Instruct" \ -e HF_TOKEN="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" \ -e HF_HOME=/mnt/hf_cache \ -e http_proxy=$http_proxy \ -e https_proxy=$https_proxy \ -e no_proxy=$no_proxy \ --cap-add=sys_nice \ --ipc=host \ --runtime=habana \ -e HABANA_VISIBLE_DEVICES=7 \ -e VLLM_SKIP_WARMUP=TRUE \ -e DRY_RUN=1 \ -p 9001:8000 \ -v /mnt/hf_cache:/mnt/hf_cache \ -v ${PWD}:/local \ --name card-7_vllm-server-rk \ cmd-dev-rk-1p23

Please share your command lines which returns the reported error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants