container backend: container.lifetime schema, setup_script provisioning, and persistent mode by atnair-amd · Pull Request #195 · ROCm/cvs

atnair-amd · 2026-05-31T23:23:21Z

Summary

Reworks the container execution backend around a single container lifecycle policy and adds in-container provisioning + a persistent (install-then-run) mode.

Schema: replaces the two-axis container.enabled + container.launch with one tri-valued container.lifetime (external / per_run / persistent). Both legacy keys now hard-error with a migration message (no silent mapping), so a stale flag can't override an explicit lifetime.
Provisioning: adds container.setup_script, run inside each freshly-launched container before sshd setup (base64-delivered over the existing docker exec channel, with set -o pipefail and a size guard). The packaged default installs openssh-server, so the base image no longer needs sshd baked in.
Persistent mode: setup_containers attaches when the container runs on every host, cold-starts when it runs on none, and hard-fails on a partial set (relaunching would force-remove and wipe the still-running hosts' overlay). Includes a per-host image-SHA consistency check (cross-host skew or unreadable SHA is a hard error) and an idempotent setup_sshd (skips when sshd is already on 2224). Teardown is a no-op, enabling cvs run install_rvs then cvs run rvs_cvs in separate invocations.
Launch pull: the container-start docker run timeout is raised 60s -> 900s (CONTAINER_START_TIMEOUT_S) so the orchestrator can pull a multi-GB image on a cold node instead of timing out.
Docs (cluster-file, run-with-containers, cluster_file README) and cluster_container.json sample updated; enabled/launch references removed.

Test plan

Unit: test_container.py, test_factory.py, test_docker.py (lifetime resolution, per-lifetime setup/teardown, persistent attach/partial/skew/unreadable-SHA, provisioning dispatch + size guard + pipefail payload, sshd pgrep self-match) — 72 pass.
Lint: ruff check clean on changed files.
End-to-end on an MI300X node (container orchestrator, lifetime: persistent):
- install_rvs cold-starts a container, provisions sshd into a no-sshd TheRock image, installs RVS into the overlay; container survives teardown.
- rvs_cvs attaches to the same container (verified identical container ID + StartedAt, no relaunch), skips sshd setup, and runs rvs from the persisted overlay.
- Verified the launch path pulls a fresh 61GB image on a cold node (~9 min) within the new 900s timeout (the old 60s timeout aborted the pull).

…er.lifetime Add _resolve_container_lifetime() and invoke it from OrchestratorConfig.__init__, the single chokepoint that from_configs routes through. container.enabled is removed (hard ValueError if present); container.launch becomes a deprecated alias (true->per_run, false->external) emitting a DeprecationWarning. Default lifetime is per_run. Docstrings rewritten to the new schema.

DockerRuntime no longer interprets container.launch now that lifetime resolution lives in OrchestratorConfig. Add image_sha_status() to compare a running container's image SHA against the local image tag per host (used by the persistent attach path), and mirror it in the ContainerRuntime protocol and the Enroot stub.

setup_containers verifies-only for external, launches for per_run, and attaches-or-launches for persistent (with a per-host image-SHA check; cross-host SHA skew is a hard error). teardown is a no-op for external/persistent and force-removes for per_run. Extract the launch path into _launch_containers(); add a pgrep precheck so setup_sshd is idempotent on persistent re-runs. Drop the dead container_enabled attribute.

…sample Replace the enabled/launch keys (and their comments) with a single lifetime: per_run plus a comment describing the three policies.

…behavior Migrate the container/factory/docker unit tests off enabled+launch. Add _resolve_container_lifetime table cases (enabled->ValueError, launch-> DeprecationWarning, invalid->ValueError, defaults), persistent attach/launch/ idempotency, cross-host SHA skew, stale-overlay warn, and setup_sshd idempotency.

Rewrite the cluster-file README/RST and the run-with-containers how-to around the lifetime truth table (external/per_run/persistent), add a persistent attach sequence diagram, drop the removed enabled/stale-name pitfalls, and add the pin-container.name guidance for persistent.

Minimal apt script run inside a freshly-launched container to install openssh-server, so CVS's in-container sshd can start on base images that do not pre-ship it. Short-circuits if sshd is already present.

Add _resolve_container_setup_script alongside the lifetime resolver: a user-supplied path is made absolute and must exist (ValueError at config load otherwise); when absent/null/empty it falls back to the packaged default, which is itself existence-checked so a broken install fails fast rather than as an OSError mid-run.

After a fresh launch (per_run, or persistent cold-start), run the resolved setup_script inside each container via docker exec before setup_sshd, so base images lacking sshd/packages are brought up to spec. external and persistent-attach skip provisioning (idempotent across runs). The script is shipped base64-encoded inline; oversized scripts and read errors fail with a clear message, and every failing host's output is logged.

…arent shell pgrep -f 'sshd.*2224' matched its own parent shell, whose argv contains the pattern, so the precheck always reported sshd already running and skipped starting it (and the post-start validation always passed). Use the [s]shd character-class trick at both sites so they match the real daemon only.

Add the optional setup_script key to cluster_container.json and the schema README, and drop the requirement that the image pre-ship openssh-server (now installed at launch by the default setup_script). Note the inline delivery constraints (bash/base64 in the image, ~16 KB raw cap) and that null/omit both mean "use the packaged default".

Update the how-to and cluster-file reference: add the setup_script field, describe the launch-time provisioning step, and soften the "image must contain openssh-server" prerequisite/pitfall.

Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged default, missing user path raises, relative/tilde paths resolve to abspath, the packaged default is present, and a missing default raises.

@patch

…guard Add provisioning coverage (fresh-launch dispatch matrix, size guard, payload bytes, read/oversize failures, all-host failure logging) and pin the [s]shd precheck/validation pattern. Restructure with setUp-based patching and subTest tables in place of the per-method @patch boilerplate.

Post-review hardening of the container persistent-lifetime feature: - container: branch persistent setup_containers on per-host running state. Attach when the container runs on every host, cold-start when it runs on no host, and hard-fail (no relaunch) when it runs on some hosts but not all -- relaunching force-removes the still-running hosts and destroys their overlay, the opposite of what persistent promises. - container: add `set -o pipefail` to the inline setup_script delivery so a missing or failing base64 in the image fails loudly instead of silently no-opping and later surfacing as an opaque sshd startup failure. - container/runtimes: fail the persistent image-SHA check when a host's SHA is unreadable (previously a silent pass), and drop the always-zero exit_code from image_sha_status and the ContainerRuntime protocol since the wrapping echo makes it meaningless. - factory: remove the container.launch deprecation mapping. launch now hard-errors like the already-removed enabled, so a stale launch flag can never silently override an explicit lifetime. - runtimes/docker: raise the container-start timeout from 60s to 900s (CONTAINER_START_TIMEOUT_S) so the launch `docker run` can pull a multi-GB image on a cold node instead of timing out at one minute. - unittests + cluster-file docs: updated for the new contracts.

`make test` runs `ruff format --check` over the tree. Collapse the multi-line calls/raises that fit the configured line length in container.py and factory.py (from the previous commit) and in test_factory.py (pre-existing drift). Formatting only, no behavior change.

Rename the verify-only container lifetime value from 'external' to 'no_launch', which describes the behavior (CVS never launches the container) rather than an ownership model. Updates the validation tuple, branch checks, removed-field error messages, unit tests, the cluster_container.json sample comment, and the cluster-file docs / README truth tables. The feature is unreleased so no alias is kept.

atnair-amd added 17 commits May 29, 2026 13:59

input/cluster_file: use container.lifetime in cluster_container.json …

c14b914

…sample Replace the enabled/launch keys (and their comments) with a single lifetime: per_run plus a comment describing the three policies.

orchestrators/scripts: add default container provisioning script

0ddd336

Minimal apt script run inside a freshly-launched container to install openssh-server, so CVS's in-container sshd can start on base images that do not pre-ship it. Short-circuits if sshd is already present.

docs: document container.setup_script in cluster-file references

8fcfec9

Update the how-to and cluster-file reference: add the setup_script field, describe the launch-time provisioning step, and soften the "image must contain openssh-server" prerequisite/pitfall.

orchestrators/unittests: cover container.setup_script resolution

87b9065

Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged default, missing user path raises, relative/tilde paths resolve to abspath, the packaged default is present, and a missing default raises.

atnair-amd self-assigned this Jun 1, 2026

atnair-amd requested a review from cijohnson June 3, 2026 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195
atnair-amd wants to merge 17 commits into
mainfrom
atnair/container-lifetime

atnair-amd commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atnair-amd commented May 31, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant