Skip to content

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195

Open
atnair-amd wants to merge 17 commits into
mainfrom
atnair/container-lifetime
Open

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195
atnair-amd wants to merge 17 commits into
mainfrom
atnair/container-lifetime

Conversation

@atnair-amd
Copy link
Copy Markdown
Contributor

Summary

Reworks the container execution backend around a single container lifecycle policy and adds in-container provisioning + a persistent (install-then-run) mode.

  • Schema: replaces the two-axis container.enabled + container.launch with one tri-valued container.lifetime (external / per_run / persistent). Both legacy keys now hard-error with a migration message (no silent mapping), so a stale flag can't override an explicit lifetime.
  • Provisioning: adds container.setup_script, run inside each freshly-launched container before sshd setup (base64-delivered over the existing docker exec channel, with set -o pipefail and a size guard). The packaged default installs openssh-server, so the base image no longer needs sshd baked in.
  • Persistent mode: setup_containers attaches when the container runs on every host, cold-starts when it runs on none, and hard-fails on a partial set (relaunching would force-remove and wipe the still-running hosts' overlay). Includes a per-host image-SHA consistency check (cross-host skew or unreadable SHA is a hard error) and an idempotent setup_sshd (skips when sshd is already on 2224). Teardown is a no-op, enabling cvs run install_rvs then cvs run rvs_cvs in separate invocations.
  • Launch pull: the container-start docker run timeout is raised 60s -> 900s (CONTAINER_START_TIMEOUT_S) so the orchestrator can pull a multi-GB image on a cold node instead of timing out.
  • Docs (cluster-file, run-with-containers, cluster_file README) and cluster_container.json sample updated; enabled/launch references removed.

Test plan

  • Unit: test_container.py, test_factory.py, test_docker.py (lifetime resolution, per-lifetime setup/teardown, persistent attach/partial/skew/unreadable-SHA, provisioning dispatch + size guard + pipefail payload, sshd pgrep self-match) — 72 pass.
  • Lint: ruff check clean on changed files.
  • End-to-end on an MI300X node (container orchestrator, lifetime: persistent):
    • install_rvs cold-starts a container, provisions sshd into a no-sshd TheRock image, installs RVS into the overlay; container survives teardown.
    • rvs_cvs attaches to the same container (verified identical container ID + StartedAt, no relaunch), skips sshd setup, and runs rvs from the persisted overlay.
    • Verified the launch path pulls a fresh 61GB image on a cold node (~9 min) within the new 900s timeout (the old 60s timeout aborted the pull).

atnair-amd added 17 commits May 29, 2026 13:59
…er.lifetime

Add _resolve_container_lifetime() and invoke it from OrchestratorConfig.__init__,
the single chokepoint that from_configs routes through. container.enabled is
removed (hard ValueError if present); container.launch becomes a deprecated alias
(true->per_run, false->external) emitting a DeprecationWarning. Default lifetime
is per_run. Docstrings rewritten to the new schema.
DockerRuntime no longer interprets container.launch now that lifetime resolution
lives in OrchestratorConfig. Add image_sha_status() to compare a running
container's image SHA against the local image tag per host (used by the
persistent attach path), and mirror it in the ContainerRuntime protocol and the
Enroot stub.
setup_containers verifies-only for external, launches for per_run, and
attaches-or-launches for persistent (with a per-host image-SHA check; cross-host
SHA skew is a hard error). teardown is a no-op for external/persistent and
force-removes for per_run. Extract the launch path into _launch_containers(); add
a pgrep precheck so setup_sshd is idempotent on persistent re-runs. Drop the dead
container_enabled attribute.
…sample

Replace the enabled/launch keys (and their comments) with a single
lifetime: per_run plus a comment describing the three policies.
…behavior

Migrate the container/factory/docker unit tests off enabled+launch. Add
_resolve_container_lifetime table cases (enabled->ValueError, launch->
DeprecationWarning, invalid->ValueError, defaults), persistent attach/launch/
idempotency, cross-host SHA skew, stale-overlay warn, and setup_sshd idempotency.
Rewrite the cluster-file README/RST and the run-with-containers how-to around the
lifetime truth table (external/per_run/persistent), add a persistent attach
sequence diagram, drop the removed enabled/stale-name pitfalls, and add the
pin-container.name guidance for persistent.
Minimal apt script run inside a freshly-launched container to install
openssh-server, so CVS's in-container sshd can start on base images that
do not pre-ship it. Short-circuits if sshd is already present.
Add _resolve_container_setup_script alongside the lifetime resolver: a
user-supplied path is made absolute and must exist (ValueError at config
load otherwise); when absent/null/empty it falls back to the packaged
default, which is itself existence-checked so a broken install fails fast
rather than as an OSError mid-run.
After a fresh launch (per_run, or persistent cold-start), run the resolved
setup_script inside each container via docker exec before setup_sshd, so
base images lacking sshd/packages are brought up to spec. external and
persistent-attach skip provisioning (idempotent across runs). The script is
shipped base64-encoded inline; oversized scripts and read errors fail with a
clear message, and every failing host's output is logged.
…arent shell

pgrep -f 'sshd.*2224' matched its own parent shell, whose argv contains the
pattern, so the precheck always reported sshd already running and skipped
starting it (and the post-start validation always passed). Use the [s]shd
character-class trick at both sites so they match the real daemon only.
Add the optional setup_script key to cluster_container.json and the schema
README, and drop the requirement that the image pre-ship openssh-server
(now installed at launch by the default setup_script). Note the inline
delivery constraints (bash/base64 in the image, ~16 KB raw cap) and that
null/omit both mean "use the packaged default".
Update the how-to and cluster-file reference: add the setup_script field,
describe the launch-time provisioning step, and soften the "image must
contain openssh-server" prerequisite/pitfall.
Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged
default, missing user path raises, relative/tilde paths resolve to abspath,
the packaged default is present, and a missing default raises.
…guard

Add provisioning coverage (fresh-launch dispatch matrix, size guard, payload
bytes, read/oversize failures, all-host failure logging) and pin the [s]shd
precheck/validation pattern. Restructure with setUp-based patching and
subTest tables in place of the per-method @patch boilerplate.
Post-review hardening of the container persistent-lifetime feature:

- container: branch persistent setup_containers on per-host running state.
  Attach when the container runs on every host, cold-start when it runs on
  no host, and hard-fail (no relaunch) when it runs on some hosts but not
  all -- relaunching force-removes the still-running hosts and destroys
  their overlay, the opposite of what persistent promises.
- container: add `set -o pipefail` to the inline setup_script delivery so a
  missing or failing base64 in the image fails loudly instead of silently
  no-opping and later surfacing as an opaque sshd startup failure.
- container/runtimes: fail the persistent image-SHA check when a host's SHA
  is unreadable (previously a silent pass), and drop the always-zero
  exit_code from image_sha_status and the ContainerRuntime protocol since
  the wrapping echo makes it meaningless.
- factory: remove the container.launch deprecation mapping. launch now
  hard-errors like the already-removed enabled, so a stale launch flag can
  never silently override an explicit lifetime.
- runtimes/docker: raise the container-start timeout from 60s to 900s
  (CONTAINER_START_TIMEOUT_S) so the launch `docker run` can pull a
  multi-GB image on a cold node instead of timing out at one minute.
- unittests + cluster-file docs: updated for the new contracts.
`make test` runs `ruff format --check` over the tree. Collapse the
multi-line calls/raises that fit the configured line length in
container.py and factory.py (from the previous commit) and in
test_factory.py (pre-existing drift). Formatting only, no behavior change.
Rename the verify-only container lifetime value from 'external' to
'no_launch', which describes the behavior (CVS never launches the
container) rather than an ownership model. Updates the validation
tuple, branch checks, removed-field error messages, unit tests, the
cluster_container.json sample comment, and the cluster-file docs /
README truth tables. The feature is unreleased so no alias is kept.
@atnair-amd atnair-amd self-assigned this Jun 1, 2026
@atnair-amd atnair-amd requested a review from cijohnson June 3, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant