container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195
Open
atnair-amd wants to merge 17 commits into
Open
container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195atnair-amd wants to merge 17 commits into
atnair-amd wants to merge 17 commits into
Conversation
…er.lifetime Add _resolve_container_lifetime() and invoke it from OrchestratorConfig.__init__, the single chokepoint that from_configs routes through. container.enabled is removed (hard ValueError if present); container.launch becomes a deprecated alias (true->per_run, false->external) emitting a DeprecationWarning. Default lifetime is per_run. Docstrings rewritten to the new schema.
DockerRuntime no longer interprets container.launch now that lifetime resolution lives in OrchestratorConfig. Add image_sha_status() to compare a running container's image SHA against the local image tag per host (used by the persistent attach path), and mirror it in the ContainerRuntime protocol and the Enroot stub.
setup_containers verifies-only for external, launches for per_run, and attaches-or-launches for persistent (with a per-host image-SHA check; cross-host SHA skew is a hard error). teardown is a no-op for external/persistent and force-removes for per_run. Extract the launch path into _launch_containers(); add a pgrep precheck so setup_sshd is idempotent on persistent re-runs. Drop the dead container_enabled attribute.
…sample Replace the enabled/launch keys (and their comments) with a single lifetime: per_run plus a comment describing the three policies.
…behavior Migrate the container/factory/docker unit tests off enabled+launch. Add _resolve_container_lifetime table cases (enabled->ValueError, launch-> DeprecationWarning, invalid->ValueError, defaults), persistent attach/launch/ idempotency, cross-host SHA skew, stale-overlay warn, and setup_sshd idempotency.
Rewrite the cluster-file README/RST and the run-with-containers how-to around the lifetime truth table (external/per_run/persistent), add a persistent attach sequence diagram, drop the removed enabled/stale-name pitfalls, and add the pin-container.name guidance for persistent.
Minimal apt script run inside a freshly-launched container to install openssh-server, so CVS's in-container sshd can start on base images that do not pre-ship it. Short-circuits if sshd is already present.
Add _resolve_container_setup_script alongside the lifetime resolver: a user-supplied path is made absolute and must exist (ValueError at config load otherwise); when absent/null/empty it falls back to the packaged default, which is itself existence-checked so a broken install fails fast rather than as an OSError mid-run.
After a fresh launch (per_run, or persistent cold-start), run the resolved setup_script inside each container via docker exec before setup_sshd, so base images lacking sshd/packages are brought up to spec. external and persistent-attach skip provisioning (idempotent across runs). The script is shipped base64-encoded inline; oversized scripts and read errors fail with a clear message, and every failing host's output is logged.
…arent shell pgrep -f 'sshd.*2224' matched its own parent shell, whose argv contains the pattern, so the precheck always reported sshd already running and skipped starting it (and the post-start validation always passed). Use the [s]shd character-class trick at both sites so they match the real daemon only.
Add the optional setup_script key to cluster_container.json and the schema README, and drop the requirement that the image pre-ship openssh-server (now installed at launch by the default setup_script). Note the inline delivery constraints (bash/base64 in the image, ~16 KB raw cap) and that null/omit both mean "use the packaged default".
Update the how-to and cluster-file reference: add the setup_script field, describe the launch-time provisioning step, and soften the "image must contain openssh-server" prerequisite/pitfall.
Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged default, missing user path raises, relative/tilde paths resolve to abspath, the packaged default is present, and a missing default raises.
…guard Add provisioning coverage (fresh-launch dispatch matrix, size guard, payload bytes, read/oversize failures, all-host failure logging) and pin the [s]shd precheck/validation pattern. Restructure with setUp-based patching and subTest tables in place of the per-method @patch boilerplate.
Post-review hardening of the container persistent-lifetime feature: - container: branch persistent setup_containers on per-host running state. Attach when the container runs on every host, cold-start when it runs on no host, and hard-fail (no relaunch) when it runs on some hosts but not all -- relaunching force-removes the still-running hosts and destroys their overlay, the opposite of what persistent promises. - container: add `set -o pipefail` to the inline setup_script delivery so a missing or failing base64 in the image fails loudly instead of silently no-opping and later surfacing as an opaque sshd startup failure. - container/runtimes: fail the persistent image-SHA check when a host's SHA is unreadable (previously a silent pass), and drop the always-zero exit_code from image_sha_status and the ContainerRuntime protocol since the wrapping echo makes it meaningless. - factory: remove the container.launch deprecation mapping. launch now hard-errors like the already-removed enabled, so a stale launch flag can never silently override an explicit lifetime. - runtimes/docker: raise the container-start timeout from 60s to 900s (CONTAINER_START_TIMEOUT_S) so the launch `docker run` can pull a multi-GB image on a cold node instead of timing out at one minute. - unittests + cluster-file docs: updated for the new contracts.
`make test` runs `ruff format --check` over the tree. Collapse the multi-line calls/raises that fit the configured line length in container.py and factory.py (from the previous commit) and in test_factory.py (pre-existing drift). Formatting only, no behavior change.
Rename the verify-only container lifetime value from 'external' to 'no_launch', which describes the behavior (CVS never launches the container) rather than an ownership model. Updates the validation tuple, branch checks, removed-field error messages, unit tests, the cluster_container.json sample comment, and the cluster-file docs / README truth tables. The feature is unreleased so no alias is kept.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks the container execution backend around a single container lifecycle policy and adds in-container provisioning + a persistent (install-then-run) mode.
container.enabled+container.launchwith one tri-valuedcontainer.lifetime(external/per_run/persistent). Both legacy keys now hard-error with a migration message (no silent mapping), so a stale flag can't override an explicitlifetime.container.setup_script, run inside each freshly-launched container before sshd setup (base64-delivered over the existingdocker execchannel, withset -o pipefailand a size guard). The packaged default installsopenssh-server, so the base image no longer needs sshd baked in.setup_containersattaches when the container runs on every host, cold-starts when it runs on none, and hard-fails on a partial set (relaunching would force-remove and wipe the still-running hosts' overlay). Includes a per-host image-SHA consistency check (cross-host skew or unreadable SHA is a hard error) and an idempotentsetup_sshd(skips when sshd is already on 2224). Teardown is a no-op, enablingcvs run install_rvsthencvs run rvs_cvsin separate invocations.docker runtimeout is raised 60s -> 900s (CONTAINER_START_TIMEOUT_S) so the orchestrator can pull a multi-GB image on a cold node instead of timing out.cluster-file,run-with-containers, cluster_file README) andcluster_container.jsonsample updated;enabled/launchreferences removed.Test plan
test_container.py,test_factory.py,test_docker.py(lifetime resolution, per-lifetime setup/teardown, persistent attach/partial/skew/unreadable-SHA, provisioning dispatch + size guard + pipefail payload, sshd pgrep self-match) — 72 pass.ruff checkclean on changed files.lifetime: persistent):install_rvscold-starts a container, provisions sshd into a no-sshd TheRock image, installs RVS into the overlay; container survives teardown.rvs_cvsattaches to the same container (verified identical container ID + StartedAt, no relaunch), skips sshd setup, and runsrvsfrom the persisted overlay.