test: validate aleph-vm dev-accelerate (supervisor + migration) on testnet#27
test: validate aleph-vm dev-accelerate (supervisor + migration) on testnet#27odesenfans wants to merge 19 commits into
Conversation
Point the CRN deployment at the aleph-vm dev-accelerate branch (PR #977: gRPC supervisor rework, integration suite, directory-based VM migration) so the testnet migration test exercises it end to end before merge. crn-up.sh fetches aleph-vm.debian-12.deb from the branch's CI build (build-deb-package-and-integration-tests.yml); the latest dev-accelerate run is green and produces that artifact. Revert to version: 1.13.0 once aleph-vm#977 lands.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Cleanly pins aleph-vm to the dev-accelerate branch for pre-merge live migration validation. The branch-based deployment path was already wired into crn-up.sh — this is a simple config toggle. The diff is minimal, well-commented, and correctly hands reverting to version: "1.13.0" once alpeh-vm#977 lands. No correctness, security, or quality issues.
manifesto.yml (line 35): The comment clearly flags this as temporary and explains what to revert to. Good practice.
scheduler 0.1.2 (#188) makes node_watcher poll the v2 executions endpoint, so a gracefully-stopping VM stays observed (present as STOPPED) instead of dropping out of the v1 running-only list and being reported 'missing' for the ~34s graceful-shutdown window. That is what broke test_instance_stop_start on the dev-accelerate aleph-vm (whose stop is a graceful guest shutdown vs the old near-instant teardown). Pairs with aleph-vm's stop=stop_vm change, which keeps the stopped VM in the pool so v2 still lists it.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean, minimal configuration PR that pins aleph-vm to the dev-accelerate branch for pre-merge validation of the gRPC supervisor rework and directory-based VM migration. Also bumps the scheduler components to 0.1.2 for graceful-stop tolerance. Both changes are well-documented with clear rationale and revert instructions. No code changes, no logic errors, no security concerns.
aleph-vm#977 (supervisor rework) merged into dev, so the temporary dev-accelerate branch pin is obsolete (that branch will be deleted). Point at dev to keep the migration and graceful-stop tests running against the merged code until it reaches main and a release is cut. Scheduler stays at 0.1.2 (graceful-stop tolerance).
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Configuration-only PR: bumps scheduler-rs/scheduler-api to 0.1.2 (fixes the graceful-stop race where a VM in STOPPED state was reported as "missing") and switches aleph-vm from version pin "1.13.0" to branch "dev" (tracks the merged #977 gRPC supervisor + directory-migration work). The comments are thorough, the revert instructions are clear, and the deployment script's branch/version branching logic handles this correctly.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Adds migration test with VLAN-aware VM scheduling and anaconda bootstrap. Core logic is sound but has a VM fault tolerance gap (no recovery on start failure), a port collision risk for concurrent tests, and a fragile Bash CIDR computation. These match the existing codebase risk profile and do not block for a testnet-facing PR.
aleph/vm/vm_pool.py (line 2265): switch_network: if cloned.start() fails, original is already stopped and released. No rollback path. Keep original alive until cloned.start() succeeds, then stop it.
tests/migration/config.py (line 76): Fixed port 4020 for Postgres forwarding collides under concurrent test runs. Use port range or random_port().
deploy-hypervisor.sh (line 483): CIDR extraction awk script fails silently on multi-IP VMs. Add fallback or grep for scope global and guard with set -e.
Validate Phase 2 (gRPC wire + two-service packaging) on testnet. Phase 1 (#980) has merged to dev; phase2 carries it plus the Phase 2 changes.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The manifest version changes are correct and well-documented, but the PR description omits nearly all of the test/ changes: removal of the confidential VM test, removal of the scheduler-observation gate (wait_for_scheduler_observed), removal of CRN re-link in migration finally block, removal of required_port=None from wait_for_dispatched, and a downgrade of aleph-cli from 0.11.1 to 0.11.0. The PR title/description should reflect the full scope of these changes.
manifesto.yml (line 41): nit: the comment references aleph-vm#980 but the branch name says "phase2". Is this the right branch? The PR description mentions "dev-accelerate" but the actual branch is "od/grpc-only-supervisor-phase2". One of these is out of date.
manifesto.yml (line 15): aleph-cli downgraded from 0.11.1 to 0.11.0 without explanation in the PR description or comments. Is this intentional? If the newer CLI is incompatible with the test branch, please document. If this was accidental (leftover from a rebase), it should be reverted.
tests/test_migration.py (line 42): Why remove wait_for_scheduler_observed? The old comment explained a real race: if the unlink happens before the scheduler's node_watcher poll sees the VM on the source CRN, the reschedule cold-starts the VM instead of migrating it — losing the marker. With scheduler 0.1.2's v2 executions endpoint, is this race fully eliminated, or does this make the test flaky by removing the gate? The commit message should justify this removal.
tests/test_migration.py (line 70): Removed the CRN re-link from the finally block. Without it, if this test unl inks the static TEE CRN, the confidential test will fail because the CRN is no longer linked. If the confidential test was removed intentionally (see below), then this is fine, but the re-link removal should still be documented as it's a non-obvious change in shared-state cleanup behavior.
tests/test_confidential.py (line 1): The entire confidential VM test (205 lines+) is deleted but not mentioned in the PR descri ption at all. There's no explanation in the commit message, title, or comments. If this is intentional (e.g., simplifying CI to focus on migration), it needs to be acknowledged. If it was removed accidentally (e.g., the branch was cut before the confidential test was added), this is a regression and should be restored.
…-accelerate # Conflicts: # manifesto.yml
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The diff correctly switches aleph-vm from a tagged release to a CI artifact build from a feature branch. The crn-up.sh script properly handles both code paths, and the migration test is well-designed with correct handling of the scheduler race condition. However, the PR description says the branch is 'dev-accelerate' while the actual code pins 'od/grpc-only-supervisor-phase2' — these should be aligned before merging.
manifesto.yml (line 39): PR description references branch "dev-accelerate" but the code pins "od/grpc-only-supervisor-phase2". These should match. Either update the PR description to reflect the actual branch name, or update the manifesto.yml if the branch was renamed.
The static AMD SEV-SNP server runs Ubuntu 24.04 (Python 3.12), but the branch path deployed the debian-12 .deb (cp311), so its vendored pydantic_core native extension failed to import there and the supervisor crash-looped (confidential test: :4020 connection refused). Provision CRN droplets as ubuntu-24.04 (CCN already is) and fetch/deploy the ubuntu-24.04 branch .deb so the Python ABI matches on every host including the TEE server.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The code changes are correct and internally consistent: standardizing all droplets to Ubuntu 24.04, updating the CI artifact name accordingly, and pinning to the development branch for testing. The only issue is a mismatch between the PR description (which says branch: "dev-accelerate") and the actual change (branch: "od/grpc-only-supervisor-phase2"). This should be reconciled before merging.
manifesto.yml (line 39): PR description says this should be "dev-accelerate" but the code sets "od/grpc-only-supervisor-phase2". Please update one or the other to match."
Diagnosing the confidential create/registration on the SEV box: in the two-service split the agent serves /control/* and /v2/about/executions/list, so the confidential allocation result and the awaiting_confidential_init state live in its journal, not the supervisor daemon's. Also switch the TEE journals from -n 5000 to --since: the supervisor crash-loops during the static-server reinstall and blew the line budget before the test window.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Sound PR that migrates from Debian 12 to Ubuntu 24.04 across all CRNs/CCN, fixes the TEE server log collection to use --since instead of -n (addressing the crash-loop noise problem), and captures the new agent service logs. The distro unification simplifies the artifact download path to a single ubuntu-24.04 artifact. No correctness or security issues found — the SSH quoting in the --since parameter is handled correctly via double-quote expansion.
.github/workflows/pr-tests.yml (line 330): Consider using a wider --since window (e.g. '90 min ago') or capturing a start timestamp earlier in the workflow. The 45 min window covers most but not all failure scenarios — a test that fails after minute 45+ of a 60-min timeout workflow could have its tail end of logs cut off. Low risk in practice given the log collection step itself is fast.
manifesto.yml (line 39): PR description says 'dev-accelerate' but the branch pinned here is 'od/grpc-only-supervisor-phase2'. Verify these refer to the same branch (e.g. a rename) or update the PR description to match.
The .deb version is git-describe (commits since tag), so a rebased branch can carry a lower count than the .deb a previous run left on the persistent SEV box. apt aborts with 'Packages were downgraded ... without --allow-downgrades' (exit 100). Add --allow-downgrades to the static-CRN reinstall.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR contains well-reasoned changes (debian-12→ubuntu-24.04 migration, improved TEE log collection, --allow-downgrades for static CRN reinstalls), but manifesto.yml pins to branch "od/supervisor-vmid-identity" while the PR description explicitly says it should pin to "dev-accelerate" — this mismatch needs to be resolved before merging, as it determines which aleph-vm code is actually tested.
manifesto.yml (line 39): Branch name mismatch: the PR description says branch: "dev-accelerate" but the code pins to branch: "od/supervisor-vmid-identity". These are different branches testing different things. Please confirm which branch is intended and update either the code or the PR description.
scripts/crn-up.sh (line 447): Stale comment: says "a static TEE server may not run debian-12 like the DO droplets do" — but the DO droplets now also run Ubuntu 24.04 (changed in this PR). The debian-12 reference is outdated. Consider: "a static TEE server may run a different distro than the DO droplets".
scripts/crn-up.sh (line 75): With version replaced by branch in manifesto.yml, m['components']['aleph-vm']['version'] will KeyError if ever called without ALEPH_VM_VERSION set. Not currently reachable (install_crn only calls this when branch is empty), but consider using m['components']['aleph-vm'].get('version', '') for robustness.
Rebased #983 (message-free VmExecution) onto dev; validate it on the testnet.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR correctly migrates all CRN droplets from debian-12 to ubuntu-24.04, updates the CI artifact name and fallback defaults consistently, switches from version to branch for aleph-vm pinning, and improves TEE log collection (using --since instead of -n, adding aleph-vm-agent.service capture). The --allow-downgrades flag is well-justified for the git describe versioning scheme. No correctness, security, or logic issues were found. The only notable discrepancy is that the PR description references dev-accelerate but the code pins od/execution-message-free (consistent with the commit message). Two stale comments referencing debian-12 remain but do not affect behavior.
manifesto.yml (line 39): PR description says branch: "dev-accelerate" but the code sets branch: "od/execution-message-free". The commit message is consistent with the code, so this appears to be a stale PR description — consider updating it to avoid confusion for reviewers tracking the aleph-vm branch being tested.
scripts/crn-up.sh (line 447): Stale comment: "TEE server may not run debian-12 like the DO droplets do" — DO droplets now run ubuntu-24.04 (changed in this PR) and the TEE server also runs ubuntu-24.04 (per the workflow comment at line 283). The dynamic detection on line 449 still works correctly, but the comment's premise is no longer accurate. Consider updating to: "e.g. a static TEE server may run a different distro than the DO droplets".
scripts/confidential-artifacts.sh (line 23): Pre-existing but now stale: the docstring says "Prefer a debian-12 CRN: its sevctl is built against the oldest glibc" — CRN 0 now runs ubuntu-24.04 (this PR). The workflow comment at pr-tests.yml:283 correctly explains the new rationale (all hosts run Ubuntu 24.04, so glibc baselines match). Consider updating this docstring to match.
Validate PR-1 of the agent/supervisor boundary split (contract layer + import-linter, behavior-neutral) on testnet.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The changes are functionally correct and well-reasoned — the distro migration from debian-12 to ubuntu-24-04 is consistent across all three files, the --allow-downgrades flag addresses a real apt versioning edge case, and the TEE log collection improvements (switching from -n to --since, adding the agent service) are solid. The quoting in the TEE log step is correct upon careful analysis. However, the PR description is stale — it references dev-accelerate/aleph-vm#977 but the code actually pins od/agent-supervisor-contract/aleph-vm#986. Two comments remain stale after the distro migration: crn-up.sh:447 still says 'debian-12 like the DO droplets do' and confidential-artifacts.sh:23-26 still references a 'debian-12 CRN' for SEVCTL_HOST. None of these are blocking.
manifesto.yml (line 39): The PR description says this pins dev-accelerate (aleph-vm#977) but the code pins od/agent-supervisor-contract (aleph-vm#986). The description should be updated to match the actual branch being tested, to avoid confusion when reviewing or reverting.
scripts/crn-up.sh (line 447): Stale comment: says 'debian-12 like the DO droplets do' but the DO droplets now run ubuntu-24-04 (changed in this PR on line 254). Update to reflect the new baseline, e.g. 'a static TEE server may not run ubuntu-24.04 like the DO droplets do'.
scripts/confidential-artifacts.sh (line 23): Not changed in this PR, but the comment here references 'debian-12 CRN' for SEVCTL_HOST, which is now ubuntu-24.04. The corresponding comment in pr-tests.yml:283 was updated in this PR but this one was missed. Consider updating for consistency.
…re origin The test_confidential init-session 'VM not found' failure reproduced 2x on #986 (od/agent-supervisor-contract), whose confidential-path code is import-only vs dev. Run against dev to determine if the breakage is pre-existing (#984/#985) rather than caused by #986. Will revert the pin afterwards.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The code changes are technically sound and internally consistent: the debian-12→ubuntu-24.04 image migration is applied uniformly across pr-tests.yml, crn-up.sh, and the artifact/deb_variant names; the --allow-downgrades flag and --since-based journalctl collection are well-reasoned and clearly documented; the tee-agent.txt capture correctly supplements the existing log artifacts and is included in the upload list. However, there is a significant metadata mismatch: the PR title and description describe pinning dev-accelerate to validate aleph-vm#977 (gRPC supervisor + migration), but the actual single commit pins branch: "dev" with a comment about an isolation run investigating a confidential VM failure on #986. The commit message matches the code, not the PR description. This mismatch should be corrected so reviewers and future bisectors aren't misled. No blocking code issues found.
manifesto.yml (line 36): PR description mismatch: the PR title says "validate aleph-vm dev-accelerate (supervisor + migration) on testnet" and the description discusses aleph-vm#977, but the actual code pins branch: "dev" with a comment about an isolation run for #986. The commit message ("ci: isolation run — point aleph-vm at dev to check confidential failure origin") matches the code, not the PR description. Update the PR title/description to match the actual change, or rebase onto the correct branch if the description is what was intended.
manifesto.yml (line 42): Latent bug: read_vm_version() at scripts/crn-up.sh:75 does print(m['components']['aleph-vm']['version']) which will KeyError when branch is set instead of version. It's not triggered today because install_crn only calls read_vm_version() in the else branch (when branch is empty), but consider adding a .get('version', '') guard for robustness so a future manifesto that accidentally sets both fields doesn't crash confusingly.
.github/workflows/pr-tests.yml (line 330): The "45 min ago" window assumes the test run completes within ~45 minutes. The job timeout is 60 min. If the run is slow (approaching timeout), early supervisor startup logs that are relevant to a failure could fall outside the window. Consider bumping to "60 min ago" to match the job timeout, or making it relative to a step start timestamp. Non-blocking — the tradeoff against crash-loop noise is reasonable and documented.
…l failure is pre-existing on dev Isolation run (manifesto pinned to dev) failed test_confidential identically (VM not found at init-session), proving the breakage is pre-existing on dev (#984/#985), not caused by #986. Restoring the pin to the PR-1 branch.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The distro migration (debian-12 → ubuntu-24.04) is consistent across pr-tests.yml, crn-up.sh, and manifesto.yml, with the artifact name, default deb_variant, and explanatory comments all updated in lockstep. The --allow-downgrades addition is well-justified by the git-describe version-count semantics on static servers. TEE log collection is improved (--since instead of -n, agent service added). The pin to od/agent-supervisor-contract and the pre-existing confidential test failure are clearly documented. Two minor documentation gaps exist in files not modified by this PR but worth addressing in a follow-up.
scripts/confidential-artifacts.sh (line 23): Stale comment: still says "Prefer a debian-12 CRN: its sevctl is built against the oldest glibc", but this PR changes all CRN droplets to ubuntu-24.04, invalidating that rationale. The workflow comment (pr-tests.yml:283-285) was updated to reflect the new uniform-distro reality — this docstring should be updated to match (e.g. "All CRNs and the TEE server now run Ubuntu 24.04, so any CRN's sevctl glibc baseline matches the CCN").
scripts/tee-reset.sh (line 28): Not modified by this PR, but relevant: if the od/agent-supervisor-contract split introduces aleph-vm-agent.service as a separate systemd unit (as implied by pr-tests.yml:326-329), this reset only stops aleph-vm-supervisor.service and would leave the agent running across CI runs. Consider adding systemctl stop aleph-vm-agent.service here in a follow-up to avoid stale agent state leaking between runs.
…#990) Validate the complete combined stack (tip branch builds a .deb with all four PRs) on testnet integration. Combined-stack GitHub CI is green (PR #991). Known: the confidential SSH test is a pre-existing dev failure (needs the .deb-selection fix + SEV hw); the other 23 are expected to pass.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
A well-documented temporary pin that switches the testnet from a tagged aleph-vm release to a feature branch (od/supervisor-vm-id-rename-impl) for integration testing, while simultaneously migrating the base OS from debian-12 to ubuntu-24.04 across all three touch points (workflow droplet image, crn-up.sh provision, deb artifact/variant selection). The distro switch is internally consistent: the CI artifact name, local_deb path, and fallback deb_variant all use ubuntu-24.04. The --allow-downgrades flag is correctly motivated (git describe counts can decrease on rebase) and the switch from -n to --since for TEE log collection is a sound robustness improvement. The added agent log capture aligns with the two-service supervisor/agent split. Shell quoting in the TEE log step is correct: $SSH is intentionally unquoted to word-split into ssh + host, and single quotes around $since are passed through the double-quoted argument to be interpreted by the remote shell. The only discrepancy is cosmetic: the PR description's prose says branch "dev-accelerate" while the actual code pins to "od/supervisor-vm-id-rename-impl" — the code is what matters and is self-consistent.
manifesto.yml (line 48): The PR description's "What" section says this pins to branch: "dev-accelerate", but the actual branch is od/supervisor-vm-id-rename-impl. The description prose appears stale — consider updating it to avoid confusion for future reviewers, though the code itself is correct.
.github/workflows/pr-tests.yml (line 331): Minor style nit: $SSH and $SUDO are unquoted variables used for intentional word-splitting, which shellcheck would flag (SC2086). This is correct and intentional here, but a brief comment noting the deliberate split (as exists for other non-obvious choices in this file) would make the intent explicit for future readers.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This PR cleanly migrates CRN droplets from debian-12 to ubuntu-24.04, pins aleph-vm to the od/fix-confidential-reconcile-teardown branch for confidential init-session validation, and improves TEE log collection (switching from fixed -n line budgets to --since time windows, and adding the aleph-vm-agent service log). The quoting of the since variable through SSH is correct, the --allow-downgrades flag is well-justified for git-describe version comparisons on the static TEE server, and the debian-12→ubuntu-24.04 changes are consistent across pr-tests.yml, crn-up.sh, and manifesto.yml. The PR description mentions dev-accelerate/#977 but the actual code pins to od/fix-confidential-reconcile-teardown/#987 — the commit message confirms #987, so the description is stale. No correctness or security issues found.
scripts/confidential-artifacts.sh (line 23): Stale comment: "Prefer a debian-12 CRN" — all CRNs now run Ubuntu 24.04 per this PR's changes to pr-tests.yml and crn-up.sh. Not in the diff but contextually inconsistent with the rest of the migration.
.github/workflows/pr-tests.yml (line 330): Minor inconsistency: CRN log collection (line 314) still uses -n 500 (fixed line budget) while TEE logs now use --since '45 min ago'. CRN droplets are fresh per-run so crash-loop noise is less of a concern there, but aligning both would be cleaner.
manifesto.yml (line 45): PR description says this pins to dev-accelerate (aleph-vm#977) but the actual branch is od/fix-confidential-reconcile-teardown (aleph-vm#987). The commit message confirms #987. Consider updating the PR description to match.
The generated supervisor.env omitted ALEPH_VM_SUPERVISOR_GRPC_SOCKET, so the agent ran its own in-process VmPool and never dialed the supervisor daemon. The daemon ran idle alongside it. Every prior green #27 validated the in-process path, not the two-process gRPC connector the boundary work is for. Set the socket so the agent talks to the daemon over gRPC. Expect split-mode gaps to surface (endpoints that still reach the in-process pool: backups, restore, confidential, migration, network recreation, GPU reservation, persistent programs).
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR correctly migrates all CRN droplets from Debian 12 to Ubuntu 24.04, updates the .deb artifact naming, forces gRPC split mode, and improves TEE log collection. The core logic is sound — the DO image slug format (hyphens) correctly differs from the .deb artifact naming (dots matching /etc/os-release), the gRPC socket path is safe relative to tee-reset.sh (reset runs before supervisor start), and the shell quoting for journalctl --since is correct. However, there are two categories of non-blocking issues: (1) the PR description is stale — it says the branch pin is "dev-accelerate" but the code actually pins "od/fix-confidential-reconcile-teardown", and the commit message is about gRPC split mode rather than the confidential fix; (2) two comments still reference debian-12 as the CRN/sevctl baseline, contradicting the updated comments elsewhere in the same PR that correctly state all CRNs now run Ubuntu 24.04.
scripts/crn-up.sh (line 453): Stale comment: "TEE server may not run debian-12 like the DO droplets do" — the DO droplets now run ubuntu-24-04-x64 (line 254), so this reference to debian-12 as the baseline is outdated. The distro-detection logic itself is still valid (the static TEE server may differ from DO droplets), but the example should say ubuntu-24.04.
scripts/confidential-artifacts.sh (line 23): Stale comment: "Prefer a debian-12 CRN: its sevctl is built against the oldest glibc" — this contradicts the updated comment in pr-tests.yml (line 283-285) which correctly states all CRNs and the TEE server now run Ubuntu 24.04. The glibc-baseline reasoning no longer applies since the SEVCTL_HOST (CRN 0) and TEE server share the same distro. Note: the BASE_IMAGE_URL at line 67 (debian-12 guest rootfs) is correctly left unchanged — that's the confidential VM guest OS, not the host.
manifesto.yml (line 45): PR description mismatch: the PR description says this pins to branch "dev-accelerate" (aleph-vm#977), but the actual code pins "od/fix-confidential-reconcile-teardown" (PR #987). The comment block (lines 36-44) describes #987's confidential fix, which matches the code, but the PR title/description should be updated to reflect what the code actually does.
The manifesto pinned od/fix-confidential-reconcile-teardown, deleted after #987 merged, so crn-up.sh could not fetch a build. dev now carries the full boundary stack + the firecracker Configuration vm_id fix, and crn-up.sh sets the gRPC socket. Deploy dev to exercise the two-process connector end-to-end for the first time.
foxpatch-aleph
left a comment
There was a problem hiding this comment.
The PR correctly migrates all CRN droplets from debian-12 to ubuntu-24.04, switches the aleph-vm source to the dev branch to validate the gRPC supervisor/agent split, and improves TEE log collection. However, tee-reset.sh is not updated to stop the new aleph-vm-agent.service before wiping state — with the two-service split, the agent will keep running while /var/lib/aleph/vm/* is deleted underneath it, potentially causing the agent to serve stale or broken state to the next CI run. The PR description also references branch "dev-accelerate" while the diff pins "dev".
scripts/tee-reset.sh (line 28): With the two-service split introduced by this PR, the agent (aleph-vm-agent.service) is a separate systemd unit. tee-reset.sh stops the supervisor but not the agent, so after line 28 the agent is still running while line 36 wipes /var/lib/aleph/vm/. The agent will continue serving /control/ and /v2/about/executions/list against stale/missing state until it crashes or the next install restarts it. Add remote "systemctl stop aleph-vm-agent.service 2>/dev/null || true" after the supervisor stop.
scripts/confidential-artifacts.sh (line 23): The comment still says "Prefer a debian-12 CRN: its sevctl is built against the oldest glibc" but this PR moves all CRNs to ubuntu-24.04. The workflow comment at pr-tests.yml:283 was updated to reflect the new uniform distro, but this script's SEVCTL_HOST doc was left stale. Update to match the new reality (all hosts run ubuntu-24.04).
manifesto.yml (line 43): The PR description says branch: "dev-accelerate" but the actual value here is "dev". The inline comment also says "Deploy from dev". Confirm that "dev" is the correct branch name — if the intent was to validate the dev-accelerate branch (aleph-vm#977), this may be the wrong ref.
scripts/crn-up.sh (line 364): ALEPH_VM_SUPERVISOR_GRPC_SOCKET is written unconditionally into the env file template. When this PR is reverted to version: "1.13.0", the socket env var will remain in the template. If 1.13.0 doesn't support the gRPC socket split, this could cause issues on revert. Consider gating this on the branch path (only when local_deb is set) or document that the revert must also remove this line.
Why
Validate the aleph-vm dev-accelerate branch (aleph-vm#977: gRPC supervisor rework, integration test suite, and directory-based VM migration) against the live cross-CRN migration test before merging.
What
Pins
manifesto.yml→components.aleph-vmfromversion: "1.13.0"tobranch: "dev-accelerate".crn-up.shthen fetchesaleph-vm.debian-12.debfrom that branch'sbuild-deb-package-and-integration-tests.ymlCI build instead of a tagged release.The latest
dev-acceleratebuild is green and produces the artifact, so the fetch step will resolve.What this exercises
tests/test_migration.py: create instance → write marker over SSH → unlink the hosting CRN → wait for scheduler reallocation → verify the marker survived on the new CRN (disk state preserved across migration).Revert
This is a temporary pin. Revert to
version: "1.13.0"once aleph-vm#977 lands.