Skip to content

[BUG] VMs stopped after automatic selection of hosts in maintenance for VM migrations away from hosts #12045

@nvazquez

Description

@nvazquez

Problem

Have found a corner case for hosts entering maintenance mode, in which CloudStack selects other hosts already in maintenance mode as destination hosts for the VM migrations away from the host, causing VMs to be stopped.

Had the following scenario on 4.20.1 environment with 4xKVM hosts on a single cluster:

  • Host 1: Enabled, VMs running
  • Host 2: On maintenance mode
  • Host 3: Enabled, VMs running
  • Host 4: On maintenance mode

Example

  • Triggered maintenance mode on host 1
  • Observed the end result was host 1 into maintenance mode, however observed one of the previously running VMs has been stopped. Was able to manually start it on host 3 after detecting it.

VM with id = 9 was running on host 1, got stopped after attempting to migrate to host 2 (which was in maintenance mode):

Triggered prepare for maintenance mode on host 1 and observed the VM migration is scheduled:

2025-11-11 13:30:03,940 DEBUG [c.c.a.ApiServlet] (qtp638169719-14:[ctx-b5ffe9bb, ctx-dd7a138a]) (logid:71054204) ===END===  10.0.3.251 -- GET  id=16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5&command=prepareHostForMaintenance&response=json&sessionkey=kFIXNlVUZjUW_e0ucKeG6uV1Wso
2025-11-11 13:30:03,941 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-8:[ctx-950aa71a, job-124]) (logid:9199a267) Executing AsyncJob {"accountId":2,"cmd":"org.apac
he.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd","cmdInfo":"{\"response\":\"json\",\"ctxUserId\":\"2\",\"sessionkey\":\"kFIXNlVUZjUW_e0ucKeG6uV1Wso\",\"httpmethod\
":\"GET\",\"ctxStartEventId\":\"271\",\"id\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"ctxDetails\":\"{\\\"interface com.cloud.host.Host\\\":\\\"16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5\\\"}\",\"ctxAccountId\":\"2\",\"uuid\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"cmdEventType\":\"MAINT.PREPARE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id"
:124,"initMsid":32989174039467,"instanceId":1,"instanceType":"Host","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"
IN_PROGRESS","userId":2,"uuid":"9199a267-9780-4407-adf1-e816aa608ae4"}
2025-11-11 13:30:03,952 INFO  [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: attempting maintenance of host
 Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
....
2025-11-11 13:30:04,101 INFO  [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: scheduling migration of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:04,104 WARN  [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Job is executed without a context, setup psudo job for the executing thread
2025-11-11 13:30:04,104 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-3:[ctx-7427107b, work-17]) (logid:77866f07) Sync job-129 execution on object VmWorkJobQueue.5
2025-11-11 13:30:04,123 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Scheduled migration work of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"} with HAWork HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,124 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Wakeup workers HA
2025-11-11 13:30:04,127 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Processing work HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,129 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Sync job-130 execution on object VmWorkJobQueue.8
2025-11-11 13:30:04,129 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}. Starting attempt: 1/5 times.
2025-11-11 13:30:04,130 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM b39f1e40-9576-4553-92b3-6e3e84d372ebfrom host id 1. Starting attempt: 1/5 times.

Observed CloudStack wrongly selected host 2 as a destination host for the VM with id = 9 (this host was in maintenance mode) -> That caused the VM to get stopped

2025-11-11 13:30:05,730 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Last host [Host {"id":2
,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":
"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] is UP and has enough capacity. Checking for suitable pools for this host under zone [Zone {"id": "1", "na
me": "ref-trl-10084-k-Mr8-nicolas-vazquez", "uuid": "c6bb7e3f-388d-4318-8403-04d7f70d3862"}], pod [HostPod {"id":1,"name":"Pod1","uuid":"46052b42-8089-4b24-bd46-ca9fb938964f"}] 
and cluster [Cluster {id: "1", name: "p1-c1", uuid: "2e2b4642-b4a5-46c1-ba26-ab54880b9b53"}].
2025-11-11 13:30:05,732 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Checking suitable pools
 for volume [Volume {"id":10,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}, ROOT] of VM [VM instance {"id":9,"instanceName":"
i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}].
2025-11-11 13:30:05,733 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Volume [Volume {"id":10
,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type"
:"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] has pool [StoragePool {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid"
:"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] already specified. Checking if this pool can be reused.
2025-11-11 13:30:05,734 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-11:[ctx-237107a3, job-119/job-130, ctx-406ba4b3]) (logid:506b69a7) Service SecurityGroup is not support
ed in the network Network {"id": 204, "name": "Net1", "uuid": "52cde633-2f3b-4305-825d-0dc1abca1493", "networkofferingid": 10}
2025-11-11 13:30:05,734 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Pool [StoragePool {"id"
:1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid":"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] of volume [Volume {"id":10,"instanceId":9
,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] used by VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","u
uid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] fits the specified plan. No need to reallocate a pool for this volume.
2025-11-11 13:30:05,735 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Trying to find a poteni
al host and associated storage pools from the suitable host/pool lists for this VM
....
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) Host has enough CPU and RAM avail
able
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc CPU from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: 2500, reserved: 0, actual total: 6000, total
 with overprovisioning: 12000; requested cpu: 500, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc MEM from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: (2.50 GB) 2684354560, reserved: (0 bytes) 0,
 total: (6.59 GB) 7072067584; requested mem: (512.00 MB) 536870912, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,796 WARN  [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Unable to migrate 
VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} to Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"} due to [Resource [Host:2] is unreachable: Host 2: Unable to send class com.cloud.agent.api.PrepareF
orMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode] com.cloud.exception.AgentUnavailableException: Resource [Host:2] is unreachable
: Host 2: Unable to send class com.cloud.agent.api.PrepareForMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode
        at com.cloud.agent.manager.AgentAttache.checkAvailability(AgentAttache.java:190)
        at com.cloud.agent.manager.ClusteredAgentAttache.checkAvailability(ClusteredAgentAttache.java:80)
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:358)
        at com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:131)
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:404)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:570)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:414)
        at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2798)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3497)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5522)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
        at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5610)
        at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
......
2025-11-11 13:30:05,809 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) VM instance {"id":9,"instanceName
":"i-2-9-VM","state":"Stopping","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} state transited from [Running] to [Stopping] with event [StopRequested]. VM's origin
al host: Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}, new host: Host {"id":1,"name":"ref-trl-1
0084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}, host before state transition: Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:05,810 DEBUG [c.c.v.UserVmManagerImpl] (Work-Job-Executor-12:[c

Versions

CloudStack version 4.20.1
Tested with single cluster with 4 KVM hosts + NFS cluster-wide storage

The steps to reproduce the bug

Described above

What to do about it?

The migrate away logic must not consider hosts in maintenance (or other invalid state) as destination hosts

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions