[BUG] VMs stopped after automatic selection of hosts in maintenance for VM migrations away from hosts

### Problem

Have found a corner case for hosts entering maintenance mode, in which CloudStack selects other hosts already in maintenance mode as destination hosts for the VM migrations away from the host, causing VMs to be stopped.

Had the following scenario on 4.20.1 environment with 4xKVM hosts on a single cluster:
- Host 1: Enabled, VMs running
- Host 2: On maintenance mode
- Host 3: Enabled, VMs running
- Host 4: On maintenance mode

### Example
- Triggered maintenance mode on host 1
- Observed the end result was host 1 into maintenance mode, however observed one of the previously running VMs has been stopped. Was able to manually start it on host 3 after detecting it.

VM with id = 9 was running on host 1, got stopped after attempting to migrate to host 2 (which was in maintenance mode):

Triggered prepare for maintenance mode on host 1 and observed the VM migration is scheduled:
````
2025-11-11 13:30:03,940 DEBUG [c.c.a.ApiServlet] (qtp638169719-14:[ctx-b5ffe9bb, ctx-dd7a138a]) (logid:71054204) ===END===  10.0.3.251 -- GET  id=16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5&command=prepareHostForMaintenance&response=json&sessionkey=kFIXNlVUZjUW_e0ucKeG6uV1Wso
2025-11-11 13:30:03,941 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-8:[ctx-950aa71a, job-124]) (logid:9199a267) Executing AsyncJob {"accountId":2,"cmd":"org.apac
he.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd","cmdInfo":"{\"response\":\"json\",\"ctxUserId\":\"2\",\"sessionkey\":\"kFIXNlVUZjUW_e0ucKeG6uV1Wso\",\"httpmethod\
":\"GET\",\"ctxStartEventId\":\"271\",\"id\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"ctxDetails\":\"{\\\"interface com.cloud.host.Host\\\":\\\"16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5\\\"}\",\"ctxAccountId\":\"2\",\"uuid\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"cmdEventType\":\"MAINT.PREPARE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id"
:124,"initMsid":32989174039467,"instanceId":1,"instanceType":"Host","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"
IN_PROGRESS","userId":2,"uuid":"9199a267-9780-4407-adf1-e816aa608ae4"}
2025-11-11 13:30:03,952 INFO  [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: attempting maintenance of host
 Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
....
2025-11-11 13:30:04,101 INFO  [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: scheduling migration of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:04,104 WARN  [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Job is executed without a context, setup psudo job for the executing thread
2025-11-11 13:30:04,104 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-3:[ctx-7427107b, work-17]) (logid:77866f07) Sync job-129 execution on object VmWorkJobQueue.5
2025-11-11 13:30:04,123 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Scheduled migration work of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"} with HAWork HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,124 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Wakeup workers HA
2025-11-11 13:30:04,127 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Processing work HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,129 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Sync job-130 execution on object VmWorkJobQueue.8
2025-11-11 13:30:04,129 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}. Starting attempt: 1/5 times.
2025-11-11 13:30:04,130 INFO  [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM b39f1e40-9576-4553-92b3-6e3e84d372ebfrom host id 1. Starting attempt: 1/5 times.
````

Observed CloudStack wrongly selected host 2 as a destination host for the VM with id = 9 (this host was in maintenance mode) -> That caused the VM to get stopped
````
2025-11-11 13:30:05,730 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Last host [Host {"id":2
,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":
"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] is UP and has enough capacity. Checking for suitable pools for this host under zone [Zone {"id": "1", "na
me": "ref-trl-10084-k-Mr8-nicolas-vazquez", "uuid": "c6bb7e3f-388d-4318-8403-04d7f70d3862"}], pod [HostPod {"id":1,"name":"Pod1","uuid":"46052b42-8089-4b24-bd46-ca9fb938964f"}] 
and cluster [Cluster {id: "1", name: "p1-c1", uuid: "2e2b4642-b4a5-46c1-ba26-ab54880b9b53"}].
2025-11-11 13:30:05,732 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Checking suitable pools
 for volume [Volume {"id":10,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}, ROOT] of VM [VM instance {"id":9,"instanceName":"
i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}].
2025-11-11 13:30:05,733 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Volume [Volume {"id":10
,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type"
:"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] has pool [StoragePool {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid"
:"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] already specified. Checking if this pool can be reused.
2025-11-11 13:30:05,734 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-11:[ctx-237107a3, job-119/job-130, ctx-406ba4b3]) (logid:506b69a7) Service SecurityGroup is not support
ed in the network Network {"id": 204, "name": "Net1", "uuid": "52cde633-2f3b-4305-825d-0dc1abca1493", "networkofferingid": 10}
2025-11-11 13:30:05,734 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Pool [StoragePool {"id"
:1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid":"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] of volume [Volume {"id":10,"instanceId":9
,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] used by VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","u
uid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] fits the specified plan. No need to reallocate a pool for this volume.
2025-11-11 13:30:05,735 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Trying to find a poteni
al host and associated storage pools from the suitable host/pool lists for this VM
....
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) Host has enough CPU and RAM avail
able
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc CPU from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: 2500, reserved: 0, actual total: 6000, total
 with overprovisioning: 12000; requested cpu: 500, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc MEM from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: (2.50 GB) 2684354560, reserved: (0 bytes) 0,
 total: (6.59 GB) 7072067584; requested mem: (512.00 MB) 536870912, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,796 WARN  [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Unable to migrate 
VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} to Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"} due to [Resource [Host:2] is unreachable: Host 2: Unable to send class com.cloud.agent.api.PrepareF
orMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode] com.cloud.exception.AgentUnavailableException: Resource [Host:2] is unreachable
: Host 2: Unable to send class com.cloud.agent.api.PrepareForMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode
        at com.cloud.agent.manager.AgentAttache.checkAvailability(AgentAttache.java:190)
        at com.cloud.agent.manager.ClusteredAgentAttache.checkAvailability(ClusteredAgentAttache.java:80)
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:358)
        at com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:131)
        at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:404)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:570)
        at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:414)
        at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2798)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3497)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5522)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
        at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5610)
        at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
......
2025-11-11 13:30:05,809 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) VM instance {"id":9,"instanceName
":"i-2-9-VM","state":"Stopping","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} state transited from [Running] to [Stopping] with event [StopRequested]. VM's origin
al host: Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}, new host: Host {"id":1,"name":"ref-trl-1
0084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}, host before state transition: Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:05,810 DEBUG [c.c.v.UserVmManagerImpl] (Work-Job-Executor-12:[c
````

### Versions

CloudStack version 4.20.1
Tested with single cluster with 4 KVM hosts + NFS cluster-wide storage

### The steps to reproduce the bug

Described above

### What to do about it?

The migrate away logic must not consider hosts in maintenance (or other invalid state) as destination hosts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] VMs stopped after automatic selection of hosts in maintenance for VM migrations away from hosts #12045

Problem

Example

Versions

The steps to reproduce the bug

What to do about it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] VMs stopped after automatic selection of hosts in maintenance for VM migrations away from hosts #12045

Description

Problem

Example

Versions

The steps to reproduce the bug

What to do about it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions