-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Problem
Have found a corner case for hosts entering maintenance mode, in which CloudStack selects other hosts already in maintenance mode as destination hosts for the VM migrations away from the host, causing VMs to be stopped.
Had the following scenario on 4.20.1 environment with 4xKVM hosts on a single cluster:
- Host 1: Enabled, VMs running
- Host 2: On maintenance mode
- Host 3: Enabled, VMs running
- Host 4: On maintenance mode
Example
- Triggered maintenance mode on host 1
- Observed the end result was host 1 into maintenance mode, however observed one of the previously running VMs has been stopped. Was able to manually start it on host 3 after detecting it.
VM with id = 9 was running on host 1, got stopped after attempting to migrate to host 2 (which was in maintenance mode):
Triggered prepare for maintenance mode on host 1 and observed the VM migration is scheduled:
2025-11-11 13:30:03,940 DEBUG [c.c.a.ApiServlet] (qtp638169719-14:[ctx-b5ffe9bb, ctx-dd7a138a]) (logid:71054204) ===END=== 10.0.3.251 -- GET id=16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5&command=prepareHostForMaintenance&response=json&sessionkey=kFIXNlVUZjUW_e0ucKeG6uV1Wso
2025-11-11 13:30:03,941 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl$5] (API-Job-Executor-8:[ctx-950aa71a, job-124]) (logid:9199a267) Executing AsyncJob {"accountId":2,"cmd":"org.apac
he.cloudstack.api.command.admin.host.PrepareForMaintenanceCmd","cmdInfo":"{\"response\":\"json\",\"ctxUserId\":\"2\",\"sessionkey\":\"kFIXNlVUZjUW_e0ucKeG6uV1Wso\",\"httpmethod\
":\"GET\",\"ctxStartEventId\":\"271\",\"id\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"ctxDetails\":\"{\\\"interface com.cloud.host.Host\\\":\\\"16b3b0e8-8aaf-4e71-9ce0-29c252b
15aa5\\\"}\",\"ctxAccountId\":\"2\",\"uuid\":\"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5\",\"cmdEventType\":\"MAINT.PREPARE\"}","cmdVersion":0,"completeMsid":null,"created":null,"id"
:124,"initMsid":32989174039467,"instanceId":1,"instanceType":"Host","lastPolled":null,"lastUpdated":null,"processStatus":0,"removed":null,"result":null,"resultCode":0,"status":"
IN_PROGRESS","userId":2,"uuid":"9199a267-9780-4407-adf1-e816aa608ae4"}
2025-11-11 13:30:03,952 INFO [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: attempting maintenance of host
Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
....
2025-11-11 13:30:04,101 INFO [c.c.r.ResourceManagerImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Maintenance: scheduling migration of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:04,104 WARN [o.a.c.f.j.AsyncJobExecutionContext] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Job is executed without a context, setup psudo job for the executing thread
2025-11-11 13:30:04,104 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-3:[ctx-7427107b, work-17]) (logid:77866f07) Sync job-129 execution on object VmWorkJobQueue.5
2025-11-11 13:30:04,123 INFO [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Scheduled migration work of VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"} with HAWork HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,124 DEBUG [c.c.h.HighAvailabilityManagerExtImpl] (API-Job-Executor-8:[ctx-950aa71a, job-124, ctx-ed4ed961]) (logid:9199a267) Wakeup workers HA
2025-11-11 13:30:04,127 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Processing work HAWork[19-Migration-9-Running-Scheduled]
2025-11-11 13:30:04,129 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (HA-Worker-4:[ctx-fdd5cf9e, work-18]) (logid:616dae9c) Sync job-130 execution on object VmWorkJobQueue.8
2025-11-11 13:30:04,129 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}from host Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}. Starting attempt: 1/5 times.
2025-11-11 13:30:04,130 INFO [c.c.h.HighAvailabilityManagerExtImpl] (HA-Worker-0:[ctx-934bad64, work-19]) (logid:72bb5199) Migration attempt: for VM b39f1e40-9576-4553-92b3-6e3e84d372ebfrom host id 1. Starting attempt: 1/5 times.
Observed CloudStack wrongly selected host 2 as a destination host for the VM with id = 9 (this host was in maintenance mode) -> That caused the VM to get stopped
2025-11-11 13:30:05,730 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Last host [Host {"id":2
,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":
"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] is UP and has enough capacity. Checking for suitable pools for this host under zone [Zone {"id": "1", "na
me": "ref-trl-10084-k-Mr8-nicolas-vazquez", "uuid": "c6bb7e3f-388d-4318-8403-04d7f70d3862"}], pod [HostPod {"id":1,"name":"Pod1","uuid":"46052b42-8089-4b24-bd46-ca9fb938964f"}]
and cluster [Cluster {id: "1", name: "p1-c1", uuid: "2e2b4642-b4a5-46c1-ba26-ab54880b9b53"}].
2025-11-11 13:30:05,732 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Checking suitable pools
for volume [Volume {"id":10,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}, ROOT] of VM [VM instance {"id":9,"instanceName":"
i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}].
2025-11-11 13:30:05,733 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Volume [Volume {"id":10
,"instanceId":9,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] of VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type"
:"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] has pool [StoragePool {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid"
:"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] already specified. Checking if this pool can be reused.
2025-11-11 13:30:05,734 DEBUG [c.c.n.NetworkModelImpl] (Work-Job-Executor-11:[ctx-237107a3, job-119/job-130, ctx-406ba4b3]) (logid:506b69a7) Service SecurityGroup is not support
ed in the network Network {"id": 204, "name": "Net1", "uuid": "52cde633-2f3b-4305-825d-0dc1abca1493", "networkofferingid": 10}
2025-11-11 13:30:05,734 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Pool [StoragePool {"id"
:1,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm-pri1","poolType":"NetworkFilesystem","uuid":"162a7420-8b58-3701-bf2e-e27b2bf328ed"}] of volume [Volume {"id":10,"instanceId":9
,"name":"ROOT-9","uuid":"95ea3693-b4a4-49c4-9b21-88084cc295f2","volumeType":"ROOT"}] used by VM [VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","u
uid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"}] fits the specified plan. No need to reallocate a pool for this volume.
2025-11-11 13:30:05,735 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Trying to find a poteni
al host and associated storage pools from the suitable host/pool lists for this VM
....
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) Host has enough CPU and RAM avail
able
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc CPU from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: 2500, reserved: 0, actual total: 6000, total
with overprovisioning: 12000; requested cpu: 500, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,793 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-10:[ctx-5279597a, job-128/job-129, ctx-97af9533]) (logid:12e4154a) STATS: Can alloc MEM from host: H
ost {"id":3,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm3","type":"Routing","uuid":"95e251a6-5ffb-436d-b287-df2a77b80165"}, used: (2.50 GB) 2684354560, reserved: (0 bytes) 0,
total: (6.59 GB) 7072067584; requested mem: (512.00 MB) 536870912, alloc_from_last_host?: false, considerReservedCapacity?: true
2025-11-11 13:30:05,796 WARN [c.c.v.ClusteredVirtualMachineManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) Unable to migrate
VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} to Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"} due to [Resource [Host:2] is unreachable: Host 2: Unable to send class com.cloud.agent.api.PrepareF
orMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode] com.cloud.exception.AgentUnavailableException: Resource [Host:2] is unreachable
: Host 2: Unable to send class com.cloud.agent.api.PrepareForMigrationCommand because agent ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2 is in maintenance mode
at com.cloud.agent.manager.AgentAttache.checkAvailability(AgentAttache.java:190)
at com.cloud.agent.manager.ClusteredAgentAttache.checkAvailability(ClusteredAgentAttache.java:80)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:358)
at com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:131)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:404)
at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:570)
at com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:414)
at com.cloud.vm.VirtualMachineManagerImpl.migrate(VirtualMachineManagerImpl.java:2798)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:3497)
at com.cloud.vm.VirtualMachineManagerImpl.orchestrateMigrateAway(VirtualMachineManagerImpl.java:5522)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.cloud.vm.VmWorkJobHandlerProxy.handleVmWorkJob(VmWorkJobHandlerProxy.java:102)
at com.cloud.vm.VirtualMachineManagerImpl.handleVmWorkJob(VirtualMachineManagerImpl.java:5610)
at com.cloud.vm.VmWorkJobDispatcher.runJob(VmWorkJobDispatcher.java:99)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:652)
at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:600)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
......
2025-11-11 13:30:05,809 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-12:[ctx-420f129c, job-117/job-131, ctx-c86f8e0b]) (logid:aee8f755) VM instance {"id":9,"instanceName
":"i-2-9-VM","state":"Stopping","type":"User","uuid":"b39f1e40-9576-4553-92b3-6e3e84d372eb"} state transited from [Running] to [Stopping] with event [StopRequested]. VM's origin
al host: Host {"id":2,"name":"ref-trl-10084-k-Mr8-nicolas-vazquez-kvm2","type":"Routing","uuid":"9716940e-7eef-475c-b88c-1f00caa4fe6d"}, new host: Host {"id":1,"name":"ref-trl-1
0084-k-Mr8-nicolas-vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}, host before state transition: Host {"id":1,"name":"ref-trl-10084-k-Mr8-nicolas-
vazquez-kvm1","type":"Routing","uuid":"16b3b0e8-8aaf-4e71-9ce0-29c252b15aa5"}
2025-11-11 13:30:05,810 DEBUG [c.c.v.UserVmManagerImpl] (Work-Job-Executor-12:[c
Versions
CloudStack version 4.20.1
Tested with single cluster with 4 KVM hosts + NFS cluster-wide storage
The steps to reproduce the bug
Described above
What to do about it?
The migrate away logic must not consider hosts in maintenance (or other invalid state) as destination hosts