Skip to content

Conversation

@dinhngtu
Copy link
Member

@dinhngtu dinhngtu commented Sep 11, 2025

Add "Clean up cached VM even if specified from UUID", which changes how VMs are cleaned up if specified by UUID.

Aside from the new DMC tests, the Windows tests were also enhanced with some tests that were previously failure-prone (upgrades, suspend with emulated NVMe, device ID changes, VIF unplug)

Requires WinPV 9.0.9135 or later.

@dinhngtu dinhngtu marked this pull request as ready for review September 11, 2025 17:02
@dinhngtu
Copy link
Member Author

Added a rework of test_checkpoint and associated functions to avoid repeated failures on Windows.

@dinhngtu dinhngtu requested review from stormi and ydirson September 18, 2025 11:31
snapshot.revert()


@pytest.mark.small_vm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to also test it with a variety of VMs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can mark it multi_vms. But I'm not sure if all of these VMs properly support ballooning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RHEL 8 & 9 ones don't, if I remember correctly. Can we detect that and skip for uncooperative VMs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for other/feature-balloon. However, this check won't work correctly on Linux VMs and current XCP-ng WinPV VMs, since in both cases the guest agent insists on setting feature-balloon regardless of driver support. I've added a fix in the WinPV guest agent, but due to this issue, I'll leave it marked small_vm for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does the guest agent currently set feature-balloon despite drivers not being there? Is this a realistic scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the case with the "recent" RHEL guests mentioned above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately this is a problem with all Linux guests. The Linux balloon driver doesn't set feature-balloon, so it's up to the guest agent to do that. I don't know if there's a way to check if the balloon driver is enabled, but at least the Rust agent doesn't do any such checks.

On this the Rust agent just mimicked what the XS one does. Worth a Plane (+gitlab?) ticket?

Copy link
Member Author

@dinhngtu dinhngtu Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plane card created. OTOH Gitlab ticket can't really be created (IMO) until the current refactor situation is sorted out.

vm.suspend()
wait_for_vm_running_and_ssh_up_without_tools(vm)

def test_toggle_device_id(self, running_unsealed_windows_vm: VM, guest_tools_iso: dict[str, Any]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the objective of this test? I understand we want to make sure the VM still boots after changing the device ID, but why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a test of our driver, which after the unplug rework must remain activated even if the device ID changes. It also serves as a proxy for device ID changes if the Windows Update option was toggled. It's not an exact reproduction of the situation, but since we don't yet support the C200 device, it's good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment above the test function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also moved the device ID assert up one line.

@dinhngtu dinhngtu force-pushed the dnt/windows-tests branch 3 times, most recently from b1670f7 to e58b6c2 Compare September 19, 2025 13:55
Copy link
Contributor

@last-genius last-genius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - both with the new tests and fixing up old tests to be more reliable. Looks good to me from the xapi point of view as a starting point for DMC testing.

Comment on lines +116 to +144
def test_dmc_suspend(self, vm_with_memory_limits: VM):
"""Suspend a VM with DMC enabled."""
vm = vm_with_memory_limits
self.start_dmc_vm(vm)
vm.set_memory_target(MEMORY_TARGET_LOW)
wait_for_vm_balloon_finished(vm)
vm.suspend(verify=True)
vm.resume()
vm.wait_for_vm_running_and_ssh_up()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the tests here set dynamic-min and dynamic-max to be the same value (oscillating between LOW and HIGH), that's what set_memory_target does. Do we plan on having tests with dynamic-min set lower than dynamic-max (not in this PR, but in the future)? Would be great to test how squeezed redistributes memory between VMs dynamically/how VMs are ballooned down to dynamic-min on migrations (but not on "localhost migrations" anymore).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how such scenarios will behave (i.e. what should we test?) so I'll need your input on that.

snapshot.revert()


@pytest.mark.small_vm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does the guest agent currently set feature-balloon despite drivers not being there? Is this a realistic scenario?

@dinhngtu
Copy link
Member Author

How often does the guest agent currently set feature-balloon despite drivers not being there? Is this a realistic scenario?

Yes, unfortunately this is a problem with all Linux guests. The Linux balloon driver doesn't set feature-balloon, so it's up to the guest agent to do that. I don't know if there's a way to check if the balloon driver is enabled, but at least the Rust agent doesn't do any such checks.

@dinhngtu
Copy link
Member Author

dinhngtu commented Oct 2, 2025

Backed out the test_checkpoint rework since the SSH failures on Windows need more investigation. I'll merge later today provided there are no more comments.

Comment on lines +52 to +62
def wait_for_vm_balloon_finished(vm: VM):
memory_target = int(vm.param_get("memory-target"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to be possibly subject to a race condition: nothing seems to ensure that the param cannot get changed behind the test's back and that we indeed get the expected value here. Looks like that target should rather be passed as a parameter to the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the parameter change behind the test's back?

Copy link
Contributor

@ydirson ydirson Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my comment was not 100% on the spot. But this parameter is RO, so likely based on the dynamic ranges, so the race is rather "how can we be sure that parameter has been set to the value we should be expecting?"
Intuitively, I would expect that the target would be set by squeezed to ensure the dynamic aspect of things - if that's right, it would even be expected that it changes behind our back.

But then, in the existence of vm-memory-target-set raises doubts on my interpretation above.

On a different note, vm-memory-target-wait, would look like a candidate for replacing DmcMemoryTracker?

I was not able to locate a dedicated doc for the DMC feature, so maybe we need that at some point, and in the meantime I guess more explanations about what we expect and test would help understanding this PR :)

Copy link
Member Author

@dinhngtu dinhngtu Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's why vm-memory-target-set was used, and why I wasn't sure of how to test the situation where dynamic-min and dynamic-max are different. (vm-memory-target-set, despite the name, sets both dynamic-min and dynamic-max)

vm-memory-target-wait looks interesting, but it doesn't have a way to bail out. I'm not sure how it reports failure either. Could you give me a quick explanation of how it works, @last-genius? I can either use it directly or replicate its logic here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It waits for abs (memory_actual - memory_target) <= tolerance for up to 256 seconds, where tolerance=1MB. Sadly it doesn't have a way to provide the timeout or tolerance parameters, but I can add that if you want to.
The errors it reports are VM_MEMORY_TARGET_WAIT_TIMEOUT and TASK_CANCELLED (which is how you can cancel any task, with xe task-cancel, but it's pretty awkward with xe, much easier with the API directly).

I also wonder why vm-memory-target-wait is hidden from the CLI help (so it's not autocompleted 🤔 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've opted to replicate the logic you described in DmcMemoryTracker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get back to my original concern: so it isn't possible that memory-target changes e.g. by squeezed, it is a parameter under total control of the admin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI it's squeezed itself that sets memory-target, but it is not possible for memory-target to deviate from what we want in this current test (which fixes memory-target by setting the dynamic limits).

Comment on lines 66 to 68
def test_drivers_detected(self, vm_install_test_tools_per_test_class: VM):
def test_vif_replug(self, vm_install_test_tools_per_test_class: VM):
vm = vm_install_test_tools_per_test_class
assert vm.are_windows_tools_working()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to have that assert systematically inside the vm_install_test_tools_per_test_class fixture? I think it would help by having the test in ERROR in case that happens, and not even starting, instead of going FAIL later when the problem is not really with what the tests checks.

Then maybe for test_drivers_detected an _unchecked version of the fixture would help so that one test does go FAIL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a very roundabout solution for little gain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the gain is not that much, and there are places when we compromise and let the test FAIL where indeed the test was not started at all because of such a prereq check, but here here it dies not not seem to me to be much of a problem to get the theoretically-correct status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved check to the vm_install_test_tools_per_test_class fixture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
However the change went as follows, which causes the symmetric issue on this test, where if it does not work an ERROR will be reported, while for this particular test we want it FAILED instead, since checking the drivers are here are precisely the topic of the test.

     def test_drivers_detected(self, vm_install_test_tools_per_test_class: VM):
-        vm = vm_install_test_tools_per_test_class
-        assert vm.are_windows_tools_working()
+        pass

This is why I originally suggested this:

Then maybe for test_drivers_detected an _unchecked version of the fixture would help so that one test does go FAIL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't think I'll add that in this PR.

@dinhngtu dinhngtu force-pushed the dnt/windows-tests branch 4 times, most recently from 7271c97 to 8027905 Compare October 2, 2025 16:22
@dinhngtu dinhngtu force-pushed the dnt/windows-tests branch 2 times, most recently from 1dc06f1 to 12fb0d2 Compare October 22, 2025 08:50
@dinhngtu
Copy link
Member Author

Added various fixes for the tools-windows tests. These fixes are coupled with several other fixes in the installer, drivers and guest agent themselves.

@dinhngtu dinhngtu requested review from stormi and ydirson November 3, 2025 14:42
@dinhngtu dinhngtu requested a review from a team as a code owner November 25, 2025 10:42
vif.unplug()
# HACK: Allow some time for the unplug to settle. If not, Windows guests have a tendency to explode.
time.sleep(5)
vif.plug()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to make sure the VIF was indeed unplugged before replugging it? Maybe just check that a ping won't succeed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added check for VIF attach status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          for vif in vifs:
+            assert strtobool(vif.param_get("currently-attached"))
             vif.unplug()
             # HACK: Allow some time for the unplug to settle. If not, Windows guests have a tendency to explode.
             time.sleep(5)
             vif.plug()

Can't we also check it is not attached before replugging?
Also, checking the toolstack's opinion is good, but since we're exercising the guest tools/drivers, we may want to make a direct control on the guest? That was what was behind my suggestion to use ping, but in fact maybe we could also check inside the guest that the interface was ... is it "removed"? (the "ping not responding any more" check may still be a good additional test, that could help catch bad cases where a different machine would also answer to the same IP) Maybe checking that it gets "added back" would not be strictly necessary, since after checking the IP we'll be reasonably confident it is really the right machine responding again to SSH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would not be the goal of this test, which is only to test the network settings offboarding script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe it would be worth detailing a bit more the scope of the test then, when I first read "offboarding" I hesitated to ask for details (and made assumptions that it was a fancy windowish name for getting rid of the VNIC), but it's now clear I should definitely not have refrained 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're writing tests, we should avoid relying assumptions, and instead verify that any assumption indeed holds, that's how regressions get caught :)

Copy link
Member Author

@dinhngtu dinhngtu Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that assumptions shouldn't be verified. But each test should be doing its own work instead of spreading paranoid checks everywhere, and I don't think it's fair for test contributors to have to add paranoid checks every time they want to test something slightly related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point above is that the scope of the test is not clear for someone not knowing about "network settings offboarding script.", that can possibly help to scope the suggestions.

ie, the goal of those "paranoid checks" I suggest is merely to make sure that the test indeed checks what it's supposed to check - and I understand that with an incorrect understanding of the scope my suggestions could be inadequate. I'll wait until I can understand the scope :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point above is that the scope of the test is not clear for someone not knowing about "network settings offboarding script.", that can possibly help to scope the suggestions.

This particular test is not about the offboarding script.

ie, the goal of those "paranoid checks" I suggest is merely to make sure that the test indeed checks what it's supposed to check - and I understand that with an incorrect understanding of the scope my suggestions could be inadequate. I'll wait until I can understand the scope :)

Fine, I've added the attachment check.

For me the scope of the test is already clear from its context, i.e. test of the Windows drivers/guest tools. I don't know why we keep reaching an impasse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the full context, but in general we've seen that being precise in our expectations at every stage of the test facilitates the debugging of unexpected test failures, not caused by what we were exactly testing, but actually caused by other expectations not being met. I don't think it hurts to had some checks, without making them an absolute prerequisite in the review, though. In these integration tests, where the system under test is very complex, I don't think it's being paranoid to add a few safeguards to make sure that we're in the state that we believe we should be in, when it's a prerequisite for the test itself. The extra effort in writing the tests is an investment that pays in the long run by reducing the amount of failures and the debugging effort, so we have to be "slightly" paranoid (with reason) when writing the tests. This is a principle that has guided us when building this test suite with many hands involved, and I'm convinced that we thus avoided situations that I have seen elsewhere where integration tests would become so unreliable that failures get ignored or that fixing them takes a frustratingly enormous amount of time each day (and I know we still have progress to do, as we have randomly failed runs that we'd like to avoid).

However I can understand the frustration when the comment comes more than 1 month after the initial review request, and we should strive to be more flexible when the review itself is late, to remove unnecessary frustration cause by the late discussion of relatively minor details.

In this case, I think Yann's assert will simplify debugging in case for some reason something is not attached as it should and was not costly to add. Was it worth the frustration caused by the discussion, maybe not. Let's strive to keep the balance.

Comment on lines +111 to +138
with pytest.raises(SSHCommandFailed, match="lacks the feature"):
vm.suspend()
wait_for_vm_running_and_ssh_up_without_tools(vm)
Copy link
Contributor

@ydirson ydirson Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this call to wait_for_vm_running_and_ssh_up_without_tools(vm)? If the VM did not suspend, intuitively I0d think we should not have to wait for anything. OTOH, in case of failure and the VM actually suspending, we can wait all we want and it would not come back all alone, right?

Shouldn't that be something like this?

        with pytest.raises(SSHCommandFailed, match="lacks the feature"):
            vm.suspend()
        else:
            vm.resume()
            wait_for_vm_running_and_ssh_up_without_tools(vm)

(that actually sounds like something that should be restored using a fixture's teardown, but I guess only a snapshot with memory can allow for this, and it would likely be overkill)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indicated by the name: test_uefi_vm_suspend_refused_without_tools. IOW the suspend must not succeed, even by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the case where the test fails - right, my else: block is nuts, aside from having an invalid name, I started it thinking of an except: condition and somewhat switched to a finally: idea for no reason 😓.

So yes, the idea was an except: ending with a raise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds rather odd since pytest.raises already filters for the relevant exception, and lets the rest bubble through.

Copy link
Member Author

@dinhngtu dinhngtu Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rereading the thread, I think I got your idea; but wait_for_vm_running_and_ssh_up_without_tools(vm) will not be executed if the with clause fails, and it's only there to verify that vm.suspend() doesn't silently succeed even if the exception is fired (which admittedly sounds somewhat unnecessary once it's said out loud).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rereading the thread, I think I got your idea; wait_for_vm_running_and_ssh_up_without_tools(vm) will not be executed if the with clause fails,

Well, depending on what we mean by "fails" for a block that's expected to fail - that would be "if it fails to fail " :)

and it's only there to verify that vm.suspend() doesn't silently succeed even if the exception is fired (which admittedly sounds somewhat unnecessary once it's said out loud).

Indeed my idea was rather to handle the case when the suspend happens and exception is not fired, since that would make the test fail immediately and we would need to resume the VM to leave it in the original state. But then maybe it is unnecessary here too, with a revert-to-snapshot taking care of everything.

But yes there is also this additional edge case you mention.

@dinhngtu dinhngtu force-pushed the dnt/windows-tests branch 5 times, most recently from 4bfec8e to 57a0bac Compare November 25, 2025 18:10
Comment on lines +21 to +26
class DmcMemoryTracker:
"""
Class to monitor whether a VM has reached its specified memory target.
Intended to replicate the logic used by vm-memory-target-wait:
https://github.com/xcp-ng/xcp-ng-tests/pull/348#discussion_r2399188845
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use wait() as an external API, it could be good to prefix all the internal methods with __ to make it obvious they are not entrypoint, and help a reader start where it makes most sense.

Copy link
Member Author

@dinhngtu dinhngtu Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made is_memory_target_satisfied private. OTOH poll should remain public.

If CACHE_IMPORTED_VM is specified, the source VM is unconditionally
cloned, even if it was referred to by UUID.

Clean that up during teardown.

Signed-off-by: Tu Dinh <[email protected]>
Otherwise you can't pass a dict[str, str] to host.xe, as mypy complained
here:

lib/vm.py:875: error: Argument 2 to "xe" of "Host" has incompatible type "dict[str, str]"; expected "dict[str, str | bool]"  [arg-type]
lib/vm.py:875: note: "dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
lib/vm.py:875: note: Consider using "Mapping" instead, which is covariant in the value type

Signed-off-by: Tu Dinh <[email protected]>
These tests verify a VM's responsiveness to memory target changes, and
checks for several suspend bugs when DMC is enabled.

Signed-off-by: Tu Dinh <[email protected]>
Remove duplicate test_tools_after_reboot which was no longer used.
Reenable upgrade tests.
Add suspend test with emulated NVMe.
Add device ID toggle test.
Add VIF replug test.

Signed-off-by: Tu Dinh <[email protected]>
These methods help test VIF functionalities and the offboarding process.

Signed-off-by: Tu Dinh <[email protected]>
In some edge cases, Xeniface may not have been initialized after
installation, and so vm.reboot() will not work.

Signed-off-by: Tu Dinh <[email protected]>
This is flaky and needs to be explicitly tested.

Use DNS as a basic, inoffensive setting that won't interfere with VM
operation.

Signed-off-by: Tu Dinh <[email protected]>
RSS enablement is flaky and needs to be explicitly tested.

Signed-off-by: Tu Dinh <[email protected]>
The default timeouts turned out to be insufficient for driver installs
in some cases.

Signed-off-by: Tu Dinh <[email protected]>
Xenvif offboard will reset the NIC, which will cause any running SSH
commands to fail.

Signed-off-by: Tu Dinh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants