-
-
Notifications
You must be signed in to change notification settings - Fork 115
Clean preloaded disposable remnants on next boot #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #742 +/- ##
==========================================
- Coverage 70.40% 70.38% -0.02%
==========================================
Files 61 61
Lines 13682 13691 +9
==========================================
+ Hits 9633 9637 +4
- Misses 4049 4054 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
OpenQA test summaryComplete test suite and dependencies: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=2025102219-4.3&flavor=pull-requests Test run included the following:
New failures, excluding unstableCompared to: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.3&build=2025081011-4.3&flavor=update
Failed tests12 failures
Fixed failuresCompared to: https://openqa.qubes-os.org/tests/149225#dependencies 84 fixed
Unstable testsPerformance TestsPerformance degradation:17 performance degradations
Remaining performance tests:163 tests
|
qubes/vm/dispvm.py
Outdated
| :param bool force: Force cleanup even if enabled, else it might \ | ||
| be handled by ``domain-shutdown``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from earlier version? This function doesn't have force parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the remnant PR had remnants... It was dealt with the running check.
If the disposable was preloaded, in progress feature would return False, now it check is qube was a preloaded disposable that completed but is not running to signal improper shutdown. When calling cleanup, if the qube is not running, the "_bare_cleanup()" was skipped on "cleanup()", now this is handled. There is something that triggers "domain-remove-from-disk" on qubesd start for unnamed disposables, didn't detect the origin, but it is now handled by removing the preloaded disposable from the list. Fixes: QubesOS/qubes-issues#10326 For: QubesOS/qubes-issues#1512
5b6d695 to
3e842d7
Compare
| running = False | ||
| # Full cleanup will be done automatically if event 'domain-shutdown' is | ||
| # triggered and "auto_cleanup=True". | ||
| if not running or not self.auto_cleanup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some strange issue appears when using not running here.
The not running statement was intended to deal with halted qube (remnant from previous boot). I guess it is better to accept a force parameter to not default to running state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But still doesn't explain the issue... was cleanup called twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some disposable kept after test has this in /var/lob/qubes/qrexec.dispX.log:
2025-10-31 16:27:02.254 qrexec-daemon[2150537]: qrexec-daemon.c:898:parse_policy_response: qrexec-policy-daemon didn't return any data WARNING:root:warning: !compat-4.0 directive in file /etc/qubes/policy.d/35-compat.policy line 16 is transitional and will be deprecated
Traceback (most recent call last): File "/usr/bin/qrexec-policy-exec", line 5, in <module>
sys.exit(main())
~~~~^^
File "/usr/lib/python3.13/site-packages/qrexec/tools/qrexec_policy_exec.py", line 331, in main
result = get_result(args)
File "/usr/lib/python3.13/site-packages/qrexec/tools/qrexec_policy_exec.py", line 275, in get_result
result_str = asyncio.run(
handle_request(
...<7 lines>...
)
)
File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
return runner.run(main)
~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete return future.result()
~~~~~~~~~~~~~^^
File "/usr/lib/python3.13/site-packages/qrexec/tools/qrexec_policy_exec.py", line 363, in handle_request
system_info = utils.get_system_info() File "/usr/lib/python3.13/site-packages/qrexec/utils.py", line 164, in get_system_info
system_info = qubesd_call("dom0", "internal.GetSystemInfo")
File "/usr/lib/python3.13/site-packages/qrexec/utils.py", line 98, in qubesd_call
client_socket.connect(socket_path)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory
2025-10-31 16:27:02.370 qrexec-daemon[2150537]: qrexec-daemon.c:1155:connect_daemon_socket: qrexec-policy-exec failed: 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The messages in qrexec log are kinda expected, as qubesd is stopped between tests, so qrexec policy can't be evaluated at that time. But the unusual part is the dispvm shouldn't even be running at that time - all the preloaded should be cleaned up by then. Maybe something triggers preloading during test cleanup stage, and it remains running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understood. The tests are not deleting everything on tearDown, some qubes are removed in qubes/tests/__init__.py with kill(). The correct way to deal with a disposable is to use cleanup() to make sure that everything is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean - even though test cleanup does remove those VMs later (see _remove_vm_qubes called from remove_vms), dispvms do that also in domain-shutdown handler and a dispvm basically gets removed twice.
But something is still missing here: kill will remove the disposable (via domain-shutdown handler, that is finalized before kill returns), it will be fully clean at this stage. The _remove_vm_qubes will hit a bunch of errors at cleanup as the qube doesn't exist anymore (they are handled with except: pass), but "logical volume in use" error looks like the VM is still running at this point, not that it got already removed. Similarly, domain-unpaused looks like a new preloaded qube got started in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in detail in the commit how I understood the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not done on the test, it will be handled by "qubes/tests/__init__.py", which will attempt to kill the domain. If the preloaded disposable was still starting, exceptions will be handled by also attempting to kill the domain. Both methods will trigger the "_bare_cleanup()", sometimes indirectly via "cleanup()" or "_auto_cleanup()", but if "_bare_cleanup()" happens on the other call, not the one that called kill, it will not await and will attempt to remove the domain from the disk while it is still running (not completely killed). For: QubesOS#742 For: QubesOS/qubes-issues#1512
If it is not done on the test, it will be handled by "qubes/tests/__init__.py", which will attempt to kill the domain. If the preloaded disposable was still starting, exceptions will be handled by also attempting to kill the domain. Both methods will trigger the "_bare_cleanup()", sometimes indirectly via "cleanup()" or "_auto_cleanup()", but if "_bare_cleanup()" happens on the other call, not the one that called kill, it will not await and will attempt to remove the domain from the disk while it is still running (not completely killed). For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
If it is not done on the test, it will be handled by "qubes/tests/__init__.py", which will attempt to kill the domain. If the preloaded disposable was still starting, exceptions will be handled by also attempting to kill the domain. Both methods will trigger the "_bare_cleanup()", sometimes indirectly via "cleanup()" or "_auto_cleanup()", but if "_bare_cleanup()" happens on the other call, not the one that called kill, it will not await and will attempt to remove the domain from the disk while it is still running (not completely killed). For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
If it is not done on the test, it will be handled by "qubes/tests/__init__.py", which will attempt to kill the domain. If the preloaded disposable was still starting, exceptions will be handled by also attempting to kill the domain. Both methods will trigger the "_bare_cleanup()", sometimes indirectly via "cleanup()" or "_auto_cleanup()", but if "_bare_cleanup()" happens on the other call, not the one that called kill, it will not await and will attempt to remove the domain from the disk while it is still running (not completely killed). For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
If it has fixed test alongside upload failure on the same result, upload failure was skipped. For: QubesOS/qubes-core-admin#742
Relying simple on domain not being running resulted in storage errors of attempting to remove it while still in use (domain still running). Didn't identify the cause, which is unfortunate as now it requires an extra parameter. For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
Relying simple on domain not being running resulted in storage errors of attempting to remove it while still in use (domain still running). Didn't identify the cause, which is unfortunate as now it requires an extra parameter. For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
Relying simple on domain not being running resulted in storage errors of attempting to remove it while still in use (domain still running). Didn't identify the cause, which is unfortunate as now it requires an extra parameter. For: QubesOS#742 For: QubesOS/qubes-issues#1512 Fixes: QubesOS/qubes-issues#10369
If the disposable was preloaded, in progress feature would return False, now it check is qube was a preloaded disposable that completed but is not running to signal improper shutdown.
When calling cleanup, if the qube is not running, the "_bare_cleanup()" was skipped on "cleanup()", now this is handled.
There is something that triggers "domain-remove-from-disk" on qubesd start for unnamed disposables, didn't detect the origin, but it is now handled by removing the preloaded disposable from the list.
Fixes: QubesOS/qubes-issues#10326
For: QubesOS/qubes-issues#1512
How can I simulate stopping qubesd on integration tests? I tried to do unittest and firing
domain-load, but I could get the magic right.