Skip to content

Conversation

jennagoddard
Copy link
Contributor

@jennagoddard jennagoddard commented Sep 30, 2025

The PR adds support for injecting a LINT1 debug interrupt into the guest VTL0 for both TDX and SNP CVM. This enables investigations into unresponsive guests.

@jennagoddard jennagoddard changed the title Draft: NMI for CVM in OpenHCL NMI for CVM in OpenHCL Oct 7, 2025
@jennagoddard jennagoddard marked this pull request as ready for review October 7, 2025 03:28
@jennagoddard jennagoddard requested a review from a team as a code owner October 7, 2025 03:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements NMI support for CVM (Confidential Virtual Machine) in OpenHCL by adding LINT1 interrupt handling capability. The implementation allows for debug interrupt injection through the Guest Emulation Transport (GET) protocol.

Key changes include:

  • Added LINT1 interrupt support to the Local APIC implementation
  • Extended GET protocol with debug interrupt notification capability
  • Implemented NMI masking and suppression logic for hardware-backed CVMs

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
vmm_core/virt_support_apic/src/lib.rs Added LINT1 interrupt support including statistics, work flags, and request handling
vm/devices/get/guest_emulation_transport/src/process_loop.rs Added debug interrupt notification handling and callback storage
vm/devices/get/guest_emulation_transport/src/client.rs Added client method to set debug interrupt callback
vm/devices/get/get_protocol/src/lib.rs Extended protocol with InjectDebugInterruptNotification structure
openhcl/virt_mshv_vtl/src/processor/tdx/mod.rs Added NMI masking support for TDX processors
openhcl/virt_mshv_vtl/src/processor/snp/mod.rs Added NMI suppression logic for SNP processors
openhcl/virt_mshv_vtl/src/processor/mod.rs Added LAPIC state fields for NMI and LINT1 handling
openhcl/virt_mshv_vtl/src/processor/hardware_cvm/mod.rs Added cross-VTL NMI tracking
openhcl/virt_mshv_vtl/src/processor/hardware_cvm/apic.rs Implemented LINT1 and NMI handling with masking support
openhcl/virt_mshv_vtl/src/lib.rs Added assert_debug_interrupt method to trigger LINT1
openhcl/underhill_core/src/worker.rs Connected GET callback to partition's debug interrupt method
Comments suppressed due to low confidence (1)

vm/devices/get/guest_emulation_transport/src/process_loop.rs:1

  • Corrected duplicate word 'the the' to 'the'.
// Copyright (c) Microsoft Corporation.

);
}

// Set the the callback in GET to trigger the debug interrupt.
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected duplicate word 'the the' to 'the'.

Suggested change
// Set the the callback in GET to trigger the debug interrupt.
// Set the callback in GET to trigger the debug interrupt.
``` #Resolved

Copilot uses AI. Check for mistakes.

Copy link

github-actions bot commented Oct 7, 2025

At least one Petri test failed.
#Closed

}

// Trigger the LINT1 interrupt vector on the LAPIC of the BSP.
self.set_debug_interrupt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no callback has been set we should probably trace a warning or something instead of silently doing nothing?

Copy link
Contributor Author

@jennagoddard jennagoddard Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need- if its that early lint1 is masked and there isn't an expectation LINT1 could be delivered at that time. For early boot diagnostics we'll need to use true NMI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing is very low cost. Even if a case is unexpected, that's exactly where a trace could help the most in figuring out something weird down the line. If we ask the VM to do something, and then absolutely nothing happens, that's confusing. If we ask it to do something and it prints a warning saying "i can't do that thing" that's helpful.


// NMI suppression state to prevent duplicate NMI
#[cfg(guest_arch = "x86_64")]
const NMI_SUPPRESS_LINT1_DELIVERED: u32 = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These and nmi_suppression should be combined into a bitfield type

// If a LINT1 NMI has been requested, then it is being delivered now,
// so no further NMIs can be delivered.
self.backing.cvm.lapics[vtl].nmi_suppression &= !NMI_SUPPRESS_LINT1_REQUESTED;
self.backing.cvm.lapics[vtl].nmi_suppression |= NMI_SUPPRESS_LINT1_DELIVERED;
Copy link
Contributor

@smalis-msft smalis-msft Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but where do we clear DELIVERED? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELIVERED isn't cleared. Once LINT1 debug interrupt has been delivered you can't deliver another LINT1 or NMI. The HCL doesn't know when the NMI is complete so its not safe to signal twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we trace a warning or something then if we get one of these and delivered is already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the guest is already crashing and hopefully writing a crash dump so no need to add another write at that time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about tracing being low cost

@smalis-msft
Copy link
Contributor

smalis-msft commented Oct 9, 2025

PR needs a description too. What we're adding and why. #Resolved


/// Returns true if the VP should be woken up to scan the APIC.
#[must_use]
fn request_lint1(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINT1 is just a line on the processor that triggers a particular kind of interrupt, based on the APIC's configuration. This is typically but not always configured by the firmware/OS to be NMI. It doesn't make sense for the CPU to get a "LINT1 request" from the APIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a way to differentiate between true NMI and LINT1 delivered as NMI. suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to distinguish between those cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I absolutely do not want to add any code to virt_support_apic for non-architectural behavior, other than the Hyper-V auto EOI extensions that we have to support. If we must distinguish between these paths then it must be outside of this crate.)

if lapic.activity != MpState::WaitForSipi {
if nmi || lapic.nmi_pending {
if lint1 {
if supports_nmi_masking || !lapic.cross_vtl_nmi_requested {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this cross_vtl_nmi_requested thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VTL0-generated NMIs may occur during normal execution e.g. due to KD interaction, and so we don't want these to block crash dumps. But if VTL1 injected an NMI that only occurs due to bug check so don't allow other NMI sources

@jennagoddard
Copy link
Contributor Author

confirmed known issue


In reply to: 3375135354

#[cfg(guest_arch = "x86_64")]
pub fn assert_debug_interrupt(&self, _vtl: u8) {
let bsp_index = VpIndex::new(0);
self.pulse_lint(bsp_index, Vtl::Vtl0, 1)
Copy link
Contributor

@smalis-msft smalis-msft Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use the vtl parameter? is there any technical reason we need to prevent injecting into vtl1, or is it just we don't expect it to happen today? If we just don't expect it to happen, but we can trivially support it, then why don't we? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VTL parameter is in the protocol so we can add VTL 2 crash dumps if or when we support the necessary encryption/security measures to do so. From discussion with Andrea, the recommended way to collect a crash dump of VTL1 is to inject NMI into VTL0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the case for anything that may ever run in VTL 1, or just for today's implementation of Windows SK though? Some other OS (like say Linux) could add their own code that runs in VTL 1 with a different contract at some point right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes though we don't know what the contract would be there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but my point is if supporting injection into VTL 1 is trivial for us to do now, why not just link it up?

/// Trigger the LINT1 interrupt vector on the LAPIC of the BSP.
#[cfg(guest_arch = "x86_64")]
pub fn assert_debug_interrupt(&self, _vtl: u8) {
pub fn assert_debug_interrupt(&self, vtl: u8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vtl param should probably be a GuestVtl, and the conversion should happen at a higher level. I guess in the callback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants