forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 3
First pass fwctl for Broadcom NICs #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gospo
wants to merge
20
commits into
jgunthorpe:fwctl
Choose a base branch
from
gospo:fwctl-brcm-0
base: fwctl
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Create a new 'struct cxl_mailbox' and move all mailbox related bits to it. This allows isolation of all CXL mailbox data in order to export some of the calls to external caller (fwctl) and avoid exporting of CXL driver specific bits such has device states. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
The code indicates that the min of n_commands and total commands is returned. The comment incorrectly says it's the max(). Correct comment to min(). Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
With 'struct cxl_mailbox' context introduced, the helper functions cxl_query_cmd() and cxl_send_cmd() can take a cxl_mailbox directly rather than a cxl_memdev parameter. Refactor to use cxl_mailbox directly. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
CXL spec r3.1 8.2.9.6.1 Get Supported Features (Opcode 0500h) The command retrieve the list of supported device-specific features (identified by UUID) and general information about each Feature. The driver will retrieve the feature entries in order to make checks and provide information for the Get Feature and Set Feature command. One of the main piece of information retrieved are the effects a Set Feature command would have for a particular feature. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Add cxl-test emulation of Get Supported Features mailbox command. Currently only planning to add the Device Patrol Scrub Control feature as a testing vehicle. CXL spec r3.1 8.2.9.9.11.1 Device Patrol Scrub Control Feature. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Add enumeration of Get Feature mailbox command for the kernel to recognize the command being passed in from user space. CXL spec r3.1 8.2.9.6.2 Get Feature (Opcode 0501h) The feature requested is identified by specific UUID. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Add enumeration of Set Feature mailbox command for the kernel to recognize the command being passed in from user space. CXL spec r3.1 8.2.9.6.3 Set Feature (Opcode 0502h) The feature requested is identified by specific UUID. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
Create the class, character device and functions for a fwctl driver to un/register to the subsystem. A typical fwctl driver has a sysfs presence like: $ ls -l /dev/fwctl/fwctl0 crw------- 1 root root 250, 0 Apr 25 19:16 /dev/fwctl/fwctl0 $ ls /sys/class/fwctl/fwctl0 dev device power subsystem uevent $ ls /sys/class/fwctl/fwctl0/device/infiniband/ ibp0s10f0 $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/ fwctl0/ $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0 dev device power subsystem uevent Which allows userspace to link all the multi-subsystem driver components together and learn the subsystem specific names for the device's components. Signed-off-by: Jason Gunthorpe <[email protected]>
Each file descriptor gets a chunk of per-FD driver specific context that allows the driver to attach a device specific struct to. The core code takes care of the memory lifetime for this structure. The ioctl dispatch and design is based on what was built for iommufd. The ioctls have a struct which has a combined in/out behavior with a typical 'zero pad' scheme for future extension and backwards compatibility. Like iommufd some shared logic does most of the ioctl marshalling and compatibility work and tables diatches to some function pointers for each unique iotcl. This approach has proven to work quite well in the iommufd and rdma subsystems. Allocate an ioctl number space for the subsystem. Signed-off-by: Jason Gunthorpe <[email protected]>
Userspace will need to know some details about the fwctl interface being used to locate the correct userspace code to communicate with the kernel. Provide a simple device_type enum indicating what the kernel driver is. Allow the device to provide a device specific info struct that contains any additional information that the driver may need to provide to userspace. Signed-off-by: Jason Gunthorpe <[email protected]>
Requesting a fwctl scope of access that includes mutating device debug data will cause the kernel to be tainted. Changing the device operation through things in the debug scope may cause the device to malfunction in undefined ways. This should be reflected in the TAINT flags to help any debuggers understand that something has been done. Signed-off-by: Jason Gunthorpe <[email protected]>
Add the FWCTL_RPC ioctl which allows a request/response RPC call to device firmware. Drivers implementing this call must follow the security guidelines under Documentation/userspace-api/fwctl.rst The core code provides some memory management helpers to get the messages copied from and back to userspace. The driver is responsible for allocating the output message memory and delivering the message to the device. Signed-off-by: Jason Gunthorpe <[email protected]>
Document the purpose and rules for the fwctl subsystem. Link in kdocs to the doc tree. Nacked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Acked-by: Daniel Vetter <[email protected]> https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
mlx5's fw has long provided a User Context concept. This has a long history in RDMA as part of the devx extended verbs programming interface. A User Context is a security envelope that contains objects and controls access. It contains the Protection Domain object from the InfiniBand Architecture and both togther provide the OS with the necessary tools to bind a security context like a process to the device. The security context is restricted to not be able to touch the kernel or other processes. In the RDMA verbs case it is also restricted to not touch global device resources. The fwctl_mlx5 takes this approach and builds a User Context per fwctl file descriptor and uses a FW security capability on the User Context to enable access to global device resources. This makes the context useful for provisioning and debugging the global device state. mlx5 already has a robust infrastructure for delivering RPC messages to fw. Trivially connect fwctl's RPC mechanism to mlx5_cmd_do(). Enforce the User Context ID in every RPC header so the FW knows the security context of the issuing ID. Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
If the device supports User Context then it can support fwctl. Create an auxiliary device to allow fwctl to bind to it. Create a sysfs like: $ ls /sys/devices/pci0000:00/0000:00:0a.0/mlx5_core.fwctl.0/driver -l lrwxrwxrwx 1 root root 0 Apr 25 19:46 /sys/devices/pci0000:00/0000:00:0a.0/mlx5_core.fwctl.0/driver -> ../../../../bus/auxiliary/drivers/mlx5_fwctl.mlx5_fwctl Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
…nds (RFC) Add an fwctl (auxiliary bus) driver to allow sending of CXL feature commands from userspace through as ioctls. Create a driver skeleton for initial setup. FWCTL_INFO will return the commands supported by the fwctl char device as a bitmap of enable commands. fwctl provides a fwctl_ops->fw_rpc() callback in order to issue ioctls to a device. FWCTL_RPC will start by supporting the CXL feature commands: Get Supported Features, Get Feature, and Set Feature. FWCTL_RPC provides 'enum fwctl_rpc_scope' parameter where it indicates the security scope of the call. The Get Supported Features and Get Feature calls can be executed with the scope of FWCTL_RPC_DEBUG_READ_ONLY. The Set Feature call is gated by the effects of the feature reported by Get Supported Features call for the specific feature. Add a software command through FWCTL_RPC in order for the user to retrieve information about the commands that are supported. In this instance only 3 commands are supported: Get Supported Features, Get Feature, and Set Feature. The expected flow of operation is to send the call first with 0 set to the n_commands parameter to indicate query of total commands available. And then a second call provides the number of commands to retrieve with the appropriate amount of memory allocated to store information about the commands. Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
This will link the fwctl subsystem to CXL devices. Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
fwctl is a new subsystem intended to bring some common rules and order to the growing pattern of exposing a secure FW interface directly to userspace. Unlike existing places like RDMA/DRM/VFIO/uacce that are exposing a device for datapath operations fwctl is focused on debugging, configuration and provisioning of the device. It will not have the necessary features like interrupt delivery to support a datapath. This concept is similar to the long standing practice in the "HW" RAID space of having a device specific misc device to manager the RAID controller FW. fwctl generalizes this notion of a companion debug and management interface that goes along with a dataplane implemented in an appropriate subsystem. The need for this has reached a critical point as many users are moving to run lockdown enabled kernels. Several existing devices have had long standing tooling for management that relied on /sys/../resource0 or PCI config space access which is not permitted in lockdown. A major point of fwctl is to define and document the rules that a device must follow to expose a lockdown compatible RPC. Based on some discussion fwctl splits the RPCs into four categories FWCTL_RPC_CONFIGURATION FWCTL_RPC_DEBUG_READ_ONLY FWCTL_RPC_DEBUG_WRITE FWCTL_RPC_DEBUG_WRITE_FULL Where the latter two trigger a new TAINT_FWCTL, and the final one requires CAP_SYS_RAWIO - excluding it from lockdown. The device driver and its FW would be responsible to restrict RPCs to the requested security scope, while the core code handles the tainting and CAP checks. For details see the final patch which introduces the documentation. This series incorporates a version of the mlx5ctl interface previously proposed: https://lore.kernel.org/r/[email protected]/ For this series the memory registration mechanism was removed, but I expect it will come back. It also includes the FWCL driver series from David: https://lore.kernel.org/all/[email protected]/ This is still waiting a 3rd fwctl driver and the CXL side to finish some of its development. The github has the necessary CXL precursor patches. There have been two LWN articles written discussing various aspects of this proposal: https://lwn.net/Articles/955001/ https://lwn.net/Articles/969383/ And a really giant ksummit thread: https://lore.kernel.org/ksummit/[email protected]/ Several have expressed general support for this concept: Broadcom Networking - https://lore.kernel.org/r/Zf2n02q0GevGdS-Z@C02YVCJELVCG Christoph Hellwig - https://lore.kernel.org/r/[email protected]/ Daniel Vetter - https://lore.kernel.org/r/[email protected] Enfabrica - https://lore.kernel.org/r/[email protected]/ NVIDIA Networking Oded Gabbay/Habana - https://lore.kernel.org/r/ZrMl1bkPP-3G9B4N@T14sgabbay. Oracle Linux - https://lore.kernel.org/r/6lakj6lxlxhdgrewodvj3xh6sxn3d36t5dab6najzyti2navx3@wrge7cyfk6nq SuSE/Hannes - https://lore.kernel.org/r/[email protected] Work is ongoing for a robust multi-device open source userspace, currently the mlx5ctl_user that was posted by Saeed has been updated to use fwctl. https://github.com/saeedtx/mlx5ctl.git https://github.com/jgunthorpe/mlx5ctl.git This is on github: https://github.com/jgunthorpe/linux/commits/fwctl v3: - Rebase to v6.11-rc4 - Add a squashed version of David's CXL series as the 2nd driver - Add missing includes - Improve comments based on feedback - Use the kdoc format that puts the member docs inside the struct - Rewrite fwctl_alloc_device() to be clearer - Incorporate all remarks for the documentation v2: https://lore.kernel.org/r/[email protected] - Rebase to v6.10-rc5 - Minor style changes - Follow the style consensus for the guard stuff - Documentation grammer/spelling - Add missed length output for mlx5 get_info - Add two more missed MLX5 CMD's - Collect tags v1: https://lore.kernel.org/r/[email protected] Cc: Andy Gospodarek <[email protected]> Cc: Aron Silverton <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Ahern <[email protected]> Cc: Itay Avraham <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jiri Pirko <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Leonid Bloch <[email protected]> Cc: Dan Williams <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
This patch adds basic support for the fwctl infrastructure. With the approriate tool, the most basic RPC to the bnxt_en firmware can be called. Signed-off-by: Andy Gospodarek <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Oct 14, 2024
On the node of an NFS client, some files saved in the mountpoint of the NFS server were copied to another location of the same NFS server. Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference crash with the following syslog: [232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [232066.588586] Mem abort info: [232066.588701] ESR = 0x0000000096000007 [232066.588862] EC = 0x25: DABT (current EL), IL = 32 bits [232066.589084] SET = 0, FnV = 0 [232066.589216] EA = 0, S1PTW = 0 [232066.589340] FSC = 0x07: level 3 translation fault [232066.589559] Data abort info: [232066.589683] ISV = 0, ISS = 0x00000007 [232066.589842] CM = 0, WnR = 0 [232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400 [232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000 [232066.590757] Internal error: Oops: 96000007 [#1] SMP [232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2 [232066.591052] vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs [232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1 [232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06 [232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4] [232066.598595] sp : ffff8000f568fc70 [232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000 [232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001 [232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050 [232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000 [232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000 [232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6 [232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828 [232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a [232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058 [232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000 [232066.601636] Call trace: [232066.601749] nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.601998] nfs4_do_reclaim+0x1b8/0x28c [nfsv4] [232066.602218] nfs4_state_manager+0x928/0x10f0 [nfsv4] [232066.602455] nfs4_run_state_manager+0x78/0x1b0 [nfsv4] [232066.602690] kthread+0x110/0x114 [232066.602830] ret_from_fork+0x10/0x20 [232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00) [232066.603284] SMP: stopping secondary CPUs [232066.606936] Starting crashdump kernel... [232066.607146] Bye! Analysing the vmcore, we know that nfs4_copy_state listed by destination nfs_server->ss_copies was added by the field copies in handle_async_copy(), and we found a waiting copy process with the stack as: PID: 3511963 TASK: ffff710028b47e00 CPU: 0 COMMAND: "cp" #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4 #1 [ffff8001116ef760] __schedule at ffff800008dd0650 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898 torvalds#6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4] torvalds#7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4] torvalds#8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4] torvalds#9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4] The NULL-pointer dereference was due to nfs42_complete_copies() listed the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state. So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and the data accessed through this pointer was also incorrect. Generally, the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state(). When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED and copies are not deleted in nfs_server->ss_copies, the source state may be passed to the nfs42_complete_copies() process earlier, resulting in this crash scene finally. To solve this issue, we add a list_head nfs_server->ss_src_copies for a server-to-server copy specially. Fixes: 0e65a32 ("NFS: handle source server reboot") Signed-off-by: Yanjun Zhang <[email protected]> Reviewed-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Oct 23, 2024
Syzkaller reported a lockdep splat:
============================================
WARNING: possible recursive locking detected
6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted
--------------------------------------------
syz-executor364/5113 is trying to acquire lock:
ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
but task is already holding lock:
ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(k-slock-AF_INET);
lock(k-slock-AF_INET);
*** DEADLOCK ***
May be due to missing lock nesting notation
7 locks held by syz-executor364/5113:
#0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
#0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806
#1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
#1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727
#2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
#2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470
#3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
#3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228
#4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
#4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104
#5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
#5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232
torvalds#6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
torvalds#6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
stack backtrace:
CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
check_deadlock kernel/locking/lockdep.c:3061 [inline]
validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855
__lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:351 [inline]
sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279
subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874
tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853
tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267
ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
__netif_receive_skb_one_core net/core/dev.c:5661 [inline]
__netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
process_backlog+0x662/0x15b0 net/core/dev.c:6108
__napi_poll+0xcb/0x490 net/core/dev.c:6772
napi_poll net/core/dev.c:6841 [inline]
net_rx_action+0x89b/0x1240 net/core/dev.c:6963
handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
do_softirq+0x11b/0x1e0 kernel/softirq.c:455
</IRQ>
<TASK>
__local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
local_bh_enable include/linux/bottom_half.h:33 [inline]
rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
__dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450
dev_queue_xmit include/linux/netdevice.h:3105 [inline]
neigh_hh_output include/net/neighbour.h:526 [inline]
neigh_output include/net/neighbour.h:540 [inline]
ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235
ip_local_out net/ipv4/ip_output.c:129 [inline]
__ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535
__tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline]
tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729
tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934
sk_backlog_rcv include/net/sock.h:1111 [inline]
__release_sock+0x214/0x350 net/core/sock.c:3004
release_sock+0x61/0x1f0 net/core/sock.c:3558
mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733
mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x1a6/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
___sys_sendmsg net/socket.c:2651 [inline]
__sys_sendmmsg+0x3b2/0x740 net/socket.c:2737
__do_sys_sendmmsg net/socket.c:2766 [inline]
__se_sys_sendmmsg net/socket.c:2763 [inline]
__x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f04fb13a6b9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9
RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004
RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240
R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300
</TASK>
As noted by Cong Wang, the splat is false positive, but the code
path leading to the report is an unexpected one: a client is
attempting an MPC handshake towards the in-kernel listener created
by the in-kernel PM for a port based signal endpoint.
Such connection will be never accepted; many of them can make the
listener queue full and preventing the creation of MPJ subflow via
such listener - its intended role.
Explicitly detect this scenario at initial-syn time and drop the
incoming MPC request.
Fixes: 1729cf1 ("mptcp: create the listening socket for new port")
Cc: [email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e
Cc: Cong Wang <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 4, 2024
Hou Tao says: ==================== The patch set fixes several issues in bits iterator. Patch #1 fixes the kmemleak problem of bits iterator. Patch #2~#3 fix the overflow problem of nr_bits. Patch #4 fixes the potential stack corruption when bits iterator is used on 32-bit host. Patch #5 adds more test cases for bits iterator. Please see the individual patches for more details. And comments are always welcome. --- v4: * patch #1: add ack from Yafang * patch #3: revert code-churn like changes: (1) compute nr_bytes and nr_bits before the check of nr_words. (2) use nr_bits == 64 to check for single u64, preventing build warning on 32-bit hosts. * patch #4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)" v3: https://lore.kernel.org/bpf/[email protected]/T/#t * split the bits-iterator related patches from "Misc fixes for bpf" patch set * patch #1: use "!nr_bits || bits >= nr_bits" to stop the iteration * patch #2: add a new helper for the overflow problem * patch #3: decrease the limitation from 512 to 511 and check whether nr_bytes is too large for bpf memory allocator explicitly * patch #5: add two more test cases for bit iterator v2: http://lore.kernel.org/bpf/[email protected] ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 4, 2024
Petr Machata says: ==================== mlxsw: Fixes In this patchset: - Tx header should be pushed for each packet which is transmitted via Spectrum ASICs. Patch #1 adds a missing call to skb_cow_head() to make sure that there is both enough room to push the Tx header and that the SKB header is not cloned and can be modified. - Commit b5b60bb ("mlxsw: pci: Use page pool for Rx buffers allocation") converted mlxsw to use page pool for Rx buffers allocation. Sync for CPU and for device should be done for Rx pages. In patches #2 and #3, add the missing calls to sync pages for, respectively, CPU and the device. - Patch #4 then fixes a bug to IPv6 GRE forwarding offload. Patch #5 adds a generic forwarding test that fails with mlxsw ports prior to the fix. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 18, 2024
When CONFIG_KASAN_SW_TAGS and CONFIG_KASAN_STACK are enabled, the object_is_on_stack() function may produce incorrect results due to the presence of tags in the obj pointer, while the stack pointer does not have tags. This discrepancy can lead to incorrect stack object detection and subsequently trigger warnings if CONFIG_DEBUG_OBJECTS is also enabled. Example of the warning: ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:557 __debug_object_init+0x330/0x364 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc5 #4 Hardware name: linux,dummy-virt (DT) pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __debug_object_init+0x330/0x364 lr : __debug_object_init+0x330/0x364 sp : ffff800082ea7b40 x29: ffff800082ea7b40 x28: 98ff0000c0164518 x27: 98ff0000c0164534 x26: ffff800082d93ec8 x25: 0000000000000001 x24: 1cff0000c00172a0 x23: 0000000000000000 x22: ffff800082d93ed0 x21: ffff800081a24418 x20: 3eff800082ea7bb0 x19: efff800000000000 x18: 0000000000000000 x17: 00000000000000ff x16: 0000000000000047 x15: 206b63617473206e x14: 0000000000000018 x13: ffff800082ea7780 x12: 0ffff800082ea78e x11: 0ffff800082ea790 x10: 0ffff800082ea79d x9 : 34d77febe173e800 x8 : 34d77febe173e800 x7 : 0000000000000001 x6 : 0000000000000001 x5 : feff800082ea74b8 x4 : ffff800082870a90 x3 : ffff80008018d3c4 x2 : 0000000000000001 x1 : ffff800082858810 x0 : 0000000000000050 Call trace: __debug_object_init+0x330/0x364 debug_object_init_on_stack+0x30/0x3c schedule_hrtimeout_range_clock+0xac/0x26c schedule_hrtimeout+0x1c/0x30 wait_task_inactive+0x1d4/0x25c kthread_bind_mask+0x28/0x98 init_rescuer+0x1e8/0x280 workqueue_init+0x1a0/0x3cc kernel_init_freeable+0x118/0x200 kernel_init+0x28/0x1f0 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- ODEBUG: object 3eff800082ea7bb0 is NOT on stack ffff800082ea0000, but annotated. ------------[ cut here ]------------ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qun-Wei Lin <[email protected]> Cc: Andrew Yang <[email protected]> Cc: AngeloGioacchino Del Regno <[email protected]> Cc: Casper Li <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chinwen Chang <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Dec 2, 2024
Use 0 for the values as we use them for the return value on init to keep the test modules simple. This fixes a splat reported do_init_module: 'test_kallsyms_b'->init suspiciously returned 255, it should follow 0/-E convention do_init_module: loading module anyway... CPU: 5 UID: 0 PID: 1873 Comm: modprobe Not tainted 6.12.0-rc3+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024 Call Trace: <TASK> dump_stack_lvl+0x56/0x80 do_init_module.cold+0x21/0x26 init_module_from_file+0x88/0xf0 idempotent_init_module+0x108/0x300 __x64_sys_finit_module+0x5a/0xb0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f4f3a718839 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff> RSP: 002b:00007fff97d1a9e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000055b94001ab90 RCX: 00007f4f3a718839 RDX: 0000000000000000 RSI: 000055b910e68a10 RDI: 0000000000000004 RBP: 0000000000000000 R08: 00007f4f3a7f1b20 R09: 000055b94001c5b0 R10: 0000000000000040 R11: 0000000000000246 R12: 000055b910e68a10 R13: 0000000000040000 R14: 000055b94001ad60 R15: 0000000000000000 </TASK> do_init_module: 'test_kallsyms_b'->init suspiciously returned 255, it should follow 0/-E convention do_init_module: loading module anyway... CPU: 1 UID: 0 PID: 1884 Comm: modprobe Not tainted 6.12.0-rc3+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024 Call Trace: <TASK> dump_stack_lvl+0x56/0x80 do_init_module.cold+0x21/0x26 init_module_from_file+0x88/0xf0 idempotent_init_module+0x108/0x300 __x64_sys_finit_module+0x5a/0xb0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7ffaa5d18839 Reported-by: Sami Tolvanen <[email protected]> Reviewed-by: Sami Tolvanen <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
1fdf145 to
4032570
Compare
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 2, 2025
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD.
For a pipe data, it reads the header data including pmu_mapping from
PERF_RECORD_HEADER_FEATURE runtime. But it's already set in:
perf_session__new()
__perf_session__new()
evlist__init_trace_event_sample_raw()
evlist__has_amd_ibs()
perf_env__nr_pmu_mappings()
Then it'll overwrite that when it processes the HEADER_FEATURE record.
Here's a report from address sanitizer.
Direct leak of 2689 byte(s) in 1 object(s) allocated from:
#0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
#1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64
#2 0x5595a7d414ef in strbuf_init util/strbuf.c:25
#3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362
#4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517
#5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315
torvalds#6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23
torvalds#7 0x5595a7d7f893 in __perf_session__new util/session.c:179
torvalds#8 0x5595a7b79572 in perf_session__new util/session.h:115
torvalds#9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603
torvalds#10 0x5595a7c019eb in run_builtin perf.c:351
torvalds#11 0x5595a7c01c92 in handle_internal_command perf.c:404
torvalds#12 0x5595a7c01deb in run_argv perf.c:448
torvalds#13 0x5595a7c02134 in main perf.c:556
torvalds#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Let's free the existing pmu_mapping data if any.
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 7, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().
An example from v5.4, similar problem also exists in upstream:
crash> bt 2091206
PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0"
#0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
#1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
#2 [ffff800084a2f880] schedule at ffff800040bfa4b4
#3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
#4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
#5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
torvalds#6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
torvalds#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
torvalds#8 [ffff800084a2fa60] generic_make_request at ffff800040570138
torvalds#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
torvalds#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
torvalds#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
torvalds#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
torvalds#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
torvalds#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
torvalds#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
torvalds#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
torvalds#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
torvalds#18 [ffff800084a2fe70] kthread at ffff800040118de4
After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.
Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().
Signed-off-by: Jinliang Zheng <[email protected]>
Reviewed-by: Tianxiang Peng <[email protected]>
Reviewed-by: Hao Peng <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 7, 2025
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
Setting pool0/active to 0
Setting pool1/active to 1
[ 73.911067][ T4365] ibmveth 30000002 eth0: close starting
Setting pool1/active to 1
Setting pool1/active to 0
[ 73.911367][ T4366] ibmveth 30000002 eth0: close starting
[ 73.916056][ T4365] ibmveth 30000002 eth0: close complete
[ 73.916064][ T4365] ibmveth 30000002 eth0: open starting
[ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
[ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
[ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty torvalds#8
[ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000
[ 243.683852][ T123] Call Trace:
[ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
[ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
[ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
[ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
[ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
[ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
[ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
[ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
[ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
[ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
[ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
[ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
[ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
[ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
[ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
...
[ 243.684087][ T123] Showing all locks held in the system:
[ 243.684095][ T123] 1 lock held by khungtaskd/123:
[ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
[ 243.684114][ T123] 4 locks held by stress.sh/4365:
[ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
[ 243.684166][ T123] 5 locks held by stress.sh/4366:
[ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
[ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
[ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
[ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
[ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0
From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:
T4365 T4366
----------------- ----------------- ---------
veth_pool_store veth_pool_store
ibmveth_close
ibmveth_close
napi_disable
napi_disable
ibmveth_open
napi_enable <- HANG
ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.
https://docs.kernel.org/networking/napi.html]] says
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <[email protected]>
Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
Reviewed-by: Nick Child <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 11, 2025
Commit 7da55c2 ("drm/amd/display: Remove incorrect FP context start") removes the FP context protection of dml2_create(), and it said "All the DC_FP_START/END should be used before call anything from DML2". However, dml2_validate()/dml21_validate() are not protected from their callers, causing such errors: do_fpu invoked from kernel context![#1]: CPU: 10 UID: 0 PID: 331 Comm: kworker/10:1H Not tainted 6.14.0-rc6+ #4 Workqueue: events_highpri dm_irq_work_func [amdgpu] pc ffff800003191eb0 ra ffff800003191e60 tp 9000000107a94000 sp 9000000107a975b0 a0 9000000140ce4910 a1 0000000000000000 a2 9000000140ce49b0 a3 9000000140ce49a8 a4 9000000140ce49a8 a5 0000000100000000 a6 0000000000000001 a7 9000000107a97660 t0 ffff800003790000 t1 9000000140ce5000 t2 0000000000000001 t3 0000000000000000 t4 0000000000000004 t5 0000000000000000 t6 0000000000000000 t7 0000000000000000 t8 0000000100000000 u0 ffff8000031a3b9c s9 9000000130bc0000 s0 9000000132400000 s1 9000000140ec0000 s2 9000000132400000 s3 9000000140ce0000 s4 90000000057f8b88 s5 9000000140ec0000 s6 9000000140ce4910 s7 0000000000000001 s8 9000000130d45010 ra: ffff800003191e60 dml21_map_dc_state_into_dml_display_cfg+0x40/0x1140 [amdgpu] ERA: ffff800003191eb0 dml21_map_dc_state_into_dml_display_cfg+0x90/0x1140 [amdgpu] CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000000 (-FPE -SXE -ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 000f0000 [FPD] (IS= ECode=15 EsubCode=0) PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S) Process kworker/10:1H (pid: 331, threadinfo=000000007bf9ddb0, task=00000000cc4ab9f3) Stack : 0000000100000000 0000043800000780 0000000100000001 0000000100000001 0000000000000000 0000078000000000 0000000000000438 0000078000000000 0000000000000438 0000078000000000 0000000000000438 0000000100000000 0000000100000000 0000000100000000 0000000100000000 0000000100000000 0000000000000001 9000000140ec0000 9000000132400000 9000000132400000 ffff800003408000 ffff800003408000 9000000132400000 9000000140ce0000 9000000140ce0000 ffff800003193850 0000000000000001 9000000140ec0000 9000000132400000 9000000140ec0860 9000000140ec0738 0000000000000001 90000001405e8000 9000000130bc0000 9000000140ec02a8 ffff8000031b5db8 0000000000000000 0000043800000780 0000000000000003 ffff8000031b79cc ... Call Trace: [<ffff800003191eb0>] dml21_map_dc_state_into_dml_display_cfg+0x90/0x1140 [amdgpu] [<ffff80000319384c>] dml21_validate+0xcc/0x520 [amdgpu] [<ffff8000031b8948>] dc_validate_global_state+0x2e8/0x460 [amdgpu] [<ffff800002e94034>] create_validate_stream_for_sink+0x3d4/0x420 [amdgpu] [<ffff800002e940e4>] amdgpu_dm_connector_mode_valid+0x64/0x240 [amdgpu] [<900000000441d6b8>] drm_connector_mode_valid+0x38/0x80 [<900000000441d824>] __drm_helper_update_and_validate+0x124/0x3e0 [<900000000441ddc0>] drm_helper_probe_single_connector_modes+0x2e0/0x620 [<90000000044050dc>] drm_client_modeset_probe+0x23c/0x1780 [<9000000004420384>] __drm_fb_helper_initial_config_and_unlock+0x44/0x5a0 [<9000000004403acc>] drm_client_dev_hotplug+0xcc/0x140 [<ffff800002e9ab50>] handle_hpd_irq_helper+0x1b0/0x1e0 [amdgpu] [<90000000038f5da0>] process_one_work+0x160/0x300 [<90000000038f6718>] worker_thread+0x318/0x440 [<9000000003901b8c>] kthread+0x12c/0x220 [<90000000038b1484>] ret_from_kernel_thread+0x8/0xa4 Unfortunately, protecting dml2_validate()/dml21_validate() out of DML2 causes "sleeping function called from invalid context", so protect them with DC_FP_START() and DC_FP_END() inside. Fixes: 7da55c2 ("drm/amd/display: Remove incorrect FP context start") Cc: [email protected] Signed-off-by: Huacai Chen <[email protected]> Tested-by: Dongyan Qian <[email protected]> Reviewed-by: Aurabindo Pillai <[email protected]> Tested-by: Daniel Wheeler <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 14, 2025
If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.
However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.
Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.
The virtio spec states under "Setting Up Two-Stage Queue Indicators":
... indicator contains the guest address of an area wherein the
indicators for the devices are contained, starting at bit_nr, one
bit per virtqueue of the device.
And further in "Notification via Adapter I/O Interrupts":
For notifying the driver of virtqueue buffers, the device sets the
bit in the guest-provided indicator area at the corresponding
offset.
For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.
Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.
Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:
INFO: task kworker/1:3:463 blocked for more than 61 seconds.
Not tainted 6.14.0 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/1:3 [...]
Workqueue: events page_reporting_process
Call Trace:
[<000002f404e6dfb2>] __schedule+0x402/0x1640
[<000002f404e6f22e>] schedule+0x3e/0xe0
[<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
[<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
[<000002f403fd3ee2>] process_one_work+0x1c2/0x400
[<000002f403fd4b96>] worker_thread+0x296/0x420
[<000002f403fe10b4>] kthread+0x124/0x290
[<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
[<000002f404e77272>] ret_from_fork+0xa/0x38
There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.
Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.
[1] https://lore.kernel.org/all/[email protected]/
Fixes: a229989 ("virtio: don't allocate vqs when names[i] = NULL")
Reported-by: Chandra Merla <[email protected]>
Cc: [email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Tested-by: Thomas Huth <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Apr 28, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <[email protected]> CC: <[email protected]> # 6.13+ Tested-by: Shin'ichiro Kawasaki <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
May 13, 2025
When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to an invalid address access, as illustrated below. To prevent this invalid access, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target. Mailing list discussion and explanation from Hugh Dickins: "An anon_vma lookup points to a location which may contain the folio of interest, but might instead contain another folio: and weeding out those other folios is precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing the wrong folio" comment a few lines above it) is for." BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace: <TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e The bug is found by syzkaller on an internal kernel, then confirmed on upstream. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/all/[email protected]/ Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Gavin Guo <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Zi Yan <[email protected]> Reviewed-by: Gavin Shan <[email protected]> Cc: Florent Revest <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
May 15, 2025
A warning on driver removal started occurring after commit 9dd05df ("net: warn if NAPI instance wasn't shut down"). Disable tx napi before deleting it in mt76_dma_cleanup(). WARNING: CPU: 4 PID: 18828 at net/core/dev.c:7288 __netif_napi_del_locked+0xf0/0x100 CPU: 4 UID: 0 PID: 18828 Comm: modprobe Not tainted 6.15.0-rc4 #4 PREEMPT(lazy) Hardware name: ASUS System Product Name/PRIME X670E-PRO WIFI, BIOS 3035 09/05/2024 RIP: 0010:__netif_napi_del_locked+0xf0/0x100 Call Trace: <TASK> mt76_dma_cleanup+0x54/0x2f0 [mt76] mt7921_pci_remove+0xd5/0x190 [mt7921e] pci_device_remove+0x47/0xc0 device_release_driver_internal+0x19e/0x200 driver_detach+0x48/0x90 bus_remove_driver+0x6d/0xf0 pci_unregister_driver+0x2e/0xb0 __do_sys_delete_module.isra.0+0x197/0x2e0 do_syscall_64+0x7b/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Tested with mt7921e but the same pattern can be actually applied to other mt76 drivers calling mt76_dma_cleanup() during removal. Tx napi is enabled in their *_dma_init() functions and only toggled off and on again inside their suspend/resume/reset paths. So it should be okay to disable tx napi in such a generic way. Found by Linux Verification Center (linuxtesting.org). Fixes: 2ac515a ("mt76: mt76x02: use napi polling for tx cleanup") Cc: [email protected] Signed-off-by: Fedor Pchelkin <[email protected]> Tested-by: Ming Yen Hsieh <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Felix Fietkau <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Jun 30, 2025
Before the commit under the Fixes tag below, bnxt_ulp_stop() and bnxt_ulp_start() were always invoked in pairs. After that commit, the new bnxt_ulp_restart() can be invoked after bnxt_ulp_stop() has been called. This may result in the RoCE driver's aux driver .suspend() method being invoked twice. The 2nd bnxt_re_suspend() call will crash when it dereferences a NULL pointer: (NULL ib_device): Handle device suspend call BUG: kernel NULL pointer dereference, address: 0000000000000b78 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 20 UID: 0 PID: 181 Comm: kworker/u96:5 Tainted: G S 6.15.0-rc1 #4 PREEMPT(voluntary) Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017 Workqueue: bnxt_pf_wq bnxt_sp_task [bnxt_en] RIP: 0010:bnxt_re_suspend+0x45/0x1f0 [bnxt_re] Code: 8b 05 a7 3c 5b f5 48 89 44 24 18 31 c0 49 8b 5c 24 08 4d 8b 2c 24 e8 ea 06 0a f4 48 c7 c6 04 60 52 c0 48 89 df e8 1b ce f9 ff <48> 8b 83 78 0b 00 00 48 8b 80 38 03 00 00 a8 40 0f 85 b5 00 00 00 RSP: 0018:ffffa2e84084fd88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffb4b6b934 RDI: 00000000ffffffff RBP: ffffa1760954c9c0 R08: 0000000000000000 R09: c0000000ffffdfff R10: 0000000000000001 R11: ffffa2e84084fb50 R12: ffffa176031ef070 R13: ffffa17609775000 R14: ffffa17603adc180 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffa17daa397000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000b78 CR3: 00000004aaa30003 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bnxt_ulp_stop+0x69/0x90 [bnxt_en] bnxt_sp_task+0x678/0x920 [bnxt_en] ? __schedule+0x514/0xf50 process_scheduled_works+0x9d/0x400 worker_thread+0x11c/0x260 ? __pfx_worker_thread+0x10/0x10 kthread+0xfe/0x1e0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Check the BNXT_EN_FLAG_ULP_STOPPED flag and do not proceed if the flag is already set. This will preserve the original symmetrical bnxt_ulp_stop() and bnxt_ulp_start(). Also, inside bnxt_ulp_start(), clear the BNXT_EN_FLAG_ULP_STOPPED flag after taking the mutex to avoid any race condition. And for symmetry, only proceed in bnxt_ulp_start() if the BNXT_EN_FLAG_ULP_STOPPED is set. Fixes: 3c163f3 ("bnxt_en: Optimize recovery path ULP locking in the driver") Signed-off-by: Kalesh AP <[email protected]> Co-developed-by: Michael Chan <[email protected]> Signed-off-by: Michael Chan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Jul 9, 2025
… context The current use of a mutex to protect the notifier hashtable accesses can lead to issues in the atomic context. It results in the below kernel warnings: | BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258 | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0 | preempt_count: 1, expected: 0 | RCU nest depth: 0, expected: 0 | CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4 | Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn | Call trace: | show_stack+0x18/0x24 (C) | dump_stack_lvl+0x78/0x90 | dump_stack+0x18/0x24 | __might_resched+0x114/0x170 | __might_sleep+0x48/0x98 | mutex_lock+0x24/0x80 | handle_notif_callbacks+0x54/0xe0 | notif_get_and_handle+0x40/0x88 | generic_exec_single+0x80/0xc0 | smp_call_function_single+0xfc/0x1a0 | notif_pcpu_irq_work_fn+0x2c/0x38 | process_one_work+0x14c/0x2b4 | worker_thread+0x2e4/0x3e0 | kthread+0x13c/0x210 | ret_from_fork+0x10/0x20 To address this, replace the mutex with an rwlock to protect the notifier hashtable accesses. This ensures that read-side locking does not sleep and multiple readers can acquire the lock concurrently, avoiding unnecessary contention and potential deadlocks. Writer access remains exclusive, preserving correctness. This change resolves warnings from lockdep about potential sleep in atomic context. Cc: Jens Wiklander <[email protected]> Reported-by: Jérôme Forissier <[email protected]> Closes: OP-TEE/optee_os#7394 Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks") Message-Id: <[email protected]> Reviewed-by: Jens Wiklander <[email protected]> Tested-by: Jens Wiklander <[email protected]> Signed-off-by: Sudeep Holla <[email protected]>
jgunthorpe
added a commit
that referenced
this pull request
Jul 11, 2025
The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. Alex suspects a multi-function
modeling of the DSPs may have hidden it, but I could not find a switch
using that scheme.
Most likely people were simply happy they had smaller groups oblivious to
the security problems.
For discussion lets consider a simple topology like the below:
-- DSP 02:00.0 -> End Point A
Root 00:00.0 -> USP 01:00.0 --|
-- DSP 02:03.0 -> End Point B
If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.
If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.
If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.
Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.
The current algorithm has several issues:
1) It implicitly depends on ordering. Since the existing group discovery
only goes in the upstream direction discovering a downstream device
before its upstream will cause the wrong creation of narrower groups.
2) It assumes that if the path from the end point to the root is entirely
ACS isolated then that end point is isolated. This misses cross-traffic
in the asymmetric ACS case.
3) When evaluating a non-isolated DSP it does not check peer DSPs for an
already established group unless the multi-function feature does it.
4) It does not understand the aliasing rule for PCIe to PCI bridges
where the alias is to the subordinate bus. The bridge's RID on the
primary bus is not aliased. This causes the PCIe to PCI bridge to be
wrongly joined to the group with the downstream devices.
As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.
Revise the design to solve these problems.
Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.
Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.
Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.
Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2
The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.
Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.
Finally, replace the initial pci_for_each_dma_alias() with a combination
of:
- Directly checking pci_real_dma_dev() and enforcing ordering.
The group should contain both pdev and pci_real_dma_dev(pdev) which is
only possible if pdev is ordered after real_dma_dev. This solves a case
of #1.
- Indirectly relying on pci_bus_isolation() to report legacy PCI busses
as non-isolated, with the enum including the distinction of the PCIe to
PCI bridge being isolated from the downstream. This solves problem #4.
It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.
The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.
Fixes: 104a1c1 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: [email protected]
Signed-off-by: Jason Gunthorpe <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Jul 14, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #4 - Gracefully fail initialising pKVM if the interrupt controller isn't GICv3 - Also gracefully fail initialising pKVM if the carveout allocation fails - Fix the computing of the minimum MMIO range required for the host on stage-2 fault - Fix the generation of the GICv3 Maintenance Interrupt in nested mode
jgunthorpe
pushed a commit
that referenced
this pull request
Aug 21, 2025
BPF CI testing report a UAF issue: [ 16.446633] BUG: kernel NULL pointer dereference, address: 000000000000003 0 [ 16.447134] #PF: supervisor read access in kernel mod e [ 16.447516] #PF: error_code(0x0000) - not-present pag e [ 16.447878] PGD 0 P4D 0 [ 16.448063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPT I [ 16.448409] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:1 Tainted: G OE 6.13.0-rc3-g89e8a75fda73-dirty #4 2 [ 16.449124] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODUL E [ 16.449502] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/201 4 [ 16.450201] Workqueue: smc_hs_wq smc_listen_wor k [ 16.450531] RIP: 0010:smc_listen_work+0xc02/0x159 0 [ 16.452158] RSP: 0018:ffffb5ab40053d98 EFLAGS: 0001024 6 [ 16.452526] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 000000000000030 0 [ 16.452994] RDX: 0000000000000280 RSI: 00003513840053f0 RDI: 000000000000000 0 [ 16.453492] RBP: ffffa097808e3800 R08: ffffa09782dba1e0 R09: 000000000000000 5 [ 16.453987] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0978274640 0 [ 16.454497] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa09782d4092 0 [ 16.454996] FS: 0000000000000000(0000) GS:ffffa097bbc00000(0000) knlGS:000000000000000 0 [ 16.455557] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003 3 [ 16.455961] CR2: 0000000000000030 CR3: 0000000102788004 CR4: 0000000000770ef 0 [ 16.456459] PKRU: 5555555 4 [ 16.456654] Call Trace : [ 16.456832] <TASK > [ 16.456989] ? __die+0x23/0x7 0 [ 16.457215] ? page_fault_oops+0x180/0x4c 0 [ 16.457508] ? __lock_acquire+0x3e6/0x249 0 [ 16.457801] ? exc_page_fault+0x68/0x20 0 [ 16.458080] ? asm_exc_page_fault+0x26/0x3 0 [ 16.458389] ? smc_listen_work+0xc02/0x159 0 [ 16.458689] ? smc_listen_work+0xc02/0x159 0 [ 16.458987] ? lock_is_held_type+0x8f/0x10 0 [ 16.459284] process_one_work+0x1ea/0x6d 0 [ 16.459570] worker_thread+0x1c3/0x38 0 [ 16.459839] ? __pfx_worker_thread+0x10/0x1 0 [ 16.460144] kthread+0xe0/0x11 0 [ 16.460372] ? __pfx_kthread+0x10/0x1 0 [ 16.460640] ret_from_fork+0x31/0x5 0 [ 16.460896] ? __pfx_kthread+0x10/0x1 0 [ 16.461166] ret_from_fork_asm+0x1a/0x3 0 [ 16.461453] </TASK > [ 16.461616] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE) ] [ 16.462134] CR2: 000000000000003 0 [ 16.462380] ---[ end trace 0000000000000000 ]--- [ 16.462710] RIP: 0010:smc_listen_work+0xc02/0x1590 The direct cause of this issue is that after smc_listen_out_connected(), newclcsock->sk may be NULL since it will releases the smcsk. Therefore, if the application closes the socket immediately after accept, newclcsock->sk can be NULL. A possible execution order could be as follows: smc_listen_work | userspace ----------------------------------------------------------------- lock_sock(sk) | smc_listen_out_connected() | | \- smc_listen_out | | | \- release_sock | | |- sk->sk_data_ready() | | fd = accept(); | close(fd); | \- socket->sk = NULL; /* newclcsock->sk is NULL now */ SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk)) Since smc_listen_out_connected() will not fail, simply swapping the order of the code can easily fix this issue. Fixes: 3b2dec2 ("net/smc: restructure client and server code in af_smc") Signed-off-by: D. Wythe <[email protected]> Reviewed-by: Guangguan Wang <[email protected]> Reviewed-by: Alexandra Winter <[email protected]> Reviewed-by: Dust Li <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Sep 4, 2025
These iterations require the read lock, otherwise RCU lockdep will splat: ============================= WARNING: suspicious RCU usage 6.17.0-rc3-00014-g31419c045d64 torvalds#6 Tainted: G O ----------------------------- drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by rtcwake/547: #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98 stack backtrace: CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G O 6.17.0-rc3-00014-g31419c045d64 torvalds#6 VOLUNTARY Tainted: [O]=OOT_MODULE Stack: 223721b3a80 6089eac6 00000001 00000001 ffffff00 6089eac6 00000535 6086e528 721b3ac0 6003c294 00000000 60031fc0 Call Trace: [<600407ed>] show_stack+0x10e/0x127 [<6003c294>] dump_stack_lvl+0x77/0xc6 [<6003c2fd>] dump_stack+0x1a/0x20 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e [<603d980f>] device_suspend+0x528/0x541 [<603da24b>] dpm_suspend+0x1a2/0x267 [<603da837>] dpm_suspend_start+0x5d/0x72 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736 [...] Add the fourth argument to the iteration to annotate this and avoid the splat. Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents") Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children") Signed-off-by: Johannes Berg <[email protected]> Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid Signed-off-by: Rafael J. Wysocki <[email protected]>
jgunthorpe
added a commit
that referenced
this pull request
Sep 5, 2025
The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. I think it has been avoided
because the kernel command line options to target specific devices and
disable ACS are rarely used.
For discussion lets consider a simple topology like the below:
-- DSP 02:00.0 -> End Point A
Root 00:00.0 -> USP 01:00.0 --|
-- DSP 02:03.0 -> End Point B
If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.
If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.
If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.
Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.
For root-ports a PCIe topology like:
-- Dev 01:00.0
Root 00:00.00 --- Root Port 00:01.0 --|
| -- Dev 01:00.1
|- Dev 00:17.0
Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
ACS capability in the root port.
While ACS on root ports is underspecified in the spec, it should still
function as an egress control and limit access to either the MMIO of the
root port itself, or perhaps some other devices upstream of the root
complex - 00:17.0 perhaps in this example.
Historically the grouping in Linux has assumed the root port routes all
traffic into the TA/IOMMU and never bypasses the TA to go to other
functions in the root complex. Following the new understanding that ACS is
required for internal loopback also treat root ports with no ACS
capability as lacking internal loopback as well.
The current algorithm has several issues:
1) It implicitly depends on ordering. Since the existing group discovery
only goes in the upstream direction discovering a downstream device
before its upstream will cause the wrong creation of narrower groups.
2) It assumes that if the path from the end point to the root is entirely
ACS isolated then that end point is isolated. This misses cross-traffic
in the asymmetric ACS case.
3) When evaluating a non-isolated DSP it does not check peer DSPs for an
already established group unless the multi-function feature does it.
4) It does not understand the aliasing rule for PCIe to PCI bridges
where the alias is to the subordinate bus. The bridge's RID on the
primary bus is not aliased. This causes the PCIe to PCI bridge to be
wrongly joined to the group with the downstream devices.
As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.
Revise the design to solve these problems.
Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.
Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.
Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.
Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2
The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.
Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.
Finally, replace the initial pci_for_each_dma_alias() with a combination
of:
- Directly checking pci_real_dma_dev() and enforcing ordering.
The group should contain both pdev and pci_real_dma_dev(pdev) which is
only possible if pdev is ordered after real_dma_dev. This solves a case
of #1.
- Indirectly relying on pci_bus_isolation() to report legacy PCI busses
as non-isolated, with the enum including the distinction of the PCIe to
PCI bridge being isolated from the downstream. This solves problem #4.
It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.
The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.
Fixes: 104a1c1 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: [email protected]
Signed-off-by: Jason Gunthorpe <[email protected]>
jgunthorpe
added a commit
that referenced
this pull request
Sep 9, 2025
The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. I think it has been avoided
because the kernel command line options to target specific devices and
disable ACS are rarely used.
For discussion lets consider a simple topology like the below:
-- DSP 02:00.0 -> End Point A
Root 00:00.0 -> USP 01:00.0 --|
-- DSP 02:03.0 -> End Point B
If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.
If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.
If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.
Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSPs group.
For root-ports a PCIe topology like:
-- Dev 01:00.0
Root 00:00.00 --- Root Port 00:01.0 --|
| -- Dev 01:00.1
|- Dev 00:17.0
Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
ACS capability in the root port.
While ACS on root ports is underspecified in the spec, it should still
function as an egress control and limit access to either the MMIO of the
root port itself, or perhaps some other devices upstream of the root
complex - 00:17.0 perhaps in this example.
Historically the grouping in Linux has assumed the root port routes all
traffic into the TA/IOMMU and never bypasses the TA to go to other
functions in the root complex. Following the new understanding that ACS is
required for internal loopback also treat root ports with no ACS
capability as lacking internal loopback as well.
The current algorithm has several issues:
1) It implicitly depends on ordering. Since the existing group discovery
only goes in the upstream direction discovering a downstream device
before its upstream will cause the wrong creation of narrower groups.
2) It assumes that if the path from the end point to the root is entirely
ACS isolated then that end point is isolated. This misses cross-traffic
in the asymmetric ACS case.
3) When evaluating a non-isolated DSP it does not check peer DSPs for an
already established group unless the multi-function feature does it.
4) It does not understand the aliasing rule for PCIe to PCI bridges
where the alias is to the subordinate bus. The bridge's RID on the
primary bus is not aliased. This causes the PCIe to PCI bridge to be
wrongly joined to the group with the downstream devices.
As grouping is a security property for VFIO creating incorrectly narrowed
groups is a security problem for the system.
Revise the design to solve these problems.
Explicitly require ordering, or return EPROBE_DEFER if things are out of
order. This avoids silent errors that created smaller groups and solves
problem #1.
Work on busses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all busses on the path to an end device are
isolated then the end device has a chance to make a single-device group.
Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.
Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2
The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.
Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.
Finally, replace the initial pci_for_each_dma_alias() with a combination
of:
- Directly checking pci_real_dma_dev() and enforcing ordering.
The group should contain both pdev and pci_real_dma_dev(pdev) which is
only possible if pdev is ordered after real_dma_dev. This solves a case
of #1.
- Indirectly relying on pci_bus_isolation() to report legacy PCI busses
as non-isolated, with the enum including the distinction of the PCIe to
PCI bridge being isolated from the downstream. This solves problem #4.
It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.
The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.
Fixes: 104a1c1 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: [email protected]
Signed-off-by: Jason Gunthorpe <[email protected]>
jgunthorpe
added a commit
that referenced
this pull request
Sep 15, 2025
The current algorithm does not work if ACS is turned off, and it is not
clear how this has been missed for so long. I think it has been avoided
because the kernel command line options to target specific devices and
disable ACS are rarely used.
For discussion consider a simple topology like the below:
-- DSP 02:00.0 -> End Point A
Root 00:00.0 -> USP 01:00.0 --|
-- DSP 02:03.0 -> End Point B
If ACS is fully activated we expect 00:00.0, 01:00.0, 02:00.0, 02:03.0, A,
B to all have unique single device groups.
If both DSPs have ACS off then we expect 00:00.0 and 01:00.0 to have
unique single device groups while 02:00.0, 02:03.0, A, B are part of one
multi-device group.
If the DSPs have asymmetric ACS, with one fully isolating and one
non-isolating we also expect the above multi-device group result.
Instead the current algorithm always creates unique single device groups
for this topology. It happens because the pci_device_group(DSP)
immediately moves to the USP and computes pci_acs_path_enabled(USP) ==
true and decides the DSP can get a unique group. The pci_device_group(A)
immediately moves to the DSP, sees pci_acs_path_enabled(DSP) == false and
then takes the DSP's group.
For Root Ports a PCIe topology like:
-- Dev 01:00.0
Root 00:00.00 --- Root Port 00:01.0 --|
| -- Dev 01:00.1
|- Dev 00:17.0
Previously would group [00:01.0, 01:00.0, 01:00.1] together if there is no
ACS capability in the Root Port.
While ACS on Root Ports is underspecified in the spec, it should still
function as an egress control and limit access to either the BARs of the
Root Port itself, or perhaps some other devices upstream of the Root
Complex - 00:17.0 perhaps in this example.
Historically the grouping in Linux has assumed the Root Port routes all
traffic into the TA/IOMMU and never bypasses the TA to go to other
functions in the Root Complex. Following the new understanding that ACS is
required for internal loopback also treat Root Ports with no ACS
capability as lacking internal loopback as well.
The current algorithm has several issues:
1) It implicitly depends on ordering. Since the existing group discovery
only goes in the upstream direction discovering a downstream device
before its upstream will cause the wrong creation of narrower groups.
2) It assumes that if the path from the end point to the root is entirely
ACS isolated then that end point is isolated. This misses cross-traffic
in the asymmetric ACS case.
3) When evaluating a non-isolated DSP it does not check peer DSPs for an
already established group unless the multi-function feature does it.
4) It does not understand the aliasing rule for PCIe to PCI bridges
where the alias is to the subordinate bus. The bridge's RID on the
primary bus is not aliased. This causes the PCIe to PCI bridge to be
wrongly joined to the group with the downstream devices.
As grouping is a security property for VFIO creating less isolated group
with fewer devices is a security problem for the system.
Revise the design to solve these problems.
Explicitly require probing upstream devices before downstream edvices, or
return EPROBE_DEFER if things are out of order. This avoids silent errors
that created smaller groups and solves problem #1.
Work on buses, not devices. Isolation is a property of the bus, and the
first non-isolated bus should form a group containing all devices
downstream of that bus. If all buses on the path to an end device are
isolated then the end device has a chance to make a single-device group.
Use pci_bus_isolation() to compute the bus's isolation status based on the
ACS flags and technology. pci_bus_isolation() touches a lot of PCI
internals to get the information in the right format.
Add a new flag in the iommu_group to record that the group contains a
non-isolated bus. Any downstream pci_device_group() will see
bus->self->iommu_group is non-isolated and unconditionally join it. This
makes the first non-isolation apply to all downstream devices and solves
problem #2
The bus's non-isolated iommu_group will be stored in either the DSP of
PCIe switch or the bus->self upstream device, depending on the situation.
When storing in the DSP all the DSPs are checked first for a pre-existing
non-isolated iommu_group. When stored in the upstream the flag forces it
to all downstreams. This solves problem #3.
Put the handling of end-device aliases and MFD into pci_get_alias_group()
and only call it in cases where we have a fully isolated path. Otherwise
every downstream device on the bus is going to be joined to the group of
bus->self.
Finally, replace the initial pci_for_each_dma_alias() with a combination
of:
- Directly checking pci_real_dma_dev() and enforcing ordering.
The group should contain both pdev and pci_real_dma_dev(pdev) which is
only possible if pdev is ordered after real_dma_dev. This solves a case
of #1.
- Indirectly relying on pci_bus_isolation() to report legacy PCI buses
as non-isolated, with the enum including the distinction of the PCIe to
PCI bridge being isolated from the downstream. This solves problem #4.
It is very likely this is going to expand iommu_group membership in
existing systems. After all that is the security bug that is being
fixed. Expanding the iommu_groups risks problems for users using VFIO.
The intention is to have a more accurate reflection of the security
properties in the system and should be seen as a security fix. However
people who have ACS disabled may now need to enable it. As such users may
have had good reason for ACS to be disabled I strongly recommend that
backporting of this also include the new config_acs option so that such
users can potentially minimally enable ACS only where needed.
Fixes: 104a1c1 ("iommu/core: Create central IOMMU group lookup/creation interface")
Cc: [email protected]
Reviewed-by: Donald Dutile <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Oct 20, 2025
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.
If the device list is too long, this will lock more device mutexes than
lockdep can handle:
unshare -n \
bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by kworker/u16:1/69:
#0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
#1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
#2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
#3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
#4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]
Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.
Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.
Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 3, 2025
The original code causes a circular locking dependency found by lockdep. ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S U ------------------------------------------------------ xe_fault_inject/5091 is trying to acquire lock: ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660 but task is already holding lock: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&devcd->mutex){+.+.}-{3:3}: mutex_lock_nested+0x4e/0xc0 devcd_data_write+0x27/0x90 sysfs_kf_bin_write+0x80/0xf0 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (kn->active#236){++++}-{0:0}: kernfs_drain+0x1e2/0x200 __kernfs_remove+0xae/0x400 kernfs_remove_by_name_ns+0x5d/0xc0 remove_files+0x54/0x70 sysfs_remove_group+0x3d/0xa0 sysfs_remove_groups+0x2e/0x60 device_remove_attrs+0xc7/0x100 device_del+0x15d/0x3b0 devcd_del+0x19/0x30 process_one_work+0x22b/0x6f0 worker_thread+0x1e8/0x3d0 kthread+0x11c/0x250 ret_from_fork+0x26c/0x2e0 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}: __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 __flush_work+0x27a/0x660 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devcd->mutex); lock(kn->active#236); lock(&devcd->mutex); lock((work_completion)(&(&devcd->del_wk)->work)); *** DEADLOCK *** 5 locks held by xe_fault_inject/5091: #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660 stack backtrace: CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S U 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)} Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021 Call Trace: <TASK> dump_stack_lvl+0x91/0xf0 dump_stack+0x10/0x20 print_circular_bug+0x285/0x360 check_noncircular+0x135/0x150 ? register_lock_class+0x48/0x4a0 __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 ? __flush_work+0x25d/0x660 ? mark_held_locks+0x46/0x90 ? __flush_work+0x25d/0x660 __flush_work+0x27a/0x660 ? __flush_work+0x25d/0x660 ? trace_hardirqs_on+0x1e/0xd0 ? __pfx_wq_barrier_func+0x10/0x10 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 ? bus_find_device+0xa8/0xe0 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 ? __f_unlock_pos+0x15/0x20 ? __x64_sys_getdents64+0x9b/0x130 ? __pfx_filldir64+0x10/0x10 ? do_syscall_64+0x1a2/0xb60 ? clear_bhb_loop+0x30/0x80 ? clear_bhb_loop+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x76e292edd574 Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574 RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063 R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000 </TASK> xe 0000:03:00.0: [drm] Xe device coredump has been deleted. Fixes: 01daccf ("devcoredump : Serialize devcd_del work") Cc: Mukesh Ojha <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Danilo Krummrich <[email protected]> Cc: [email protected] Cc: [email protected] # v6.1+ Signed-off-by: Maarten Lankhorst <[email protected]> Cc: Matthew Brost <[email protected]> Acked-by: Mukesh Ojha <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 3, 2025
…ked_roots() If fs_info->super_copy or fs_info->super_for_commit allocated failed in btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info(). Otherwise btrfs_check_leaked_roots() would access NULL pointer because fs_info->allocated_roots had not been initialised. syzkaller reported the following information: ------------[ cut here ]------------ BUG: unable to handle page fault for address: fffffffffffffbb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...) RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline] RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline] RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline] RIP: 0010:refcount_read include/linux/refcount.h:170 [inline] RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230 [...] Call Trace: <TASK> btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280 btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029 btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097 vfs_get_tree+0x98/0x320 fs/super.c:1759 do_new_mount+0x357/0x660 fs/namespace.c:3899 path_mount+0x716/0x19c0 fs/namespace.c:4226 do_mount fs/namespace.c:4239 [inline] __do_sys_mount fs/namespace.c:4450 [inline] __se_sys_mount fs/namespace.c:4427 [inline] __x64_sys_mount+0x28c/0x310 fs/namespace.c:4427 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f032eaffa8d [...] Fixes: 3bb17a2 ("btrfs: add get_tree callback for new mount API") CC: [email protected] # 6.12+ Reviewed-by: Daniel Vacek <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Dewei Meng <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 7, 2025
Michael Chan says: ==================== bnxt_en: Bug fixes Patches 1, 3, and 4 are bug fixes related to the FW log tracing driver coredump feature recently added in 6.13. Patch #1 adds the necessary call to shutdown the FW logging DMA during PCI shutdown. Patch #3 fixes a possible null pointer derefernce when using early versions of the FW with this feature. Patch #4 adds the coredump header information unconditionally to make it more robust. Patch #2 fixes a possible memory leak during PTP shutdown. Patch #5 eliminates a dmesg warning when doing devlink reload. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
jgunthorpe
pushed a commit
that referenced
this pull request
Nov 21, 2025
On completion of i915_vma_pin_ww(), a synchronous variant of dma_fence_work_commit() is called. When pinning a VMA to GGTT address space on a Cherry View family processor, or on a Broxton generation SoC with VTD enabled, i.e., when stop_machine() is then called from intel_ggtt_bind_vma(), that can potentially lead to lock inversion among reservation_ww and cpu_hotplug locks. [86.861179] ====================================================== [86.861193] WARNING: possible circular locking dependency detected [86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 Tainted: G U [86.861226] ------------------------------------------------------ [86.861238] i915_module_loa/1432 is trying to acquire lock: [86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50 [86.861290] but task is already holding lock: [86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.862233] which lock already depends on the new lock. [86.862251] the existing dependency chain (in reverse order) is: [86.862265] -> #5 (reservation_ww_class_mutex){+.+.}-{3:3}: [86.862292] dma_resv_lockdep+0x19a/0x390 [86.862315] do_one_initcall+0x60/0x3f0 [86.862334] kernel_init_freeable+0x3cd/0x680 [86.862353] kernel_init+0x1b/0x200 [86.862369] ret_from_fork+0x47/0x70 [86.862383] ret_from_fork_asm+0x1a/0x30 [86.862399] -> #4 (reservation_ww_class_acquire){+.+.}-{0:0}: [86.862425] dma_resv_lockdep+0x178/0x390 [86.862440] do_one_initcall+0x60/0x3f0 [86.862454] kernel_init_freeable+0x3cd/0x680 [86.862470] kernel_init+0x1b/0x200 [86.862482] ret_from_fork+0x47/0x70 [86.862495] ret_from_fork_asm+0x1a/0x30 [86.862509] -> #3 (&mm->mmap_lock){++++}-{3:3}: [86.862531] down_read_killable+0x46/0x1e0 [86.862546] lock_mm_and_find_vma+0xa2/0x280 [86.862561] do_user_addr_fault+0x266/0x8e0 [86.862578] exc_page_fault+0x8a/0x2f0 [86.862593] asm_exc_page_fault+0x27/0x30 [86.862607] filldir64+0xeb/0x180 [86.862620] kernfs_fop_readdir+0x118/0x480 [86.862635] iterate_dir+0xcf/0x2b0 [86.862648] __x64_sys_getdents64+0x84/0x140 [86.862661] x64_sys_call+0x1058/0x2660 [86.862675] do_syscall_64+0x91/0xe90 [86.862689] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.862703] -> #2 (&root->kernfs_rwsem){++++}-{3:3}: [86.862725] down_write+0x3e/0xf0 [86.862738] kernfs_add_one+0x30/0x3c0 [86.862751] kernfs_create_dir_ns+0x53/0xb0 [86.862765] internal_create_group+0x134/0x4c0 [86.862779] sysfs_create_group+0x13/0x20 [86.862792] topology_add_dev+0x1d/0x30 [86.862806] cpuhp_invoke_callback+0x4b5/0x850 [86.862822] cpuhp_issue_call+0xbf/0x1f0 [86.862836] __cpuhp_setup_state_cpuslocked+0x111/0x320 [86.862852] __cpuhp_setup_state+0xb0/0x220 [86.862866] topology_sysfs_init+0x30/0x50 [86.862879] do_one_initcall+0x60/0x3f0 [86.862893] kernel_init_freeable+0x3cd/0x680 [86.862908] kernel_init+0x1b/0x200 [86.862921] ret_from_fork+0x47/0x70 [86.862934] ret_from_fork_asm+0x1a/0x30 [86.862947] -> #1 (cpuhp_state_mutex){+.+.}-{3:3}: [86.862969] __mutex_lock+0xaa/0xed0 [86.862982] mutex_lock_nested+0x1b/0x30 [86.862995] __cpuhp_setup_state_cpuslocked+0x67/0x320 [86.863012] __cpuhp_setup_state+0xb0/0x220 [86.863026] page_alloc_init_cpuhp+0x2d/0x60 [86.863041] mm_core_init+0x22/0x2d0 [86.863054] start_kernel+0x576/0xbd0 [86.863068] x86_64_start_reservations+0x18/0x30 [86.863084] x86_64_start_kernel+0xbf/0x110 [86.863098] common_startup_64+0x13e/0x141 [86.863114] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [86.863135] __lock_acquire+0x1635/0x2810 [86.863152] lock_acquire+0xc4/0x2f0 [86.863166] cpus_read_lock+0x41/0x100 [86.863180] stop_machine+0x1c/0x50 [86.863194] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.863987] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.864735] __vma_bind+0x55/0x70 [i915] [86.865510] fence_work+0x26/0xa0 [i915] [86.866248] fence_notify+0xa1/0x140 [i915] [86.866983] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.867719] i915_sw_fence_commit+0x39/0x60 [i915] [86.868453] i915_vma_pin_ww+0x462/0x1360 [i915] [86.869228] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.870001] initial_plane_vma+0x307/0x840 [i915] [86.870774] intel_initial_plane_config+0x33f/0x670 [i915] [86.871546] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.872330] i915_driver_probe+0x7fa/0xe80 [i915] [86.873057] i915_pci_probe+0xe6/0x220 [i915] [86.873782] local_pci_probe+0x47/0xb0 [86.873802] pci_device_probe+0xf3/0x260 [86.873817] really_probe+0xf1/0x3c0 [86.873833] __driver_probe_device+0x8c/0x180 [86.873848] driver_probe_device+0x24/0xd0 [86.873862] __driver_attach+0x10f/0x220 [86.873876] bus_for_each_dev+0x7f/0xe0 [86.873892] driver_attach+0x1e/0x30 [86.873904] bus_add_driver+0x151/0x290 [86.873917] driver_register+0x5e/0x130 [86.873931] __pci_register_driver+0x7d/0x90 [86.873945] i915_pci_register_driver+0x23/0x30 [i915] [86.874678] i915_init+0x37/0x120 [i915] [86.875347] do_one_initcall+0x60/0x3f0 [86.875369] do_init_module+0x97/0x2a0 [86.875385] load_module+0x2c54/0x2d80 [86.875398] init_module_from_file+0x96/0xe0 [86.875413] idempotent_init_module+0x117/0x330 [86.875426] __x64_sys_finit_module+0x77/0x100 [86.875440] x64_sys_call+0x24de/0x2660 [86.875454] do_syscall_64+0x91/0xe90 [86.875470] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.875486] other info that might help us debug this: [86.875502] Chain exists of: cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex [86.875539] Possible unsafe locking scenario: [86.875552] CPU0 CPU1 [86.875563] ---- ---- [86.875573] lock(reservation_ww_class_mutex); [86.875588] lock(reservation_ww_class_acquire); [86.875606] lock(reservation_ww_class_mutex); [86.875624] rlock(cpu_hotplug_lock); [86.875637] *** DEADLOCK *** [86.875650] 3 locks held by i915_module_loa/1432: [86.875663] #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220 [86.875699] #1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.876512] #2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.877305] stack backtrace: [86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G U 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 PREEMPT(voluntary) [86.877334] Tainted: [U]=USER [86.877336] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020 [86.877339] Call Trace: [86.877344] <TASK> [86.877353] dump_stack_lvl+0x91/0xf0 [86.877364] dump_stack+0x10/0x20 [86.877369] print_circular_bug+0x285/0x360 [86.877379] check_noncircular+0x135/0x150 [86.877390] __lock_acquire+0x1635/0x2810 [86.877403] lock_acquire+0xc4/0x2f0 [86.877408] ? stop_machine+0x1c/0x50 [86.877422] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878173] cpus_read_lock+0x41/0x100 [86.878182] ? stop_machine+0x1c/0x50 [86.878191] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878916] stop_machine+0x1c/0x50 [86.878927] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.879652] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.880375] __vma_bind+0x55/0x70 [i915] [86.881133] fence_work+0x26/0xa0 [i915] [86.881851] fence_notify+0xa1/0x140 [i915] [86.882566] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.883286] i915_sw_fence_commit+0x39/0x60 [i915] [86.884003] i915_vma_pin_ww+0x462/0x1360 [i915] [86.884756] ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915] [86.885513] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.886281] initial_plane_vma+0x307/0x840 [i915] [86.887049] intel_initial_plane_config+0x33f/0x670 [i915] [86.887819] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.888587] i915_driver_probe+0x7fa/0xe80 [i915] [86.889293] ? mutex_unlock+0x12/0x20 [86.889301] ? drm_privacy_screen_get+0x171/0x190 [86.889308] ? acpi_dev_found+0x66/0x80 [86.889321] i915_pci_probe+0xe6/0x220 [i915] [86.890038] local_pci_probe+0x47/0xb0 [86.890049] pci_device_probe+0xf3/0x260 [86.890058] really_probe+0xf1/0x3c0 [86.890067] __driver_probe_device+0x8c/0x180 [86.890072] driver_probe_device+0x24/0xd0 [86.890078] __driver_attach+0x10f/0x220 [86.890083] ? __pfx___driver_attach+0x10/0x10 [86.890088] bus_for_each_dev+0x7f/0xe0 [86.890097] driver_attach+0x1e/0x30 [86.890101] bus_add_driver+0x151/0x290 [86.890107] driver_register+0x5e/0x130 [86.890113] __pci_register_driver+0x7d/0x90 [86.890119] i915_pci_register_driver+0x23/0x30 [i915] [86.890833] i915_init+0x37/0x120 [i915] [86.891482] ? __pfx_i915_init+0x10/0x10 [i915] [86.892135] do_one_initcall+0x60/0x3f0 [86.892145] ? __kmalloc_cache_noprof+0x33f/0x470 [86.892157] do_init_module+0x97/0x2a0 [86.892164] load_module+0x2c54/0x2d80 [86.892168] ? __kernel_read+0x15c/0x300 [86.892185] ? kernel_read_file+0x2b1/0x320 [86.892195] init_module_from_file+0x96/0xe0 [86.892199] ? init_module_from_file+0x96/0xe0 [86.892211] idempotent_init_module+0x117/0x330 [86.892224] __x64_sys_finit_module+0x77/0x100 [86.892230] x64_sys_call+0x24de/0x2660 [86.892236] do_syscall_64+0x91/0xe90 [86.892243] ? irqentry_exit+0x77/0xb0 [86.892249] ? sysvec_apic_timer_interrupt+0x57/0xc0 [86.892256] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.892261] RIP: 0033:0x7303e1b2725d [86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48 [86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d [86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c [86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80 [86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0 [86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710 [86.892304] </TASK> Call asynchronous variant of dma_fence_work_commit() in that case. v3: Provide more verbose in-line comment (Andi), - mention target environments in commit message. Fixes: 7d1c261 ("drm/i915: Take reservation lock around i915_vma_pin.") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985 Cc: Andi Shyti <[email protected]> Signed-off-by: Janusz Krzysztofik <[email protected]> Reviewed-by: Sebastian Brzezinka <[email protected]> Reviewed-by: Krzysztof Karas <[email protected]> Acked-by: Andi Shyti <[email protected]> Signed-off-by: Andi Shyti <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 648ef13) Signed-off-by: Rodrigo Vivi <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.