bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9603

kernel-patches-daemon-bpf · 2025-08-29T01:03:02Z

Pull request for series with
subject: bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
version: 4
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=996706

If memcg is enabled, accept() acquires lock_sock() twice for each new TCP/MPTCP socket in inet_csk_accept() and __inet_accept(). Let's move memcg operations from inet_csk_accept() to __inet_accept(). Note that SCTP somehow allocates a new socket by sk_alloc() in sk->sk_prot->accept() and clones fields manually, instead of using sk_clone_lock(). mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(), so I added the protocol check in __inet_accept(), but this can be removed once SCTP uses sk_clone_lock(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>

We will store a flag in sk->sk_memcg by bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk() that passes a pointer to struct sock to the bpf prog as void *ctx. But there are no bpf_func_proto for bpf_setsockopt() that receives the ctx as a pointer to struct sock. Also, bpf_getsockopt() will be necessary for a cgroup with multiple bpf progs running. Let's add new bpf_setsockopt() and bpf_getsockopt() variants for BPF_CGROUP_INET_SOCK_CREATE. Note that inet_create() is not under lock_sock() and has the same semantics with bpf_lsm_unlocked_sockopt_hooks. Signed-off-by: Kuniyuki Iwashima <[email protected]>

We will decouple sockets from the global protocol memory accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED. This can be flagged (and cleared) at the BPF_CGROUP_INET_SOCK_CREATE hook by bpf_setsockopt() and is inherited to child sockets. u32 flags = SK_BPF_MEMCG_SOCK_ISOLATED; bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS, &flags, sizeof(flags)); bpf_setsockopt(SK_BPF_MEMCG_FLAGS) is only supported at bpf_unlocked_sock_setsockopt() and not supported on other hooks for some reasons: 1. UDP charges memory under sk->sk_receive_queue.lock instead of lock_sock() 2. For TCP child sockets, memory accounting is adjusted only in __inet_accept() which sk->sk_memcg allocation is deferred to 3. Modifying the flag after skb is charged to sk requires such adjustment during bpf_setsockopt() and complicates the logic unnecessarily We can support other hooks later if a real use case justifies that. OTOH, bpf_getsockopt() is supported on other hooks, e.g. bpf_iter, for the debugging purpose. Given sk->sk_memcg can be accessed in the fast path, it would be preferable to place the flag field in the same cache line as sk->sk_memcg. However, struct sock does not have such a 1-byte hole. Let's store the flag in the lowest bit of sk->sk_memcg and add a helper to check the bit. In the next patch, if mem_cgroup_sk_isolated() returns true, the socket will not be charged to sk->sk_prot->memory_allocated. Signed-off-by: Kuniyuki Iwashima <[email protected]>

…ing. Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. When running under a non-root cgroup, this memory is also charged to the memcg as "sock" in memory.stat. Even when a memcg controls memory usage, sockets of such protocols are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). This makes it difficult to accurately estimate and configure appropriate global limits, especially in multi-tenant environments. If all workloads were guaranteed to be controlled under memcg, the issue could be worked around by setting tcp_mem[0~2] to UINT_MAX. In reality, this assumption does not always hold, and processes not controlled by memcg lose the seatbelt and can consume memory up to the global limit, becoming noisy neighbour. Let's decouple sockets in memcg from the global per-protocol memory accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED in sk->sk_memcg. This simplifies memcg configuration while keeping the global limits within a reasonable range. If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory accounting is skipped. In __inet_accept(), we need to reclaim counts that are already charged for child sockets because we do not allocate sk->sk_memcg until accept(). Note that trace_sock_exceed_buf_limit() will always show 0 as accounted for the isolated sockets, but this can be obtained via memory.stat. Tested with a script that creates local socket pairs and send()s a bunch of data without recv()ing. Setup: # mkdir /sys/fs/cgroup/test # echo $$ >> /sys/fs/cgroup/test/cgroup.procs # sysctl -q net.ipv4.tcp_mem="1000 1000 1000" Without bpf prog: # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 22642688 # cat /proc/net/sockstat| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554 # nstat | grep Pressure || echo no pressure TcpExtTCPMemoryPressures 1 0.0 With bpf prog in the next patch: # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 2757468160 # cat /proc/net/sockstat | grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870 ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274 # nstat | grep Pressure || echo no pressure no pressure Signed-off-by: Kuniyuki Iwashima <[email protected]>

The test does the following for IPv4/IPv6 x TCP/UDP sockets with/without BPF prog. 1. Create socket pairs 2. Send a bunch of data that requires more than 256 pages 3. Read memory_allocated from the 3rd column in /proc/net/protocols 4. Check if unread data is charged to memory_allocated If BPF prog is attached, memory_allocated should not be changed, but we allow a small error (up to 10 pages) in case other processes on the host use some amounts of TCP/UDP memory. At 2., the test actually sends more than 1024 pages because the sysctl net.core.mem_pcpu_rsv is 256 is by default, which means 256 pages are buffered per cpu before reporting to sk->sk_prot->memory_allocated. BUF_SINGLE (1024) * NR_SEND (64) * NR_SOCKETS (64) / 4096 = 1024 pages When I reduced it to 512 pages, the following assertion for the non-isolated case got flaky. ASSERT_GT(memory_allocated[1], memory_allocated[0] + 256, ...) Another contributor to slowness is 150ms sleep to make sure 1 RCU grace period passes because UDP recv queue is destroyed after that. # time ./test_progs -t sk_memcg #370/1 sk_memcg/TCP :OK #370/2 sk_memcg/UDP :OK #370/3 sk_memcg/TCPv6 :OK #370/4 sk_memcg/UDPv6 :OK #370 sk_memcg:OK Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED real 0m1.214s user 0m0.014s sys 0m0.318s Signed-off-by: Kuniyuki Iwashima <[email protected]>

kernel-patches-daemon-bpf · 2025-08-29T01:03:02Z

Upstream branch: 02614ee
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=996706
version: 4

q2ven added 5 commits August 28, 2025 18:02

kernel-patches-daemon-bpf bot added new V4 bpf-net V4-ci-pass labels Aug 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9603

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9603

kernel-patches-daemon-bpf bot commented Aug 29, 2025

Uh oh!

kernel-patches-daemon-bpf bot commented Aug 29, 2025

Uh oh!

Uh oh!

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9603

Are you sure you want to change the base?

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9603

Conversation

kernel-patches-daemon-bpf bot commented Aug 29, 2025

Uh oh!

kernel-patches-daemon-bpf bot commented Aug 29, 2025

Uh oh!

Uh oh!