|
| 1 | +--- |
| 2 | +title: optimized kernel-native mode |
| 3 | +authors: |
| 4 | +- "@bitcoffeeiux" |
| 5 | +reviews: |
| 6 | +- |
| 7 | +approves: |
| 8 | +- |
| 9 | + |
| 10 | +create-date: 2025-01-17 |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## The kernel-native mode supports traffic control based on the layer7 HTTP protocol and improves the process. |
| 15 | + |
| 16 | +### Background |
| 17 | + |
| 18 | +The kernel-native mode requires a large number of intrusive kernel reconstructions to implement HTTP-based traffic control. Some of these modifications may have a significant impact on the kernel, which makes the kernel-native mode difficult to deploy and use in a real production environment. |
| 19 | + |
| 20 | +To resolve this problem, we have modified the kernel in kernel-native mode and the involved ko and eBPF synchronously. |
| 21 | + |
| 22 | +In kernel 5.10, the kernel modification is limited to four, and in kernel 6.6, the kernel modification is reduced to only one. |
| 23 | +This last one will be eliminated as much as possible, with the goal of eventually running kernel-native mode on native version 6.6 and above. |
| 24 | + |
| 25 | +### Overview |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +### Changes in kernel |
| 30 | +#### kernel enhanced 1 |
| 31 | + |
| 32 | +A flag is added to the socket to indicate whether the current socket needs to delay link establishment. |
| 33 | + |
| 34 | +Solution on kernel 5.10: |
| 35 | +Add a flag bit called bpf_defer_connect, which occupies the same u8 variable with defer_connect, bind_address_no_port, and recverr_rfc4884. |
| 36 | + |
| 37 | + ```c |
| 38 | + include/net/inet_sock.h |
| 39 | + |
| 40 | + struct inet_sock { |
| 41 | + ... |
| 42 | + __u8 bind_address_no_port:1, |
| 43 | + recverr_rfc4884:1, |
| 44 | + - defer_connect:1; |
| 45 | + + defer_connect:1, |
| 46 | + + bpf_defer_connect:1; |
| 47 | + __u8 rcv_tos; |
| 48 | + ... |
| 49 | + } |
| 50 | + ``` |
| 51 | + |
| 52 | +Solution on kernel 6.6: |
| 53 | +The 6.6 kernel does not need to be modified. The variable `inet_flags`(unsigned long)type is added to the 6.6 kernel. On the 64-bit platform, 64 flags can be set. Currently, 30 flags are used by the kernel. This variable can be directly used to store the flags. |
| 54 | + |
| 55 | + ```c |
| 56 | + include/net/inet_sock.h |
| 57 | + |
| 58 | + struct inet_sock { |
| 59 | + ... |
| 60 | + #define inet_num sk.__sk_common.skc_num |
| 61 | + |
| 62 | + unsigned long inet_flags; |
| 63 | + __be32 inet_saddr; |
| 64 | + ... |
| 65 | + } |
| 66 | + ... |
| 67 | + enum { |
| 68 | + INET_FLAGS_PKTINFO = 0, |
| 69 | + INET_FLAGS_TTL = 1, |
| 70 | + INET_FLAGS_TOS = 2, |
| 71 | + ... |
| 72 | + INET_FLAGS_RECVERR_RFC4884 = 10, |
| 73 | + ... |
| 74 | + INET_FLAGS_BIND_ADDRESS_NO_PORT = 18, |
| 75 | + INET_FLAGS_DEFER_CONNECT = 19, |
| 76 | + ... |
| 77 | + INET_FLAGS_RTALERT = 30, |
| 78 | + } |
| 79 | + ... |
| 80 | + ``` |
| 81 | + |
| 82 | +#### kernel enhanced 2 |
| 83 | +bpf_setsockopt and bpf_getsockopt support setting and reading of the Upper Layer Protocol(ULP). ULP is a key technology to implement the kernel-native mode. Through ULP, we can customize the behavior details of the socket managed by Kmesh to implement the goal of offloading traffic control logic to the kernel for execution. |
| 84 | + |
| 85 | +Solution on kernel 5.10: |
| 86 | +This function is integrated in 5.10 based on the implementation details of the upstream community. |
| 87 | + |
| 88 | +Solution on kernel 6.6: |
| 89 | +The solution does not need to be modified and is supported by the upstream community. |
| 90 | + |
| 91 | +#### kernel enhance 3: |
| 92 | +Inject writable_tracepoint to inet_stream_connect to change the specified error code returned by kmesh_defer_connect to normal value. The error code is used to indicate whether the Kmesh pseudo connect establishment process is used. The error code must be returned to the user mode. Use the defer_connect label to avoid this problem. |
| 93 | + |
| 94 | +Use the defer_connect label to avoid this problem. In the connect and sendmsg phases, set the default_connect flag to the socket and analyze the impact on the socket. If the impact is not affected, delete the tracepoint. |
| 95 | + |
| 96 | +Solution on kernel 5.10: |
| 97 | +Introduce a new writable_tracepoint hook |
| 98 | + |
| 99 | +Solution on kernel 6.6: |
| 100 | +Introduce a new writable_tracepoint hook |
| 101 | + |
| 102 | +#### kernel enhance 4: |
| 103 | +In kernel-native mode, user information is parsed in eBPF using the helper function of kfunc. The parameter is passed to ctx, which contains an iovec copy of the data sent by the user for parsing. |
| 104 | + |
| 105 | +Solution on kernel 5.10: |
| 106 | +Use the helper function to inject related interfaces into the kernel. |
| 107 | + |
| 108 | +Solution on kernel 6.6: |
| 109 | +No reconstruction is required. The kfunc interface is used. |
| 110 | + |
| 111 | +### Changes in ko |
| 112 | + |
| 113 | +The ko performs the following functions: |
| 114 | + |
| 115 | +- The connect function kmesh_defer_connect is customized by the ULP framework. Provides the pseudo connect establishment capability and sets the related status. After detecting that the fastopen capability is enabled for the socket, roll back the sendmsg and epoll |
| 116 | +- The sendmsg function kmesh_defer_sendmsg is customized by the ULP framework. Provides the delay sending capability and determines whether the socket needs to delay link establishment based on the socket status. In the link establishment delay phase, the eBPF program is invoked to complete the DNAT. Delay link setup. |
| 117 | +- The epoll function customized by the ULP framework is used to solve the problem that the epoll cannot be triggered when the socket is in the connected state. |
| 118 | +- The helper function and kfunc are implemented to parse, save, and match layer-7 user information for the eBPF to support grayscale functions. |
0 commit comments