bpf: Add kfuncs for read-only string operations#8709
bpf: Add kfuncs for read-only string operations#8709kernel-patches-daemon-bpf[bot] wants to merge 3 commits intobpf-next_basefrom
Conversation
|
Upstream branch: 9aa8fe2 |
5cf614b to
c9cf71b
Compare
|
Upstream branch: 9aa8fe2 |
a9a1250 to
17612c9
Compare
c9cf71b to
26ba3c4
Compare
|
Upstream branch: 9aa8fe2 |
17612c9 to
eed3db8
Compare
26ba3c4 to
3dc28f3
Compare
|
Upstream branch: 9aa8fe2 |
eed3db8 to
801107f
Compare
3dc28f3 to
35ab59d
Compare
|
Upstream branch: 9aa8fe2 |
801107f to
07a7484
Compare
35ab59d to
bf339f7
Compare
String operations are commonly used so this exposes the most common ones to BPF programs. For now, we limit ourselves to operations which do not copy memory around. Unfortunately, most in-kernel implementations assume that strings are %NUL-terminated, which is not necessarily true, and therefore we cannot use them directly in BPF context. So, we use distinct approaches for bounded and unbounded variants of string operations: - Unbounded variants are open-coded with using __get_kernel_nofault instead of plain dereference to make them safe. - Bounded variants use params with the __sz suffix so safety is assured by the verifier and we can use the in-kernel (potentially optimized) functions. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Viktor Malik <vmalik@redhat.com>
|
Upstream branch: 9aa8fe2 |
The tests use the RUN_TESTS helper which executes BPF programs with BPF_PROG_TEST_RUN and check for the expected return value. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Viktor Malik <vmalik@redhat.com>
Add a new benchmark using the existing bench infrastructure which
compares performance of bounded and unbounded string kfuncs added in the
previous commits.
Running on x86_64 and arm64, the most significant difference is in the
strlen/strnlen and strstr/strnstr comparisons on arm64:
strlen/strnlen
==============
strlen-1 0.453 ± 0.002M/s (drops 0.000 ± 0.000M/s)
strnlen-1 0.470 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strlen-8 0.459 ± 0.011M/s (drops 0.000 ± 0.000M/s)
strnlen-8 0.451 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strlen-64 0.439 ± 0.007M/s (drops 0.000 ± 0.000M/s)
strnlen-64 0.455 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strlen-512 0.359 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strnlen-512 0.441 ± 0.007M/s (drops 0.000 ± 0.000M/s)
strlen-2048 0.232 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strnlen-2048 0.403 ± 0.005M/s (drops 0.000 ± 0.000M/s)
strlen-4095 0.151 ± 0.001M/s (drops 0.000 ± 0.000M/s)
strnlen-4095 0.362 ± 0.005M/s (drops 0.000 ± 0.000M/s)
strstr/strnstr
==============
strstr-8 0.452 ± 0.005M/s (drops 0.000 ± 0.000M/s)
strnstr-8 0.442 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strstr-64 0.390 ± 0.004M/s (drops 0.000 ± 0.000M/s)
strnstr-64 0.400 ± 0.004M/s (drops 0.000 ± 0.000M/s)
strstr-512 0.228 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strnstr-512 0.256 ± 0.002M/s (drops 0.000 ± 0.000M/s)
strstr-2048 0.095 ± 0.001M/s (drops 0.000 ± 0.000M/s)
strnstr-2048 0.113 ± 0.001M/s (drops 0.000 ± 0.000M/s)
strstr-4095 0.052 ± 0.001M/s (drops 0.000 ± 0.000M/s)
strnstr-4095 0.064 ± 0.001M/s (drops 0.000 ± 0.000M/s)
For strings longer than 64B, the unbounded variants are notably faster,
having as much as 140% performance gain over the bounded variants
(strncmp for strings of length 4095). The reason is that arm64 has an
optimized implementation of strnlen in assembly which is also used
inside strnstr.
On x86_64, which doesn't have any optimized string operations, there is
still an observable difference in strlen/strnlen and strstr/strnstr,
albeit much smaller than for arm64:
strlen/strnlen
==============
strlen-1 7.021 ± 0.036M/s (drops 0.000 ± 0.000M/s)
strnlen-1 7.000 ± 0.038M/s (drops 0.000 ± 0.000M/s)
strlen-8 6.837 ± 0.011M/s (drops 0.000 ± 0.000M/s)
strnlen-8 6.832 ± 0.064M/s (drops 0.000 ± 0.000M/s)
strlen-64 5.638 ± 0.026M/s (drops 0.000 ± 0.000M/s)
strnlen-64 6.010 ± 0.034M/s (drops 0.000 ± 0.000M/s)
strlen-512 3.322 ± 0.011M/s (drops 0.000 ± 0.000M/s)
strnlen-512 3.449 ± 0.014M/s (drops 0.000 ± 0.000M/s)
strlen-2048 1.390 ± 0.007M/s (drops 0.000 ± 0.000M/s)
strnlen-2048 1.429 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strlen-4095 0.786 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strnlen-4095 0.803 ± 0.002M/s (drops 0.000 ± 0.000M/s)
strstr/strnstr
==============
strstr-8 6.031 ± 0.012M/s (drops 0.000 ± 0.000M/s)
strnstr-8 6.322 ± 0.048M/s (drops 0.000 ± 0.000M/s)
strstr-64 3.221 ± 0.054M/s (drops 0.000 ± 0.000M/s)
strnstr-64 3.059 ± 0.025M/s (drops 0.000 ± 0.000M/s)
strstr-512 0.734 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strnstr-512 0.849 ± 0.004M/s (drops 0.000 ± 0.000M/s)
strstr-2048 0.220 ± 0.004M/s (drops 0.000 ± 0.000M/s)
strnstr-2048 0.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
strstr-4095 0.104 ± 0.000M/s (drops 0.000 ± 0.000M/s)
strnstr-4095 0.122 ± 0.000M/s (drops 0.000 ± 0.000M/s)
The performance gain of the bounded variants on strings over 64B is
3%-6% for strlen/strnlen and 12%-18% for strstr/strnstr. The likely
explanation is that the unbounded variants use __get_kernel_nofault
instead of plain derefence which introduces some small overhead. This
manifests mainly in the above functions as they iterate multiple
strings (i.e. use __get_kernel_nofault more).
For the rest of the functions in the benchmark (strchr/strnchr and
strchrnul/strnchrnul), the performance difference is negligable or
within the bounds of a statistical error, with an exception of
strchr/strnchr on arm64:
strchr/strnchr
==============
strchr-1 0.475 ± 0.010M/s (drops 0.000 ± 0.000M/s)
strnchr-1 0.469 ± 0.008M/s (drops 0.000 ± 0.000M/s)
strchr-8 0.448 ± 0.011M/s (drops 0.000 ± 0.000M/s)
strnchr-8 0.472 ± 0.006M/s (drops 0.000 ± 0.000M/s)
strchr-64 0.432 ± 0.010M/s (drops 0.000 ± 0.000M/s)
strnchr-64 0.445 ± 0.008M/s (drops 0.000 ± 0.000M/s)
strchr-512 0.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strnchr-512 0.330 ± 0.005M/s (drops 0.000 ± 0.000M/s)
strchr-2048 0.156 ± 0.002M/s (drops 0.000 ± 0.000M/s)
strnchr-2048 0.186 ± 0.003M/s (drops 0.000 ± 0.000M/s)
strchr-4095 0.094 ± 0.001M/s (drops 0.000 ± 0.000M/s)
strnchr-4095 0.115 ± 0.004M/s (drops 0.000 ± 0.000M/s)
Here, I'm not sure what the reason for the performance benefit is,
possibly a combination of compiler optimizations and
__get_kernel_nofault overhead.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
07a7484 to
ae22ed1
Compare
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=946777 expired. Closing PR. |
Pull request for series with
subject: bpf: Add kfuncs for read-only string operations
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=946777