-
Notifications
You must be signed in to change notification settings - Fork 105
[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] sync maple-tree patches from mainline #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] sync maple-tree patches from mainline #672
Conversation
Reviewer's Guide by SourceryThis pull request introduces a new function, Sequence diagram for fork() with mtree_dupsequenceDiagram
participant fork()
participant dup_mmap()
participant __mt_dup()
participant ksm_fork()
participant khugepaged_fork()
fork()->>dup_mmap(): dup_mmap(mm, oldmm)
dup_mmap()->>__mt_dup(): __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL)
alt __mt_dup fails
__mt_dup-->>dup_mmap(): Return error
dup_mmap()->>ksm_fork(): ksm_fork(mm, oldmm)
dup_mmap()->>khugepaged_fork(): khugepaged_fork(mm, oldmm)
else __mt_dup succeeds
__mt_dup-->>dup_mmap(): Return success
dup_mmap()->>ksm_fork(): ksm_fork(mm, oldmm)
dup_mmap()->>khugepaged_fork(): khugepaged_fork(mm, oldmm)
end
dup_mmap()->>fork(): Return result
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @opsiff - I've reviewed your changes - here's some feedback:
Overall Comments:
- The
compare_treefunction could be simplified by comparing thema_rootmembers directly after comparingma_flags. - Consider adding a comment to
mas_dup_buildexplaining why the error code is stored inmas->node.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 2 issues found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| } | ||
|
|
||
| static inline void dup_userfaultfd_complete(struct list_head *l) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Inline stub for dup_userfaultfd_fail may conflict.
There is a non-inline implementation of dup_userfaultfd_fail in fs/userfaultfd.c. The inline stub in the header might override or conflict with that implementation. Consider consolidating to a single definition to avoid potential link-time or behavioral inconsistencies.
Suggested implementation:
Make sure the non-inline definition (implementation) of dup_userfaultfd_fail in fs/userfaultfd.c is maintained and that no duplicate declarations exist.
| * Compares two nodes except for the addresses stored in the nodes. | ||
| * Returns zero if they are the same, otherwise returns non-zero. | ||
| */ | ||
| static int __init compare_node(struct maple_enode *enode_a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring the compare_node and build_full_tree functions into smaller helper functions to improve code readability and reduce complexity by removing the need for goto statements and simplifying control flow.
Consider refactoring the heavily interleaved control flow (especially the use of goto statements) by breaking the functions into smaller, well‐named helper functions. This can reduce complexity without reverting functionality. For example:
-
In
compare_node():
Separate the leaf and internal node comparisons into helper functions. For instance:static int compare_leaf_node(const struct maple_node *a, const struct maple_node *b) { return memcmp(a, b, sizeof(struct maple_node)); } static int compare_internal_node(struct maple_node *a, struct maple_node *b, enum maple_type type) { void **slots_a = ma_slots(a, type); void **slots_b = ma_slots(b, type); for (int i = 0; i < mt_slots[type]; i++) { if (!slots_a[i] && !slots_b[i]) break; if (!slots_a[i] || !slots_b[i]) { pr_err("The number of slots is different.\n"); return -1; } ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK; ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK; } return memcmp(a, b, sizeof(struct maple_node)); }
Then your main function becomes more straightforward:
static int __init compare_node(struct maple_enode *enode_a, struct maple_enode *enode_b) { // ... [pre-comparison initialization unchanged] if (ma_is_leaf(type)) return compare_leaf_node(&a, &b); else return compare_internal_node(&a, &b, type); }
-
In
build_full_tree():
Replacegotousage by extracting loops into helper functions. For example, create a helper to perform the “store” operation:static inline int store_subtree(struct ma_state *mas, struct maple_tree *mt) { mas_subtree_max_range(mas); unsigned long step = mas->last - mas->index; if (step < 1) return -1; mas->last = mas->index + step / 2; mas_store_gfp(mas, xa_mk_value(5), GFP_KERNEL); return 0; }
Then structure the main loop without gotos:
static __init int build_full_tree(struct maple_tree *mt, unsigned int flags, int height) { MA_STATE(mas, mt, 0, 0); int ret = 0, cnt = 1; enum maple_type type; mt_init_flags(mt, flags); mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL); mtree_lock(mt); while (1) { mas_set(&mas, 0); if (mt_height(mt) < height) { mas.max = ULONG_MAX; } else { while (1) { mas_dfs_preorder(&mas); if (mas_is_none(&mas)) goto unlock; type = mte_node_type(mas.node); if (mas_data_end(&mas) + 1 < mt_slots[type]) { mas_set(&mas, mas.min); break; } } } ret = store_subtree(&mas, mt); if (ret) goto unlock; ++cnt; // Optionally add a condition to break from the infinite loop. } unlock: mtree_unlock(mt); MT_BUG_ON(mt, mt_height(mt) != height); return ret; }
These small, focused refactorings encapsulate the convoluted parts into helper functions, making the overall flow clearer and easier to maintain.
| * Author: Peng Zhang <[email protected]> | ||
| */ | ||
|
|
||
| /* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring the DFS duplication logic to improve clarity by isolating traversal steps into smaller, focused helpers or using a recursive approach where appropriate to reduce cognitive complexity without changing functionality
The DFS duplication logic mixes control flow (including a while (1) loop, manual ascent/descent management, etc.) and helper functions that can make it hard to follow. One way to reduce cognitive complexity without changing functionality is to isolate the traversal steps in small, focused helpers or even switch to a recursive approach when possible.
For example, consider splitting the non-leaf handling and leaf processing into separate inline helpers. This refactor could look like:
static inline int mas_handle_leaf(struct ma_state *mas, struct ma_state *new_mas)
{
/* When at a leaf and not at the final node, ascend until a subtree change is needed */
do {
mas_ascend(mas);
mas_ascend(new_mas);
} while (mas->offset == mas_data_end(mas));
mas->offset++;
new_mas->offset++;
return 0; // You can also return an error code if needed
}And then in the main duplication routine, refactor the DFS loop to improve clarity:
static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas, gfp_t gfp)
{
struct maple_node *node;
struct maple_pnode *parent = NULL;
struct maple_enode *root;
enum maple_type type;
if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
unlikely(!mtree_empty(new_mas->tree))) {
mas_set_err(mas, -EINVAL);
return;
}
root = mas_start(mas);
if (mas_is_ptr(mas) || mas_is_none(mas))
goto set_new_tree;
node = mt_alloc_one(gfp);
if (!node) {
new_mas->node = MAS_NONE;
mas_set_err(mas, -ENOMEM);
return;
}
type = mte_node_type(mas->node);
root = mt_mk_node(node, type);
new_mas->node = root;
new_mas->min = 0;
new_mas->max = ULONG_MAX;
root = mte_mk_root(root);
while (1) {
mas_copy_node(mas, new_mas, parent);
if (!mte_is_leaf(mas->node)) {
/* Allocate child nodes for non-leaf nodes */
mas_dup_alloc(mas, new_mas, gfp);
if (unlikely(mas_is_err(mas)))
return;
} else {
/* Process leaf: if duplication complete, break; otherwise, adjust state */
if (mas->max == ULONG_MAX)
break;
mas_handle_leaf(mas, new_mas);
}
mas_descend(mas);
parent = ma_parent_ptr(mte_to_node(new_mas->node));
mas_descend(new_mas);
mas->offset = 0;
new_mas->offset = 0;
}
/* Special handling for root parent update */
mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
set_new_tree:
new_mas->tree->ma_flags = mas->tree->ma_flags;
rcu_assign_pointer(new_mas->tree->ma_root, root);
}Actionable Steps:
- Extract leaf handling: Isolate the logic for processing leaf nodes (e.g., the ascent loop and offset adjustments) into its own helper function as shown above.
- Refactor the DFS loop: Use the helper(s) to break up the lengthy
while (1)loop into logically separated steps. This makes the control flow easier to follow. - Consider a recursive approach: Where safe (e.g., tree depth is bounded), a recursive helper could replace the manual state management. Ensure to measure stack usage if you choose that route.
These changes keep the functionality intact while reducing the interleaving of complex control flow and inline operations.
31550d7 to
497b83e
Compare
…_bulk() mainline inclusion from mainline-v6.7-rc1 category: bugfix The bulk allocation is iterating through an array and storing enough memory for the entire bulk allocation instead of a single array entry. Only allocate an array element of the size set in the kmem_cache. Link: https://lkml.kernel.org/r/[email protected] Fixes: cc86e0c ("radix tree test suite: add support for slab bulk APIs") Signed-off-by: Liam R. Howlett <[email protected]> Reported-by: Christophe JAILLET <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion
from mainline-v6.8-rc1
category: performance
Patch series "Introduce __mt_dup() to improve the performance of fork()", v7.
This series introduces __mt_dup() to improve the performance of fork().
During the duplication process of mmap, all VMAs are traversed and
inserted one by one into the new maple tree, causing the maple tree to be
rebalanced multiple times. Balancing the maple tree is a costly
operation. To duplicate VMAs more efficiently, mtree_dup() and __mt_dup()
are introduced for the maple tree. They can efficiently duplicate a maple
tree.
Here are some algorithmic details about {mtree,__mt}_dup(). We perform a
DFS pre-order traversal of all nodes in the source maple tree. During
this process, we fully copy the nodes from the source tree to the new
tree. This involves memory allocation, and when encountering a new node,
if it is a non-leaf node, all its child nodes are allocated at once.
This idea was originally from Liam R. Howlett's Maple Tree Work email,
and I added some of my own ideas to implement it. Some previous
discussions can be found in [1]. For a more detailed analysis of the
algorithm, please refer to the logs for patch [3/10] and patch [10/10].
There is a "spawn" in byte-unixbench[2], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
Thanks to Liam and Matthew for the review.
This patch (of 10):
Add two helpers:
1. mt_free_one(), used to free a maple node.
2. mt_attr(), used to obtain the attributes of maple tree.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 4f2267b)
Signed-off-by: Wentao Guan <[email protected]>
mainline inlcusion
from mainline-v6.8-rc1
category: performance
In some cases, nested locks may be needed, so {mtree,mas}_lock_nested is
introduced. For example, when duplicating maple tree, we need to hold the
locks of two trees, in which case nested locks are needed.
At the same time, add the definition of spin_lock_nested() in tools for
testing.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peng Zhang <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit b2472ef)
Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Introduce interfaces __mt_dup() and mtree_dup(), which are used to duplicate a maple tree. They duplicate a maple tree in Depth-First Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the source tree and allocate new child nodes in non-leaf nodes. The new node is exactly the same as the source node except for all the addresses stored in it. It will be faster than traversing all elements in the source tree and inserting them one by one into the new tree. The time complexity of these two functions is O(n). The difference between __mt_dup() and mtree_dup() is that mtree_dup() handles locks internally. Analysis of the average time complexity of this algorithm: For simplicity, let's assume that the maximum branching factor of all non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a full tree. Under the given conditions, if there is a maple tree with n elements, the number of its leaves is n/16. From bottom to top, the number of nodes in each level is 1/16 of the number of nodes in the level below. So the total number of nodes in the entire tree is given by the sum of n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has log(n) terms with base 16. According to the formula for the sum of a geometric series, the sum of this series can be calculated as (n-1)/15. Each node has only one parent node pointer, which can be considered as an edge. In total, there are (n-1)/15-1 edges. This algorithm consists of two operations: 1. Traversing all nodes in DFS order. 2. For each node, making a copy and performing necessary modifications to create a new node. For the first part, DFS traversal will visit each edge twice. Let T(ascend) represent the cost of taking one step downwards, and T(descend) represent the cost of taking one step upwards. And both of them are constants (although mas_ascend() may not be, as it contains a loop, but here we ignore it and treat it as a constant). So the time spent on the first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)). For the second part, each node will be copied, and the cost of copying a node is denoted as T(copy_node). For each non-leaf node, it is necessary to reallocate all child nodes, and the cost of this operation is denoted as T(dup_alloc). The behavior behind memory allocation is complex and not specific to the maple tree operation. Here, we assume that the time required for a single allocation is constant. Since the size of a node is fixed, both of these symbols are also constants. We can calculate that the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc). Adding both parts together, the total time spent by the algorithm can be represented as: ((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) - n/16 * T(dup_alloc) - (T(ascend) + T(descend)) Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc) Let C2 = T(dup_alloc) Let C3 = T(ascend) + T(descend) Finally, the expression can be simplified as: ((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3). This is a linear function, so the average time complexity is O(n). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Suggested-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit fd32e4e) Signed-off-by: Wentao Guan <[email protected]>
…vior. mainline inclusion from mainline-v6.8-rc1 category: bugfix When kmem_cache_alloc_bulk() fails to allocate, leave the freed pointers in the array. This enables a more accurate simulation of the kernel's behavior and allows for testing potential double-free scenarios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 46c99e2) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Add test for mtree_dup(). Test by duplicating different maple trees and then comparing the two trees. Includes tests for duplicating full trees and memory allocation failures on different nodes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit a2587a7) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Introduce the new interface mtree_dup() in the documentation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 9bc1d3c) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Skip other tests when BENCH is enabled so that performance can be measured in user space. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit f670fa1) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance Updated check_forking() and bench_forking() to use __mt_dup() to duplicate maple tree. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 446e186) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance When destroying maple tree, preserve its attributes and then turn it into an empty tree. This allows it to be reused without needing to be reinitialized. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 8e50d32) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Suggested-by: Liam R. Howlett <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit d240629) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.12-rc6 category: bugfix Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: Jann Horn <[email protected]> Reviewed-by: Jann Horn <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit f64e67e) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.7-rc1 category: bugfix Patch series "mm/ksm: add fork-exec support for prctl", v4. A process can enable KSM with the prctl system call. When the process is forked the KSM flag is inherited by the child process. However if the process is executing an exec system call directly after the fork, the KSM setting is cleared. This patch series addresses this problem. 1) Change the mask in coredump.h for execing a new process 2) Add a new test case in ksm_functional_tests This patch (of 2): Today we have two ways to enable KSM: 1) madvise system call This allows to enable KSM for a memory region for a long time. 2) prctl system call This is a recent addition to enable KSM for the complete process. In addition when a process is forked, the KSM setting is inherited. This change only affects the second case. One of the use cases for (2) was to support the ability to enable KSM for cgroups. This allows systemd to enable KSM for the seed process. By enabling it in the seed process all child processes inherit the setting. This works correctly when the process is forked. However it doesn't support fork/exec workflow. From the previous cover letter: .... Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. .... In addition it can also be a bit surprising that fork keeps the KSM setting and fork/exec does not. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Fixes: d7597f5 ("mm: add new api to enable ksm per process") Reviewed-by: David Hildenbrand <[email protected]> Reported-by: Carl Klemm <[email protected]> Tested-by: Carl Klemm <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 3c6f33b) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.7-rc1 category: bugfix This adds a new test case to the ksm functional tests to make sure that the KSM setting is inherited by the child process when doing a fork/exec. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Carl Klemm <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 0374af1) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.10-rc1 category: bugfix Patch series "mm/ksm: fix ksm exec support for prctl", v4. commit 3c6f33b ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue. The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting. This patch (of 3): commit 3c6f33b ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task. To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 3c6f33b ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stefan Roesch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 3a9e567) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.10-rc1 category: misc Since commit 3c6f33b ("mm/ksm: support fork/exec for prctl"), when a child process is forked, the MMF_VM_MERGE_ANY flag will be inherited in mm_init(). So, it's unnecessary to set the flag in ksm_fork(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjiang Tu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stefan Roesch <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 7edea4c) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.12-rc6 category: bugfix There is no reason to invoke these hooks early against an mm that is in an incomplete state. The change in commit d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case. Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these. As a result, we follow the example of commit d2081b2 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also. We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation. Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: Jann Horn <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Jann Horn <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 985da55) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.13-rc4 category: bugfix If dup_mmap() encounters an issue, currently uprobe is able to access the relevant mm via the reverse mapping (in build_map_info()), and if we are very unlucky with a race window, observe invalid XA_ZERO_ENTRY state which we establish as part of the fork error path. This occurs because uprobe_write_opcode() invokes anon_vma_prepare() which in turn invokes find_mergeable_anon_vma() that uses a VMA iterator, invoking vma_iter_load() which uses the advanced maple tree API and thus is able to observe XA_ZERO_ENTRY entries added to dup_mmap() in commit d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"). This change was made on the assumption that only process tear-down code would actually observe (and make use of) these values. However this very unlikely but still possible edge case with uprobes exists and unfortunately does make these observable. The uprobe operation prevents races against the dup_mmap() operation via the dup_mmap_sem semaphore, which is acquired via uprobe_start_dup_mmap() and dropped via uprobe_end_dup_mmap(), and held across register_for_each_vma() prior to invoking build_map_info() which does the reverse mapping lookup. Currently these are acquired and dropped within dup_mmap(), which exposes the race window prior to error handling in the invoking dup_mm() which tears down the mm. We can avoid all this by just moving the invocation of uprobe_start_dup_mmap() and uprobe_end_dup_mmap() up a level to dup_mm() and only release this lock once the dup_mmap() operation succeeds or clean up is done. This means that the uprobe code can never observe an incompletely constructed mm and resolves the issue in this case. Link: https://lkml.kernel.org/r/[email protected] Fixes: d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 8ac662f) Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.14-rc1 category: bugfix If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: d240629 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jann Horn <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 64c37e1) Signed-off-by: Wentao Guan <[email protected]> Conflicts: kernel/events/uprobes.c
497b83e to
08dcc40
Compare
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @opsiff - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider adding a comment describing the locking strategy for
mtree_dup. - The
compare_treefunction intools/testing/radix-tree/maple.ccould be improved by using aMA_STATEfor both trees simultaneously, instead of callingmas_dfs_preorderrepeatedly.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| * Compares two nodes except for the addresses stored in the nodes. | ||
| * Returns zero if they are the same, otherwise returns non-zero. | ||
| */ | ||
| static int __init compare_node(struct maple_enode *enode_a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring compare_node and build_full_tree to use early returns and clearer loop constructs instead of gotos to improve readability and reduce complexity.
In several functions (particularly in `compare_node` and `build_full_tree`), the complexity is increased by nested conditions and nonlocal jumps with `goto`. Consider refactoring those sections to use early returns or clearer loop constructs. For example:
**In `compare_node`:**
Replace the use of `goto cmp` with an early exit for leaf nodes. For example, change
```c
if (ma_is_leaf(type))
goto cmp;to:
if (ma_is_leaf(type))
return memcmp(&a, &b, sizeof(struct maple_node));and then do the rest of the comparisons normally. This flattens the control flow.
In build_full_tree:
You can refactor the inner while (1) and goto store/goto unlock by using flags or breaking out of the loops. For example, refactor like this:
static __init int build_full_tree(struct maple_tree *mt, unsigned int flags, int height)
{
MA_STATE(mas, mt, 0, 0);
unsigned long step;
int ret = 0, cnt = 1;
enum maple_type type;
mt_init_flags(mt, flags);
mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
mtree_lock(mt);
while (1) {
mas_set(&mas, 0);
if (mt_height(mt) < height) {
mas.max = ULONG_MAX;
/* Process store step */
} else {
bool found = false;
while (1) {
mas_dfs_preorder(&mas);
if (mas_is_none(&mas)) {
ret = 0;
goto unlock;
}
type = mte_node_type(mas.node);
if (mas_data_end(&mas) + 1 < mt_slots[type]) {
mas_set(&mas, mas.min);
found = true;
break;
}
}
if (!found)
break; /* terminate if processing is done */
}
mas_subtree_max_range(&mas);
step = mas.last - mas.index;
if (step < 1) {
ret = -1;
break;
}
step /= 2;
mas.last = mas.index + step;
mas_store_gfp(&mas, xa_mk_value(5), GFP_KERNEL);
++cnt;
}
unlock:
mtree_unlock(mt);
MT_BUG_ON(mt, mt_height(mt) != height);
return ret;
}This refactoring uses a flag (found) and break statements rather than interleaving goto labels, which helps to document the different stages more clearly.
These small changes can greatly improve readability while preserving all the functionality.
|
/test-ok |
Summary by Sourcery
This pull request introduces the ability to duplicate a maple tree, which is used in the kernel's memory management. It also fixes a number of bugs related to memory management during forking and execve, and improves the performance of forking by using the new maple tree duplication functionality.
New Features:
Tests: