Skip to content

Commit 859bee3

Browse files
committed
pack-objects: support sparse:oid filter with path-walk
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing an object set to a sparse-checkout definition. This reduces the set of matching blobs while retaining all reachable trees. No server currently supports fetching with this filter because it is expensive to compute and reachability bitmaps do not help without a significant effort to extend the bitmap feature to store bitmaps for each supported sparse- checkout definition. Without focusing on serving fetches and clones with these filters, there are still benefits that could be realized by making this faster. With the sparse index, it's more realistic now than ever to be able to operate a local clone that was bootstrapped by a packfile created with a sparse filter, because the missing trees are not needed to move a sparse-checkout from one commit to another or to view the history of any path in scope. Such clones could perhaps be bootstrapped by partial bundles. Previously, constructing these sparse packs has been incredibly computationally inefficient. The revision walk that explores which objects are in scope spends a lot of time checking each object to see if it matches the sparse-checkout patterns, causing quadratic behavior (number of objects times number of sparse-checkout patterns). This improves somewhat when using cone-mode sparse-checkout patterns that can use hashtables and prefix matches to determine containment. However, the check per object is still too expensive for most cases. This is where the path-walk feature comes in. We can proceed as normal by placing objects in bins by path and _then_ check a group of objects all at once. Since sparse:<oid> only restricts blobs, the path-walk must include all reachable trees while using the cone-mode patterns to skip blobs at paths outside the sparse scope. This establishes a baseline for a potential future "treesparse:<oid>" filter that would also restrict trees, but introducing such a new filter is deferred to a later change. The implementation here is focused around loading the sparse-checkout patterns from the provided object ID and checking that the patterns are indeed cone-mode patterns. We can then load the correct pattern list into the path walk context and use the logic that already exists from bff4555 (backfill: add --sparse option, 2025-02-03), though that feature loads sparse-checkout patterns from the worktree's local settings and also restricts tree objects. We use a combination of errors and warnings to signal problems during this load. The difference is that errors are likely fatal for the non-path-walk version while the warnings are probably just implementation details for the path-walk version and the 'git pack-objects' command can fall back to the revision walk version. Now that the SEEN flag is deferred until after pattern checks (from the previous commit), handle the case where a tree with a shared OID appears at both an out-of-cone and in-cone path. When trees are not being pruned (pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone path so that in-cone blobs within it are discovered. The new tests in t5317 and t6601 demonstrate this behavior and would fail without these changes. The performance test p5315 shows the impact of this change when using sparse filters: Test HEAD~1 HEAD ---------------------------------------------------------------------- 5315.10: repack (sparse:oid) 77.98 77.47 -0.7% 5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0% 5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7% 5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1% These performance tests were run on the Git repository. The --path-walk feature shows meaningful space savings (14% smaller for sparse packs) and dramatic time savings (60% faster) by leveraging the path-walk's ability to skip blobs outside the sparse scope. Signed-off-by: Derrick Stolee <stolee@gmail.com>
1 parent c5aca53 commit 859bee3

4 files changed

Lines changed: 335 additions & 4 deletions

File tree

builtin/pack-objects.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4777,6 +4777,8 @@ static void get_object_list_path_walk(struct rev_info *revs)
47774777
result = walk_objects_by_path(&info);
47784778
trace2_region_leave("pack-objects", "path-walk", revs->repo);
47794779

4780+
path_walk_info_clear(&info);
4781+
47804782
if (result)
47814783
die(_("failed to pack objects via path-walk"));
47824784
}

path-walk.c

Lines changed: 77 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include "hex.h"
1111
#include "list-objects.h"
1212
#include "list-objects-filter-options.h"
13+
#include "object-name.h"
1314
#include "odb.h"
1415
#include "object.h"
1516
#include "oid-array.h"
@@ -180,10 +181,6 @@ static int add_tree_entries(struct path_walk_context *ctx,
180181
return -1;
181182
}
182183

183-
/* Skip this object if already seen. */
184-
if (o->flags & SEEN)
185-
continue;
186-
187184
strbuf_setlen(&path, base_len);
188185
strbuf_add(&path, entry.path, entry.pathlen);
189186

@@ -194,6 +191,40 @@ static int add_tree_entries(struct path_walk_context *ctx,
194191
if (type == OBJ_TREE)
195192
strbuf_addch(&path, '/');
196193

194+
if (o->flags & SEEN) {
195+
/*
196+
* A tree with a shared OID may appear at multiple
197+
* paths. Even though we already added this tree to
198+
* the output at some other path, we still need to
199+
* walk into it at this in-cone path to discover
200+
* blobs that were not found at the earlier
201+
* out-of-cone path.
202+
*
203+
* Only do this for paths not yet in our map, to
204+
* avoid duplicate entries when the same tree OID
205+
* appears at the same path across multiple commits.
206+
*/
207+
if (type == OBJ_TREE && ctx->info->pl &&
208+
ctx->info->pl->use_cone_patterns &&
209+
!ctx->info->pl_sparse_trees &&
210+
!strmap_contains(&ctx->paths_to_lists, path.buf)) {
211+
int dtype;
212+
enum pattern_match_result m;
213+
m = path_matches_pattern_list(path.buf, path.len,
214+
path.buf + base_len,
215+
&dtype,
216+
ctx->info->pl,
217+
ctx->repo->index);
218+
if (m != NOT_MATCHED) {
219+
add_path_to_list(ctx, path.buf, type,
220+
&entry.oid,
221+
!(o->flags & UNINTERESTING));
222+
push_to_stack(ctx, path.buf);
223+
}
224+
}
225+
continue;
226+
}
227+
197228
if (ctx->info->pl) {
198229
int dtype;
199230
enum pattern_match_result match;
@@ -533,6 +564,48 @@ static int prepare_filters(struct path_walk_info *info,
533564
}
534565
return 1;
535566

567+
case LOFC_SPARSE_OID:
568+
if (info) {
569+
struct object_id sparse_oid;
570+
struct repository *repo = info->revs->repo;
571+
572+
if (info->pl) {
573+
warning(_("sparse filter cannot be combined with existing sparse patterns"));
574+
return 0;
575+
}
576+
577+
if (repo_get_oid_with_flags(repo,
578+
options->sparse_oid_name,
579+
&sparse_oid,
580+
GET_OID_BLOB)) {
581+
error(_("unable to access sparse blob in '%s'"),
582+
options->sparse_oid_name);
583+
return 0;
584+
}
585+
586+
CALLOC_ARRAY(info->pl, 1);
587+
info->pl->use_cone_patterns = 1;
588+
589+
if (add_patterns_from_blob_to_list(&sparse_oid, "", 0,
590+
info->pl) < 0) {
591+
clear_pattern_list(info->pl);
592+
FREE_AND_NULL(info->pl);
593+
error(_("unable to parse sparse filter data in '%s'"),
594+
oid_to_hex(&sparse_oid));
595+
return 0;
596+
}
597+
598+
if (!info->pl->use_cone_patterns) {
599+
clear_pattern_list(info->pl);
600+
FREE_AND_NULL(info->pl);
601+
warning(_("sparse filter is not cone-mode compatible"));
602+
return 0;
603+
}
604+
605+
list_objects_filter_release(options);
606+
}
607+
return 1;
608+
536609
default:
537610
error(_("object filter '%s' not supported by the path-walk API"),
538611
list_objects_filter_spec(options));

t/t5317-pack-objects-filter-objects.sh

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -478,4 +478,129 @@ test_expect_success 'verify pack-objects w/ --missing=allow-any' '
478478
EOF
479479
'
480480

481+
# Test that --path-walk produces the same object set as standard traversal
482+
# when using sparse:oid filters with cone-mode patterns.
483+
#
484+
# The sparse:oid filter restricts only blobs, not trees. Both standard
485+
# and path-walk should produce identical sets of blobs, commits, and trees.
486+
487+
test_expect_success 'setup pw_sparse for path-walk comparison' '
488+
git init pw_sparse &&
489+
mkdir -p pw_sparse/inc/sub pw_sparse/exc/sub &&
490+
491+
for n in 1 2
492+
do
493+
echo "inc $n" >pw_sparse/inc/file$n &&
494+
echo "inc sub $n" >pw_sparse/inc/sub/file$n &&
495+
echo "exc $n" >pw_sparse/exc/file$n &&
496+
echo "exc sub $n" >pw_sparse/exc/sub/file$n &&
497+
echo "root $n" >pw_sparse/root$n || return 1
498+
done &&
499+
500+
git -C pw_sparse add . &&
501+
git -C pw_sparse commit -m "first" &&
502+
503+
echo "inc 1 modified" >pw_sparse/inc/file1 &&
504+
echo "exc 1 modified" >pw_sparse/exc/file1 &&
505+
echo "root 1 modified" >pw_sparse/root1 &&
506+
git -C pw_sparse add . &&
507+
git -C pw_sparse commit -m "second" &&
508+
509+
# Cone-mode sparse pattern: include root + inc/
510+
printf "/*\n!/*/\n/inc/\n" |
511+
git -C pw_sparse hash-object -w --stdin >sparse_oid
512+
'
513+
514+
test_expect_success 'sparse:oid with --path-walk produces same blobs' '
515+
oid=$(cat sparse_oid) &&
516+
517+
git -C pw_sparse pack-objects --revs --stdout \
518+
--filter=sparse:oid=$oid >standard.pack <<-EOF &&
519+
HEAD
520+
EOF
521+
git -C pw_sparse index-pack ../standard.pack &&
522+
git -C pw_sparse verify-pack -v ../standard.pack >standard_verify &&
523+
524+
git -C pw_sparse pack-objects --revs --stdout \
525+
--path-walk --filter=sparse:oid=$oid >pathwalk.pack <<-EOF &&
526+
HEAD
527+
EOF
528+
git -C pw_sparse index-pack ../pathwalk.pack &&
529+
git -C pw_sparse verify-pack -v ../pathwalk.pack >pathwalk_verify &&
530+
531+
# Blobs must match exactly
532+
grep -E "^[0-9a-f]{40} blob" standard_verify |
533+
awk "{print \$1}" | sort >standard_blobs &&
534+
grep -E "^[0-9a-f]{40} blob" pathwalk_verify |
535+
awk "{print \$1}" | sort >pathwalk_blobs &&
536+
test_cmp standard_blobs pathwalk_blobs &&
537+
538+
# Commits must match exactly
539+
grep -E "^[0-9a-f]{40} commit" standard_verify |
540+
awk "{print \$1}" | sort >standard_commits &&
541+
grep -E "^[0-9a-f]{40} commit" pathwalk_verify |
542+
awk "{print \$1}" | sort >pathwalk_commits &&
543+
test_cmp standard_commits pathwalk_commits
544+
'
545+
546+
test_expect_success 'sparse:oid with --path-walk includes all trees' '
547+
# The sparse:oid filter restricts only blobs, not trees.
548+
# Both standard and path-walk should include the same trees.
549+
grep -E "^[0-9a-f]{40} tree" standard_verify |
550+
awk "{print \$1}" | sort >standard_trees &&
551+
grep -E "^[0-9a-f]{40} tree" pathwalk_verify |
552+
awk "{print \$1}" | sort >pathwalk_trees &&
553+
554+
test_cmp standard_trees pathwalk_trees
555+
'
556+
557+
# Test the edge case where the same tree/blob OID appears at both an
558+
# in-cone and out-of-cone path. When sibling directories have identical
559+
# contents, they share a tree OID. The path-walk defers marking objects
560+
# SEEN until after checking sparse patterns, so an object at an out-of-cone
561+
# path can still be discovered at an in-cone path.
562+
563+
test_expect_success 'setup pw_shared for shared OID across cone boundary' '
564+
git init pw_shared &&
565+
mkdir pw_shared/aaa pw_shared/zzz &&
566+
echo "shared content" >pw_shared/aaa/file &&
567+
echo "shared content" >pw_shared/zzz/file &&
568+
echo "root file" >pw_shared/rootfile &&
569+
git -C pw_shared add . &&
570+
git -C pw_shared commit -m "aaa and zzz share tree OID" &&
571+
572+
# Verify they share a tree OID
573+
aaa_tree=$(git -C pw_shared rev-parse HEAD:aaa) &&
574+
zzz_tree=$(git -C pw_shared rev-parse HEAD:zzz) &&
575+
test "$aaa_tree" = "$zzz_tree" &&
576+
577+
# Cone pattern: include root + zzz/ (not aaa/)
578+
printf "/*\n!/*/\n/zzz/\n" |
579+
git -C pw_shared hash-object -w --stdin >shared_sparse_oid
580+
'
581+
582+
test_expect_success 'shared tree OID: --path-walk blobs match standard' '
583+
oid=$(cat shared_sparse_oid) &&
584+
585+
git -C pw_shared pack-objects --revs --stdout \
586+
--filter=sparse:oid=$oid >shared_std.pack <<-EOF &&
587+
HEAD
588+
EOF
589+
git -C pw_shared index-pack ../shared_std.pack &&
590+
git -C pw_shared verify-pack -v ../shared_std.pack >shared_std_verify &&
591+
592+
git -C pw_shared pack-objects --revs --stdout \
593+
--path-walk --filter=sparse:oid=$oid >shared_pw.pack <<-EOF &&
594+
HEAD
595+
EOF
596+
git -C pw_shared index-pack ../shared_pw.pack &&
597+
git -C pw_shared verify-pack -v ../shared_pw.pack >shared_pw_verify &&
598+
599+
grep -E "^[0-9a-f]{40} blob" shared_std_verify |
600+
awk "{print \$1}" | sort >shared_std_blobs &&
601+
grep -E "^[0-9a-f]{40} blob" shared_pw_verify |
602+
awk "{print \$1}" | sort >shared_pw_blobs &&
603+
test_cmp shared_std_blobs shared_pw_blobs
604+
'
605+
481606
test_done

t/t6601-path-walk.sh

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -590,4 +590,135 @@ test_expect_success 'all, blob:limit=3 filter' '
590590
test_cmp_sorted expect out
591591
'
592592

593+
test_expect_success 'setup sparse filter blob' '
594+
# Cone-mode patterns: include root, exclude all dirs, include left/
595+
cat >patterns <<-\EOF &&
596+
/*
597+
!/*/
598+
/left/
599+
EOF
600+
sparse_oid=$(git hash-object -w -t blob patterns)
601+
'
602+
603+
test_expect_success 'all, sparse:oid filter' '
604+
test-tool path-walk --filter=sparse:oid=$sparse_oid -- --all >out &&
605+
606+
cat >expect <<-EOF &&
607+
0:commit::$(git rev-parse topic)
608+
0:commit::$(git rev-parse base)
609+
0:commit::$(git rev-parse base~1)
610+
0:commit::$(git rev-parse base~2)
611+
1:tag:/tags:$(git rev-parse refs/tags/first)
612+
1:tag:/tags:$(git rev-parse refs/tags/second.1)
613+
1:tag:/tags:$(git rev-parse refs/tags/second.2)
614+
1:tag:/tags:$(git rev-parse refs/tags/third)
615+
1:tag:/tags:$(git rev-parse refs/tags/fourth)
616+
1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
617+
1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
618+
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
619+
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
620+
3:tree::$(git rev-parse topic^{tree})
621+
3:tree::$(git rev-parse base^{tree})
622+
3:tree::$(git rev-parse base~1^{tree})
623+
3:tree::$(git rev-parse base~2^{tree})
624+
3:tree::$(git rev-parse refs/tags/tree-tag^{})
625+
3:tree::$(git rev-parse refs/tags/tree-tag2^{})
626+
4:blob:a:$(git rev-parse base~2:a)
627+
5:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
628+
6:tree:a/:$(git rev-parse base:a)
629+
7:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
630+
8:tree:left/:$(git rev-parse base:left)
631+
8:tree:left/:$(git rev-parse base~2:left)
632+
9:blob:left/b:$(git rev-parse base~2:left/b)
633+
9:blob:left/b:$(git rev-parse base:left/b)
634+
10:tree:right/:$(git rev-parse topic:right)
635+
10:tree:right/:$(git rev-parse base~1:right)
636+
10:tree:right/:$(git rev-parse base~2:right)
637+
blobs:6
638+
commits:4
639+
tags:7
640+
trees:13
641+
EOF
642+
643+
test_cmp_sorted expect out
644+
'
645+
646+
test_expect_success 'topic only, sparse:oid filter' '
647+
test-tool path-walk --filter=sparse:oid=$sparse_oid -- topic >out &&
648+
649+
cat >expect <<-EOF &&
650+
0:commit::$(git rev-parse topic)
651+
0:commit::$(git rev-parse base~1)
652+
0:commit::$(git rev-parse base~2)
653+
1:tree::$(git rev-parse topic^{tree})
654+
1:tree::$(git rev-parse base~1^{tree})
655+
1:tree::$(git rev-parse base~2^{tree})
656+
2:blob:a:$(git rev-parse base~2:a)
657+
3:tree:left/:$(git rev-parse base~2:left)
658+
4:blob:left/b:$(git rev-parse base~2:left/b)
659+
5:tree:right/:$(git rev-parse topic:right)
660+
5:tree:right/:$(git rev-parse base~1:right)
661+
5:tree:right/:$(git rev-parse base~2:right)
662+
blobs:2
663+
commits:3
664+
tags:0
665+
trees:7
666+
EOF
667+
668+
test_cmp_sorted expect out
669+
'
670+
671+
# Demonstrate the SEEN flag ordering issue: when the same tree/blob OID
672+
# appears at two sibling paths where one is in-cone and the other is
673+
# out-of-cone, the path-walk must still discover blobs at the in-cone
674+
# path even when the shared tree OID was first encountered out-of-cone.
675+
# Since sparse:oid includes all trees, the out-of-cone tree (aaa/) is
676+
# walked first, and its blob is skipped. The path-walk then re-walks
677+
# the same tree OID at the in-cone path (zzz/) to find the blob there.
678+
679+
test_expect_success 'setup shared tree OID across cone boundary' '
680+
git checkout --orphan shared-tree &&
681+
git rm -rf . &&
682+
mkdir aaa zzz &&
683+
echo "shared content" >aaa/file &&
684+
echo "shared content" >zzz/file &&
685+
echo "root file" >rootfile &&
686+
git add aaa zzz rootfile &&
687+
git commit -m "aaa and zzz have same tree OID" &&
688+
689+
# Verify they really share a tree OID
690+
aaa_tree=$(git rev-parse HEAD:aaa) &&
691+
zzz_tree=$(git rev-parse HEAD:zzz) &&
692+
test "$aaa_tree" = "$zzz_tree" &&
693+
694+
# Cone pattern: include root + zzz/ (not aaa/)
695+
cat >shared-patterns <<-\EOF &&
696+
/*
697+
!/*/
698+
/zzz/
699+
EOF
700+
shared_sparse_oid=$(git hash-object -w -t blob shared-patterns)
701+
'
702+
703+
test_expect_success 'sparse:oid with shared tree OID across cone boundary' '
704+
test-tool path-walk \
705+
--filter=sparse:oid=$shared_sparse_oid \
706+
-- shared-tree >out &&
707+
708+
cat >expect <<-EOF &&
709+
0:commit::$(git rev-parse shared-tree)
710+
1:tree::$(git rev-parse shared-tree^{tree})
711+
2:blob:rootfile:$(git rev-parse shared-tree:rootfile)
712+
3:tree:aaa/:$(git rev-parse shared-tree:aaa)
713+
4:tree:zzz/:$(git rev-parse shared-tree:zzz)
714+
5:blob:zzz/file:$(git rev-parse shared-tree:zzz/file)
715+
blobs:2
716+
commits:1
717+
tags:0
718+
trees:3
719+
EOF
720+
721+
test_cmp_sorted expect out
722+
'
723+
593724
test_done

0 commit comments

Comments
 (0)