Skip to content

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Aug 15, 2025

Here's a small sparse index performance update based on a user report.

Thanks,
-Stolee

cc: [email protected]
cc: Elijah Newren [email protected]

When running 'git ls-files' with a pathspec, the index entries get
filtered according to that pathspec before iterating over them in
show_files().  In 7808709 (ls-files: add --sparse option,
2021-12-22), this iteration was prefixed with a check for the '--sparse'
option which allows the command to output directory entries; this
created a pre-loop call to ensure_full_index().

However, when a user runs 'git ls-files' where the pathspec matches
directories that are recursively matched in the sparse-checkout, there
are not any sparse directories that match the pathspec so they would not
be written to the output. The expansion in this case is just a
performance drop for no behavior difference.

Replace this global check to expand the index with a check inside the
loop for a matched sparse directory. If we see one, then expand the
index and continue from the current location. This is safe since the
previous entries in the index did not have any sparse directories and
thus would remain stable in this expansion.

A test in t1092 confirms that this changes the behavior.

Signed-off-by: Derrick Stolee <[email protected]>
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

* alone.
*/
ensure_full_index(repo->index);
ce = repo->index->cache[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether we'd want to avoid calling ensure_full_index() multiple times, but it seems that it returns early if istate->sparse_index == INDEX_EXPANDED, so it's safe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also we will never have S_ISSPARSEDIR() true after expansion.

@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Aug 15, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1955/derrickstolee/ls-files-sparse-index-v1

To fetch this version to local tag pr-1955/derrickstolee/ls-files-sparse-index-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1955/derrickstolee/ls-files-sparse-index-v1

Copy link

gitgitgadget bot commented Aug 26, 2025

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/15/2025 12:12 PM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
...
> Replace this global check to expand the index with a check inside the
> loop for a matched sparse directory. If we see one, then expand the
> index and continue from the current location. This is safe since the
> previous entries in the index did not have any sparse directories and
> thus would remain stable in this expansion.
...>     Here's a small sparse index performance update based on a user report.

I know this is small and somewhat niche, but it hasn't had any review
or been picked up in What's Cooking. Could someone please take a look?

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Aug 26, 2025

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Aug 15, 2025 at 9:13 AM Derrick Stolee via GitGitGadget
<[email protected]> wrote:
>
> From: Derrick Stolee <[email protected]>
>
> When running 'git ls-files' with a pathspec, the index entries get
> filtered according to that pathspec before iterating over them in

When I first read this patch, I missed this part of your commit
message and figured there was no possible way your patch could
actually speed things up.  I verified with your testcase that it
worked, though, and had to step through a debugger to find out what I
was missing.  It's the prune_index() call in cmd_ls_files() that does
this -- but only when the pathspecs provided have some common prefix.
So, it's not unique to when there's a single pathspec as your commit
message claims, and the pointer to prune_index() may have helped save
me some head-scratching in review the patch.

Perhaps this could be clarified here (and made more explicit for folks
like me that gloss over it), something like

When running 'git ls-files' with pathspecs with a common prefix, the
index entries get
filtered according to that common prefix in prune_index() before
iterating over them in show_files().

> show_files().  In 78087097b8 (ls-files: add --sparse option,
> 2021-12-22), this iteration was prefixed with a check for the '--sparse'
> option which allows the command to output directory entries; this
> created a pre-loop call to ensure_full_index().
>
> However, when a user runs 'git ls-files' where the pathspec matches
> directories that are recursively matched in the sparse-checkout, there
> are not any sparse directories that match the pathspec so they would not
> be written to the output. The expansion in this case is just a
> performance drop for no behavior difference.
>
> Replace this global check to expand the index with a check inside the
> loop for a matched sparse directory. If we see one, then expand the
> index and continue from the current location. This is safe since the
> previous entries in the index did not have any sparse directories and
> thus would remain stable in this expansion.
>
> A test in t1092 confirms that this changes the behavior.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>     ls-files: conditionally leave index sparse
>
>     Here's a small sparse index performance update based on a user report.
>
>     Thanks, -Stolee
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1955%2Fderrickstolee%2Fls-files-sparse-index-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1955/derrickstolee/ls-files-sparse-index-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1955
>
>  builtin/ls-files.c                       | 13 ++++++++++---
>  t/t1092-sparse-checkout-compatibility.sh | 13 +++++++++++++
>  2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index c06a6f33e41..b148607f7a1 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -414,14 +414,21 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
>         if (!(show_cached || show_stage || show_deleted || show_modified))
>                 return;
>
> -       if (!show_sparse_dirs)
> -               ensure_full_index(repo->index);
> -
>         for (i = 0; i < repo->index->cache_nr; i++) {
>                 const struct cache_entry *ce = repo->index->cache[i];
>                 struct stat st;
>                 int stat_err;
>
> +               if (S_ISSPARSEDIR(ce->ce_mode) && !show_sparse_dirs) {
> +                       /*
> +                        * This is the first time we've hit a sparse dir,
> +                        * so expansion will leave the first 'i' entries
> +                        * alone.
> +                        */
> +                       ensure_full_index(repo->index);
> +                       ce = repo->index->cache[i];
> +               }

I see how this is safe.  I didn't understand how it helped performance
until I figured out by stepping through that repo->indexc->cache_nr is
much less than I expected, because of the prune_index() call that
happened earlier.

>                 construct_fullname(&fullname, repo, ce);
>
>                 if ((dir->flags & DIR_SHOW_IGNORED) &&
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index d8101139b40..b0f691c151a 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -1506,6 +1506,8 @@ test_expect_success 'sparse-index is not expanded' '
>         ensure_not_expanded reset --hard &&
>         ensure_not_expanded restore -s rename-out-to-out -- deep/deeper1 &&
>
> +       ensure_not_expanded ls-files deep/deeper1 &&
> +

Thanks, this testcase is exactly what I needed to figure out what I
was misunderstanding.

>         echo >>sparse-index/README.md &&
>         ensure_not_expanded add -A &&
>         echo >>sparse-index/extra.txt &&
> @@ -1607,6 +1609,17 @@ test_expect_success 'describe tested on all' '
>         test_all_match git describe --dirty
>  '
>
> +test_expect_success 'ls-files filtering and expansion' '
> +       init_repos &&
> +
> +       # This filtering will hit a sparse directory midway
> +       # through the iteration.
> +       test_all_match git ls-files deep &&
> +
> +       # This pathspec will filter the index to only a sparse
> +       # directory.
> +       test_all_match git ls-files folder1
> +'

Looks good.

Copy link

gitgitgadget bot commented Aug 26, 2025

User Elijah Newren <[email protected]> has been added to the cc: list.

Copy link

gitgitgadget bot commented Aug 26, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

>  	for (i = 0; i < repo->index->cache_nr; i++) {
>  		const struct cache_entry *ce = repo->index->cache[i];
>  		struct stat st;
>  		int stat_err;
>  
> +		if (S_ISSPARSEDIR(ce->ce_mode) && !show_sparse_dirs) {
> +			/*
> +			 * This is the first time we've hit a sparse dir,
> +			 * so expansion will leave the first 'i' entries
> +			 * alone.
> +			 */

In other words,

 (1) we know that the original index entries are sorted

 (2) we are looking at a single directory entry that is sparse, say
     "D/", and ensure_full_index() will expand it (and other later
     entries in the current index).

 (3) we assume that the contents of "D/" will never sort before the
     original location where "D/" used to sit, iow, we do not have
     to rewind to the beginning of index->cache[] array and skip
     what we have already processed.

Having bitten by the index sort order number of times, I just wanted
to make sure everybody's assumption is on the same page.

> +			ensure_full_index(repo->index);
> +			ce = repo->index->cache[i];

and there is no need to say "again" (redo this round of the loop)
here, as grabbing ce was the only thing the loop did, and we just
replaced the entry for the originally folded "D/" with one for the
first subpath in "D/".  Sounds sensible.

Copy link

gitgitgadget bot commented Aug 27, 2025

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/26/2025 12:40 PM, Derrick Stolee wrote:
> On 8/15/2025 12:12 PM, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <[email protected]>
> ...
>> Replace this global check to expand the index with a check inside the
>> loop for a matched sparse directory. If we see one, then expand the
>> index and continue from the current location. This is safe since the
>> previous entries in the index did not have any sparse directories and
>> thus would remain stable in this expansion.
> ...>     Here's a small sparse index performance update based on a user report.
> 
> I know this is small and somewhat niche, but it hasn't had any review
> or been picked up in What's Cooking. Could someone please take a look?

Thanks, Elijah and Junio for reviewing.

By coincidence, a user reported an issue where the sparse index was
expanded during "git mergetool" and it was due to a pathspec-focused
ls-files subcommand. So maybe it's less niche than I had thought.

Thanks,
-Stolee

Copy link

gitgitgadget bot commented Aug 28, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes:

> On 8/26/2025 12:40 PM, Derrick Stolee wrote:
>> On 8/15/2025 12:12 PM, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <[email protected]>
>> ...
>>> Replace this global check to expand the index with a check inside the
>>> loop for a matched sparse directory. If we see one, then expand the
>>> index and continue from the current location. This is safe since the
>>> previous entries in the index did not have any sparse directories and
>>> thus would remain stable in this expansion.
>> ...>     Here's a small sparse index performance update based on a user report.
>> 
>> I know this is small and somewhat niche, but it hasn't had any review
>> or been picked up in What's Cooking. Could someone please take a look?
>
> Thanks, Elijah and Junio for reviewing.
>
> By coincidence, a user reported an issue where the sparse index was
> expanded during "git mergetool" and it was due to a pathspec-focused
> ls-files subcommand. So maybe it's less niche than I had thought.

I do not speak for others who did not comment on the thread, but the
reason I did not act on the message was not because I found it niche
or insignificant.  I simply missed the single message in the sea of
other threads.

Queued.  Thanks for pinging.

Copy link

gitgitgadget bot commented Aug 29, 2025

This patch series was integrated into seen via git@6268258.

@gitgitgadget gitgitgadget bot added the seen label Aug 29, 2025
Copy link

gitgitgadget bot commented Aug 29, 2025

This branch is now known as ds/ls-files-lazy-unsparse.

Copy link

gitgitgadget bot commented Aug 29, 2025

This patch series was integrated into seen via git@c683d80.

Copy link

gitgitgadget bot commented Aug 29, 2025

This patch series was integrated into next via git@a48fee2.

@gitgitgadget gitgitgadget bot added the next label Aug 29, 2025
Copy link

gitgitgadget bot commented Aug 29, 2025

There was a status update in the "New Topics" section about the branch ds/ls-files-lazy-unsparse on the Git mailing list:

"git ls-files <pathspec>..." should not necessarily have to expand
the index fully if a sparsified directory is excluded by the
pathspec; the code is taught to expand the index on demand to avoid
this.

Will merge to 'next'.
source: <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants