Skip to content

Conversation

@tones111
Copy link

@tones111 tones111 commented Dec 24, 2025

Motivation

This change resolves several deficiencies with the current url normalization. Tests documenting these deficiencies have been corrected and new tests added to cover additional fetchGit test expectations.

Context

fixes #14852
fixes #14867

The primary new functionality is that SCP-like paths no longer require specifying the user component of the uri authority. This complicates the implementation as there is now overlap with simple uris. To prevent processing uris as SCP paths we now attempt to filter them, but the regex for uri and scp detection are definitely more complicated.

While this effort keeps the url normalization logic in fixGitURL it feels like it's poorly duplicating boosts parsing logic. However, I understand parsing urls has a lot of edge cases, so leveraging a high-quality external library is likely still the right strategy. These regex are only trying to identify enough shape to differentiate them from SCP-like paths.

I'm very new to Nix/NixOS. I'd like to confirm this build fixes the nixos-rebuild error motivating this change but can't figure out how to get it to use my custom build. Having it in the path (from nix develop in the source tree) was not sufficient. Is there any documentation on how to do this?

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@tones111 tones111 requested a review from edolstra as a code owner December 24, 2025 02:33
@tones111
Copy link
Author

Looking over the history of libutil/url.cc I've come across #13821 (and its revert #13888). I had initially tried going down a similar path (manually parsing instead of regex) but trying to handle all the corner cases from optional fields got very unwieldy.

It looks like the tests introduced during that effort made it in 2b310ae and more test work has landed since. Hopefully the impact to the tests made here is palatable as it tries to address some known shortcomings.

@xokdvium xokdvium self-assigned this Dec 24, 2025
@xokdvium
Copy link
Contributor

Thanks! Generally, not a fan of re-adding regexes once again - using those is what got us into this messy situation. Could we look into re-landing the #13821? That was mostly reverted due to the fact that we wanted tests first before functional changes (also the handling of relative paths was borked). Hardcoding github.com in the regex is definitely not the right approach here.

@tones111
Copy link
Author

I agree about the regex usage. Putting the implementation aside, would you be willing to (in)validate some of my assumptions with the test changes? I'd like to get the tests nailed down and then will work on simplifying the implementation (will take a longer look at #13821).

Some questions that came up while working with the tests:

  1. Do the changes need to be 100% compatible with the current implementation?
  • The IPV6 test appears to add an additional directory separator at the root level which I don't understand.
  • What about the xxxParsesPoorly cases?
  1. What's the expected behavior for relative paths (file and scp-like)?
  • I don't believe URIs can express relative paths. This was the main justification for special casing github urls. github.com:owner/repo.git is very common and I didn't want to break that case.
  • Changing relative paths to be absolute from the root seems like surprising behavior
  1. For the file scheme I found the Nix tests depend on a default-constructed authority (file:///path) instead of (file:/path) which should be equivalent but requires a std::nullopt authority. Is there any opportunity for a future effort to reconcile the discrepancy or are we constrained to retaining backward compatibility?

The "git+file" and "path with a space" cases were added to cover regressions discovered when running against the full Nix test suite.

Thanks for taking a look over the changes!

@xokdvium
Copy link
Contributor

Do the changes need to be 100% compatible with the current implementation?

Bug fixes are good. All the test changes aside from relative paths look good to me. We can fix the cases when we produce utterly invalid results that git doesn't understand or can't reasonable succeed. One thing is that this is used in builtins.fetchGit, so bug fixes would change the behaviour of that language builtin fetcher, but since it's already quite broken today I don't think that's an issue.

The IPV6 test appears to add an additional directory separator at the root level which I don't understand.

This looks like a bug. Currently nix discards consecutive slashes in libfetchers already in quite a few places (not sure it's the case in the git fetcher), so this change seems good to me from the back-compat side of things.

What's the expected behavior for relative paths (file and scp-like)?

I'm not sure. Issue is that we can't distinguish scp-style URLs from ligitimate URLs where it's a scheme, but git-style URLs shouldn't have that collision. Can you make sure that fixGitUrl is only used when we know it must be a git one? It might be that it's misused in a few places (hopefully I'm wrong).

I don't believe URIs can express relative paths. This was the main justification for special casing github urls.

Yeah, I'm not sure how to reconcile this. What are the semantics of this? Use the working directory in the remote? We can stick with fixing everything but that and explicitly ban relative paths for now.

Changing relative paths to be absolute from the root seems like surprising behavior

Yeah, this doesn't sound like a good idea.

For the file scheme I found the Nix tests depend on a default-constructed authority (file:///path)

I think that we prefer to use the empty authority case rather than the no-authority. I recall from the RCCs that the second variant is less preferred.

The "git+file" and "path with a space" cases were added to cover regressions discovered when running against the full Nix test suite.

That is good, thanks! Always nice to be thorough about these things. Technically nix URL semantics sit somewhere between WHATWG spec (spaces and ^ are allowed unencoded), but closer to RFC3986.

@xokdvium
Copy link
Contributor

xokdvium commented Dec 24, 2025

FWIW, it's only recently that we started doing proper URL parsing in more places and it was all stringly typed everywhere, so that recent big rework is surfacing some bugs. Specifically in the to_string assertions that validate the invariants required by the spec.

@xokdvium
Copy link
Contributor

Also can you make sure that we also fix #5958?

@tones111
Copy link
Author

tones111 commented Dec 26, 2025

Can you make sure that fixGitUrl is only used when we know it must be a git one?

I've not familiar with this code base but all the fixGitURL invocations appear to be passed git urls.

What are the semantics of this? Use the working directory in the remote? We can stick with fixing everything but that and explicitly ban relative paths for now.

I'm unable to reason about the semantics either. I like banning relative paths, but that breaks github.com:owner/repo.git. So we either have to allow them generally (rooted at /) or special-case github urls, right? Some clarification on how to handle scp-style relative paths would be helpful.

Also can you make sure that we also fix #5958?

I'm confused here. You reopened on Oct 20, but this test appears to validate the expected behavior (port number parsed as the authority's port instead of a path component).

        // https://github.com/NixOS/nix/issues/5958
        // Already proper URL with git+ssh
        FixGitURLParam{
            .input = "git+ssh://user@domain:1234/path",
            .expected = "ssh://user@domain:1234/path",
            .parsed =
                ParsedURL{
                    .scheme = "ssh",
                    .authority =
                        ParsedURL::Authority{
                            .host = "domain",
                            .user = "user",
                            .port = 1234,
                        },
                    .path = {"", "path"},
                },
        },

@tones111
Copy link
Author

I think this is ready for another look. I can rebase+squash if that would be easier to read. Thanks!

@xokdvium
Copy link
Contributor

IIUC, the primary headache comes from the fact that git itself doesn't treat really do pct encoding, but fetchTree does, so you must ensure that everything is percent encoded when it's converted back into a string, but when rendering into a string when passing to git it doesn't get re-encoded. Special-casing just the spaces doesn't seem like the best approach.

So my suggestion (more concise than above):

  • Do the trick with percentEncode before doing parseURL (making sure that path separators don't get encoded).

  • Check that when we render the URL and pass that to git we don't re-encode.

That should cover all the bases pretty ok-ish. It's not efficient to do this encoding/decoding but we care about correctness here. fitGitURL should only really care about massaging scp-style into ssh one without touching the contents at all and ensuring that ParsedURL gets the contents verbatim.

@tones111
Copy link
Author

tones111 commented Jan 1, 2026

Thank you for the feedback/insight. I've moved the path %-encoding into fixGitURL and working with a std::string feels much cleaner. I'm less sure how/where to write tests to validate your second point about decoding before passing to git.

@xokdvium
Copy link
Contributor

xokdvium commented Jan 5, 2026

@tones111, I'm going to push over your changes if there are no objections.

@xokdvium
Copy link
Contributor

xokdvium commented Jan 5, 2026

@tones111, also added support for IPv6 and made the percent encoding much less janky. Should be much more consistent now.

@xokdvium
Copy link
Contributor

xokdvium commented Jan 5, 2026

Some clarification on how to handle scp-style relative paths would be helpful.

Probably the best way is to stick with rewriting to absolute ones for now. I think we actual git and not git forges using tilde expansion for the home directory is best for the time being. Ideally we'd be able to differentiate between ssh URLs and SCP ones, but that poses a bit of an issue with flake lockfiles and back-compat.

Comment on lines +547 to +590
/* TODO: What to do about query parameters? Git should pass those to the * http(s) remotes. Ignore for now and
* just pass through. Will fail later. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, it is parsing query params?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but downstream users are broken.

/* HACK: SCP syntax overlaps with file:/path/to/repo. Git itself doesn't recognize it (or rather treats `file` as
* the host name), but Nix accepts file:/path/to/repo as well as file:///path/to/repo. */
if (schemeOrHost == "file" || schemeOrHost == "git+file") {
auto res = parseURL(url);
Copy link
Author

@tones111 tones111 Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential issue with using parseURL here and not for the absolute path above is that they're no longer equivalent. If I modify this test to use a file scheme the output differs.

existing test:

            .input = "/repos/git repo",
            .expected = "file:///repos/git%20repo",

modified to use file:///

        FixGitURLParam{
            .input = "file:///repos/git repo",
            .expected = "file:///repos/git%20repo",
            .parsed =
                ParsedURL{
                    .scheme = "file",
                    .authority = ParsedURL::Authority{},
                    .path = {"", "repos", "git repo"},
                },
        },

fails with

unknown file: Failure
C++ exception with description "�[31;1merror:�[0m '�[35;1mfile:///repos/git repo�[0m' is not a valid URL: �[35;1mleftover�[0m" thrown in the test body.

The git clone url documentation says they should be equivalent

For local repositories, also supported by Git natively, the following syntaxes may be used:
/path/to/repo.git/
file:///path/to/repo.git/
These two syntaxes are mostly equivalent, except the former implies --local option.

While it is a url, git is able to handle the raw space.

$ git clone file:///home/paul/src/my\ repo my\ repo2
Cloning into 'my repo2'...
<...>
$ cd my\ repo2/
$ git remote -v
origin	file:///home/paul/src/my repo (fetch)
origin	file:///home/paul/src/my repo (push)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK for us to be stricter than Git --- the most important thing is that when we do succeed, we agree with Git. It is OK to fail where Git would succeed I think, especially because the point of file:// vs plain paths is to be explicit.

@Ericson2314 Ericson2314 force-pushed the fixGitURL branch 2 times, most recently from 766ab55 to 6cce38b Compare January 6, 2026 06:06
@Ericson2314
Copy link
Member

OK @xokdvium, this is back to you now.

Comment on lines +423 to +429
* When Git encounters a URL of the form <transport>://<address>, where
* <transport> is a protocol that it cannot handle natively, it
* automatically invokes git remote-<transport> with the full URL as the
* second argument. https://git-scm.com/docs/gitremote-helpers. If the
* url doesn't look like it would be accepted by the remote helper,
* treat it as a SCP-style one. Don't do any pct-decoding in that case.
* Schemes supported by git are excluded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this comment is in the right spot now, as this function is about what to do in the not :// case.

Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ git clone 'user[2001:db8:1::2]:/home/@file'
Cloning into '@file'...
ssh: Could not resolve hostname user[2001: Name or service not known
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
$ git clone 'user:[2001:db8:1::2]:/home/file'
Cloning into 'file'...
ssh: Could not resolve hostname user: Name or service not known
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
$ git clone 'user:@[2001:db8:1::2]:/home/file'
Cloning into 'file'...
^C⏎  # timeout connecting to IP

@xokdvium from some manual testing of truly insane stuff, I don't think the new algorithm is right yet.

@Ericson2314
Copy link
Member

I added some WIP additinoal tests for these crazy cases.

This change resolves several deficiencies with the current url
normalization.  Tests documenting these deficiencies have been corrected
and new tests added to cover additional fetchGit test expectations.

Co-authored-by: Sergei Zimmerman <[email protected]>
Co-authored-by: John Ericson <[email protected]>
Copy link
Member

@Ericson2314 Ericson2314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all the code changes I want to make

@Ericson2314 Ericson2314 dismissed their stale review January 7, 2026 21:35

now fixed

@tones111
Copy link
Author

@xokdvium
Just checking in to make sure this isn't waiting on me. Is there anything left to do for this effort?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

core dump on malformed remote repo build unable to fetch flake submodule (unexpected authority) using SSH transport

3 participants