Possible to introduce bogus UTF-16 into Gecko

NEW
Assigned to

Status

()

13 years ago
9 years ago

People

(Reporter: bzbarsky, Assigned: sicking)

Tracking

Trunk
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite ?

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

13 years ago
I'll attach a simple testcase that introduces bogus UTF16 byte sequences (high surrogate not followed by low surrogate) into Gecko in three different ways:

1)  CSS character escape
2)  JS character escape
3)  HTML entity reference.

All three cause asserts as we attempt to atomize the bogus UTF-16, etc.

There's also the following comment in our "UCS216ToUnicode" converter:

// XXX : illegal surrogate code points are just passed through !!

but writing a testcase for that didn't seem to be worth the trouble and that might be fodder for a separate bug.

So the question is what we can do to deal with this...  Ideally we'd drop high surrogates not followed by a low surrogate and vice versa.  For CSS, we could maybe drop surrogates altogether, since I don't believe they're valid Unicode chars and CSS escapes escape Unicode chars, not UTF16 things.  If we did so, would this break anyone?  Would it violate the CSS spec to do this (I think it wouldn't, but I could be wrong).

For JavaScript, I'm not sure.  http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Unicode#Unicode_Escape_Sequences talks about chars, but given that it's a 4-digit (hex) escape sequence, that doesn't cover all of Unicode unless we implicitly assume UTF-16 and surrogate pairs.  That said, I seem to recall that JS only promises UCS-2, not all of Unicode...  I also suspect there are other ways of producing bogus strings in JavaScript than just \u escapes, since you can operate on the individual codepoints in the string.

For HTML, since http://www.w3.org/TR/html401/charset.html#h-5.3.1 talks about "ISO 10646 character number" without any limitations to 4 digits, I think we should be able to do the same thing as for CSS.

Thoughts?
(Reporter)

Comment 1

13 years ago
Created attachment 202945 [details]
Testcase -- WARNING: will crash on shutdown

This shows the three problems; it crashes on shutdown as we try to remove some of the atoms involved from the atom hashtables.
(Reporter)

Comment 2

13 years ago
Created attachment 202946 [details]
Same as text/plain
(In reply to comment #0)

For both CSS and HTML I think we should treat all escaped surrogates as illegal characters, whether they come singly or in proper UTF-16 pairs. We should either drop them altogether or replace them by the replacement character U+FFFD. It seems to me the same issue as UTF-8 characters encoded in a non-shortest form, with the same possibility of security risks from spoofing characters (I agree that spoofing characters in the astral planes is less likely to be an issue, but still...)

> For JavaScript, I'm not sure. 

Whatever the status of UCS-2 versus UTF-16 in Javascript, lone surrogates will always be illegal characters.
(Reporter)

Comment 4

13 years ago
> For both CSS and HTML ....

Excellent.  Sounds good to me.

> Whatever the status of UCS-2 versus UTF-16 in Javascript, lone surrogates will
> always be illegal characters.

Yes, but the problem is detecting them.  Consider someone building a string one char at a time.  Or using unescapeURI.  etc, etc.  The real issue is that in JS a "character" is a single 16-bit unit and the language can operate on these.  So it's trivial, in the language, to produce lone surrogates in all sorts of ways....
(Reporter)

Updated

13 years ago
Depends on: 316394
(In reply to comment #0)
> There's also the following comment in our "UCS216ToUnicode" converter:
> 
> // XXX : illegal surrogate code points are just passed through !!
> 

and also in UTF32ToUnicode:

// XXX Do we have to convert surrogate code points to the replacement
// character (0xfffd)?  

> but writing a testcase for that didn't seem to be worth the trouble and that
> might be fodder for a separate bug.

Yes, please :)
(Reporter)

Updated

13 years ago
Depends on: 317216
(Reporter)

Comment 6

13 years ago
Filed bug 317216
(Reporter)

Updated

13 years ago
Depends on: 331773
(Reporter)

Updated

12 years ago
Depends on: 377314

Comment 7

12 years ago
roc said something to me that sounded like he disagreed about this being a bug.

Comment 8

12 years ago
To keep text nodes from containing invalid UTF-16, the following functions might need checks:

* document.createTextNode()
* element.textContent = ...
* textNode.data = ...
* textNode.splitText()
* textNode.insertData(), etc
* The bug 194231 behavior

Updated

12 years ago
Blocks: 377360
(Reporter)

Comment 9

12 years ago
Or we could just thrown on invalid UTF-16 in XPConnect... or in the JS engine.  That would help a lot more than whack-a-moling all sorts of DOM methods.

We'd still need to deal with splitText/insertData.  But that's preatty easy -- just have to check for a surrogate pair around the insertion/split point.
Shouldn't the DOM spec say what to do with invalid UTF16?
Yeah, INVALID_CHARACTER_ERR should be thrown. It's not explicitly mentioned in all cases where it's needed, but that's certainly the intent.
We're going to have to whack-a-mole these, so any help with coming up with more APIs that needs fixing would be useful.

jesse and I came up with these ones as well:

* .innerHTML
* loading an XML file with illegal UTF-16
* .setAttribute along with CSS to get the attribute displayed
* :before/:after and the CSS content property


That said, do we know that this is actually a problem. I really don't like to play whack-a-mole so it'd be great if we could fix the places in our code that can't deal with illegal utf-16 rather than to try to prevent it.
Assignee: dougt → jonas
Flags: in-testsuite?
Isn't Johnny saying in comment #11 that we need to whack-a-mole in order to be standards compliant?

We can certainly make text rendering handle invalid UTF16, as a backstop. I actually don't know of any outstanding bugs in that area.

Comment 14

12 years ago
Once bug 377360 is fixed, I'll be able to test and find out whether text rendering has issues with lonely surrogates and such ;)
(Reporter)

Comment 15

12 years ago
Pretty much any place that uses atoms can't deal with illegal UTF-16.

And if we do the checking in either XPConnect or better yet in the JS engine (e.g. at initial String object creation time), it shouldn't be too whack-a-moly...

I don't know whether JS can actually enforce, this, of course, since as far as it's concerned there is nothing outside the BMP... :(

> * :before/:after and the CSS content property

That comes through the CSS parser.  Which doesn't allow through stuff that's not actually characters....  Unless the CSS starts out encoded in UTF16 and containing bogus bytes to start with, of course; see the part about the converted in comment 0.
You can't modify :before/:after through CSSOM?

Jst says we can't block this at a JS level.
(Reporter)

Comment 17

12 years ago
> You can't modify :before/:after through CSSOM?

Hmm....  You can, yes.  That basically comes back to the "UTF16 data that is bogus to start with" case as far as the CSS parser is concerned...

Can we block this in XPConnect?  e.g. have a flag on JSStrings if they're valid UTF16, have JS engine propagate it if desired (e.g. when concatenating two valid strings the concatenation is valid), and have XPConnect check strings that don't have the flag set before passing them into Gecko (and flag them if they're ok)?  Or would that cost too much?
(In reply to comment #13)
> Isn't Johnny saying in comment #11 that we need to whack-a-mole in order to be
> standards compliant?

I'm fine with fixing the DOM APIs as well, I'm just not comfortable with relying on that for security.
Depends on: 448166

Updated

9 years ago
Duplicate of this bug: 377314
You need to log in before you can comment on or make changes to this bug.