316338 - Possible to introduce bogus UTF-16 into Gecko

Reporter

Description

•

19 years ago

I'll attach a simple testcase that introduces bogus UTF16 byte sequences (high surrogate not followed by low surrogate) into Gecko in three different ways:

1) CSS character escape
2) JS character escape
3) HTML entity reference.

All three cause asserts as we attempt to atomize the bogus UTF-16, etc.

There's also the following comment in our "UCS216ToUnicode" converter:

// XXX : illegal surrogate code points are just passed through !!

but writing a testcase for that didn't seem to be worth the trouble and that might be fodder for a separate bug.

So the question is what we can do to deal with this... Ideally we'd drop high surrogates not followed by a low surrogate and vice versa. For CSS, we could maybe drop surrogates altogether, since I don't believe they're valid Unicode chars and CSS escapes escape Unicode chars, not UTF16 things. If we did so, would this break anyone? Would it violate the CSS spec to do this (I think it wouldn't, but I could be wrong).

For JavaScript, I'm not sure. http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Unicode#Unicode_Escape_Sequences talks about chars, but given that it's a 4-digit (hex) escape sequence, that doesn't cover all of Unicode unless we implicitly assume UTF-16 and surrogate pairs. That said, I seem to recall that JS only promises UCS-2, not all of Unicode... I also suspect there are other ways of producing bogus strings in JavaScript than just \u escapes, since you can operate on the individual codepoints in the string.

For HTML, since http://www.w3.org/TR/html401/charset.html#h-5.3.1 talks about "ISO 10646 character number" without any limitations to 4 digits, I think we should be able to do the same thing as for CSS.

Thoughts?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 1

•

19 years ago

Attached file Testcase -- WARNING: will crash on shutdown — Details

This shows the three problems; it crashes on shutdown as we try to remove some of the atoms involved from the atom hashtables.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 2

•

19 years ago

Attached file Same as text/plain — Details

Simon Montagu :smontagu

Comment 3

•

19 years ago

(In reply to comment #0)

For both CSS and HTML I think we should treat all escaped surrogates as illegal characters, whether they come singly or in proper UTF-16 pairs. We should either drop them altogether or replace them by the replacement character U+FFFD. It seems to me the same issue as UTF-8 characters encoded in a non-shortest form, with the same possibility of security risks from spoofing characters (I agree that spoofing characters in the astral planes is less likely to be an issue, but still...)

> For JavaScript, I'm not sure. 

Whatever the status of UCS-2 versus UTF-16 in Javascript, lone surrogates will always be illegal characters.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 4

•

19 years ago

> For both CSS and HTML ....

Excellent.  Sounds good to me.

> Whatever the status of UCS-2 versus UTF-16 in Javascript, lone surrogates will
> always be illegal characters.

Yes, but the problem is detecting them.  Consider someone building a string one char at a time.  Or using unescapeURI.  etc, etc.  The real issue is that in JS a "character" is a single 16-bit unit and the language can operate on these.  So it's trivial, in the language, to produce lone surrogates in all sorts of ways....

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

19 years ago

Depends on: 316394

Simon Montagu :smontagu

Comment 5

•

19 years ago

(In reply to comment #0)
> There's also the following comment in our "UCS216ToUnicode" converter:
> 
> // XXX : illegal surrogate code points are just passed through !!
> 

and also in UTF32ToUnicode:

// XXX Do we have to convert surrogate code points to the replacement
// character (0xfffd)?  

> but writing a testcase for that didn't seem to be worth the trouble and that
> might be fodder for a separate bug.

Yes, please :)

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

19 years ago

Depends on: 317216

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

19 years ago

Filed bug 317216

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

18 years ago

Depends on: 331773

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

17 years ago

Depends on: 377314

Jesse Ruderman

Comment 7

•

17 years ago

roc said something to me that sounded like he disagreed about this being a bug.

Jesse Ruderman

Comment 8

•

17 years ago

To keep text nodes from containing invalid UTF-16, the following functions might need checks:

* document.createTextNode()
* element.textContent = ...
* textNode.data = ...
* textNode.splitText()
* textNode.insertData(), etc
* The bug 194231 behavior

Jesse Ruderman

Updated

•

17 years ago

Blocks: 377360

Boris Zbarsky [:bzbarsky]

Reporter

Comment 9

•

17 years ago

Or we could just thrown on invalid UTF-16 in XPConnect... or in the JS engine.  That would help a lot more than whack-a-moling all sorts of DOM methods.

We'd still need to deal with splitText/insertData.  But that's preatty easy -- just have to check for a surrogate pair around the insertion/split point.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 10

•

17 years ago

Shouldn't the DOM spec say what to do with invalid UTF16?

Johnny Stenback (:jst)

Comment 11

•

17 years ago

Yeah, INVALID_CHARACTER_ERR should be thrown. It's not explicitly mentioned in all cases where it's needed, but that's certainly the intent.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 12

•

17 years ago

We're going to have to whack-a-mole these, so any help with coming up with more APIs that needs fixing would be useful.

jesse and I came up with these ones as well:

* .innerHTML
* loading an XML file with illegal UTF-16
* .setAttribute along with CSS to get the attribute displayed
* :before/:after and the CSS content property


That said, do we know that this is actually a problem. I really don't like to play whack-a-mole so it'd be great if we could fix the places in our code that can't deal with illegal utf-16 rather than to try to prevent it.

Assignee: dougt → jonas

Flags: in-testsuite?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 13

•

17 years ago

Isn't Johnny saying in comment #11 that we need to whack-a-mole in order to be standards compliant?

We can certainly make text rendering handle invalid UTF16, as a backstop. I actually don't know of any outstanding bugs in that area.

Jesse Ruderman

Comment 14

•

17 years ago

Once bug 377360 is fixed, I'll be able to test and find out whether text rendering has issues with lonely surrogates and such ;)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 15

•

17 years ago

Pretty much any place that uses atoms can't deal with illegal UTF-16.

And if we do the checking in either XPConnect or better yet in the JS engine (e.g. at initial String object creation time), it shouldn't be too whack-a-moly...

I don't know whether JS can actually enforce, this, of course, since as far as it's concerned there is nothing outside the BMP... :(

> * :before/:after and the CSS content property

That comes through the CSS parser.  Which doesn't allow through stuff that's not actually characters....  Unless the CSS starts out encoded in UTF16 and containing bogus bytes to start with, of course; see the part about the converted in comment 0.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 16

•

17 years ago

You can't modify :before/:after through CSSOM?

Jst says we can't block this at a JS level.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 17

•

17 years ago

> You can't modify :before/:after through CSSOM?

Hmm....  You can, yes.  That basically comes back to the "UTF16 data that is bogus to start with" case as far as the CSS parser is concerned...

Can we block this in XPConnect?  e.g. have a flag on JSStrings if they're valid UTF16, have JS engine propagate it if desired (e.g. when concatenating two valid strings the concatenation is valid), and have XPConnect check strings that don't have the flag set before passing them into Gecko (and flag them if they're ok)?  Or would that cost too much?

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 18

•

17 years ago

(In reply to comment #13)
> Isn't Johnny saying in comment #11 that we need to whack-a-mole in order to be
> standards compliant?

I'm fine with fixing the DOM APIs as well, I'm just not comfortable with relying on that for security.

Daniel Veditz [:dveditz]

Updated

•

16 years ago

Depends on: 448166

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

BugBot [:suhaib / :marco/ :calixte]

Comment 20

•

2 years ago

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: jonas → nobody

Testcase -- WARNING: will crash on shutdown 19 years ago Boris Zbarsky [:bzbarsky] 331 bytes, text/html		Details
Same as text/plain 19 years ago Boris Zbarsky [:bzbarsky] 331 bytes, text/plain		Details