Closed Bug 396127 Opened 17 years ago Closed 16 years ago

When setting input.value (for an HTMLInputElement) the next character is deleted after an invalid unpaired UTF-16 surrogate

Categories

(Core :: DOM: Core & HTML, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: smontagu)

References

()

Details

(Whiteboard: [sg:investigate])

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.3) Gecko/20060601 Firefox/2.0.0.3 (Ubuntu-edgy)
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.8.1.3) Gecko/20060601 Firefox/2.0.0.3 (Ubuntu-edgy)

Normally, mozilla allows invalid unpaired UTF-16 surrogates into the DOM.  But when setting the "value" attribute of an INPUT element, each invalid surrogate is removed *and so is the following character* (sometimes).

For instance, 'var s = "\ud863_"' creates a string of length 2.  However, if we then set 'someInputElement.value = s', then "someInputElement.value" has length 0.

Deleting invalid surrogates is probably OK, but deleting a valid character is a really bad idea, because if it is some sort of control character (in any domain-specific sense) then the semantics of the string can be changed radically.

For instance, the "@" sign can be removed from an email address, a "/" removed from a URL, or a safe JSON string turned into malicious code (See the contents of the data URL given).

OK, it might be unlikely that a JSON string would be pasted into an input box, but if this behaviour extends to other objects then it could be quite unsafe.

Reproducible: Always

Steps to Reproduce:
var bad = String.fromCharCode(0xd863); // half a surrogate pair
var str = bad + "hello";
alert("before: " + str); // Prints "before: ?Hello"
var elt = document.createElement('input');
elt.value = str; // XXX mozilla removes the surrogate *and* the h
alert("after: " + elt.value); // prints "after: ello"
Summary: When setting input.value (for an HTMLInputElement) deletes next character after an invalid unpaired UTF-16 surrogate → When setting input.value (for an HTMLInputElement) the next character is deleted after an invalid unpaired UTF-16 surrogate
Interesting attack idea.

jst, where else does this behavior occur?  For example, does the same thing happen during HTML parsing for either UTF-8 (when the UTF-8 encodes one of the code points reserved for use as surrogates) or UTF-16?
Trunk behavior is slightly different: we do not remove the surrogate, but we still remove the character after (or maybe we are treating the second character as part of the "pair"). the "after" alert from comment 0 on trunk is "after: ?ello", where '?' is now a graphical box.

Simon: what is the correct behavior here?
Assignee: nobody → smontagu
Status: UNCONFIRMED → NEW
Ever confirmed: true
Whiteboard: [sg:investigate]
I vote for throwing an exception ;)
I am seeing "before: [fffd]ello" and "after: [fffd]ello" on linux trunk which is at least consistent. I think Dan is right and we are trying to decode 0xd863 and the next character as a surrogate pair. If we don't throw, correct behaviour would be to treat the unpaired surrogate as invalid and resynchronize on the next character, so the expected result is [fffd]hello, which is what we get from data:text/html,<p>&#xd863;hello</p>

See also bug 316338
FWIW, fixing this would fix test 68 in Acid3.
There's a patch in bug 421576 which probably fixes this, but you know we don't want to be touching UTF-8 parsing code right now unless there's no way we can avoid it, so I expect it in 4 or perhaps 3.5.
Component: DOM: HTML → DOM: Core & HTML
Fixed by bug 421576
Status: NEW → RESOLVED
Closed: 16 years ago
Depends on: 421576
Resolution: --- → FIXED
Group: core-security
You need to log in before you can comment on or make changes to this bug.