Closed
Bug 172701
Opened 22 years ago
Closed 13 years ago
UTF-8 decoder accepts prohibited characters
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
INVALID
People
(Reporter: jgmyers, Assigned: smontagu)
References
Details
(Keywords: intl)
Attachments
(1 file)
726 bytes,
patch
|
Details | Diff | Splinter Review |
The UTF8 decoder incorrectly accepts the prohibited characters U+FFFE and
U+FFFF. Accepting the former is particularly bad as it might later be
interpreted as a byteswapped byte order mark.
Reporter | ||
Comment 1•22 years ago
|
||
Comment 2•22 years ago
|
||
UTF-8 can't be byteswapped, and it doesn't matter if there's a BOM or not
(actually, it is quite useful with a BOM in UTF-8, in terms of opeing documents
in an editor). Furthermore:
http://www.unicode.org/unicode/faq/utf_bom.html#1
"Since every Unicode coded character sequence maps to a unique sequence of bytes
in a given UTF, a reverse mapping can be derived. Thus every UTF supports
lossless round tripping: mapping from any Unicode coded character sequence S to
a sequence of bytes and back will produce S again. To ensure round tripping, a
UTF mapping must also map the 16-bit values that are not valid Unicode values
to unique byte sequences. These invalid 16-bit values are FFFE, FFFF, and
unpaired surrogates."
Reporter | ||
Comment 3•22 years ago
|
||
UTF-8 can't be byteswapped, but the UTF-16 it decodes into can be. It is
conceivable that an attacker could bypass an input validator this way. For
example, suppose there is an input validator that prohibits input with a '/'
character. The attacker encodes their input as the UTF-8 sequence for U+FEFF
plus the UTF-8 encoding of the byteswapped version of their desired input. As
the '/' is encoded as the UTF-8 sequence for U+2f00, it passes through the
validator. After the data is decoded into UTF-16, some later UTF-16 parser
handles the U+FFFE as a byteswapped BOM and handles the rest of the data as the
attacker intended.
Assignee | ||
Comment 4•22 years ago
|
||
Assigning to myself as intl security contact and cc-ing mstolz, but I think this
is WONTFIX, because the UTF-8 decoder seems to be the wrong place to try to fix
this.
Assignee: yokoyama → smontagu
Reporter | ||
Comment 5•22 years ago
|
||
Could you explain why you believe the UTF-8 decoder is the wrong place to fix
this? The UTF-8 decoder is the place that has the relevant knowledge and is the
only place where this can be reliably addressed.
Comment 6•22 years ago
|
||
according to the discussion on the unicode@unicode.org mailing list. The utf8
converter SHOULD accept and convert it.
"
Unicode 3.0 chapter 3.8 D29 defines this and the text there adn below spells
out that non-charcters and the like must be converted as well. The chagne since
3.0 only affects single-surrogate code poitns. Non-characters shoudl not be
excahnged across system boundaries, but the converter does not necessailly
define such a boundary."
from Markus Schererer
Reporter | ||
Comment 7•22 years ago
|
||
Could you give a more precise reference to this discussion on the mailing list?
I wasn't able to find it quickly with the supplied reference.
Reporter | ||
Comment 8•22 years ago
|
||
It is clear to me that the members of the unicode mailing list are unaware of
the security ramifications of permitting the decoding of EF BF BE as U+FFFE. I
have tried sending a message about this to the unicode list, but it did not
appear in the mailing list archives. I have forwarded the message to Mark Davis
asking his help in bringing up the issue to the Unicode Consortium.
I argue that it is not appropriate to leave open a potential security
vulnerability just to conform to an obscure and ill-considered requirement in a
standard.
Assignee | ||
Comment 9•22 years ago
|
||
It's clear to me that our current behaviour contradicts the Unicode compliance
requirement. Section C5 in chapter 3 of Unicode 3.0 says "A process shall not
interpret either U+FFFE or U+FFFF as an abstract character," and this is emended
in version 3.1 (http://www.unicode.org/reports/tr27/) to include the values
U+nFFFE and U+nFFFF (where n is from 0 to 0x10) and the values U+FDD0..U+FDEF.
However, I still think it would be wrong to fix this just in the UTF-8 decoder.
We also need to handle the non-characters if they appear as NCRs, or in UTF-16
or UTF-32, or any other encoding which can encode them. I think they should be
handled by the content sinks (especially since a non-character is a fatal error
in XML, but can be ignored in HTML)
Comment 10•22 years ago
|
||
FYI, |nsConvertUTF8toUCS2| (
http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsString2.h#570)
replaces two nonchars (it's not updated to include pFFF[EF] with
p from 1 to 16 and U+FD00 .. U+FDEF) with U+FFFD. Obviously,
the usage pattern of this (usually used for relatively short
strings in 'meta' data) is very different from that of UTF8 decoder
(used for 'content'). Therefore, what one does doesn't have to be automatically
implemented by the other.
Anyway, I agree with Simon that non-chars (for content) have to be dealt with
in the content sinks so that non-chars that come in various encodings are treated
consistently.
Reporter | ||
Comment 11•22 years ago
|
||
How is the author of a "content sink" supposed to know to guard against the
attack described in comment 3? It would seem safer to make sure no converter
can emit the U+FFFE noncharacter.
Updated•15 years ago
|
QA Contact: amyy → i18n
Comment 12•13 years ago
|
||
Doing this is prohibited by the utf-8 decoding standard.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•