Closed Bug 172701 Opened 22 years ago Closed 13 years ago

UTF-8 decoder accepts prohibited characters

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: jgmyers, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(1 file)

The UTF8 decoder incorrectly accepts the prohibited characters U+FFFE and U+FFFF. Accepting the former is particularly bad as it might later be interpreted as a byteswapped byte order mark.
Attached patch proposed fixSplinter Review
UTF-8 can't be byteswapped, and it doesn't matter if there's a BOM or not (actually, it is quite useful with a BOM in UTF-8, in terms of opeing documents in an editor). Furthermore: http://www.unicode.org/unicode/faq/utf_bom.html#1 "Since every Unicode coded character sequence maps to a unique sequence of bytes in a given UTF, a reverse mapping can be derived. Thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must also map the 16-bit values that are not valid Unicode values to unique byte sequences. These invalid 16-bit values are FFFE, FFFF, and unpaired surrogates."
Keywords: intl
QA Contact: ruixu → ylong
UTF-8 can't be byteswapped, but the UTF-16 it decodes into can be. It is conceivable that an attacker could bypass an input validator this way. For example, suppose there is an input validator that prohibits input with a '/' character. The attacker encodes their input as the UTF-8 sequence for U+FEFF plus the UTF-8 encoding of the byteswapped version of their desired input. As the '/' is encoded as the UTF-8 sequence for U+2f00, it passes through the validator. After the data is decoded into UTF-16, some later UTF-16 parser handles the U+FFFE as a byteswapped BOM and handles the rest of the data as the attacker intended.
Assigning to myself as intl security contact and cc-ing mstolz, but I think this is WONTFIX, because the UTF-8 decoder seems to be the wrong place to try to fix this.
Assignee: yokoyama → smontagu
Could you explain why you believe the UTF-8 decoder is the wrong place to fix this? The UTF-8 decoder is the place that has the relevant knowledge and is the only place where this can be reliably addressed.
Blocks: 182751
according to the discussion on the unicode@unicode.org mailing list. The utf8 converter SHOULD accept and convert it. " Unicode 3.0 chapter 3.8 D29 defines this and the text there adn below spells out that non-charcters and the like must be converted as well. The chagne since 3.0 only affects single-surrogate code poitns. Non-characters shoudl not be excahnged across system boundaries, but the converter does not necessailly define such a boundary." from Markus Schererer
Could you give a more precise reference to this discussion on the mailing list? I wasn't able to find it quickly with the supplied reference.
It is clear to me that the members of the unicode mailing list are unaware of the security ramifications of permitting the decoding of EF BF BE as U+FFFE. I have tried sending a message about this to the unicode list, but it did not appear in the mailing list archives. I have forwarded the message to Mark Davis asking his help in bringing up the issue to the Unicode Consortium. I argue that it is not appropriate to leave open a potential security vulnerability just to conform to an obscure and ill-considered requirement in a standard.
It's clear to me that our current behaviour contradicts the Unicode compliance requirement. Section C5 in chapter 3 of Unicode 3.0 says "A process shall not interpret either U+FFFE or U+FFFF as an abstract character," and this is emended in version 3.1 (http://www.unicode.org/reports/tr27/) to include the values U+nFFFE and U+nFFFF (where n is from 0 to 0x10) and the values U+FDD0..U+FDEF. However, I still think it would be wrong to fix this just in the UTF-8 decoder. We also need to handle the non-characters if they appear as NCRs, or in UTF-16 or UTF-32, or any other encoding which can encode them. I think they should be handled by the content sinks (especially since a non-character is a fatal error in XML, but can be ignored in HTML)
FYI, |nsConvertUTF8toUCS2| ( http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsString2.h#570) replaces two nonchars (it's not updated to include pFFF[EF] with p from 1 to 16 and U+FD00 .. U+FDEF) with U+FFFD. Obviously, the usage pattern of this (usually used for relatively short strings in 'meta' data) is very different from that of UTF8 decoder (used for 'content'). Therefore, what one does doesn't have to be automatically implemented by the other. Anyway, I agree with Simon that non-chars (for content) have to be dealt with in the content sinks so that non-chars that come in various encodings are treated consistently.
How is the author of a "content sink" supposed to know to guard against the attack described in comment 3? It would seem safer to make sure no converter can emit the U+FFFE noncharacter.
Blocks: 86411
QA Contact: amyy → i18n
Doing this is prohibited by the utf-8 decoding standard.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: