Closed Bug 172701 Opened 22 years ago Closed 13 years ago

UTF-8 decoder accepts prohibited characters

Tracking

()

Status:

RESOLVED INVALID

People

(Reporter: jgmyers, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(1 file)

proposed fix 22 years ago John G. Myers 726 bytes, patch		Details \| Diff \| Splinter Review

John G. Myers

Reporter

Description

•

22 years ago

The UTF8 decoder incorrectly accepts the prohibited characters U+FFFE and U+FFFF. Accepting the former is particularly bad as it might later be interpreted as a byteswapped byte order mark.

John G. Myers

Reporter

Comment 1

•

22 years ago

Attached patch proposed fix — Details — Splinter Review

Niklas Dougherty

Comment 2

•

22 years ago

UTF-8 can't be byteswapped, and it doesn't matter if there's a BOM or not (actually, it is quite useful with a BOM in UTF-8, in terms of opeing documents in an editor). Furthermore: http://www.unicode.org/unicode/faq/utf_bom.html#1 "Since every Unicode coded character sequence maps to a unique sequence of bytes in a given UTF, a reverse mapping can be derived. Thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must also map the 16-bit values that are not valid Unicode values to unique byte sequences. These invalid 16-bit values are FFFE, FFFF, and unpaired surrogates."

Rui Xu

Updated

•

22 years ago

Keywords: intl

QA Contact: ruixu → ylong

John G. Myers

Reporter

Comment 3

•

22 years ago

UTF-8 can't be byteswapped, but the UTF-16 it decodes into can be. It is conceivable that an attacker could bypass an input validator this way. For example, suppose there is an input validator that prohibits input with a '/' character. The attacker encodes their input as the UTF-8 sequence for U+FEFF plus the UTF-8 encoding of the byteswapped version of their desired input. As the '/' is encoded as the UTF-8 sequence for U+2f00, it passes through the validator. After the data is decoded into UTF-16, some later UTF-16 parser handles the U+FFFE as a byteswapped BOM and handles the rest of the data as the attacker intended.

Simon Montagu :smontagu

Assignee

Comment 4

•

22 years ago

Assigning to myself as intl security contact and cc-ing mstolz, but I think this is WONTFIX, because the UTF-8 decoder seems to be the wrong place to try to fix this.

Assignee: yokoyama → smontagu

John G. Myers

Reporter

Comment 5

•

22 years ago

Could you explain why you believe the UTF-8 decoder is the wrong place to fix this? The UTF-8 decoder is the place that has the relevant knowledge and is the only place where this can be reliably addressed.

Boris Zbarsky [:bzbarsky]

Updated

•

22 years ago

Blocks: 182751

Frank Tang

Comment 6

•

22 years ago

according to the discussion on the unicode@unicode.org mailing list. The utf8 converter SHOULD accept and convert it. " Unicode 3.0 chapter 3.8 D29 defines this and the text there adn below spells out that non-charcters and the like must be converted as well. The chagne since 3.0 only affects single-surrogate code poitns. Non-characters shoudl not be excahnged across system boundaries, but the converter does not necessailly define such a boundary." from Markus Schererer

John G. Myers

Reporter

Comment 7

•

22 years ago

Could you give a more precise reference to this discussion on the mailing list? I wasn't able to find it quickly with the supplied reference.

John G. Myers

Reporter

Comment 8

•

22 years ago

It is clear to me that the members of the unicode mailing list are unaware of the security ramifications of permitting the decoding of EF BF BE as U+FFFE. I have tried sending a message about this to the unicode list, but it did not appear in the mailing list archives. I have forwarded the message to Mark Davis asking his help in bringing up the issue to the Unicode Consortium. I argue that it is not appropriate to leave open a potential security vulnerability just to conform to an obscure and ill-considered requirement in a standard.

Simon Montagu :smontagu

Assignee

Comment 9

•

22 years ago

It's clear to me that our current behaviour contradicts the Unicode compliance requirement. Section C5 in chapter 3 of Unicode 3.0 says "A process shall not interpret either U+FFFE or U+FFFF as an abstract character," and this is emended in version 3.1 (http://www.unicode.org/reports/tr27/) to include the values U+nFFFE and U+nFFFF (where n is from 0 to 0x10) and the values U+FDD0..U+FDEF. However, I still think it would be wrong to fix this just in the UTF-8 decoder. We also need to handle the non-characters if they appear as NCRs, or in UTF-16 or UTF-32, or any other encoding which can encode them. I think they should be handled by the content sinks (especially since a non-character is a fatal error in XML, but can be ignored in HTML)

Jungshik Shin

Comment 10

•

22 years ago

FYI, |nsConvertUTF8toUCS2| ( http://lxr.mozilla.org/seamonkey/source/string/obsolete/nsString2.h#570) replaces two nonchars (it's not updated to include pFFF[EF] with p from 1 to 16 and U+FD00 .. U+FDEF) with U+FFFD. Obviously, the usage pattern of this (usually used for relatively short strings in 'meta' data) is very different from that of UTF8 decoder (used for 'content'). Therefore, what one does doesn't have to be automatically implemented by the other. Anyway, I agree with Simon that non-chars (for content) have to be dealt with in the content sinks so that non-chars that come in various encodings are treated consistently.

John G. Myers

Reporter

Comment 11

•

22 years ago

How is the author of a "content sink" supposed to know to guard against the attack described in comment 3? It would seem safer to make sure no converter can emit the U+FFFE noncharacter.

Alexey Chernyak

Updated

•

21 years ago

Blocks: 86411

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: amyy → i18n

Anne (:annevk)

Comment 12

•

13 years ago

Doing this is prohibited by the utf-8 decoding standard.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → INVALID

You need to log in before you can comment on or make changes to this bug.

Bugzilla

UTF-8 decoder accepts prohibited characters

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: jgmyers, Assigned: smontagu)

References

Details

(Keywords: intl)

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Attachment

General

Description

File Name

Content Type