Closed Bug 301797 Opened 19 years ago Closed 19 years ago

UTF-8 decoder drops byte on encoding error

Categories

(Core :: Internationalization, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: udo_richter, Assigned: smontagu)

References

Details

Attachments

(2 files)

On certain UTF-8 documents with encoding errors, the UTF-8 decoder gets confused
and drops one byte exactly 4096 bytes later. The encoding error has to be on a
4096-byte-boundary too. Most probably this a buffer handling error. This affects
XML and HTML files. Testcase follows.

This triggers bug 280442 - A bad downloads.rdf file causes firefox to crash on
download

Some related references:
Bug 174351 - Encoding errors aren't treated as fatal XML errors
bug 231267 - Charset converters should always notify callers of encoding errors
Attached file testcase html
I still need to confirm it, but I think the problem is here:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/intl/uconv/src/nsUTF8ToUnicode.cpp&rev=1.8&mark=83#78

The assumption is that in the worst case the number of UTF-16 code units in the
output can be no greater than the number of UTF-8 code units in the input. In
the testcase that assumption doesn't hold: the buffer from offset 0x0000 to
offset 0x0FFF ends with 0xE4, which is assumed to be the first octet of a
multi-octet sequence and is not decoded. Since the buffer from offset 0x1000 to
0x1FFF begins with an ASCII space character, the sequence turns out to be
invalid and is decoded to a replacement character, but the space character is
treated as valid input [1]. The second buffer thus contains 4096 UTF-8 code
units (all in the ASCII range) which decode to 4096 UTF-16 code units plus one
more for the replacement character.

[1] One could argue that this behaviour is incorrect, and the two characters
following 0xe4 should be treated as part of the illegal sequence and ignored,
but let's gloss over that issue for now.
Status: NEW → ASSIGNED
Sounds like a good explanation. I just did a quick test: Inserting one 2-byte
character somewhere inside 0x1000-0x1fff is enough to let the missing character
re-appear. With the 2-byte character, the total number of UTF-16 units is back
at 4096 and GetMaxLength is correct.
Changing GetMaxLength() to |*aDestLength = aSrcLength + 1;| also makes the
missing character reappear. I'll attach a patch shortly.
Attached patch PatchSplinter Review
Attachment #190396 - Flags: superreview?(bzbarsky)
Attachment #190396 - Flags: review?(jshin1987)
Attachment #190396 - Flags: superreview?(bzbarsky) → superreview+
Attachment #190396 - Flags: review?(jshin1987) → review+
Blocks: 280442
Comment on attachment 190396 [details] [diff] [review]
Patch

Asking approval. This is a minimum-risk patch. The error will rarely occur in
practice but it's ugly when it does.
Attachment #190396 - Flags: approval1.8b4?
Sorry, if I had remembered that this causes a crash, I would have phrased my
approval request much more emphatically.
Attachment #190396 - Flags: approval1.8b4? → approval1.8b4+
Checked in.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: