Closed
Bug 301797
Opened 19 years ago
Closed 19 years ago
UTF-8 decoder drops byte on encoding error
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: udo_richter, Assigned: smontagu)
References
Details
Attachments
(2 files)
8.14 KB,
text/html
|
Details | |
1.91 KB,
patch
|
jshin1987
:
review+
bzbarsky
:
superreview+
benjamin
:
approval1.8b4+
|
Details | Diff | Splinter Review |
On certain UTF-8 documents with encoding errors, the UTF-8 decoder gets confused and drops one byte exactly 4096 bytes later. The encoding error has to be on a 4096-byte-boundary too. Most probably this a buffer handling error. This affects XML and HTML files. Testcase follows. This triggers bug 280442 - A bad downloads.rdf file causes firefox to crash on download Some related references: Bug 174351 - Encoding errors aren't treated as fatal XML errors bug 231267 - Charset converters should always notify callers of encoding errors
Reporter | ||
Comment 1•19 years ago
|
||
Assignee | ||
Comment 2•19 years ago
|
||
I still need to confirm it, but I think the problem is here: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/intl/uconv/src/nsUTF8ToUnicode.cpp&rev=1.8&mark=83#78 The assumption is that in the worst case the number of UTF-16 code units in the output can be no greater than the number of UTF-8 code units in the input. In the testcase that assumption doesn't hold: the buffer from offset 0x0000 to offset 0x0FFF ends with 0xE4, which is assumed to be the first octet of a multi-octet sequence and is not decoded. Since the buffer from offset 0x1000 to 0x1FFF begins with an ASCII space character, the sequence turns out to be invalid and is decoded to a replacement character, but the space character is treated as valid input [1]. The second buffer thus contains 4096 UTF-8 code units (all in the ASCII range) which decode to 4096 UTF-16 code units plus one more for the replacement character. [1] One could argue that this behaviour is incorrect, and the two characters following 0xe4 should be treated as part of the illegal sequence and ignored, but let's gloss over that issue for now.
Status: NEW → ASSIGNED
Reporter | ||
Comment 3•19 years ago
|
||
Sounds like a good explanation. I just did a quick test: Inserting one 2-byte character somewhere inside 0x1000-0x1fff is enough to let the missing character re-appear. With the 2-byte character, the total number of UTF-16 units is back at 4096 and GetMaxLength is correct.
Assignee | ||
Comment 4•19 years ago
|
||
Changing GetMaxLength() to |*aDestLength = aSrcLength + 1;| also makes the missing character reappear. I'll attach a patch shortly.
Assignee | ||
Comment 5•19 years ago
|
||
Attachment #190396 -
Flags: superreview?(bzbarsky)
Attachment #190396 -
Flags: review?(jshin1987)
Updated•19 years ago
|
Attachment #190396 -
Flags: superreview?(bzbarsky) → superreview+
Updated•19 years ago
|
Attachment #190396 -
Flags: review?(jshin1987) → review+
Assignee | ||
Comment 6•19 years ago
|
||
Comment on attachment 190396 [details] [diff] [review] Patch Asking approval. This is a minimum-risk patch. The error will rarely occur in practice but it's ugly when it does.
Attachment #190396 -
Flags: approval1.8b4?
Assignee | ||
Comment 7•19 years ago
|
||
Sorry, if I had remembered that this causes a crash, I would have phrased my approval request much more emphatically.
Updated•19 years ago
|
Attachment #190396 -
Flags: approval1.8b4? → approval1.8b4+
Assignee | ||
Comment 8•19 years ago
|
||
Checked in.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•