Closed Bug 301797 Opened 19 years ago Closed 19 years ago

UTF-8 decoder drops byte on encoding error

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: udo_richter, Assigned: smontagu)

References

Details

Attachments

(2 files)

testcase html 19 years ago Udo Richter 8.14 KB, text/html		Details
Patch 19 years ago Simon Montagu :smontagu 1.91 KB, patch	jshin1987 : review+ bzbarsky : superreview+ benjamin : approval1.8b4+	Details \| Diff \| Splinter Review

Udo Richter

Reporter

Description

•

19 years ago

On certain UTF-8 documents with encoding errors, the UTF-8 decoder gets confused
and drops one byte exactly 4096 bytes later. The encoding error has to be on a
4096-byte-boundary too. Most probably this a buffer handling error. This affects
XML and HTML files. Testcase follows.

This triggers bug 280442 - A bad downloads.rdf file causes firefox to crash on
download

Some related references:
Bug 174351 - Encoding errors aren't treated as fatal XML errors
bug 231267 - Charset converters should always notify callers of encoding errors

Udo Richter

Reporter

Comment 1

•

19 years ago

Attached file testcase html — Details

Simon Montagu :smontagu

Assignee

Comment 2

•

19 years ago

I still need to confirm it, but I think the problem is here:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/intl/uconv/src/nsUTF8ToUnicode.cpp&rev=1.8&mark=83#78

The assumption is that in the worst case the number of UTF-16 code units in the
output can be no greater than the number of UTF-8 code units in the input. In
the testcase that assumption doesn't hold: the buffer from offset 0x0000 to
offset 0x0FFF ends with 0xE4, which is assumed to be the first octet of a
multi-octet sequence and is not decoded. Since the buffer from offset 0x1000 to
0x1FFF begins with an ASCII space character, the sequence turns out to be
invalid and is decoded to a replacement character, but the space character is
treated as valid input [1]. The second buffer thus contains 4096 UTF-8 code
units (all in the ASCII range) which decode to 4096 UTF-16 code units plus one
more for the replacement character.

[1] One could argue that this behaviour is incorrect, and the two characters
following 0xe4 should be treated as part of the illegal sequence and ignored,
but let's gloss over that issue for now.

Status: NEW → ASSIGNED

Udo Richter

Reporter

Comment 3

•

19 years ago

Sounds like a good explanation. I just did a quick test: Inserting one 2-byte
character somewhere inside 0x1000-0x1fff is enough to let the missing character
re-appear. With the 2-byte character, the total number of UTF-16 units is back
at 4096 and GetMaxLength is correct.

Simon Montagu :smontagu

Assignee

Comment 4

•

19 years ago

Changing GetMaxLength() to |*aDestLength = aSrcLength + 1;| also makes the
missing character reappear. I'll attach a patch shortly.

Simon Montagu :smontagu

Assignee

Comment 5

•

19 years ago

Attached patch Patch — Details — Splinter Review

Attachment #190396 - Flags: superreview?(bzbarsky)

Attachment #190396 - Flags: review?(jshin1987)

Boris Zbarsky [:bzbarsky]

Updated

•

19 years ago

Attachment #190396 - Flags: superreview?(bzbarsky) → superreview+

Jungshik Shin

Updated

•

19 years ago

Attachment #190396 - Flags: review?(jshin1987) → review+

Caleb

Updated

•

19 years ago

Blocks: 280442

Simon Montagu :smontagu

Assignee

Comment 6

•

19 years ago

Comment on attachment 190396 [details] [diff] [review]
Patch

Asking approval. This is a minimum-risk patch. The error will rarely occur in
practice but it's ugly when it does.

Attachment #190396 - Flags: approval1.8b4?

Simon Montagu :smontagu

Assignee

Comment 7

•

19 years ago

Sorry, if I had remembered that this causes a crash, I would have phrased my
approval request much more emphatically.

Benjamin Smedberg

Updated

•

19 years ago

Attachment #190396 - Flags: approval1.8b4? → approval1.8b4+

Simon Montagu :smontagu

Assignee

Comment 8

•

19 years ago

Checked in.

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

UTF-8 decoder drops byte on encoding error

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: udo_richter, Assigned: smontagu)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Comment 6

Comment 7

Updated

Comment 8

Attachment

General

Description

File Name

Content Type