Closed
Bug 746900
Opened 13 years ago
Closed 13 years ago
UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla20
People
(Reporter: annevk, Assigned: emk)
References
(Blocks 1 open bug, )
Details
Attachments
(1 file, 2 obsolete files)
14.73 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code points. 5/6 byte sequences should not be recognized. IE/Safari/Chrome already do this.
I'm updating the Encoding Standard to match and filed a bug on Opera too.
Comment 1•13 years ago
|
||
Do we have a bug for "Rewrite Unicode converters to comply with the Encoding Standard" yet?
Our converters are sad in many ways. In particular, our decoder API has no way to signal the end of stream to flush trailing errors into output as the REPLACEMENT CHARACTER.
Comment 2•13 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> Do we have a bug for "Rewrite Unicode converters to comply with the Encoding
> Standard" yet?
Filed bug 746911.
Updated•13 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Comment 3•13 years ago
|
||
(In reply to Anne van Kesteren from comment #0)
> Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code
> points.
Don't you mean "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence F8 80 80 80 80 80 should yield six U+FFFD code points"?
Comment 4•13 years ago
|
||
I'm wrong too -- make that "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence FC 80 80 80 80 80 should yield six U+FFFD code points"
Comment 5•13 years ago
|
||
I don't mind doing this for the sake of interoperability, but I don't in fact see where the Unicode Standard requires it.
The section "Constraints on Conversion Processes" on page 96 of http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf says
|Although a UTF-8 conversion process is required to never consume
|well-formed subsequences as part of its error handling for ill-formed
|subsequences, such a process is not otherwise constrained in how it
|deals with any ill-formed subsequence itself. An ill-formed subsequence
|consisting of more than one code unit could be treated as a single
|error or as multiple errors. For example, in processing the UTF-8 code
|unit sequence <F0 80 80 41>, the only formal requirement mandated by
|Unicode conformance for a converter is that the <41> be processed and
|correctly interpreted as <U+0041>. The converter could return
|<U+FFFD, U+0041>, handling <F0 80 80> as a single error, or
|<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a
|separate error, or could take other approaches to signalling <F0 80 80>
|as an ill-formed code unit subsequence.
Assignee | ||
Comment 6•13 years ago
|
||
I've posted a question to ietf-charsets.
Comment 7•13 years ago
|
||
Assignee: nobody → smontagu
Attachment #616643 -
Flags: review?(VYV03354)
Assignee | ||
Comment 8•13 years ago
|
||
Comment on attachment 616643 [details] [diff] [review]
Patch
>- } else if (0xC0 == (0xE0 & (*in))) {
>- // First octet of 2 octet sequence
>+ } else if (0xC0 == (0xE0 & (*in)) && (unsigned char)*in > 0xC1) {
>+ // First octet of 2 octet sequence (excluding illegal values 0xC0/0xC1)
>- } else if (0xF0 == (0xF8 & (*in))) {
>- // First octet of 4 octet sequence
>+ } else if (0xF0 == (0xF8 & (*in)) && (unsigned char)*in < 0xF5) {
>+ // First octet of 4 octet sequence in the legal range 0xF0-0xF4
} else if ((unsigned char)*in < 0xC2) {
res = NS_ERROR_ILLEGAL_INPUT;
} else if ((unsigned char)*in < 0xE0) {
...
} else if ((unsigned char)*in < 0xF0) {
...
} else if ((unsigned char)*in < 0xF5) {
...
} else {
res = NS_ERROR_ILLEGAL_INPUT;
}
would have fewer number of expressions and it would be much easier to understand.
>+ for (var i = 0; i < inStrings3.length; ++i) {
>+ var inStr = inStrings3[i];
>+ testCaseInputStream(inStr, expected3);
>+ }
>+ for (var i = 0; i < inStrings4.length; ++i) {
>+ var inStr = inStrings4[i];
>+ testCaseInputStream(inStr, expected4);
>+ }
>+ for (var i = 0; i < inStrings5.length; ++i) {
>+ var inStr = inStrings5[i];
>+ testCaseInputStream(inStr, expected5);
>+ }
>+ for (var i = 0; i < inStrings6.length; ++i) {
>+ var inStr = inStrings6[i];
>+ testCaseInputStream(inStr, expected6);
>+ }
Could you change inStrings* and expected* to one array? e.g.
[
{ inStrings: ["%e0%80%af","%f0%80%80%af",...], expected: "ABC\ufffdXYZ" },
{ inStrings: [...], expected: "ABC\ufffd\ufffdXYZ" },
...
];
Assignee | ||
Comment 9•13 years ago
|
||
If I read the spec correctly, the current patch doesn't comply the spec yet.
Per utf-8 decoding algorithm,
http://dvcs.w3.org/hg/encoding/raw-file/d3ea478b3c73/Overview.html#utf-8-decoder
byte sequences "\xC0\x80" and "[\xF5-\xF7]\x80\x80\x80" will emit only one replacement character. For example:
C0: Increment the byte pointer (Step 4), set utf-8 bytes needed to 1, utf-8 lower boundary to 0x80, and utf-8 code point to 0 and continue (Step 5).
80: Increment the byte pointer (Step 4), increment utf-8 bytes seen and set utf-8 code point to 0 (Step 7), Let code point be 0 and lower boundary be 0x80 (Step 9), and emit decoder error (Step 12).
Is this an intentional behavior or a spec error?
Comment 10•13 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #9)
> Is this an intentional behavior or a spec error?
I think Anne's standard is inconsistent here: 0xC0, 0xC1, and 0xF5-0xFF are all disallowed in UTF-8 and I think they should be treated the same way.
However, my patch is still not interoperable with Webkit and Trident: AFAICT they always decode each byte in any invalid sequence as U+FFFD, including overlong encodings, incomplete sequences, codepoints over U+10FFFF and surrogate characters. Examples:
c0 af => U+FFFD U+FFFD
e0 80 af => U+FFFD U+FFFD U+FFFD
e8 80 => U+FFFD U+FFFD
f0 80 80 => U+FFFD U+FFFD U+FFFD
f4 90 80 80 => U+FFFD U+FFFD U+FFFD U+FFFD
ed a0 80 => U+FFFD U+FFFD U+FFFD
We would emit a single U+FFFD in all these cases, even with the patch here, and IINM so would Anne's decoding algorithm
Reporter | ||
Comment 11•13 years ago
|
||
I have updated the ranges. I think it is better behavior to emit a single U+FFFD personally. That is how errors are handled everywhere else and I think the Unicode standard recommends (though not requires) that behavior as well.
Assignee | ||
Comment 12•13 years ago
|
||
Comment on attachment 616643 [details] [diff] [review]
Patch
UTF-8 decoder has been changed again to comply with “Best Practices for Using U+FFFD” from the Unicode standard.
Attachment #616643 -
Flags: review?(VYV03354)
Assignee | ||
Comment 13•13 years ago
|
||
Encoder change was required to pass a test such as decode(encode("\ud800")) == "\ufffd". This change is consistent with the WebIDL "convert a DOMString to a sequence of Unicode characters" algorithm.
Assignee: smontagu → VYV03354
Attachment #616643 -
Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #682757 -
Flags: review?(smontagu)
Assignee | ||
Comment 14•13 years ago
|
||
Assignee | ||
Comment 15•13 years ago
|
||
Reducing the number of compares when the first byte is in the range C2..DF
Attachment #682757 -
Attachment is obsolete: true
Attachment #682757 -
Flags: review?(smontagu)
Attachment #682782 -
Flags: review?(smontagu)
Assignee | ||
Comment 16•13 years ago
|
||
Assignee | ||
Updated•13 years ago
|
Comment 17•13 years ago
|
||
Comment on attachment 682782 [details] [diff] [review]
Implement "Best Practices for Using U+FFFD" from the Unicode standard
Review of attachment 682782 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. I think we now have interoperable error handling in all cases except for the distinction between incomplete sequences that could or could not be the start of a legal sequence.
E.g. with this patch 0xe0 0x80 is decoded as two U+FFFDs, but 0xe0 0xa0 is decoded as a single U+FFFD, and this is correct according to the "Best Practices", but IE and Chrome (Version 22.0.1229.94) decode both of them as two U+FFFDs. Opera 12.11 decodes both of them as one U+FFFD.
Attachment #682782 -
Flags: review?(smontagu) → review+
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Comment 18•13 years ago
|
||
Flags: in-testsuite+
Keywords: checkin-needed
Comment 19•13 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Updated•12 years ago
|
Summary: utf-8 decoder is not compliant → UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard
You need to log in
before you can comment on or make changes to this bug.
Description
•