Closed Bug 746900 Opened 9 years ago Closed 8 years ago

UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: annevk, Assigned: emk)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file, 2 obsolete files)

Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code points. 5/6 byte sequences should not be recognized. IE/Safari/Chrome already do this.

I'm updating the Encoding Standard to match and filed a bug on Opera too.
Do we have a bug for "Rewrite Unicode converters to comply with the Encoding Standard" yet?

Our converters are sad in many ways. In particular, our decoder API has no way to signal the end of stream to flush trailing errors into output as the REPLACEMENT CHARACTER.
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> Do we have a bug for "Rewrite Unicode converters to comply with the Encoding
> Standard" yet?

Filed bug 746911.
OS: Mac OS X → All
Hardware: x86 → All
(In reply to Anne van Kesteren from comment #0)
> Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code
> points.

Don't you mean "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence F8 80 80 80 80 80 should yield six U+FFFD code points"?
I'm wrong too -- make that "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence FC 80 80 80 80 80 should yield six U+FFFD code points"
I don't mind doing this for the sake of interoperability, but I don't in fact see where the Unicode Standard requires it.

The section "Constraints on Conversion Processes" on page 96 of http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf says 

|Although a UTF-8 conversion process is required to never consume 
|well-formed subsequences as part of its error handling for ill-formed 
|subsequences, such a process is not otherwise constrained in how it 
|deals with any ill-formed subsequence itself. An ill-formed subsequence 
|consisting of more than one code unit could be treated as a single 
|error or as multiple errors. For example, in processing the UTF-8 code 
|unit sequence <F0 80 80 41>, the only formal requirement mandated by 
|Unicode conformance for a converter is that the <41> be processed and 
|correctly interpreted as <U+0041>. The converter could return 
|<U+FFFD, U+0041>, handling <F0 80 80> as a single error, or 
|<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a 
|separate error, or could take other approaches to signalling <F0 80 80> 
|as an ill-formed code unit subsequence.
I've posted a question to ietf-charsets.
Attached patch Patch (obsolete) — Splinter Review
Assignee: nobody → smontagu
Attachment #616643 - Flags: review?(VYV03354)
Comment on attachment 616643 [details] [diff] [review]
Patch

>-      } else if (0xC0 == (0xE0 & (*in))) {
>-        // First octet of 2 octet sequence
>+      } else if (0xC0 == (0xE0 & (*in)) && (unsigned char)*in > 0xC1) {
>+        // First octet of 2 octet sequence (excluding illegal values 0xC0/0xC1)
>-      } else if (0xF0 == (0xF8 & (*in))) {
>-        // First octet of 4 octet sequence
>+      } else if (0xF0 == (0xF8 & (*in)) && (unsigned char)*in < 0xF5) {
>+        // First octet of 4 octet sequence in the legal range 0xF0-0xF4

 } else if ((unsigned char)*in < 0xC2) {
   res = NS_ERROR_ILLEGAL_INPUT;
 } else if ((unsigned char)*in < 0xE0) {
   ...
 } else if ((unsigned char)*in < 0xF0) {
   ...
 } else if ((unsigned char)*in < 0xF5) {
   ...
 } else {
   res = NS_ERROR_ILLEGAL_INPUT;
 }
would have fewer number of expressions and it would be much easier to understand.

>+    for (var i = 0; i < inStrings3.length; ++i) {
>+	var inStr = inStrings3[i];
>+	testCaseInputStream(inStr, expected3);
>+    }
>+    for (var i = 0; i < inStrings4.length; ++i) {
>+	var inStr = inStrings4[i];
>+	testCaseInputStream(inStr, expected4);
>+    }
>+    for (var i = 0; i < inStrings5.length; ++i) {
>+	var inStr = inStrings5[i];
>+	testCaseInputStream(inStr, expected5);
>+    }
>+    for (var i = 0; i < inStrings6.length; ++i) {
>+	var inStr = inStrings6[i];
>+	testCaseInputStream(inStr, expected6);
>+    }
Could you change inStrings* and expected* to one array? e.g.
 [
  { inStrings: ["%e0%80%af","%f0%80%80%af",...], expected: "ABC\ufffdXYZ" },
  { inStrings: [...], expected: "ABC\ufffd\ufffdXYZ" },
  ...
 ];
If I read the spec correctly, the current patch doesn't comply the spec yet.
Per utf-8 decoding algorithm,
http://dvcs.w3.org/hg/encoding/raw-file/d3ea478b3c73/Overview.html#utf-8-decoder
byte sequences "\xC0\x80" and "[\xF5-\xF7]\x80\x80\x80" will emit only one replacement character. For example:
C0: Increment the byte pointer (Step 4), set utf-8 bytes needed to 1, utf-8 lower boundary to 0x80, and utf-8 code point to 0 and continue (Step 5).
80: Increment the byte pointer (Step 4), increment utf-8 bytes seen and set utf-8 code point to 0 (Step 7), Let code point be 0 and lower boundary be 0x80 (Step 9), and emit decoder error (Step 12).
Is this an intentional behavior or a spec error?
(In reply to Masatoshi Kimura [:emk] from comment #9)
> Is this an intentional behavior or a spec error?

I think Anne's standard is inconsistent here: 0xC0, 0xC1, and 0xF5-0xFF are all disallowed in UTF-8 and I think they should be treated the same way. 

However, my patch is still not interoperable with Webkit and Trident: AFAICT they always decode each byte in any invalid sequence as U+FFFD, including overlong encodings, incomplete sequences, codepoints over U+10FFFF and surrogate characters. Examples:

c0 af             => U+FFFD U+FFFD
e0 80 af          => U+FFFD U+FFFD U+FFFD

e8 80             => U+FFFD U+FFFD
f0 80 80          => U+FFFD U+FFFD U+FFFD

f4 90 80 80       => U+FFFD U+FFFD U+FFFD U+FFFD

ed a0 80          => U+FFFD U+FFFD U+FFFD

We would emit a single U+FFFD in all these cases, even with the patch here, and IINM so would Anne's decoding algorithm
I have updated the ranges. I think it is better behavior to emit a single U+FFFD personally. That is how errors are handled everywhere else and I think the Unicode standard recommends (though not requires) that behavior as well.
Comment on attachment 616643 [details] [diff] [review]
Patch

UTF-8 decoder has been changed again to comply with “Best Practices for Using U+FFFD” from the Unicode standard.
Attachment #616643 - Flags: review?(VYV03354)
Encoder change was required to pass a test such as decode(encode("\ud800")) == "\ufffd". This change is consistent with the WebIDL "convert a DOMString to a sequence of Unicode characters" algorithm.
Assignee: smontagu → VYV03354
Attachment #616643 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #682757 - Flags: review?(smontagu)
Reducing the number of compares when the first byte is in the range C2..DF
Attachment #682757 - Attachment is obsolete: true
Attachment #682757 - Flags: review?(smontagu)
Attachment #682782 - Flags: review?(smontagu)
Comment on attachment 682782 [details] [diff] [review]
Implement "Best Practices for Using U+FFFD" from the Unicode standard

Review of attachment 682782 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. I think we now have interoperable error handling in all cases except for the distinction between incomplete sequences that could or could not be the start of a legal sequence. 

E.g. with this patch 0xe0 0x80 is decoded as two U+FFFDs, but 0xe0 0xa0 is decoded as a single U+FFFD, and this is correct according to the "Best Practices", but IE and Chrome (Version 22.0.1229.94) decode both of them as two U+FFFDs. Opera 12.11 decodes both of them as one U+FFFD.
Attachment #682782 - Flags: review?(smontagu) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d9da3418be9d
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Summary: utf-8 decoder is not compliant → UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard
You need to log in before you can comment on or make changes to this bug.