Closed Bug 746900 Opened 13 years ago Closed 13 years ago

UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla20

People

(Reporter: annevk, Assigned: emk)

References

(Blocks 1 open bug,
URL
)

Details

Attachments

(1 file, 2 obsolete files)

Patch 13 years ago Simon Montagu :smontagu 6.30 KB, patch		Details \| Diff \| Splinter Review
Implement "Best Practices for Using U+FFFD" from the Unicode standard 13 years ago Masatoshi Kimura [:emk] 14.56 KB, patch		Details \| Diff \| Splinter Review
Implement "Best Practices for Using U+FFFD" from the Unicode standard 13 years ago Masatoshi Kimura [:emk] 14.73 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review

Anne (:annevk)

Reporter

Description

•

13 years ago

Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code points. 5/6 byte sequences should not be recognized. IE/Safari/Chrome already do this. I'm updating the Encoding Standard to match and filed a bug on Opera too.

Henri Sivonen (:hsivonen) (on vacation)

Comment 1

•

13 years ago

Do we have a bug for "Rewrite Unicode converters to comply with the Encoding Standard" yet? Our converters are sad in many ways. In particular, our decoder API has no way to signal the end of stream to flush trailing errors into output as the REPLACEMENT CHARACTER.

Henri Sivonen (:hsivonen) (on vacation)

Updated

•

13 years ago

Blocks: encoding

Henri Sivonen (:hsivonen) (on vacation)

Comment 2

•

13 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1) > Do we have a bug for "Rewrite Unicode converters to comply with the Encoding > Standard" yet? Filed bug 746911.

Henri Sivonen (:hsivonen) (on vacation)

Updated

•

13 years ago

OS: Mac OS X → All

Hardware: x86 → All

Simon Montagu :smontagu

Comment 3

•

13 years ago

(In reply to Anne van Kesteren from comment #0) > Per Unicode the sequence F8 80 80 80 80 80 should yield five U+FFFD code > points. Don't you mean "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence F8 80 80 80 80 80 should yield six U+FFFD code points"?

Simon Montagu :smontagu

Comment 4

•

13 years ago

I'm wrong too -- make that "the sequence F8 80 80 80 80 should yield five U+FFFD code points, and the sequence FC 80 80 80 80 80 should yield six U+FFFD code points"

Simon Montagu :smontagu

Comment 5

•

13 years ago

I don't mind doing this for the sake of interoperability, but I don't in fact see where the Unicode Standard requires it. The section "Constraints on Conversion Processes" on page 96 of http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf says |Although a UTF-8 conversion process is required to never consume |well-formed subsequences as part of its error handling for ill-formed |subsequences, such a process is not otherwise constrained in how it |deals with any ill-formed subsequence itself. An ill-formed subsequence |consisting of more than one code unit could be treated as a single |error or as multiple errors. For example, in processing the UTF-8 code |unit sequence <F0 80 80 41>, the only formal requirement mandated by |Unicode conformance for a converter is that the <41> be processed and |correctly interpreted as <U+0041>. The converter could return |<U+FFFD, U+0041>, handling <F0 80 80> as a single error, or |<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a |separate error, or could take other approaches to signalling <F0 80 80> |as an ill-formed code unit subsequence.

Masatoshi Kimura [:emk]

Assignee

Comment 6

•

13 years ago

I've posted a question to ietf-charsets.

Simon Montagu :smontagu

Comment 7

•

13 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

Assignee: nobody → smontagu

Attachment #616643 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Assignee

Comment 8

•

13 years ago

Comment on attachment 616643 [details] [diff] [review] Patch >- } else if (0xC0 == (0xE0 & (*in))) { >- // First octet of 2 octet sequence >+ } else if (0xC0 == (0xE0 & (*in)) && (unsigned char)*in > 0xC1) { >+ // First octet of 2 octet sequence (excluding illegal values 0xC0/0xC1) >- } else if (0xF0 == (0xF8 & (*in))) { >- // First octet of 4 octet sequence >+ } else if (0xF0 == (0xF8 & (*in)) && (unsigned char)*in < 0xF5) { >+ // First octet of 4 octet sequence in the legal range 0xF0-0xF4 } else if ((unsigned char)*in < 0xC2) { res = NS_ERROR_ILLEGAL_INPUT; } else if ((unsigned char)*in < 0xE0) { ... } else if ((unsigned char)*in < 0xF0) { ... } else if ((unsigned char)*in < 0xF5) { ... } else { res = NS_ERROR_ILLEGAL_INPUT; } would have fewer number of expressions and it would be much easier to understand. >+ for (var i = 0; i < inStrings3.length; ++i) { >+ var inStr = inStrings3[i]; >+ testCaseInputStream(inStr, expected3); >+ } >+ for (var i = 0; i < inStrings4.length; ++i) { >+ var inStr = inStrings4[i]; >+ testCaseInputStream(inStr, expected4); >+ } >+ for (var i = 0; i < inStrings5.length; ++i) { >+ var inStr = inStrings5[i]; >+ testCaseInputStream(inStr, expected5); >+ } >+ for (var i = 0; i < inStrings6.length; ++i) { >+ var inStr = inStrings6[i]; >+ testCaseInputStream(inStr, expected6); >+ } Could you change inStrings* and expected* to one array? e.g. [ { inStrings: ["%e0%80%af","%f0%80%80%af",...], expected: "ABC\ufffdXYZ" }, { inStrings: [...], expected: "ABC\ufffd\ufffdXYZ" }, ... ];

Masatoshi Kimura [:emk]

Assignee

Comment 9

•

13 years ago

If I read the spec correctly, the current patch doesn't comply the spec yet. Per utf-8 decoding algorithm, http://dvcs.w3.org/hg/encoding/raw-file/d3ea478b3c73/Overview.html#utf-8-decoder byte sequences "\xC0\x80" and "[\xF5-\xF7]\x80\x80\x80" will emit only one replacement character. For example: C0: Increment the byte pointer (Step 4), set utf-8 bytes needed to 1, utf-8 lower boundary to 0x80, and utf-8 code point to 0 and continue (Step 5). 80: Increment the byte pointer (Step 4), increment utf-8 bytes seen and set utf-8 code point to 0 (Step 7), Let code point be 0 and lower boundary be 0x80 (Step 9), and emit decoder error (Step 12). Is this an intentional behavior or a spec error?

Simon Montagu :smontagu

Comment 10

•

13 years ago

(In reply to Masatoshi Kimura [:emk] from comment #9) > Is this an intentional behavior or a spec error? I think Anne's standard is inconsistent here: 0xC0, 0xC1, and 0xF5-0xFF are all disallowed in UTF-8 and I think they should be treated the same way. However, my patch is still not interoperable with Webkit and Trident: AFAICT they always decode each byte in any invalid sequence as U+FFFD, including overlong encodings, incomplete sequences, codepoints over U+10FFFF and surrogate characters. Examples: c0 af => U+FFFD U+FFFD e0 80 af => U+FFFD U+FFFD U+FFFD e8 80 => U+FFFD U+FFFD f0 80 80 => U+FFFD U+FFFD U+FFFD f4 90 80 80 => U+FFFD U+FFFD U+FFFD U+FFFD ed a0 80 => U+FFFD U+FFFD U+FFFD We would emit a single U+FFFD in all these cases, even with the patch here, and IINM so would Anne's decoding algorithm

Anne (:annevk)

Reporter

Comment 11

•

13 years ago

I have updated the ranges. I think it is better behavior to emit a single U+FFFD personally. That is how errors are handled everywhere else and I think the Unicode standard recommends (though not requires) that behavior as well.

Masatoshi Kimura [:emk]

Assignee

Comment 12

•

13 years ago

Comment on attachment 616643 [details] [diff] [review] Patch UTF-8 decoder has been changed again to comply with “Best Practices for Using U+FFFD” from the Unicode standard.

Attachment #616643 - Flags: review?(VYV03354)

Masatoshi Kimura [:emk]

Assignee

Comment 13

•

13 years ago

Attached patch Implement "Best Practices for Using U+FFFD" from the Unicode standard (obsolete) — Details — Splinter Review

Encoder change was required to pass a test such as decode(encode("\ud800")) == "\ufffd". This change is consistent with the WebIDL "convert a DOMString to a sequence of Unicode characters" algorithm.

Assignee: smontagu → VYV03354

Attachment #616643 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Attachment #682757 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 14

•

13 years ago

https://tbpl.mozilla.org/?tree=Try&rev=cdbe5f959e70

Masatoshi Kimura [:emk]

Assignee

Comment 15

•

13 years ago

Attached patch Implement "Best Practices for Using U+FFFD" from the Unicode standard — Details — Splinter Review

Reducing the number of compares when the first byte is in the range C2..DF

Attachment #682757 - Attachment is obsolete: true

Attachment #682757 - Flags: review?(smontagu)

Attachment #682782 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 16

•

13 years ago

https://tbpl.mozilla.org/?tree=Try&rev=f7abb8908e02

Masatoshi Kimura [:emk]

Assignee

Updated

•

13 years ago

See Also: → https://www.w3.org/Bugs/Public/show_bug.cgi?id=19938

Simon Montagu :smontagu

Comment 17

•

13 years ago

Comment on attachment 682782 [details] [diff] [review] Implement "Best Practices for Using U+FFFD" from the Unicode standard Review of attachment 682782 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. I think we now have interoperable error handling in all cases except for the distinction between incomplete sequences that could or could not be the start of a legal sequence. E.g. with this patch 0xe0 0x80 is decoded as two U+FFFDs, but 0xe0 0xa0 is decoded as a single U+FFFD, and this is correct according to the "Best Practices", but IE and Chrome (Version 22.0.1229.94) decode both of them as two U+FFFDs. Opera 12.11 decodes both of them as one U+FFFD.

Attachment #682782 - Flags: review?(smontagu) → review+

Masatoshi Kimura [:emk]

Assignee

Updated

•

13 years ago

Keywords: checkin-needed

(PTO June 19-July 7) Ryan VanderMeulen

Comment 18

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9da3418be9d

Flags: in-testsuite+

Keywords: checkin-needed

Ed Morley [:emorley]

Comment 19

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/d9da3418be9d

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla20

Jesse Ruderman

Updated

•

12 years ago

Summary: utf-8 decoder is not compliant → UTF-8 decoder: Implement "Best Practices for Using U+FFFD" from the Unicode standard

You need to log in before you can comment on or make changes to this bug.