Open Bug 199090 Opened 22 years ago Updated 2 years ago

UTF8trait and related codes need to be update to reflect 'the' new definition of UTF-8

Categories

(Core :: XPCOM, defect, P3)

defect

Tracking

()

People

(Reporter: jshin1987, Unassigned)

References

Details

(Keywords: intl)

Attachments

(2 files, 1 obsolete file)

This is a spin-off of bug 191541. (see also bug 172701) |UTF8trait| and a few functions (|UTF8InputStream::CountValidUTF8Bytes|, |CalculateUTF8Length|, |ConvertUTF8toUCS2|) that make use of it haven't been updated to the 'new' definition of UTF-8. Unicode wouldn't support codepoints beyond 0x10FFFFF (20.1 bit) and ISO/IEC committed itself not to assign a character beyond that to get ISO 10646 in sync with Unicode. IETF RFC is also being revised to get its definition of UTF-8 in sync (see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt). Things that (may) have to be done: - remove Is5byte and Is6byte - adjust Is4byte to exclude the beg. of 4byte UTF-8 sequence for codepoints above 0x10FFFF (plane 17 and higher) - adjust Is2byte to exclude 0xC0 and 0xC1 which begin overlong UTF-8 sequences for U+0000 - U+007F (e.g. 0xC0 0x80 is an overlong representation of U+0000 for which we can just use 0x00) - modify three functions mentioned above accordingly. Things to consider: - performance ramification : Is2byte and Is4byte will get more complicated (on the other hand, ConvertUTF8toUCS2 can get a bit simpler) - would we never utilize '10 or so' bits (5byte and 6byte UTF-8) for 'out-of-band' information? probably not as long as the internal string representation remains UTF-16 which does not offer a way to access codepoints beyond 0x10FFFF - security issue?? (bug 172701)
Attached patch a patch Splinter Review
We need to do a little more tweaking later, but in the meantime this will save us some cycles by excluding 5-6 byte UTF-8 sequences.
Attached patch patch v2 (obsolete) — Splinter Review
This patch is more aggressive than attachment 127987 [details] [diff] [review], but can be slower than that. dbaron, do you think perf. loss is significant with extra conditions in this patch?
Attached patch patch v3Splinter Review
UTF8-ness check is equivalent to |IsUTF8|.
Attachment #127990 - Attachment is obsolete: true
attachment 127999 [details] [diff] [review] is 'more' correct than attachment 127987 [details] [diff] [review], but we may get away with attachment 127987 [details] [diff] [review] if perf. diff. is too significant. any comment?
Status: NEW → ASSIGNED
Comment on attachment 127999 [details] [diff] [review] patch v3 > class UTF8traits >- static PRBool is2byte(char c) { return (c & 0xE0) == 0xC0; } >+ static PRBool is2byte(char c) >+ { return ((c & 0xE0) == 0xC0) && (PRUint8(c) > 0xC1); } This seems odd to me. Does the new UTF-8 standard really say that the behavior has to be different for this one case of overlong sequences? Why should we be treating it differently from the rest of the |minUcs4| handling? > static PRBool is3byte(char c) { return (c & 0xF0) == 0xE0; } >- static PRBool is4byte(char c) { return (c & 0xF8) == 0xF0; } >+ static PRBool is4byte(char c) >+ { return ((c & 0xF8) == 0xF0) && (PRUint8(c) < 0xF5); } Likewise, this seems like rather odd error handling since it will only catch things that are greater than or equal to 0x00140000, not 0x00110000. (In both cases it's the difference between putting in a single 0xFFFD and ignoring the whole string.) > else if ( ucs4 >= PLANE1_BASE ) > { >+ // we still need to exclude the range [0x00110000, 0x00110fff] Yes, but you're excluding it in a very different way (ignoring the whole rest of the string versus a singe 0xFFFD). > if ( ucs4 >= 0x00110000 ) > *out++ = UCS2_REPLACEMENT_CHAR; > else { I guess I should probably look at the new version of UTF-8 myself rather than speculating... Also, are you going to change nsUTF8ToUnicode and nsUnicodeToUTF8?
Please, note that UTF8traits are not only for ConvertUTF8toUTF16 but is also used in a couple of other places. You're right that ConvertUTF8toUTF16 does have its own error handling and range checking. Anyway, I put up two versions, the first of which doesn't have what you think of as odd. > (ignoring the whole rest of the string versus a singe 0xFFFD). I don't understand what you meant by this
> (ignoring the whole rest of the string versus a singe 0xFFFD). Now I see your point. Some overlong sequences are replaced by 'Replacement character' while others are just rejected as invalid. As long as we use UTF8traits ('stateless'), I guess it's hard to avoid that. If we have to choose one, I believe we have to go with the latter (rejecting as invalid and stopping the conversion ) for all cases [1]. There's a problem, though because ConvertUTF8toUTF16 is also used for UTF8InputStream, for which 'quiting the conversion' prematurely might be considered as too severe. However, it would be all right, UTF8InputStream is only used for 'internally' generated stream. An alternative is to do something similar to what's done in |IsUTF8| (bug 191541), but we may have a perf. hit. Below is the UTF-8 syntax in a new IETF draft (as noted in the draft, Unicode 4.0 should be considered the authoritative source) http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-05.txt 4. Syntax of UTF-8 Byte Sequences For the convenience of implementors using ABNF, a definition of UTF-8 in ABNF syntax is given here. A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234]. UTF8-octets = *( UTF8-char ) UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1 = %x00-7F UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail ) UTF8-tail = %x80-BF
Blocks: 86411
RFC 2279bis is now officially released as RFC 3629 http://www.ietf.org/rfc/rfc3629.txt. The ABNF is unchanged from the draft quoted in comment 7. The authoritative definition referenced by the RFC is in http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, section D36
QA Contact: scc → string
smontagu or jshin, is this bug still valuable at all?
Assignee: jshin1987 → nobody
Flags: needinfo?(smontagu)
Yes, we do need to fix this still (rather to my surprise -- I thought it had been fixed years ago, and perhaps some of it has been, e.g. in bug 422868 and/or bug 541245, but at the least we still have the obsolete is5byte and is6byte in xpcom/string/public/nsUTF8Utils.h)
Flags: needinfo?(smontagu)
Priority: -- → P3
No assignee, updating the status.
Status: ASSIGNED → NEW
No assignee, updating the status.
No assignee, updating the status.
Component: String → XPCOM
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: