Open
Bug 199090
Opened 22 years ago
Updated 2 years ago
UTF8trait and related codes need to be update to reflect 'the' new definition of UTF-8
Categories
(Core :: XPCOM, defect, P3)
Core
XPCOM
Tracking
()
NEW
People
(Reporter: jshin1987, Unassigned)
References
Details
(Keywords: intl)
Attachments
(2 files, 1 obsolete file)
4.65 KB,
patch
|
Details | Diff | Splinter Review | |
6.57 KB,
patch
|
Details | Diff | Splinter Review |
This is a spin-off of bug 191541. (see also bug 172701)
|UTF8trait| and a few functions (|UTF8InputStream::CountValidUTF8Bytes|,
|CalculateUTF8Length|, |ConvertUTF8toUCS2|)
that make use of it haven't been updated to the 'new' definition of UTF-8.
Unicode wouldn't support codepoints beyond 0x10FFFFF (20.1 bit)
and ISO/IEC committed itself not to assign a character beyond that to get ISO
10646 in sync with Unicode. IETF RFC is also being revised to get its
definition of UTF-8 in sync (see
http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt).
Things that (may) have to be done:
- remove Is5byte and Is6byte
- adjust Is4byte to exclude the beg. of 4byte UTF-8 sequence
for codepoints above 0x10FFFF (plane 17 and higher)
- adjust Is2byte to exclude 0xC0 and 0xC1 which begin
overlong UTF-8 sequences for U+0000 - U+007F
(e.g. 0xC0 0x80 is an overlong representation of U+0000
for which we can just use 0x00)
- modify three functions mentioned above accordingly.
Things to consider:
- performance ramification : Is2byte and Is4byte will get more complicated
(on the other hand, ConvertUTF8toUCS2 can get a bit simpler)
- would we never utilize '10 or so' bits (5byte and 6byte UTF-8)
for 'out-of-band' information?
probably not as long as the internal string representation remains
UTF-16 which does not offer a way to access codepoints beyond 0x10FFFF
- security issue?? (bug 172701)
Reporter | ||
Comment 1•22 years ago
|
||
We need to do a little more tweaking later, but in the meantime this will save
us some cycles by excluding 5-6 byte UTF-8 sequences.
Reporter | ||
Comment 2•22 years ago
|
||
This patch is more aggressive than attachment 127987 [details] [diff] [review], but can be slower than
that.
dbaron, do you think perf. loss is significant with extra conditions in this
patch?
Reporter | ||
Comment 3•22 years ago
|
||
UTF8-ness check is equivalent to |IsUTF8|.
Attachment #127990 -
Attachment is obsolete: true
Reporter | ||
Comment 4•22 years ago
|
||
attachment 127999 [details] [diff] [review] is 'more' correct than attachment 127987 [details] [diff] [review], but we may get away
with attachment 127987 [details] [diff] [review] if perf. diff. is too significant. any comment?
Status: NEW → ASSIGNED
Comment on attachment 127999 [details] [diff] [review]
patch v3
> class UTF8traits
>- static PRBool is2byte(char c) { return (c & 0xE0) == 0xC0; }
>+ static PRBool is2byte(char c)
>+ { return ((c & 0xE0) == 0xC0) && (PRUint8(c) > 0xC1); }
This seems odd to me. Does the new UTF-8 standard really say that the behavior
has to be different for this one case of overlong sequences? Why should we be
treating it differently from the rest of the |minUcs4| handling?
> static PRBool is3byte(char c) { return (c & 0xF0) == 0xE0; }
>- static PRBool is4byte(char c) { return (c & 0xF8) == 0xF0; }
>+ static PRBool is4byte(char c)
>+ { return ((c & 0xF8) == 0xF0) && (PRUint8(c) < 0xF5); }
Likewise, this seems like rather odd error handling since it will only catch
things that are greater than or equal to 0x00140000, not 0x00110000.
(In both cases it's the difference between putting in a single 0xFFFD and
ignoring the whole string.)
> else if ( ucs4 >= PLANE1_BASE )
> {
>+ // we still need to exclude the range [0x00110000, 0x00110fff]
Yes, but you're excluding it in a very different way (ignoring the whole rest
of the string versus a singe 0xFFFD).
> if ( ucs4 >= 0x00110000 )
> *out++ = UCS2_REPLACEMENT_CHAR;
> else {
I guess I should probably look at the new version of UTF-8 myself rather than
speculating...
Also, are you going to change nsUTF8ToUnicode and nsUnicodeToUTF8?
Reporter | ||
Comment 6•22 years ago
|
||
Please, note that UTF8traits are not only for ConvertUTF8toUTF16 but is also
used in a couple of other places. You're right that ConvertUTF8toUTF16 does
have its own error handling and range checking. Anyway, I put up two versions,
the first of which doesn't have what you think of as odd.
> (ignoring the whole rest of the string versus a singe 0xFFFD).
I don't understand what you meant by this
Reporter | ||
Comment 7•22 years ago
|
||
> (ignoring the whole rest of the string versus a singe 0xFFFD).
Now I see your point. Some overlong sequences are replaced by 'Replacement
character' while others are just rejected as invalid. As long as we use UTF8traits
('stateless'), I guess it's hard to avoid that. If we have to choose one, I believe we have
to go with the latter (rejecting as invalid and stopping the conversion ) for all cases [1].
There's a problem, though because ConvertUTF8toUTF16 is also used for
UTF8InputStream, for which 'quiting the conversion' prematurely might be considered
as too severe. However, it would be all right, UTF8InputStream is only used for
'internally' generated stream.
An alternative is to do something similar to what's done in |IsUTF8| (bug 191541), but
we may have a perf. hit.
Below is the UTF-8 syntax in a new IETF draft (as noted in the draft, Unicode 4.0
should be considered the authoritative source)
http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-05.txt
4. Syntax of UTF-8 Byte Sequences
For the convenience of implementors using ABNF, a definition of UTF-8
in ABNF syntax is given here.
A UTF-8 string is a sequence of octets representing a sequence of UCS
characters. An octet sequence is valid UTF-8 only if it matches the
following syntax, which is derived from the rules for encoding UTF-8
and is expressed in the ABNF of [RFC2234].
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
Comment 8•21 years ago
|
||
RFC 2279bis is now officially released as RFC 3629
http://www.ietf.org/rfc/rfc3629.txt. The ABNF is unchanged from the draft quoted
in comment 7.
The authoritative definition referenced by the RFC is in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, section D36
Updated•18 years ago
|
QA Contact: scc → string
Comment 9•12 years ago
|
||
smontagu or jshin, is this bug still valuable at all?
Assignee: jshin1987 → nobody
Flags: needinfo?(smontagu)
Comment 10•12 years ago
|
||
Yes, we do need to fix this still (rather to my surprise -- I thought it had been fixed years ago, and perhaps some of it has been, e.g. in bug 422868 and/or bug 541245, but at the least we still have the obsolete is5byte and is6byte in xpcom/string/public/nsUTF8Utils.h)
Flags: needinfo?(smontagu)
Updated•12 years ago
|
Priority: -- → P3
Comment 12•6 years ago
|
||
No assignee, updating the status.
Comment 13•6 years ago
|
||
No assignee, updating the status.
Assignee | ||
Updated•4 years ago
|
Component: String → XPCOM
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•