Closed Bug 191541 Opened 22 years ago Closed 21 years ago

we need IsUTF8 /CouldBeUTF8 /CannotBeUTF8

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4alpha

People

(Reporter: jshin1987, Assigned: jag+mozilla)

References

Details

(Keywords: intl)

Attachments

(1 file, 2 obsolete files)

This is a spin-off of bug 57011 and bug 162765.
(see also
http://bugzilla.mozilla.org/buglist.cgi?query_format=&short_desc_type=allwordssubstr&short_desc=&long_desc_type=substring&long_desc=IsUTF8&bug_file_loc_type=allwordssubstr&bug_file_loc=&status_whiteboard_type=allwordssubstr&status_whiteboard=&keywords_type=allwords&keywords=&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailassigned_to1=1&emailtype1=exact&email1=&emailassigned_to2=1&emailreporter2=1&emailqa_contact2=1&emailtype2=exact&email2=&bugidtype=include&bug_id=&votes=&changedin=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Bug+Number&field0-0-0=noop&type0-0-0=noop&value0-0-0=
)

In several places in Mozilla code, we need to check
if a given CString is a valid UTF-8 string. At the moment,
something like the following 

cstr.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(cstr)))

is used. (see, for instance,
http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#107)

As Conrad suggested in bug 57011 and implemented
in attachment 17866 [details] [diff] [review], it'd be nice to have IsUTF8().
If we're worried that non-UTF8 strings are mistaken
to be in UTF-8 (which I think rarely happens in
*real-life*, as opposed to strings concocted
to fool IsUTF8(),  non-UTF-8 strings encoded in legacy
encodings), we can name it CannotBeUTF8() and
invert the return value (or just name it 
CouldBeUTF8()). 

FYI, glib has
g_utf8_validate()(http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-validate)


I also found that |nsCRT| class also has |IsAscii|. Is it an accidental 
duplication or was it done on purpose? 

In short, I think it's good to add IsUTF8/CannotBeUTF8/CouldBeUTF8
in attachment 17866 [details] [diff] [review] by Conrad.
adding some i18n people and 'intl' keyword.
Keywords: intl
There's a bug on this already, with a patch and stuff.  I can't remember who was
working on it, though... Neil, maybe?
Whiteboard: DUPEME
bz: not one of mine
Boris,
attachment 17866 [details] [diff] [review] also includes a patch. Have you found out
which is another bug on this? I couldn't find with Bugzilla. 
I haven't; in the end it does not matter as long as we have this functionality
and actually use it in the dozens of places it's needed....
I ran a RE search on the entire tree and found only a few places
where UTF8ness check is done. However, I think it's a good
idea to check in attachment 17866 [details] [diff] [review] (with a possible name change)
because I'm pretty sure it'll find more customers once  in 
the tree(see bug 16276). If not for any other reasons, I'd love to 
see this checked in to avoid being bombarded with 
tens of pop-up windows with warning when I tried a debug build
for MS Windows. 
Attached patch a patch (obsolete) — Splinter Review
I modified attachment 17866 [details] [diff] [review] to filter out 
overlong UTF-8 sequences ( e.g. 0xC0 0x80 for U+0000)
'UTF-8 representation of surrogate codepoints' and
non-characters ( 0xFFFE, 0xFFFF in all planes).
With this change, this patch is equivalent
to |str.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(str)))|
in terms of 'features' and is more likely to 
detect non-UTF-8 string. 

In the patch, I only included one customer of this
method(nsTextToSubURI.cpp), but there are a couple
of more (see bug 162765).
Comment on attachment 117690 [details] [diff] [review]
a patch

asking r/sr to move this forward.
Attachment #117690 - Flags: superreview?(bzbarsky)
Attachment #117690 - Flags: review?(nhotta)
I won't be able to sr this for at least 2 weeks....
Comment on attachment 117690 [details] [diff] [review]
a patch

there are several experts on CC. Pls,  pick this up and review it if you can.
Attachment #117690 - Flags: superreview?(bzbarsky) → superreview?(alecf)
Comment on attachment 117690 [details] [diff] [review]
a patch

Jungshik, please ask Simon for a review. He was working on this issue recently.
Attachment #117690 - Flags: review?(nhotta) → review?(smontagu)
Comment on attachment 117690 [details] [diff] [review]
a patch

some comments:
+      // for each chunk of |aString|...

this should have a bit less indentation

+    for ( aString.BeginReading(iter); iter != done_reading; iter.advance(
PRInt32(fragmentLength) ) )
+      {

if your style of spaces after/before parens and indented { on the next line
really the current style of the file?

+#if 0
why is this #if 0?
Naoki, thank you for transfering review.


> +      // for each chunk of |aString|...
> this should have a bit less indentation
> is your style of spaces after/before parens and indented { on the next line
> really the current style of the file?

Both indentation styles are consistently used in the file. See, for instance,
 
http://lxr.mozilla.org/seamonkey/source/string/src/nsReadableUtils.cpp#570

BTW, I'll replace C++-style comment before |IsUTF8| with C-style comment
as is done elsewhere in the file. 

> +#if 0
> why is this #if 0?

  5-6 bytes long UTF-8 sequences are not supposed to exist in practically all
cases (Unicode has only 20.1bit codespaces - upto 0x10FFFF - and
ISO JTC1/SC2/WG2 committed itself not to assign any character beyond
that limit - plane 16. A new IETF RFC has been in preparation to reflect
this. see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt
). Still, there's a possibility that some people
produce them (for what? ask them :-)) so that I wasn't 100% sure whether 
to take that out or not. Excluding them  not only speeds things up a bit 
but also makes |IsUTF8| less likely to mistake non-UTF8 for UTF-8.
 
Attached patch a new patch (obsolete) — Splinter Review
Skimming over rfc2279-rev draft, 
I realized that I forgot to filter out 4byte UTF-8
sequences for codepoints above 0x10FFFF. I also
replaced bitwise-and and comparison with comparison
(which is possible because |c| is now PRUint8). 
Besides, I removed #if 0'd code for 5-6 bytes-long seq.
Attachment #117690 - Attachment is obsolete: true
Attachment #117784 - Flags: superreview?(alecf)
Attachment #117784 - Flags: review?(smontagu)
Comment on attachment 117690 [details] [diff] [review]
a patch

sorry for spamming
Attachment #117690 - Flags: superreview?(alecf)
Attachment #117690 - Flags: review?(smontagu)
I'm not sure about excluding the ..FFFE and ..FFFF non-characters. They aren't
legal codepoints, but are they illegal UTF-8? There was discussion of this
question in bug 172701.
Thank you for the ref. to bug 172701. It reminded me of additional
non-characters that could be rather hard to throw away in the
current scheme. 

I'm not sure of excluding non-chars, either. 
(FYI, a successor-to-be of RFC 2279 mentioned
in my prev. comment doesn't regard
non-chars as invalid.). I filtered them
out because I wanted to preserve the behavior of
|str.Equals(nsConvertUCS2toUTF8(nsConvertUTF8toUCS2(str)))|
for which  |IsUTF8| is a replacement.  

The context for bug 172701 appears different from the one for this bug.
The former is about how to and where to validate UTF-8 (or content
encoded in it). |IsUTF8| (or |CouldBeUTF8| ) is not a generic
bullet-proof validator/invalidator (so that we don't
have to worry about Unicode compliance as much) but  rather
a heuristic 'device' to tell relatively short non-UTF-8 strings 
from UTF-8 strings that we resort to when other options are exhausted
(no external tag present, etc). |IsUTF8| is useful
for URI, filename, mail header, http header, ftp list, etc.
When |IsUTF8| returns false, the input string is interpreted
as a fallback encoding(e.g. parent document encoding, locale
charset, etc). 

Given this, it'd be extremly rare that non-chars are included
in UTF-8 strings for cases listed above. On the other hand,
someone with malicious intent may come up with a string
valid in both a legacy encoding and UTF-8
(if we allow non-chars in UTF-8) that has some security
implication if interpreted as the legacy encoding
while benign if interpreted as UTF-8. However,
there may as well the other way around (benign
in the legacy encoding while security-risk in
UTF-8).    
 
 So, I'm really not sure, but I'm a bit inclined to
preserve the behavior of what |IsUTF8| intends
to replace. 
Comment on attachment 117784 [details] [diff] [review]
a new patch

>+                if ( c <= 0xBF ) // [80-BF] when it's not expected.
>+                  return PR_FALSE;

Why not make this test
		  if ( c <= 0xC1 )
and then you can save the overlong test from the 2 bytes case:

>+                    if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF]
>+                      return PR_FALSE;


>+		else if ( c <= 0xDF ) // 2 bytes UTF8

Nit: remove the tabs from this line

r=smontagu with those changes
Attachment #117784 - Flags: review?(smontagu) → review+
Thank you for r, Simon. Alec, can I get your sr? 
Marking this as blocking bug 162765. It can go in
w/o this, but |IsUTF8| would be much nicer than
converting back and forth bet. UTF-8 and UTF-16.
Blocks: 162765
Whiteboard: DUPEME
Target Milestone: --- → mozilla1.4alpha
Comment on attachment 117784 [details] [diff] [review]
a new patch

+		 if ( c <= 0xBF ) // [80-BF] when it's not expected.
+		   return PR_FALSE;
+		else if ( c <= 0xDF ) // 2 bytes UTF8
+		   {
+		     state = 1;
+		     if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF]
+		       return PR_FALSE;
+		   }
+		 else if ( c <= 0xEF ) // 3 bytes UTF8
+

please use UTF8Traits::is3byte() and so forth..
Attachment #117784 - Flags: superreview?(alecf) → superreview-
Uses UTF8traits::isXXX where approrpiate per Alec's comment.

BTW, it seems like UTF8traits, |UTF8InputStream::CountValidUTF8Bytes| and
|CalculateUTF8Length| need to be updated.
Attachment #117784 - Attachment is obsolete: true
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::

assuming Simon's r is
carried over...
Attachment #118366 - Flags: superreview?(alecf)
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::

what about this line:
+		 else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not
0xF7)

sr=alecf with that line fixed (or explain why you can't use it in a comment)

how should each of those things be updated? to reflect all this stuff with
surrogates and such?

I will gladly review any additional changes to UTF8Traits if you want to
include them here.
Attachment #118366 - Flags: superreview?(alecf) → superreview+
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::

what about this line:
+		 else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not
0xF7)

sr=alecf with that line fixed (or explain why you can't use it in a comment)

how should each of those things be updated? to reflect all this stuff with
surrogates and such?

I will gladly review any additional changes to UTF8Traits if you want to
include them here.
> + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7)

> sr=alecf with that line fixed (or explain why you can't use it in a commen

Thank you for sr. Currently, UTF8traits::Is4byte(c) includes [0xF5-0xF7] as well
as [0xF0-0xF4], which is why I commented that 'note the bound ......'.
[0xF5-0xF7] begins  4byte seq. for codepoints beyond 10FFFFF. I'll make
it more clear.
 

> how should each of those things be updated? to reflect all this stuff with
> surrogates and such?

  I think UTF8traits::Is[56]byte have to be removed and Is4byte
has to be revised to filter out [0xF5-0xF7].  Besides, Is2byte
may have to be revised to exclude [0xC0-0xC1] (only
used for illegal overlong UTF-8 seq.). Because UTF8traits
only deals with a single 'char' at a time, it's not possible to filter out
all overlong/surrogate cases.  UTF8traits::isXXX are used
in three places (I didn't count ConvertUTF8toUCS2 in nsString2.h in
my prev. comment) and they can be updated accordingly. In case of 
ConvertUTF8toUCS2, it has its own overlong/non-char/surrogate/plane-17 or higher
detectors. In two other places, not detecting all those cases wouldn't be critical.

> I will gladly review any additional changes to UTF8Traits if you want to
> include them here.

 How about taking care of them in a  separate bug? Alternatively,
we can just keep this open to deal with the issue after landing attachment
118366 [details] [diff] [review]. (expecting Is4byte() to be updated as proposed above, I'll comment that
|c <= 0xF4| in attachment 118366 [details] [diff] [review] should be replaced by Is4byte when 
UTF8traits is updated.)

ah, I am beginning to understand
can you file a new bug on the issues we just discussed? (I don't think I
understand the issues well enough yet to file it myself) 

with the comments on the 4-byte sequences, can you reference the new bug that
you file, so that people reading the code can understand the issues? thanks!

it sounds like this is ready to go (With smontagu's review, and the comment update)
Thank you, Alec. Patch was just checked in.
I filed bug 199090 and added a comment referring to it in the patch.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: