Closed Bug 183156 Opened 22 years ago Closed 18 years ago

replace 'UCS2' in function/class/method names with 'UTF16' and add some new suppl. methods(if necessary)

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

Details

Attachments

(4 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008

A spin-off from bug 183048 and bug 182877.

Mozilla's internal string representation has always been UTF-16 (not UCS-2),
but functions/classes/methods that deal with UTF-16 are all named with
UCS2. This is pretty well known to I18N people, but some new developers
or those who don't work usually on I18N aspect may get a false notion
that |String| classes only contain UCS-2.  To dispell this misconception,
we need to rename all |*UCS2*|  |*UTF16*| along with changes
in documentation.

I'm not sure of how to proceed. Definining macros to map old name
to new names is one way, but could well worsen the situation 
because then it may be thought even by those familiar with
Unicode that functions with UCS2 in their names are different
from (work differently from) functions with UTF16 in their names. 
I don't know how 'relicencing' (which requires changing every
single file in Mozilla tree) was done. Can that method be applied
here? 



Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Confirming.  ;)
Status: UNCONFIRMED → NEW
Ever confirmed: true
As a first step, I'm gonna change new-born (Copy|Append)UCS2toUTF8 and
(Append|Copy)UTF8toUCS2 (bug 87677). Nobody has used them so far.
Assignee: jaggernaut → jshin
Comment on attachment 124971 [details] [diff] [review]
a patch as just mentioned

asking for r/sr.
Attachment #124971 - Flags: superreview?(alecf)
Attachment #124971 - Flags: review?(jaggernaut)
Comment on attachment 124971 [details] [diff] [review]
a patch as just mentioned

sr=alecf
Attachment #124971 - Flags: superreview?(alecf) → superreview+
Attachment #124971 - Flags: review?(jaggernaut) → review+
I wouldn't say that Mozilla's internal string representation has always been
UTF-16. There was a time where, in converting from UTF8, we would ignore
anything that would result in a character that would take more than 2 bytes to
store. I do however agree with this bug, we should move to naming our classes
|*UTF16*| instead of |*UCS2*|, since that is the case now.

Begin by making the changes inside string/, but keeping typedefs / inline
forwarders for the old classes and functions. Once that is landed, put a
notification up on tinderbox and in newsgroups that we're going to make this
change and people should start using these new names in their new code / fix 'em
up as they touch code, and that we'll do a tree wide sweep a week after that
(we'll have to coordinate this with drivers).

You could do the tree wide sweep using some sed inline replace script(s), dbaron
might be able to help you with that. If not, I could have some fun with this,
it's been a while :-)
Fix checked into trunk.
Thanks for r/sr and for the tip. I'll begin to work in string/.
Attached patch a patch for string/ (obsolete) — Splinter Review
the second step ! a patch for string/.
Comment on attachment 125162 [details] [diff] [review]
a patch for string/

asking r/sr. I've made a debug build on Win32 with this patch and it works
fine.
Attachment #125162 - Flags: superreview?(alecf)
Attachment #125162 - Flags: review?(peterv)
Comment on attachment 125162 [details] [diff] [review]
a patch for string/

in NS_ConvertToString()

-    nsAutoString aUCS2string;
-    aUCS2string.AssignWithConversion(anASCIIstring);
-    return aUCS2string;
+    nsAutoString aUTF16string;
+    aUTF16string.AssignWithConversion(anASCIIstring);
+    return aUTF16string;

this looks like the old code was named wrong anyway.. either way this document
is looking older and cruftier every day :)

As for the rest of the patch - it looks ok, but I'd want to make sure that
stuff like the typedefs from the NS_Convert* classes actually work on all 3
major platforms... so please test this on more than just win32.

sr=alecf but you need the string module owner's review, not peterv's..
Attachment #125162 - Flags: superreview?(alecf)
Attachment #125162 - Flags: superreview+
Attachment #125162 - Flags: review?(peterv)
Attachment #125162 - Flags: review?(jaggernaut)
alecf, thanks for sr. 

I made a new patch because dbaron's ToNewUTF8 patch moved things around in
string. There's nothing new (in terms of actual content).
Can you transfer your sr to this patch? 

I've just tested it on linux (RedHat 8.0, g++ 3.2)  and it works fine. To make
sure, I clobbered and am now rebuilding it with the latest trunk on Linux.
Windows and Linux (and hopefully other Unix compilers wouldn't be different in
handling 'typedef'ed class'. ...) Who is the best person to ask it for testing
on Mac,  sfraser? 

I'll leave out the part of the obsolete string-guide that talks about old
functions. BTW, why don't you check in your new doc in the tree? Or, is it not
done on purpose? Sometimes, off-line doc. may be useful (when one's net
connection is down :-))
Attachment #125162 - Attachment is obsolete: true
adding mkaply and sfraser to ask for help with OS/2 and Mac. Can you test
attachment 125407 [details] [diff] [review] on OS/2 and Mac (classic and OS X)? timeless, can you test it
under BeOS? Thank you. 
Sorry, I don't have cycles to test on Mac.
Comment on attachment 125407 [details] [diff] [review]
a new patch (against the current trunk : adapting to ToNewUTF8 patch by dbaron)

The typedef thing should work, we've done it before with typedef |const
nsAString nsAReadableString|.

Sorry for not getting to this earlier. I had a partial review comment ready but
got pulled into something else.


>Index: string/obsolete/nsString.h
>===================================================================
>@@ -422,32 +422,32 @@
> // NS_DEF_DERIVED_STRING_OPERATOR_PLUS(nsCAutoString, char)
> 
> /**
>- * A helper class that converts a UCS2 string to UTF8
>+ * A helper class that converts a UTF-16 string to UTF8

Should we scan our string code comments and docs to make sure we use "UTF-8"
everywhere?

I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8|


Should we perhaps move both typedefs all the way down in the header in a /*
Backward compatibility */ section?


>Index: string/public/nsReadableUtils.h
>===================================================================
>@@ -82,10 +89,10 @@
>    * Returns a new |char| buffer containing a zero-terminated copy of |aSource|.
>    *
>    * Allocates and returns a new |char| buffer which you must free with |nsMemory::Free|.
>-   * Performs a encoding conversion by converting 16-bit wide characters down to UTF8 encoded 8-bits wide string copying |aSource| to your new buffer.
>+   * Performs a encoding conversion from UTF-16 string made of PRUnichar's to UTF-8 made of char's copying |aSource| to your new buffer.

"Performs an encoding ... an UTF-16 ... PRUnichars ... an UTF-8 string ...
chars ..."

And please wrap that line to 80 characters.

I'm wondering if we should leave out the "made of PRUnichars" and "made of
chars". It's not relevant what data type we internally use to store the
characters in.

>    * The new buffer is zero-terminated, but that may not help you if |aSource| contains embedded nulls.
>    *
>-   * @param aSource a 16-bit wide string
>+   * @param aSource a UTF-16 string (made of PRUnichar's)

"an UTF-16"

Index: string/public/nsUTF8Utils.h
===================================================================
@@ -206,6 +206,8 @@
       PRBool mErrorEncountered;
   };

+typedef ConvertUTF8toUTF16 ConvertUTF8toUCS2;
+
 /**
  * A character sink (see |copy_string| in nsAlgorithm.h) for computing
  * the length of a UTF-8 string.

This typedef should be unnecessary if you fix the one use of it in
/xpcom/io/nsUnicharInputStream.cpp

>Index: string/doc/string-guide.html
>===================================================================
>RCS file: /cvsroot/mozilla/string/doc/string-guide.html,v
>retrieving revision 1.11
>diff -u -r1.11 string-guide.html
>--- string/doc/string-guide.html	11 Mar 2003 23:26:18 -0000	1.11
>+++ string/doc/string-guide.html	11 Jun 2003 07:57:25 -0000

>+or <span class="code">char</span> string literals.  In UTF-16, characters occupy one 16bit code unit ( BMP character ) or two 16bit code units 
>+(non-BMP characters).

Inconsistent spacing near "()" (I prefer no extra space). Should that be
plural: "(BMP characters)" to match up with the other one? Should we explain
what BMP stands for, or perhaps linkify these, or wrap them in <abbr
title="explanation">BMP</abbr>?
Attachment #125407 - Flags: review-
> "an UTF-16"
jag: actually our source tree uses "a utf" a lot more than "an utf"

that's probably ok because the rule isn't strictly about which letter it is but
more about which sound is present. 'a you tee eff'
I removed 'typedef for ConvertUTF8toUCS2'  as suggested and took care of other
issues raised.

> I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8|

Ooops. That's dropped off while hand-editing my patch after testing but before
uploading. I meant to get rid of ConvertUTF16toUTF8 (that is not used anywhere
outside string/), but deleted NS_ConvertUTF16toUTF8, instead. 

> more about which sound is present. 'a you tee eff'

That's what every foreign student taking TOEFL is expected to know by ETS :-)
Attachment #125407 - Attachment is obsolete: true
Comment on attachment 125514 [details] [diff] [review]
a new patch addressing jag's concerns

Just to save reviewers some time: diff. between this and the previous one is
mostly along the line of jag's comment. In addition, I found  a completely
wrong description (apparently cut'n'pasted from the comment for ToNewCString)
of ToNewUTF8String(aAString) and fixed it. I also removed  typedef's for
ConvertUTF8toUTF16  and ConvertUTF16toUTF8.
Attachment #125514 - Flags: superreview?(alecf)
Attachment #125514 - Flags: review?(jaggernaut)
Comment on attachment 125162 [details] [diff] [review]
a patch for string/

clearing obsolete reviews
(but jag if you're reading this, there is a newer version of this patch for you
to review!)
Attachment #125162 - Flags: review?(jaggernaut)
Comment on attachment 125514 [details] [diff] [review]
a new patch addressing jag's concerns

Heh, I could've sworn I learned it the "no exceptions" way. I guess I've
forgotten a thing or two from my English classes.

Nice catch on the "characters" -> "code units". is it okay to use "bytes" for
UTF-8 strings, or should we use "code units" there too? (Or perhaps "encoding
units"?)
Attachment #125514 - Flags: review?(jaggernaut) → review+
How about just 16-bit units in stead, that's what the W3C specs talk about when
talking about UTF16 "characters".
Thank you for review and comment.

Unicode 4.0 glossary (final draft) has the following
(http://www.unicode.org/book/preview/glossary.pdf : I had to type this so that
any typo is mine)

Code Unit: The minimial bit combination that can represent a unit of encoded
text for processing or interchange. The Unicode standard uses 8-bit code units
in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and
32-bit code units in the UTF-32 ......

Given this, I'd rather use '16-bit code units' instead of '16-bit units'.(I'll
insert '-' because I found that '-' is used in some other comments in string/)
As for UTF-8, can we just get away with bytes instead of '8-bit code units'?
There are some architectures on which a byte doesn't have 8 bits, but Mozilla
doesn't run on them ...
Status: NEW → ASSIGNED
I just realized that my patch 'rewrites' old messages  of scc in
string-guide.html  What shall I do? Can I 'html-ize' it somehow? The document is
obsolete, anyway... 
How about this? I thought abuot using <ins>, <del> or a couple of CSS tricks,
but just adding a short note seems to be  best. 
 
<p>
  Here are the email answers I have yet to format into the FAQ.
  Some of the URLs may be out-dated or moved.
  The messages are in order from oldest to newest.
</p>
+ <p span="editnote">In June, 2003, these emails were modified
+ by Jungshik Shin to better reflect what is stored in 'wide' string
+ classes (UTF-16 string instead of UCS-2)  and what        
+ related methods do as a part of the patch for <a href=
+ "http://bugzilla.mozilla.org/show_bug.cgi?id=183156" 
+ title="replace UCS2 in function/class/method names with UTF16">bug 183156</a>.
+ Therefore, they're a little different from  the original emails
+ written by <a href="http://ScottCollins.net/">Scott Collins</a>
</p>
Just to see how it works, I ran the following in the root dir. of my tree 
after updating my tree to the trunk (2003-06-14 around 6pm? US PDT) 

LC_ALL=C find . -name "*.h" -o -name "*.cpp" -o -name "*.idl" |  \
   xargs perl -i.ucs2_backup -p                         \
     -e 's/(Copy|NS_LossyConvert|NS_Convert)UCS2to(UTF8|ASCII)/$1UTF16to$2/g, \
         s/(Copy|NS_Convert)(ASCII|UTF8)toUCS2/$1$2toUTF16/g'

(I could have used 'GNU sed 4.x' with '-i' option with '$1' replacing '\1' in 
RE or manually made back-up files. I also searched for strings to replace in 
*pl, *sh, and *py files that may auto-generate cpp/h files and there's none 
found.)

The above took 86 sec (user time: system time was ~ 10 sec) on my box.

Then, I clobbered and rebuilt(debug). Everything went well. I didn't enable 
everything and the test was done on Linux so that  this test wouldn't have 
detected if there's something wrong in parts not built (non-Linux and not-
enabled code paths/files). However, I think it's pretty safe to assume that 
it'll work on other platforms. 

After the replacement, I also searched '(NS_Conver|Cop).*UCS2' and found a 
typo in widget/src/photon/nsTextHelper.cpp

http://lxr.mozilla.org/seamonkey/source/widget/src/photon/nsTextHelper.cpp#101

UTF8 was mistyped as 'UFT8' in that line. 

 
the string doc is obsolete, but the changes you made are just fine - no need for
any special ins/del tags or anything. CVS can keep track of document history.
the string doc is obsolete, but the changes you made are just fine - no need for
any special ins/del tags or anything. CVS can keep track of document history.
Alec, can you put sr stamp for attachment 125514 [details] [diff] [review] as it is, then? (my point was
that I feel a bit uncomfortable  rewriting scc's email messaegs quoted verbatim
in the document 3 years after they're written.)
Comment on attachment 125514 [details] [diff] [review]
a new patch addressing jag's concerns

sr=alecf - sorry, I didn't mean that it was delaying my sr, I just hadn't
gotten around to it yet.
Attachment #125514 - Flags: superreview?(alecf) → superreview+
Thanks, alecf. Fix got checked in several hours ago. 
Now we have to move on to next round. What newsgroups shall I cross-post the
notice to? How about putting it up at tinderbox?

post to n.p.m.xpcom and n.p.m.announce - asa should approve it - a few weeks ago
Asa and I agreed that using n.p.m.announce to announce major API changes, etc,
was a good idea... this is a good place to start.
adding jag(string owner) and asa (for carpool) to CC
This is an additional patch to rename CopyUTF16toASCII as LossyCopyUTF16toASCII
suggested by jag off-line. I think it's a good idea to make it in line with
NS_LossyConvertUTF16toASCII.
Comment on attachment 126602 [details] [diff] [review]
a patch : CopyUTF16toASCII -> LossyCopyUTF16toASCII

r+sr=jag
Attachment #126602 - Flags: superreview+
Attachment #126602 - Flags: review+
note to myself: Camino has a bunch of '*.mm' files with '*UCS2to*' and
'*toUCS2'. So I have to modify 'find statement' in comment #24 to include '*mm'
files. 
ok, this patch changes all callers to use the UTF16 functions and removes the UCS2 ones.

the essential part here was done by:
perl -pi -e 's/NS_ConvertASCIItoUCS2/NS_ConvertASCIItoUTF16/g;
             s/NS_ConvertUTF8toUCS2/NS_ConvertUTF8toUTF16/g;
             s/NS_LossyConvertUCS2toASCII/NS_LossyConvertUTF16toASCII/g;
             s/NS_ConvertUCS2toUTF8/NS_ConvertUTF16toUTF8/g;
             s/CopyASCIItoUCS2/CopyASCIItoUTF16/g;
             s/CopyUCS2toASCII/LossyCopyUTF16toASCII/g' $*
on a clean MOZ_CO_MODULES=all tree.
Attachment #210428 - Flags: superreview?(darin)
Attachment #210428 - Flags: review?(darin)
I'll post an announcement to m.d.t.xpcom and m.d.general when checking this in.
Attachment #210428 - Flags: superreview?(darin)
Attachment #210428 - Flags: superreview+
Attachment #210428 - Flags: review?(darin)
Attachment #210428 - Flags: review+
checked in. that should make this bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: