Closed
Bug 183156
Opened 22 years ago
Closed 18 years ago
replace 'UCS2' in function/class/method names with 'UTF16' and add some new suppl. methods(if necessary)
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
People
(Reporter: jshin1987, Assigned: jshin1987)
Details
Attachments
(4 files, 2 obsolete files)
2.64 KB,
patch
|
peterv
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
34.55 KB,
patch
|
jag+mozilla
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
1.82 KB,
patch
|
jag+mozilla
:
review+
jag+mozilla
:
superreview+
|
Details | Diff | Splinter Review |
917.21 KB,
patch
|
darin.moz
:
review+
darin.moz
:
superreview+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 A spin-off from bug 183048 and bug 182877. Mozilla's internal string representation has always been UTF-16 (not UCS-2), but functions/classes/methods that deal with UTF-16 are all named with UCS2. This is pretty well known to I18N people, but some new developers or those who don't work usually on I18N aspect may get a false notion that |String| classes only contain UCS-2. To dispell this misconception, we need to rename all |*UCS2*| |*UTF16*| along with changes in documentation. I'm not sure of how to proceed. Definining macros to map old name to new names is one way, but could well worsen the situation because then it may be thought even by those familiar with Unicode that functions with UCS2 in their names are different from (work differently from) functions with UTF16 in their names. I don't know how 'relicencing' (which requires changing every single file in Mozilla tree) was done. Can that method be applied here? Reproducible: Always Steps to Reproduce: 1. 2. 3.
Assignee | ||
Comment 2•21 years ago
|
||
As a first step, I'm gonna change new-born (Copy|Append)UCS2toUTF8 and (Append|Copy)UTF8toUCS2 (bug 87677). Nobody has used them so far.
Assignee: jaggernaut → jshin
Assignee | ||
Comment 3•21 years ago
|
||
Assignee | ||
Comment 4•21 years ago
|
||
Comment on attachment 124971 [details] [diff] [review] a patch as just mentioned asking for r/sr.
Attachment #124971 -
Flags: superreview?(alecf)
Attachment #124971 -
Flags: review?(jaggernaut)
Comment 5•21 years ago
|
||
Comment on attachment 124971 [details] [diff] [review] a patch as just mentioned sr=alecf
Attachment #124971 -
Flags: superreview?(alecf) → superreview+
Updated•21 years ago
|
Attachment #124971 -
Flags: review?(jaggernaut) → review+
Comment 6•21 years ago
|
||
I wouldn't say that Mozilla's internal string representation has always been UTF-16. There was a time where, in converting from UTF8, we would ignore anything that would result in a character that would take more than 2 bytes to store. I do however agree with this bug, we should move to naming our classes |*UTF16*| instead of |*UCS2*|, since that is the case now. Begin by making the changes inside string/, but keeping typedefs / inline forwarders for the old classes and functions. Once that is landed, put a notification up on tinderbox and in newsgroups that we're going to make this change and people should start using these new names in their new code / fix 'em up as they touch code, and that we'll do a tree wide sweep a week after that (we'll have to coordinate this with drivers). You could do the tree wide sweep using some sed inline replace script(s), dbaron might be able to help you with that. If not, I could have some fun with this, it's been a while :-)
Assignee | ||
Comment 7•21 years ago
|
||
Fix checked into trunk. Thanks for r/sr and for the tip. I'll begin to work in string/.
Assignee | ||
Comment 8•21 years ago
|
||
the second step ! a patch for string/.
Assignee | ||
Comment 9•21 years ago
|
||
Comment on attachment 125162 [details] [diff] [review] a patch for string/ asking r/sr. I've made a debug build on Win32 with this patch and it works fine.
Attachment #125162 -
Flags: superreview?(alecf)
Attachment #125162 -
Flags: review?(peterv)
Comment 10•21 years ago
|
||
Comment on attachment 125162 [details] [diff] [review] a patch for string/ in NS_ConvertToString() - nsAutoString aUCS2string; - aUCS2string.AssignWithConversion(anASCIIstring); - return aUCS2string; + nsAutoString aUTF16string; + aUTF16string.AssignWithConversion(anASCIIstring); + return aUTF16string; this looks like the old code was named wrong anyway.. either way this document is looking older and cruftier every day :) As for the rest of the patch - it looks ok, but I'd want to make sure that stuff like the typedefs from the NS_Convert* classes actually work on all 3 major platforms... so please test this on more than just win32. sr=alecf but you need the string module owner's review, not peterv's..
Attachment #125162 -
Flags: superreview?(alecf)
Attachment #125162 -
Flags: superreview+
Attachment #125162 -
Flags: review?(peterv)
Attachment #125162 -
Flags: review?(jaggernaut)
Assignee | ||
Comment 11•21 years ago
|
||
alecf, thanks for sr. I made a new patch because dbaron's ToNewUTF8 patch moved things around in string. There's nothing new (in terms of actual content). Can you transfer your sr to this patch? I've just tested it on linux (RedHat 8.0, g++ 3.2) and it works fine. To make sure, I clobbered and am now rebuilding it with the latest trunk on Linux. Windows and Linux (and hopefully other Unix compilers wouldn't be different in handling 'typedef'ed class'. ...) Who is the best person to ask it for testing on Mac, sfraser? I'll leave out the part of the obsolete string-guide that talks about old functions. BTW, why don't you check in your new doc in the tree? Or, is it not done on purpose? Sometimes, off-line doc. may be useful (when one's net connection is down :-))
Attachment #125162 -
Attachment is obsolete: true
Assignee | ||
Comment 12•21 years ago
|
||
adding mkaply and sfraser to ask for help with OS/2 and Mac. Can you test attachment 125407 [details] [diff] [review] on OS/2 and Mac (classic and OS X)? timeless, can you test it under BeOS? Thank you.
Comment 13•21 years ago
|
||
Sorry, I don't have cycles to test on Mac.
Comment 14•21 years ago
|
||
Comment on attachment 125407 [details] [diff] [review] a new patch (against the current trunk : adapting to ToNewUTF8 patch by dbaron) The typedef thing should work, we've done it before with typedef |const nsAString nsAReadableString|. Sorry for not getting to this earlier. I had a partial review comment ready but got pulled into something else. >Index: string/obsolete/nsString.h >=================================================================== >@@ -422,32 +422,32 @@ > // NS_DEF_DERIVED_STRING_OPERATOR_PLUS(nsCAutoString, char) > > /** >- * A helper class that converts a UCS2 string to UTF8 >+ * A helper class that converts a UTF-16 string to UTF8 Should we scan our string code comments and docs to make sure we use "UTF-8" everywhere? I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8| Should we perhaps move both typedefs all the way down in the header in a /* Backward compatibility */ section? >Index: string/public/nsReadableUtils.h >=================================================================== >@@ -82,10 +89,10 @@ > * Returns a new |char| buffer containing a zero-terminated copy of |aSource|. > * > * Allocates and returns a new |char| buffer which you must free with |nsMemory::Free|. >- * Performs a encoding conversion by converting 16-bit wide characters down to UTF8 encoded 8-bits wide string copying |aSource| to your new buffer. >+ * Performs a encoding conversion from UTF-16 string made of PRUnichar's to UTF-8 made of char's copying |aSource| to your new buffer. "Performs an encoding ... an UTF-16 ... PRUnichars ... an UTF-8 string ... chars ..." And please wrap that line to 80 characters. I'm wondering if we should leave out the "made of PRUnichars" and "made of chars". It's not relevant what data type we internally use to store the characters in. > * The new buffer is zero-terminated, but that may not help you if |aSource| contains embedded nulls. > * >- * @param aSource a 16-bit wide string >+ * @param aSource a UTF-16 string (made of PRUnichar's) "an UTF-16" Index: string/public/nsUTF8Utils.h =================================================================== @@ -206,6 +206,8 @@ PRBool mErrorEncountered; }; +typedef ConvertUTF8toUTF16 ConvertUTF8toUCS2; + /** * A character sink (see |copy_string| in nsAlgorithm.h) for computing * the length of a UTF-8 string. This typedef should be unnecessary if you fix the one use of it in /xpcom/io/nsUnicharInputStream.cpp >Index: string/doc/string-guide.html >=================================================================== >RCS file: /cvsroot/mozilla/string/doc/string-guide.html,v >retrieving revision 1.11 >diff -u -r1.11 string-guide.html >--- string/doc/string-guide.html 11 Mar 2003 23:26:18 -0000 1.11 >+++ string/doc/string-guide.html 11 Jun 2003 07:57:25 -0000 >+or <span class="code">char</span> string literals. In UTF-16, characters occupy one 16bit code unit ( BMP character ) or two 16bit code units >+(non-BMP characters). Inconsistent spacing near "()" (I prefer no extra space). Should that be plural: "(BMP characters)" to match up with the other one? Should we explain what BMP stands for, or perhaps linkify these, or wrap them in <abbr title="explanation">BMP</abbr>?
Attachment #125407 -
Flags: review-
Comment 15•21 years ago
|
||
> "an UTF-16"
jag: actually our source tree uses "a utf" a lot more than "an utf"
that's probably ok because the rule isn't strictly about which letter it is but
more about which sound is present. 'a you tee eff'
Assignee | ||
Comment 16•21 years ago
|
||
I removed 'typedef for ConvertUTF8toUCS2' as suggested and took care of other issues raised. > I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8| Ooops. That's dropped off while hand-editing my patch after testing but before uploading. I meant to get rid of ConvertUTF16toUTF8 (that is not used anywhere outside string/), but deleted NS_ConvertUTF16toUTF8, instead. > more about which sound is present. 'a you tee eff' That's what every foreign student taking TOEFL is expected to know by ETS :-)
Attachment #125407 -
Attachment is obsolete: true
Assignee | ||
Comment 17•21 years ago
|
||
Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns Just to save reviewers some time: diff. between this and the previous one is mostly along the line of jag's comment. In addition, I found a completely wrong description (apparently cut'n'pasted from the comment for ToNewCString) of ToNewUTF8String(aAString) and fixed it. I also removed typedef's for ConvertUTF8toUTF16 and ConvertUTF16toUTF8.
Attachment #125514 -
Flags: superreview?(alecf)
Attachment #125514 -
Flags: review?(jaggernaut)
Comment 18•21 years ago
|
||
Comment on attachment 125162 [details] [diff] [review] a patch for string/ clearing obsolete reviews (but jag if you're reading this, there is a newer version of this patch for you to review!)
Attachment #125162 -
Flags: review?(jaggernaut)
Comment 19•21 years ago
|
||
Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns Heh, I could've sworn I learned it the "no exceptions" way. I guess I've forgotten a thing or two from my English classes. Nice catch on the "characters" -> "code units". is it okay to use "bytes" for UTF-8 strings, or should we use "code units" there too? (Or perhaps "encoding units"?)
Attachment #125514 -
Flags: review?(jaggernaut) → review+
Comment 20•21 years ago
|
||
How about just 16-bit units in stead, that's what the W3C specs talk about when talking about UTF16 "characters".
Assignee | ||
Comment 21•21 years ago
|
||
Thank you for review and comment. Unicode 4.0 glossary (final draft) has the following (http://www.unicode.org/book/preview/glossary.pdf : I had to type this so that any typo is mine) Code Unit: The minimial bit combination that can represent a unit of encoded text for processing or interchange. The Unicode standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 ...... Given this, I'd rather use '16-bit code units' instead of '16-bit units'.(I'll insert '-' because I found that '-' is used in some other comments in string/) As for UTF-8, can we just get away with bytes instead of '8-bit code units'? There are some architectures on which a byte doesn't have 8 bits, but Mozilla doesn't run on them ...
Status: NEW → ASSIGNED
Assignee | ||
Comment 22•21 years ago
|
||
I just realized that my patch 'rewrites' old messages of scc in string-guide.html What shall I do? Can I 'html-ize' it somehow? The document is obsolete, anyway...
Assignee | ||
Comment 23•21 years ago
|
||
How about this? I thought abuot using <ins>, <del> or a couple of CSS tricks, but just adding a short note seems to be best. <p> Here are the email answers I have yet to format into the FAQ. Some of the URLs may be out-dated or moved. The messages are in order from oldest to newest. </p> + <p span="editnote">In June, 2003, these emails were modified + by Jungshik Shin to better reflect what is stored in 'wide' string + classes (UTF-16 string instead of UCS-2) and what + related methods do as a part of the patch for <a href= + "http://bugzilla.mozilla.org/show_bug.cgi?id=183156" + title="replace UCS2 in function/class/method names with UTF16">bug 183156</a>. + Therefore, they're a little different from the original emails + written by <a href="http://ScottCollins.net/">Scott Collins</a> </p>
Assignee | ||
Comment 24•21 years ago
|
||
Just to see how it works, I ran the following in the root dir. of my tree after updating my tree to the trunk (2003-06-14 around 6pm? US PDT) LC_ALL=C find . -name "*.h" -o -name "*.cpp" -o -name "*.idl" | \ xargs perl -i.ucs2_backup -p \ -e 's/(Copy|NS_LossyConvert|NS_Convert)UCS2to(UTF8|ASCII)/$1UTF16to$2/g, \ s/(Copy|NS_Convert)(ASCII|UTF8)toUCS2/$1$2toUTF16/g' (I could have used 'GNU sed 4.x' with '-i' option with '$1' replacing '\1' in RE or manually made back-up files. I also searched for strings to replace in *pl, *sh, and *py files that may auto-generate cpp/h files and there's none found.) The above took 86 sec (user time: system time was ~ 10 sec) on my box. Then, I clobbered and rebuilt(debug). Everything went well. I didn't enable everything and the test was done on Linux so that this test wouldn't have detected if there's something wrong in parts not built (non-Linux and not- enabled code paths/files). However, I think it's pretty safe to assume that it'll work on other platforms. After the replacement, I also searched '(NS_Conver|Cop).*UCS2' and found a typo in widget/src/photon/nsTextHelper.cpp http://lxr.mozilla.org/seamonkey/source/widget/src/photon/nsTextHelper.cpp#101 UTF8 was mistyped as 'UFT8' in that line.
Comment 25•21 years ago
|
||
the string doc is obsolete, but the changes you made are just fine - no need for any special ins/del tags or anything. CVS can keep track of document history.
Comment 26•21 years ago
|
||
the string doc is obsolete, but the changes you made are just fine - no need for any special ins/del tags or anything. CVS can keep track of document history.
Assignee | ||
Comment 27•21 years ago
|
||
Alec, can you put sr stamp for attachment 125514 [details] [diff] [review] as it is, then? (my point was that I feel a bit uncomfortable rewriting scc's email messaegs quoted verbatim in the document 3 years after they're written.)
Comment 28•21 years ago
|
||
Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns sr=alecf - sorry, I didn't mean that it was delaying my sr, I just hadn't gotten around to it yet.
Attachment #125514 -
Flags: superreview?(alecf) → superreview+
Assignee | ||
Comment 29•21 years ago
|
||
Thanks, alecf. Fix got checked in several hours ago. Now we have to move on to next round. What newsgroups shall I cross-post the notice to? How about putting it up at tinderbox?
Comment 30•21 years ago
|
||
post to n.p.m.xpcom and n.p.m.announce - asa should approve it - a few weeks ago Asa and I agreed that using n.p.m.announce to announce major API changes, etc, was a good idea... this is a good place to start.
Assignee | ||
Comment 31•21 years ago
|
||
adding jag(string owner) and asa (for carpool) to CC
Assignee | ||
Comment 32•21 years ago
|
||
This is an additional patch to rename CopyUTF16toASCII as LossyCopyUTF16toASCII suggested by jag off-line. I think it's a good idea to make it in line with NS_LossyConvertUTF16toASCII.
Comment 33•21 years ago
|
||
Comment on attachment 126602 [details] [diff] [review] a patch : CopyUTF16toASCII -> LossyCopyUTF16toASCII r+sr=jag
Attachment #126602 -
Flags: superreview+
Attachment #126602 -
Flags: review+
Assignee | ||
Comment 34•21 years ago
|
||
note to myself: Camino has a bunch of '*.mm' files with '*UCS2to*' and '*toUCS2'. So I have to modify 'find statement' in comment #24 to include '*mm' files.
Comment 35•18 years ago
|
||
ok, this patch changes all callers to use the UTF16 functions and removes the UCS2 ones. the essential part here was done by: perl -pi -e 's/NS_ConvertASCIItoUCS2/NS_ConvertASCIItoUTF16/g; s/NS_ConvertUTF8toUCS2/NS_ConvertUTF8toUTF16/g; s/NS_LossyConvertUCS2toASCII/NS_LossyConvertUTF16toASCII/g; s/NS_ConvertUCS2toUTF8/NS_ConvertUTF16toUTF8/g; s/CopyASCIItoUCS2/CopyASCIItoUTF16/g; s/CopyUCS2toASCII/LossyCopyUTF16toASCII/g' $* on a clean MOZ_CO_MODULES=all tree.
Attachment #210428 -
Flags: superreview?(darin)
Attachment #210428 -
Flags: review?(darin)
Comment 36•18 years ago
|
||
I'll post an announcement to m.d.t.xpcom and m.d.general when checking this in.
Updated•18 years ago
|
Attachment #210428 -
Flags: superreview?(darin)
Attachment #210428 -
Flags: superreview+
Attachment #210428 -
Flags: review?(darin)
Attachment #210428 -
Flags: review+
Comment 37•18 years ago
|
||
checked in. that should make this bug fixed.
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•