Closed Bug 183156 Opened 23 years ago Closed 20 years ago

replace 'UCS2' in function/class/method names with 'UTF16' and add some new suppl. methods(if necessary)

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

Details

Attachments

(4 files, 2 obsolete files)

a patch as just mentioned 22 years ago Jungshik Shin 2.64 KB, patch	peterv : review+ alecf : superreview+	Details \| Diff \| Splinter Review
a patch for string/ 22 years ago Jungshik Shin 25.65 KB, patch	alecf : superreview+	Details \| Diff \| Splinter Review
a new patch (against the current trunk : adapting to ToNewUTF8 patch by dbaron) 22 years ago Jungshik Shin 29.93 KB, patch	jag+mozilla : review-	Details \| Diff \| Splinter Review
a new patch addressing jag's concerns 22 years ago Jungshik Shin 34.55 KB, patch	jag+mozilla : review+ alecf : superreview+	Details \| Diff \| Splinter Review
a patch : CopyUTF16toASCII -> LossyCopyUTF16toASCII 22 years ago Jungshik Shin 1.82 KB, patch	jag+mozilla : review+ jag+mozilla : superreview+	Details \| Diff \| Splinter Review
replace UCS2 with UTF16 20 years ago Christian :Biesinger (don't email me, ping me on IRC) 917.21 KB, patch	darin.moz : review+ darin.moz : superreview+	Details \| Diff \| Splinter Review

Jungshik Shin

Assignee

Description

•

23 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 A spin-off from bug 183048 and bug 182877. Mozilla's internal string representation has always been UTF-16 (not UCS-2), but functions/classes/methods that deal with UTF-16 are all named with UCS2. This is pretty well known to I18N people, but some new developers or those who don't work usually on I18N aspect may get a false notion that |String| classes only contain UCS-2. To dispell this misconception, we need to rename all |*UCS2*| |*UTF16*| along with changes in documentation. I'm not sure of how to proceed. Definining macros to map old name to new names is one way, but could well worsen the situation because then it may be thought even by those familiar with Unicode that functions with UCS2 in their names are different from (work differently from) functions with UTF16 in their names. I don't know how 'relicencing' (which requires changing every single file in Mozilla tree) was done. Can that method be applied here? Reproducible: Always Steps to Reproduce: 1. 2. 3.

Boris Zbarsky [:bzbarsky]

Comment 1

•

23 years ago

Confirming. ;)

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jungshik Shin

Assignee

Comment 2

•

22 years ago

As a first step, I'm gonna change new-born (Copy|Append)UCS2toUTF8 and (Append|Copy)UTF8toUCS2 (bug 87677). Nobody has used them so far.

Assignee: jaggernaut → jshin

Jungshik Shin

Assignee

Comment 3

•

22 years ago

Attached patch a patch as just mentioned — Details — Splinter Review

Jungshik Shin

Assignee

Comment 4

•

22 years ago

Comment on attachment 124971 [details] [diff] [review] a patch as just mentioned asking for r/sr.

Attachment #124971 - Flags: superreview?(alecf)

Attachment #124971 - Flags: review?(jaggernaut)

Alec Flett

Comment 5

•

22 years ago

Comment on attachment 124971 [details] [diff] [review] a patch as just mentioned sr=alecf

Attachment #124971 - Flags: superreview?(alecf) → superreview+

Peter Van der Beken [:peterv]

Updated

•

22 years ago

Attachment #124971 - Flags: review?(jaggernaut) → review+

jag (Peter Annema)

Comment 6

•

22 years ago

I wouldn't say that Mozilla's internal string representation has always been UTF-16. There was a time where, in converting from UTF8, we would ignore anything that would result in a character that would take more than 2 bytes to store. I do however agree with this bug, we should move to naming our classes |*UTF16*| instead of |*UCS2*|, since that is the case now. Begin by making the changes inside string/, but keeping typedefs / inline forwarders for the old classes and functions. Once that is landed, put a notification up on tinderbox and in newsgroups that we're going to make this change and people should start using these new names in their new code / fix 'em up as they touch code, and that we'll do a tree wide sweep a week after that (we'll have to coordinate this with drivers). You could do the tree wide sweep using some sed inline replace script(s), dbaron might be able to help you with that. If not, I could have some fun with this, it's been a while :-)

Jungshik Shin

Assignee

Comment 7

•

22 years ago

Fix checked into trunk. Thanks for r/sr and for the tip. I'll begin to work in string/.

Jungshik Shin

Assignee

Comment 8

•

22 years ago

Attached patch a patch for string/ (obsolete) — Details — Splinter Review

the second step ! a patch for string/.

Jungshik Shin

Assignee

Comment 9

•

22 years ago

Comment on attachment 125162 [details] [diff] [review] a patch for string/ asking r/sr. I've made a debug build on Win32 with this patch and it works fine.

Attachment #125162 - Flags: superreview?(alecf)

Attachment #125162 - Flags: review?(peterv)

Alec Flett

Comment 10

•

22 years ago

Comment on attachment 125162 [details] [diff] [review] a patch for string/ in NS_ConvertToString() - nsAutoString aUCS2string; - aUCS2string.AssignWithConversion(anASCIIstring); - return aUCS2string; + nsAutoString aUTF16string; + aUTF16string.AssignWithConversion(anASCIIstring); + return aUTF16string; this looks like the old code was named wrong anyway.. either way this document is looking older and cruftier every day :) As for the rest of the patch - it looks ok, but I'd want to make sure that stuff like the typedefs from the NS_Convert* classes actually work on all 3 major platforms... so please test this on more than just win32. sr=alecf but you need the string module owner's review, not peterv's..

Attachment #125162 - Flags: superreview?(alecf)

Attachment #125162 - Flags: superreview+

Attachment #125162 - Flags: review?(peterv)

Attachment #125162 - Flags: review?(jaggernaut)

Jungshik Shin

Assignee

Comment 11

•

22 years ago

Attached patch a new patch (against the current trunk : adapting to ToNewUTF8 patch by dbaron) (obsolete) — Details — Splinter Review

alecf, thanks for sr. I made a new patch because dbaron's ToNewUTF8 patch moved things around in string. There's nothing new (in terms of actual content). Can you transfer your sr to this patch? I've just tested it on linux (RedHat 8.0, g++ 3.2) and it works fine. To make sure, I clobbered and am now rebuilding it with the latest trunk on Linux. Windows and Linux (and hopefully other Unix compilers wouldn't be different in handling 'typedef'ed class'. ...) Who is the best person to ask it for testing on Mac, sfraser? I'll leave out the part of the obsolete string-guide that talks about old functions. BTW, why don't you check in your new doc in the tree? Or, is it not done on purpose? Sometimes, off-line doc. may be useful (when one's net connection is down :-))

Attachment #125162 - Attachment is obsolete: true

Jungshik Shin

Assignee

Comment 12

•

22 years ago

adding mkaply and sfraser to ask for help with OS/2 and Mac. Can you test attachment 125407 [details] [diff] [review] on OS/2 and Mac (classic and OS X)? timeless, can you test it under BeOS? Thank you.

Simon Fraser [no longer active]

Comment 13

•

22 years ago

Sorry, I don't have cycles to test on Mac.

jag (Peter Annema)

Comment 14

•

22 years ago

Comment on attachment 125407 [details] [diff] [review] a new patch (against the current trunk : adapting to ToNewUTF8 patch by dbaron) The typedef thing should work, we've done it before with typedef |const nsAString nsAReadableString|. Sorry for not getting to this earlier. I had a partial review comment ready but got pulled into something else. >Index: string/obsolete/nsString.h >=================================================================== >@@ -422,32 +422,32 @@ > // NS_DEF_DERIVED_STRING_OPERATOR_PLUS(nsCAutoString, char) > > /** >- * A helper class that converts a UCS2 string to UTF8 >+ * A helper class that converts a UTF-16 string to UTF8 Should we scan our string code comments and docs to make sure we use "UTF-8" everywhere? I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8| Should we perhaps move both typedefs all the way down in the header in a /* Backward compatibility */ section? >Index: string/public/nsReadableUtils.h >=================================================================== >@@ -82,10 +89,10 @@ > * Returns a new |char| buffer containing a zero-terminated copy of |aSource|. > * > * Allocates and returns a new |char| buffer which you must free with |nsMemory::Free|. >- * Performs a encoding conversion by converting 16-bit wide characters down to UTF8 encoded 8-bits wide string copying |aSource| to your new buffer. >+ * Performs a encoding conversion from UTF-16 string made of PRUnichar's to UTF-8 made of char's copying |aSource| to your new buffer. "Performs an encoding ... an UTF-16 ... PRUnichars ... an UTF-8 string ... chars ..." And please wrap that line to 80 characters. I'm wondering if we should leave out the "made of PRUnichars" and "made of chars". It's not relevant what data type we internally use to store the characters in. > * The new buffer is zero-terminated, but that may not help you if |aSource| contains embedded nulls. > * >- * @param aSource a 16-bit wide string >+ * @param aSource a UTF-16 string (made of PRUnichar's) "an UTF-16" Index: string/public/nsUTF8Utils.h =================================================================== @@ -206,6 +206,8 @@ PRBool mErrorEncountered; }; +typedef ConvertUTF8toUTF16 ConvertUTF8toUCS2; + /** * A character sink (see |copy_string| in nsAlgorithm.h) for computing * the length of a UTF-8 string. This typedef should be unnecessary if you fix the one use of it in /xpcom/io/nsUnicharInputStream.cpp >Index: string/doc/string-guide.html >=================================================================== >RCS file: /cvsroot/mozilla/string/doc/string-guide.html,v >retrieving revision 1.11 >diff -u -r1.11 string-guide.html >--- string/doc/string-guide.html 11 Mar 2003 23:26:18 -0000 1.11 >+++ string/doc/string-guide.html 11 Jun 2003 07:57:25 -0000 >+or char string literals. In UTF-16, characters occupy one 16bit code unit ( BMP character ) or two 16bit code units >+(non-BMP characters). Inconsistent spacing near "()" (I prefer no extra space). Should that be plural: "(BMP characters)" to match up with the other one? Should we explain what BMP stands for, or perhaps linkify these, or wrap them in <abbr title="explanation">BMP</abbr>?

Attachment #125407 - Flags: review-

timeless

Comment 15

•

22 years ago

> "an UTF-16" jag: actually our source tree uses "a utf" a lot more than "an utf" that's probably ok because the rule isn't strictly about which letter it is but more about which sound is present. 'a you tee eff'

Jungshik Shin

Assignee

Comment 16

•

22 years ago

Attached patch a new patch addressing jag's concerns — Details — Splinter Review

I removed 'typedef for ConvertUTF8toUCS2' as suggested and took care of other issues raised. > I didn't see |typedef NS_ConvertUTF16toUTF8 NS_ConvertUCS2toUTF8| Ooops. That's dropped off while hand-editing my patch after testing but before uploading. I meant to get rid of ConvertUTF16toUTF8 (that is not used anywhere outside string/), but deleted NS_ConvertUTF16toUTF8, instead. > more about which sound is present. 'a you tee eff' That's what every foreign student taking TOEFL is expected to know by ETS :-)

Attachment #125407 - Attachment is obsolete: true

Jungshik Shin

Assignee

Comment 17

•

22 years ago

Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns Just to save reviewers some time: diff. between this and the previous one is mostly along the line of jag's comment. In addition, I found a completely wrong description (apparently cut'n'pasted from the comment for ToNewCString) of ToNewUTF8String(aAString) and fixed it. I also removed typedef's for ConvertUTF8toUTF16 and ConvertUTF16toUTF8.

Attachment #125514 - Flags: superreview?(alecf)

Attachment #125514 - Flags: review?(jaggernaut)

Alec Flett

Comment 18

•

22 years ago

Comment on attachment 125162 [details] [diff] [review] a patch for string/ clearing obsolete reviews (but jag if you're reading this, there is a newer version of this patch for you to review!)

Attachment #125162 - Flags: review?(jaggernaut)

jag (Peter Annema)

Comment 19

•

22 years ago

Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns Heh, I could've sworn I learned it the "no exceptions" way. I guess I've forgotten a thing or two from my English classes. Nice catch on the "characters" -> "code units". is it okay to use "bytes" for UTF-8 strings, or should we use "code units" there too? (Or perhaps "encoding units"?)

Attachment #125514 - Flags: review?(jaggernaut) → review+

Johnny Stenback (:jst)

Comment 20

•

22 years ago

How about just 16-bit units in stead, that's what the W3C specs talk about when talking about UTF16 "characters".

Jungshik Shin

Assignee

Comment 21

•

22 years ago

Thank you for review and comment. Unicode 4.0 glossary (final draft) has the following (http://www.unicode.org/book/preview/glossary.pdf : I had to type this so that any typo is mine) Code Unit: The minimial bit combination that can represent a unit of encoded text for processing or interchange. The Unicode standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 ...... Given this, I'd rather use '16-bit code units' instead of '16-bit units'.(I'll insert '-' because I found that '-' is used in some other comments in string/) As for UTF-8, can we just get away with bytes instead of '8-bit code units'? There are some architectures on which a byte doesn't have 8 bits, but Mozilla doesn't run on them ...

Status: NEW → ASSIGNED

Jungshik Shin

Assignee

Comment 22

•

22 years ago

I just realized that my patch 'rewrites' old messages of scc in string-guide.html What shall I do? Can I 'html-ize' it somehow? The document is obsolete, anyway...

Jungshik Shin

Assignee

Comment 23

•

22 years ago

How about this? I thought abuot using <ins>, <del> or a couple of CSS tricks, but just adding a short note seems to be best. Here are the email answers I have yet to format into the FAQ. Some of the URLs may be out-dated or moved. The messages are in order from oldest to newest. + In June, 2003, these emails were modified + by Jungshik Shin to better reflect what is stored in 'wide' string + classes (UTF-16 string instead of UCS-2) and what + related methods do as a part of the patch for <a href= + "http://bugzilla.mozilla.org/show_bug.cgi?id=183156" + title="replace UCS2 in function/class/method names with UTF16">bug 183156</a>. + Therefore, they're a little different from the original emails + written by <a href="http://ScottCollins.net/">Scott Collins</a>

Jungshik Shin

Assignee

Comment 24

•

22 years ago

Just to see how it works, I ran the following in the root dir. of my tree after updating my tree to the trunk (2003-06-14 around 6pm? US PDT) LC_ALL=C find . -name "*.h" -o -name "*.cpp" -o -name "*.idl" | \ xargs perl -i.ucs2_backup -p \ -e 's/(Copy|NS_LossyConvert|NS_Convert)UCS2to(UTF8|ASCII)/$1UTF16to$2/g, \ s/(Copy|NS_Convert)(ASCII|UTF8)toUCS2/$1$2toUTF16/g' (I could have used 'GNU sed 4.x' with '-i' option with '$1' replacing '\1' in RE or manually made back-up files. I also searched for strings to replace in *pl, *sh, and *py files that may auto-generate cpp/h files and there's none found.) The above took 86 sec (user time: system time was ~ 10 sec) on my box. Then, I clobbered and rebuilt(debug). Everything went well. I didn't enable everything and the test was done on Linux so that this test wouldn't have detected if there's something wrong in parts not built (non-Linux and not- enabled code paths/files). However, I think it's pretty safe to assume that it'll work on other platforms. After the replacement, I also searched '(NS_Conver|Cop).*UCS2' and found a typo in widget/src/photon/nsTextHelper.cpp http://lxr.mozilla.org/seamonkey/source/widget/src/photon/nsTextHelper.cpp#101 UTF8 was mistyped as 'UFT8' in that line.

Alec Flett

Comment 25

•

22 years ago

the string doc is obsolete, but the changes you made are just fine - no need for any special ins/del tags or anything. CVS can keep track of document history.

Alec Flett

Comment 26

•

22 years ago

the string doc is obsolete, but the changes you made are just fine - no need for any special ins/del tags or anything. CVS can keep track of document history.

Jungshik Shin

Assignee

Comment 27

•

22 years ago

Alec, can you put sr stamp for attachment 125514 [details] [diff] [review] as it is, then? (my point was that I feel a bit uncomfortable rewriting scc's email messaegs quoted verbatim in the document 3 years after they're written.)

Alec Flett

Comment 28

•

22 years ago

Comment on attachment 125514 [details] [diff] [review] a new patch addressing jag's concerns sr=alecf - sorry, I didn't mean that it was delaying my sr, I just hadn't gotten around to it yet.

Attachment #125514 - Flags: superreview?(alecf) → superreview+

Jungshik Shin

Assignee

Comment 29

•

22 years ago

Thanks, alecf. Fix got checked in several hours ago. Now we have to move on to next round. What newsgroups shall I cross-post the notice to? How about putting it up at tinderbox?

Alec Flett

Comment 30

•

22 years ago

post to n.p.m.xpcom and n.p.m.announce - asa should approve it - a few weeks ago Asa and I agreed that using n.p.m.announce to announce major API changes, etc, was a good idea... this is a good place to start.

Jungshik Shin

Assignee

Comment 31

•

22 years ago

adding jag(string owner) and asa (for carpool) to CC

Jungshik Shin

Assignee

Comment 32

•

22 years ago

Attached patch a patch : CopyUTF16toASCII -> LossyCopyUTF16toASCII — Details — Splinter Review

This is an additional patch to rename CopyUTF16toASCII as LossyCopyUTF16toASCII suggested by jag off-line. I think it's a good idea to make it in line with NS_LossyConvertUTF16toASCII.

jag (Peter Annema)

Comment 33

•

22 years ago

Comment on attachment 126602 [details] [diff] [review] a patch : CopyUTF16toASCII -> LossyCopyUTF16toASCII r+sr=jag

Attachment #126602 - Flags: superreview+

Attachment #126602 - Flags: review+

Jungshik Shin

Assignee

Comment 34

•

22 years ago

note to myself: Camino has a bunch of '*.mm' files with '*UCS2to*' and '*toUCS2'. So I have to modify 'find statement' in comment #24 to include '*mm' files.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 35

•

20 years ago

Attached patch replace *UCS2* with *UTF16* — Details — Splinter Review

ok, this patch changes all callers to use the UTF16 functions and removes the UCS2 ones. the essential part here was done by: perl -pi -e 's/NS_ConvertASCIItoUCS2/NS_ConvertASCIItoUTF16/g; s/NS_ConvertUTF8toUCS2/NS_ConvertUTF8toUTF16/g; s/NS_LossyConvertUCS2toASCII/NS_LossyConvertUTF16toASCII/g; s/NS_ConvertUCS2toUTF8/NS_ConvertUTF16toUTF8/g; s/CopyASCIItoUCS2/CopyASCIItoUTF16/g; s/CopyUCS2toASCII/LossyCopyUTF16toASCII/g' $* on a clean MOZ_CO_MODULES=all tree.

Attachment #210428 - Flags: superreview?(darin)

Attachment #210428 - Flags: review?(darin)

Christian :Biesinger (don't email me, ping me on IRC)

Comment 36

•

20 years ago

I'll post an announcement to m.d.t.xpcom and m.d.general when checking this in.

Darin Fisher

Updated

•

20 years ago

Attachment #210428 - Flags: superreview?(darin)

Attachment #210428 - Flags: superreview+

Attachment #210428 - Flags: review?(darin)

Attachment #210428 - Flags: review+

Christian :Biesinger (don't email me, ping me on IRC)

Comment 37

•

20 years ago

checked in. that should make this bug fixed.

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: String → XPCOM

You need to log in before you can comment on or make changes to this bug.