Ref. to UCS-2 have to be replaced by UTF-16 in String documents



16 years ago
6 years ago


(Reporter: jshin1987, Assigned: jshin1987)





(1 attachment, 1 obsolete attachment)

35.07 KB, patch
Details | Diff | Splinter Review


16 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008

This is a spin-off from bug 182877. Currently, in many places at
and in Mozilla source tree, UCS-2 is used where UTF-16 has to be used. 

As far as I know, Mozilla internal string representation always has been UTF-16
(not UCS-2). In the early days of Mozilla development,  it's debated which
internal representation to use for Unicode characters, UTF-8, UTF-16, UTF-32. 
Although I was for UTF-32, it's decided that UTF-16 be used because
it takes less memory than UTF-32 for BMP characters and non-BMP characters
can be supported without much difficulty when need for that arose. 
It was never UCS-2. Because UCS-2 cannot represent the full repertoire of
Unicode/ISO 10646 ("20.1bit" character sets. U+000000 to U+10FFFF) and
Mozilla as a web browser has to support the full repertoire of Unicode/ISO 10646.

Until recently non-BMP codepoints didn't get filled and Mozilla
developers could get away with a misconception that nsAString and related
string classes are UCS-2.    (Most, if not all, I18N people are aware of
that, but some may not). This misconception is, unfortunately,
strengthened by some unfortunate choices of function/class/method
names (e.g. nsUTF8ToUCS2) and string-related documents
refering to UCS-2 where UTF-16 has to be refered. 

As a web browser and application, it can be safely said that virtually
all aspects of Mozilla have some I18N/L10N elements and
string related functions/methods/class are arguably the most
important element in I18N work.  Therefore,
even non-I18N engineers/contributors to Mozilla projects should
be well informed that Mozilla's internal string representation
is NOT UCS-2 BUT UTF-16.  

To faciliate this 'education', some string related documents
and function/class/methods annotations have to be revised.
For instance, it has to be clearly documented that 
'Convert()' method for nsIUnicodeDecoder returns
not the number of Unicode characters but the number
of 'code units' (16bit PRUint16) in UTF-16. For BMP only 
strings, they're identical but if there are non-BMP characters 
in strings, they're different.  In terms of functionality,
I believe all of string conversion methods/class are UTF-16
clean (that is, methods/class/functions like
 nsUTF8ToUCS2 correctly work for strings with  non-BMP chars.).

In addition to documentation changes, it might be a good idea
(some wouldn't agree...) to replace UCS2 in the names of
fucntions/methods/classes with UTF16 unless they're specifically
for UCS-2 only. 

Reproducible: Always

Steps to Reproduce:

Comment 1

16 years ago
Adding Boris (who raised the issue in bug 182877) and some people in I18N and
string development. 

> All other classes simply have "String" in their name and refer to double-byte
> strings, most commonly encoded in UCS2. For example: nsAFlatString is an
abstract class for 
> storing double-byte characters, 

As an example of what can be clarified, the following may have to be added to
the above.

 It has to be noted that  using 'doublte-byte' for PRUnichar does not mean that
classes contain only characters in BMP(Basic Multilingual Plane: U+0000 - U+FFFF).
Neither does it mean that 'String' classes use UCS-2 as their 'encoding'. 
 They can represent non-BMP characters (characters in plane 1 - plane 16,
U+10000 - U+10FFFF) with two 'double byte characters'(a surrogate
pair) representing a single non-BMP character.   This way of representing
Unicode characters is called UTF-16 which is compatible with UCS-2
as far as BMP character is concerned. However, this compatibility
with UCS-2 should not be construed as an execuse to mistakenly
believe that 'String' classes are always in UCS-2.  Among implication
of this is that in any method that deals with  'length' of String,
'length' or 'number of character' does not mean the number
of Unicode characters but mean the number of UTF-16 code units
(the number of 'double byte characters').    They're identical
for BMP characters but are different for non-BMP characters.

BTW, I filed this under doc. component because most , if not all,
string related functions work correctly for UTF-16 although
their names have 'UCS2'. For instance, I believe(haven't checked)
nsConvertUTF8ToUCS2 return a surrogate pair
given a NON-BMP character. I'll check it and others
indeed handle UTF-16 correctly. 
That seems like a reasonable approach if we also clarify the comments on
nsIUnicodeDecoder/Encoder as to what "number of unicode chars" means for non-BMP

Do we need a "number of chars" function in nsReadableUtils in addition to the
.Count() we have on nsAString?  Such a function would have to assume UCS-16 as
the encoding, of course...

Comment 3

16 years ago
I'll help out in updating the string documents, but I really don't want to have
too much about encoding details in strings. The string documents are more about
how to properly use the classes than they are about conversion. I think a
seperate appendix describing encoding and conversion would be good. 

Comment 4

16 years ago
Erh, what .Count() on nsA{C}String?

In fact, to avoid making (more) assumptions about the storage encoding of a
nsA{C}String it'd be good to make UTF8Count(), UTF16Count() be global functions
in nsReadableUtils.h (which we really should rename to nsStringUtils.h one of
these days).

Comment 5

15 years ago
Created attachment 139377 [details] [diff] [review]

There are some rough edges to smooth out. Why don't you take a look and

Comment 6

15 years ago
overall, these changes look good - I say lets do it.

Comment 7

14 years ago
Created attachment 155168 [details] [diff] [review]

I'm sorry I forgot about this for a while. The document has since been updated
by Darin. Here's a new patch.
Assignee: endico → jshin
Attachment #139377 - Attachment is obsolete: true

Comment 8

14 years ago
Comment on attachment 155168 [details] [diff] [review]

Somehow I don't get 'r/sr' request fields. 
Darin, can you take a look and see if it looks fine. If you're Ok, I'll check
this in.
cc'ing darin so he actually knows about the last comment ;)


Comment 10

14 years ago
cbie didn't add darin to cc although he wrote he's doing that ;-)

darin, can you take a look? I've seen people still use 'UCS2' in new check-ins.
(see also bug 183156). This change will cut down that usage 

Comment 11

14 years ago
Comment on attachment 155168 [details] [diff] [review]

looks good to me.

Comment 12

14 years ago
> Somehow I don't get 'r/sr' request fields. 

yeah, this module doesn't require 'r/sr' ... go ahead and check-in your changes :)

Comment 13

14 years ago
thanks for taking a look. I've just checked the patch in.
Last Resolved: 14 years ago
Resolution: --- → FIXED
Component: Mozilla Developer → Documentation Requests
Product: Documentation → Mozilla Developer Center
Component: Documentation Requests → Documentation
Product: Mozilla Developer Network → Mozilla Developer Network
Component: Documentation → General
Product: Mozilla Developer Network → Developer Documentation
You need to log in before you can comment on or make changes to this bug.