Closed Bug 183048 Opened 22 years ago Closed 20 years ago

Ref. to UCS-2 have to be replaced by UTF-16 in String documents

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

(
URL
)

Details

Attachments

(1 file, 1 obsolete file)

draft 21 years ago Jungshik Shin 32.27 KB, patch		Details \| Diff \| Splinter Review
update 20 years ago Jungshik Shin 35.07 KB, patch		Details \| Diff \| Splinter Review

Jungshik Shin

Assignee

Description

•

22 years ago

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008

This is a spin-off from bug 182877. Currently, in many places at www.mozilla.org
and in Mozilla source tree, UCS-2 is used where UTF-16 has to be used. 

As far as I know, Mozilla internal string representation always has been UTF-16
(not UCS-2). In the early days of Mozilla development,  it's debated which
internal representation to use for Unicode characters, UTF-8, UTF-16, UTF-32. 
Although I was for UTF-32, it's decided that UTF-16 be used because
it takes less memory than UTF-32 for BMP characters and non-BMP characters
can be supported without much difficulty when need for that arose. 
It was never UCS-2. Because UCS-2 cannot represent the full repertoire of
Unicode/ISO 10646 ("20.1bit" character sets. U+000000 to U+10FFFF) and
Mozilla as a web browser has to support the full repertoire of Unicode/ISO 10646.

Until recently non-BMP codepoints didn't get filled and Mozilla
developers could get away with a misconception that nsAString and related
string classes are UCS-2.    (Most, if not all, I18N people are aware of
that, but some may not). This misconception is, unfortunately,
strengthened by some unfortunate choices of function/class/method
names (e.g. nsUTF8ToUCS2) and string-related documents
refering to UCS-2 where UTF-16 has to be refered. 

As a web browser and application, it can be safely said that virtually
all aspects of Mozilla have some I18N/L10N elements and
string related functions/methods/class are arguably the most
important element in I18N work.  Therefore,
even non-I18N engineers/contributors to Mozilla projects should
be well informed that Mozilla's internal string representation
is NOT UCS-2 BUT UTF-16.  

To faciliate this 'education', some string related documents
and function/class/methods annotations have to be revised.
For instance, it has to be clearly documented that 
'Convert()' method for nsIUnicodeDecoder returns
not the number of Unicode characters but the number
of 'code units' (16bit PRUint16) in UTF-16. For BMP only 
strings, they're identical but if there are non-BMP characters 
in strings, they're different.  In terms of functionality,
I believe all of string conversion methods/class are UTF-16
clean (that is, methods/class/functions like
 nsUTF8ToUCS2 correctly work for strings with  non-BMP chars.).

In addition to documentation changes, it might be a good idea
(some wouldn't agree...) to replace UCS2 in the names of
fucntions/methods/classes with UTF16 unless they're specifically
for UCS-2 only. 
 

Reproducible: Always

Steps to Reproduce:
1.
2.
3.

Jungshik Shin

Assignee

Comment 1

•

22 years ago

Adding Boris (who raised the issue in bug 182877) and some people in I18N and
string development. 

> All other classes simply have "String" in their name and refer to double-byte
(PRUnichar) 
> strings, most commonly encoded in UCS2. For example: nsAFlatString is an
abstract class for 
> storing double-byte characters, 

As an example of what can be clarified, the following may have to be added to
the above.

 It has to be noted that  using 'doublte-byte' for PRUnichar does not mean that
 'String'
classes contain only characters in BMP(Basic Multilingual Plane: U+0000 - U+FFFF).
Neither does it mean that 'String' classes use UCS-2 as their 'encoding'. 
 They can represent non-BMP characters (characters in plane 1 - plane 16,
U+10000 - U+10FFFF) with two 'double byte characters'(a surrogate
pair) representing a single non-BMP character.   This way of representing
Unicode characters is called UTF-16 which is compatible with UCS-2
as far as BMP character is concerned. However, this compatibility
with UCS-2 should not be construed as an execuse to mistakenly
believe that 'String' classes are always in UCS-2.  Among implication
of this is that in any method that deals with  'length' of String,
'length' or 'number of character' does not mean the number
of Unicode characters but mean the number of UTF-16 code units
(the number of 'double byte characters').    They're identical
for BMP characters but are different for non-BMP characters.

BTW, I filed this under doc. component because most , if not all,
string related functions work correctly for UTF-16 although
their names have 'UCS2'. For instance, I believe(haven't checked)
nsConvertUTF8ToUCS2 return a surrogate pair
given a NON-BMP character. I'll check it and others
indeed handle UTF-16 correctly.

Boris Zbarsky [:bzbarsky]

Comment 2

•

22 years ago

That seems like a reasonable approach if we also clarify the comments on
nsIUnicodeDecoder/Encoder as to what "number of unicode chars" means for non-BMP
things...

Do we need a "number of chars" function in nsReadableUtils in addition to the
.Count() we have on nsAString?  Such a function would have to assume UCS-16 as
the encoding, of course...

Alec Flett

Comment 3

•

22 years ago

I'll help out in updating the string documents, but I really don't want to have
too much about encoding details in strings. The string documents are more about
how to properly use the classes than they are about conversion. I think a
seperate appendix describing encoding and conversion would be good.

jag (Peter Annema)

Comment 4

•

22 years ago

Erh, what .Count() on nsA{C}String?

In fact, to avoid making (more) assumptions about the storage encoding of a
nsA{C}String it'd be good to make UTF8Count(), UTF16Count() be global functions
in nsReadableUtils.h (which we really should rename to nsStringUtils.h one of
these days).

Jungshik Shin

Assignee

Comment 5

•

21 years ago

Attached patch draft (obsolete) — Details — Splinter Review

There are some rough edges to smooth out. Why don't you take a look and
comment?

Alec Flett

Comment 6

•

21 years ago

overall, these changes look good - I say lets do it.

Jungshik Shin

Assignee

Comment 7

•

20 years ago

Attached patch update — Details — Splinter Review

I'm sorry I forgot about this for a while. The document has since been updated
by Darin. Here's a new patch.

Assignee: endico → jshin

Attachment #139377 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Jungshik Shin

Assignee

Comment 8

•

20 years ago

Comment on attachment 155168 [details] [diff] [review]
update

Somehow I don't get 'r/sr' request fields. 
Darin, can you take a look and see if it looks fine. If you're Ok, I'll check
this in.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 9

•

20 years ago

cc'ing darin so he actually knows about the last comment ;)

Jungshik Shin

Assignee

Comment 10

•

20 years ago

cbie didn't add darin to cc although he wrote he's doing that ;-)

darin, can you take a look? I've seen people still use 'UCS2' in new check-ins.
(see also bug 183156). This change will cut down that usage

Darin Fisher

Comment 11

•

20 years ago

Comment on attachment 155168 [details] [diff] [review]
update

looks good to me.

Darin Fisher

Comment 12

•

20 years ago

> Somehow I don't get 'r/sr' request fields. 

yeah, this module doesn't require 'r/sr' ... go ahead and check-in your changes :)

Jungshik Shin

Assignee

Comment 13

•

20 years ago

thanks for taking a look. I've just checked the patch in.

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → FIXED

Christian :Biesinger (don't email me, ping me on IRC)

Updated

•

20 years ago

URL: http://www.mozilla.org/projects/xpcom...

Reed Loden [:reed]

Updated

•

18 years ago

Component: Mozilla Developer → Documentation Requests

Product: Documentation → Mozilla Developer Center

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Documentation Requests → Documentation

John Karahalis [:openjck]

Updated

•

12 years ago

Component: Documentation → General

Product: Mozilla Developer Network → Developer Documentation

You need to log in before you can comment on or make changes to this bug.