Closed
Bug 183048
Opened 22 years ago
Closed 20 years ago
Ref. to UCS-2 have to be replaced by UTF-16 in String documents
Categories
(Developer Documentation Graveyard :: General, defect)
Developer Documentation Graveyard
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jshin1987, Assigned: jshin1987)
References
()
Details
Attachments
(1 file, 1 obsolete file)
35.07 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021008 This is a spin-off from bug 182877. Currently, in many places at www.mozilla.org and in Mozilla source tree, UCS-2 is used where UTF-16 has to be used. As far as I know, Mozilla internal string representation always has been UTF-16 (not UCS-2). In the early days of Mozilla development, it's debated which internal representation to use for Unicode characters, UTF-8, UTF-16, UTF-32. Although I was for UTF-32, it's decided that UTF-16 be used because it takes less memory than UTF-32 for BMP characters and non-BMP characters can be supported without much difficulty when need for that arose. It was never UCS-2. Because UCS-2 cannot represent the full repertoire of Unicode/ISO 10646 ("20.1bit" character sets. U+000000 to U+10FFFF) and Mozilla as a web browser has to support the full repertoire of Unicode/ISO 10646. Until recently non-BMP codepoints didn't get filled and Mozilla developers could get away with a misconception that nsAString and related string classes are UCS-2. (Most, if not all, I18N people are aware of that, but some may not). This misconception is, unfortunately, strengthened by some unfortunate choices of function/class/method names (e.g. nsUTF8ToUCS2) and string-related documents refering to UCS-2 where UTF-16 has to be refered. As a web browser and application, it can be safely said that virtually all aspects of Mozilla have some I18N/L10N elements and string related functions/methods/class are arguably the most important element in I18N work. Therefore, even non-I18N engineers/contributors to Mozilla projects should be well informed that Mozilla's internal string representation is NOT UCS-2 BUT UTF-16. To faciliate this 'education', some string related documents and function/class/methods annotations have to be revised. For instance, it has to be clearly documented that 'Convert()' method for nsIUnicodeDecoder returns not the number of Unicode characters but the number of 'code units' (16bit PRUint16) in UTF-16. For BMP only strings, they're identical but if there are non-BMP characters in strings, they're different. In terms of functionality, I believe all of string conversion methods/class are UTF-16 clean (that is, methods/class/functions like nsUTF8ToUCS2 correctly work for strings with non-BMP chars.). In addition to documentation changes, it might be a good idea (some wouldn't agree...) to replace UCS2 in the names of fucntions/methods/classes with UTF16 unless they're specifically for UCS-2 only. Reproducible: Always Steps to Reproduce: 1. 2. 3.
Assignee | ||
Comment 1•22 years ago
|
||
Adding Boris (who raised the issue in bug 182877) and some people in I18N and string development. > All other classes simply have "String" in their name and refer to double-byte (PRUnichar) > strings, most commonly encoded in UCS2. For example: nsAFlatString is an abstract class for > storing double-byte characters, As an example of what can be clarified, the following may have to be added to the above. It has to be noted that using 'doublte-byte' for PRUnichar does not mean that 'String' classes contain only characters in BMP(Basic Multilingual Plane: U+0000 - U+FFFF). Neither does it mean that 'String' classes use UCS-2 as their 'encoding'. They can represent non-BMP characters (characters in plane 1 - plane 16, U+10000 - U+10FFFF) with two 'double byte characters'(a surrogate pair) representing a single non-BMP character. This way of representing Unicode characters is called UTF-16 which is compatible with UCS-2 as far as BMP character is concerned. However, this compatibility with UCS-2 should not be construed as an execuse to mistakenly believe that 'String' classes are always in UCS-2. Among implication of this is that in any method that deals with 'length' of String, 'length' or 'number of character' does not mean the number of Unicode characters but mean the number of UTF-16 code units (the number of 'double byte characters'). They're identical for BMP characters but are different for non-BMP characters. BTW, I filed this under doc. component because most , if not all, string related functions work correctly for UTF-16 although their names have 'UCS2'. For instance, I believe(haven't checked) nsConvertUTF8ToUCS2 return a surrogate pair given a NON-BMP character. I'll check it and others indeed handle UTF-16 correctly.
Comment 2•22 years ago
|
||
That seems like a reasonable approach if we also clarify the comments on nsIUnicodeDecoder/Encoder as to what "number of unicode chars" means for non-BMP things... Do we need a "number of chars" function in nsReadableUtils in addition to the .Count() we have on nsAString? Such a function would have to assume UCS-16 as the encoding, of course...
Comment 3•22 years ago
|
||
I'll help out in updating the string documents, but I really don't want to have too much about encoding details in strings. The string documents are more about how to properly use the classes than they are about conversion. I think a seperate appendix describing encoding and conversion would be good.
Comment 4•22 years ago
|
||
Erh, what .Count() on nsA{C}String? In fact, to avoid making (more) assumptions about the storage encoding of a nsA{C}String it'd be good to make UTF8Count(), UTF16Count() be global functions in nsReadableUtils.h (which we really should rename to nsStringUtils.h one of these days).
Assignee | ||
Comment 5•21 years ago
|
||
There are some rough edges to smooth out. Why don't you take a look and comment?
Comment 6•21 years ago
|
||
overall, these changes look good - I say lets do it.
Assignee | ||
Comment 7•20 years ago
|
||
I'm sorry I forgot about this for a while. The document has since been updated by Darin. Here's a new patch.
Assignee | ||
Comment 8•20 years ago
|
||
Comment on attachment 155168 [details] [diff] [review] update Somehow I don't get 'r/sr' request fields. Darin, can you take a look and see if it looks fine. If you're Ok, I'll check this in.
Comment 9•20 years ago
|
||
cc'ing darin so he actually knows about the last comment ;)
Assignee | ||
Comment 10•20 years ago
|
||
cbie didn't add darin to cc although he wrote he's doing that ;-) darin, can you take a look? I've seen people still use 'UCS2' in new check-ins. (see also bug 183156). This change will cut down that usage
Comment 11•20 years ago
|
||
Comment on attachment 155168 [details] [diff] [review] update looks good to me.
Comment 12•20 years ago
|
||
> Somehow I don't get 'r/sr' request fields.
yeah, this module doesn't require 'r/sr' ... go ahead and check-in your changes :)
Assignee | ||
Comment 13•20 years ago
|
||
thanks for taking a look. I've just checked the patch in.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Updated•20 years ago
|
Updated•18 years ago
|
Component: Mozilla Developer → Documentation Requests
Product: Documentation → Mozilla Developer Center
Updated•12 years ago
|
Component: Documentation Requests → Documentation
Updated•12 years ago
|
Component: Documentation → General
Product: Mozilla Developer Network → Developer Documentation
You need to log in
before you can comment on or make changes to this bug.
Description
•