Closed Bug 206811 Opened 21 years ago Closed 21 years ago

xp_iconv in nsNativeCharsetutils.cpp should use UTF-16 if available

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4final

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(2 files, 5 obsolete files)

xp_iconv in nsNativeCharsetUtils.cpp uses UCS-2, but *PRUnichar is not a string
of UCS2's but UTF-16 string. [1] Therefore, we should try UTF-16 before trying
UCS-2. 
At least, glibc, Solaris, HP/UX 10/11 and Tru64 [2] use 'UTF-16' for the native
endian UTF-16. I haven't yet figured out what AIX and IRIX do about it. [3]

[1] 
http://lxr.mozilla.org/seamonkey/source/xpcom/io/nsNativeCharsetUtils.cpp#177
[2]
http://docs.sun.com/db/doc/806-5584/6jej8rb0u?a=view
http://h30097.www3.hp.com/docs/porting/tru64-to-hpux/CHP18NXX.HTM

[3]http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/convert.htm
Attached patch a patch (obsolete) — Splinter Review
Comment on attachment 124018 [details] [diff] [review]
a patch

>+static const char *UTF_16_NAMES[] = {
>+    "UTF-16",     
>     "UCS-2",
>     "UCS2",
>     "UCS_2",

this patch makes good sense to me, but do we need to worry about
other variants of UTF-16 .. such as UTF16, UTF_16, utf-16, utf16, 
and utf_16?  i know it almost seems rediculous, but i can't really
say for sure.

also, how about replacing all instances of "ucs2" with "utf16"
e.g., ucs2_to_isolatin1 -> utf16_to_isolatin1
As for replacing all instances of "ucs2" with "utf16", I'll do.

As for UTF-16 variants, I'm not sure. It seems like every platform that supports
UTF-16 uses "UTF-16". However, I guess it doesn't hurt to try other variants
because it's done only once at the init, right? 
darin, 
Can you also comment on my posting 
'news://news.mozilla.org:119/bajml3$fvr1@ripley.netscape.com'? I thought I was
ringing a false alram, but I just realized that Linux does not use iconv(3) but
uses wcrtomb and mbrtowc assuming that wchar_t is UTF-16 string, of which I'm
not sure.
Status: NEW → ASSIGNED
actually, i'm not sure why we'd want to ever prefer wcrtomb/mbrtowc over iconv
on linux.  i know that under OSX, nl_langinfo does not seem to exist, but the
nsNativeCharsetUtils methods aren't invoked until OSX IIRC.
On Linux it is *never* true that wchar_t == UCS2 or UTF-16.  With the
-fshort-wchar option one can make gcc define wchar_t as a two-byte value but
this doesn't magically the runtime.  In fact, when using this option and the
normal runtime for wide character handling you guarantee certain death for your
program.

Everybody who uses this option must be aware of this.  And s/he must be aware of
the fact that the so created fake wchar_t type violates ISO C.  UTF-16 is no
valid encoding for wchar_t and UCS2 is obviously insufficient.

If somebody insists on using -fshort-wchar s/he must make sure to never use any
of the wide character interfaces in the runtime.  It is possible to use iconv()
but even then the pseudo encoding name "WCHAR_T" must not be used.  It is then
necessary to explicitly use "UTF-16" or so.
Darin and I decided to switch over to iconv for Linux after discussing the
issue off-line (newsgroups/mail) as well as here with Ulrich and Bruno.
Attachment #124018 - Attachment is obsolete: true
Before leaving for the vacation, Darin wrote to me the following:
<quote>
jshin: i am going to be away on vacation for the next 2 weeks, in which
case i most likely will not get a chance to make this change in time for
1.4 final.  if you think we should try to make this change for 1.4 final
(sounds to me like we should), then you ought to get in touch with
drivers@mozilla.org to discuss it with them.  be sure to run some
performance tests (start up time is probably going to be most sensitive)
so they can be aware of all the implications of this change.  thanks!
</quote>

I agree with him that we should fix this both in trunk and 1.4f branch because
1.4f branch is likely to be a long-lived one. I measured the start-up time with
and without the patch (attachment 124092 [details] [diff] [review]) and I found that the one with the
patch takes slightly shorter than otherwise. (0.095 vs 0.097 on 10-time average.
the std. deviation is virtually zero unless I wait long enough between runs).
Attachment #124092 - Flags: superreview?(alecf)
Attachment #124092 - Flags: review?(ftang)
Comment on attachment 124092 [details] [diff] [review]
a new patch (including switching to iconv under Linux)

looks good. sr=alecf
Attachment #124092 - Flags: superreview?(alecf) → superreview+
Setting the target to 1.4final. Frank, can you review? 
Target Milestone: --- → mozilla1.4final
Comment on attachment 124092 [details] [diff] [review]
a new patch (including switching to iconv under Linux)

asking smontagu for r.
Attachment #124092 - Flags: review?(ftang) → review?(smontagu)
Comment on attachment 124092 [details] [diff] [review]
a new patch (including switching to iconv under Linux)

I can't say I know enough about iconv to review this properly, but since (from
comment 7) it looks like people who do know have signed off on the approach, I
am happy to give r=smontagu
Attachment #124092 - Flags: review?(smontagu) → review+
This appears to have increased codesize by 3.4k on various linux platforms. Was
this expected; are we statically linking the iconv libraries?
> This appears to have increased codesize by 3.4k on various linux platforms.

Thank you for alerting me about the codesize I missed. 

I realize that USE_ICONV has more code than USE_STDCONV. 

 We can cut down the increase by half by NOT defining
ENABLE_UTF8_FALLBACK_SUPPORT. For sure, iconv(3) in glibc doesn't need that
which is for iconv(3) on Solaris. On Solaris, iconv(3) has to go through UTF-8
for most combinations of to/from codesets. 

Removing 'inline' from xp_iconv_reset helped me very little (100bytes) while
removing 'inline' from xp_iconv_open and xp_iconv didn't change the size at all
(they are too complex to be inlined with gcc 3.x with -O1). 

I can squeeze out additional ~300 bytes removing unnecessary array elements. The
final numbers are:

 - old       : 898056 bytes
 - current   : 902504 bytes   : +4448
 - w/ patch  : 899528 bytes   : +1472

Attached patch a patch to reduce the size (obsolete) — Splinter Review
This patch doesn't look pretty because it's littered with '#if
!defined(__linux)'. Anyway, with this patch, I cut down the code size increase
(libxpcom.so on ix86) to 1.2kB.
I forgot to mention that I removed 'inline' in xp_iconv_reset, which saved me
only about 190 bytes. Because I removed the function call overhead on Linux in
other places (plain iconv_open() is used in place of xp_iconv_open), I don't
think there's any slow down in the startup. If it's a worry, perhaps we can just
put back inline for 190 byte size increase (on ix86, gcc 3.x, -O1...) 

BTW, I assumed that Linux binary is always compiled AND run with glibc 2.x or
GNU libiconv, which is very  unlikely to break and which enabled me to use
'UTF-16' and 'ISO-8859-1' directly in iconv_open(). Benjamin, I guess you're
concerned with embedded devices. Is my assumption valid there? 
jshin, I am not actually an embedder. I was just made very conscious of codesize
issues while working with alecf, and looking a tbox it seemed a small change had
large codesize implications. If that was unexpected, I just wanted to bring it
to your attention. I don't have any particular need/desire for a fix.
But I'd like to see this fix go in if in fact the extra code isn't needed on
linux... however, is there any way to detect at runtime what systems do and
don't need it, as opposed to choosing linux vs. not linux? i.e. is there some
kind of test we can add to configure.in to detect this case?
> is there any way to detect at runtime what systems do and
> don't need it, as opposed to choosing linux vs. not linux? i.e. is 
> there some kind of test we can add to configure.in to detect this 
> case?

There may be. Adding seawood to CC.

We might be able to detect the codeset names for UTF-16, UTF-8, and ISO-8859-1
with some tweaking in configure.in. Then, we can reduce the codesize on all
Unix(-like) platforms. When I wrote about this possibility for another program,
some people had a reservation about it because what's detected at the
compile-time is not always the case at the run-time because iconv(3) is
available in a separate standalone libiconv and people often install it on
non-Linux systems if they're not satisfied with the stock version. 

On Linux (barring possible exceptions on embedded devices unknown to me. adding
dougt for this), the possibility is almost zero because there's no reason to
replace the stock iconv(3) with something else. It just occurrs to me that some
embedded devices (Linux) use a lightweight C library other than glibc. I'm not
sure whether it lacks iconv(3) altogether, in which case we don't have to worry
because 'HAVE_ICONV' is set to false. A tricky case is vanilla iconv(3) that
doesn't support UTF-16.  

BTW, Ulrich, is there any known glibc (configuration option) that doesn't
support UTF-16 but supports UCS-2? This question is also relevant to native
uconv (http://lxr.mozilla.org/mozilla/source/intl/uconv/native/).

If all these are regarded as too much headache for about 800bytes, I think we
can just go with #ifdef'ing out 'UTF-8 fallback' code for Linux. That'll save us
about 2.4kB (out of 4.5kB introduced by my check-in) instead of 3.2kB. 
I think this checkin caused blocker bug 208809
> BTW, Ulrich, is there any known glibc (configuration option) that doesn't
> support UTF-16 but supports UCS-2?

UCS-2 handling is built in libc, UTF-16 is a module which can be removed.
>If all these are regarded as too much headache for about 800bytes, I think we
>can just go with #ifdef'ing out 'UTF-8 fallback' code for Linux. That'll save us
>about 2.4kB (out of 4.5kB introduced by my check-in) instead of 3.2kB. 

yeah, something like that is probably sufficient.

as for the other tweaks, since you have done them... maybe worth making the
changes, but in general there are bigger fish to fry ;-)

i think ulrich's comment suggests that we should fallback on UCS-2 if no UTF-16
converter exists.
Blocks: 208809
This is the first of two patches. Here I'm explictly checking whether we need
byte swapping or not and byte-swap if necessary. I had to cast away constness,
which can be dangerous. I also disabled UTF8FALLBACK on linux but had to put
back the dummy conversion for BOM check because glibc 2.2.x also puts BOM for
UTF-16 and it's also necessary to check the endianness of UTF-16. With this,
the size reduction compared with the trunk is 1.3kB.
Here I'm following Ulrich's suggestion in bug 208809. Instead of using UTF-16
(which is 'native' endian on recent glibc but is BE in old glibc regardless of
the native endianness), UTF-16LE/UTF-16BE are used depending on the endianness.
One problem with this is that UTF-16LE/BE are not supported by glibc 2.1.x (at
least they're not in glibc 2.1.3 on RH 6.2 ). As a fallback, I'm using
'UNICODELITTLE' and 'UNICODEBIG' that are present both in glibc 2.1.x and
recent glibc.

The size reduction is about 2.8kB. Although we can't support nob-BMP on Linux
with old glibc with this patch, I think this patch is better than attachment
125292 [details] [diff] [review]. There're a couple of reasons. Firstly, I don't feel comfortable casting
away const (I guess it can crash Mozilla). And, it's smaler and faster (no need
to byte-swap). With attachment 125292 [details] [diff] [review], we might  be byteswapping twice, in
glibc and in our code. 

BTW, I blocked the dummy conversion to get rid of 'BOM"  on linux (because we
now explicitly specify the endianness), but that saves us only about 100bytes.
To be safe, I'd rather keep that code in.
> Instead of using UTF-16
> (which is 'native' endian on recent glibc but is BE in old glibc regardless of
> the native endianness),

That's wrong.  If you use UTF-16 and you provide no BOM big endian is used.  But
it is not correct to use UTF-16 without BOM.  So using the explicit names is
*always* correct, regardless of platform.
> That's wrong.  If you use UTF-16 and you provide no BOM big endian is used. 
But it is not correct to use UTF-16 without BOM.  

  Let's put aside difference in interpreting UTF-16 and focus on how different
versions of glibc behave.  I didn't think about converting from UTF-16. 

1. In recent glibc, when converting to 'UTF-16',  the output is prepended with
BOM to indicate the native endian in the first invocation and the *native*
endian is used.

2.  When converting from UTF-16, the absence of BOM is regarded as indicating BE
regardless of the native endianness. 

3.  in old glibc, BOM was not prepended to the output and BE (not the native
endian) is used everywhere when converting to UTF-16.

#1 is what I observed and #3 was inferred from bz's test result in bug 208809. 

As for #2, is that what you meant? My test with glibc 2.2.93 on ix86 tells me a
different story. With or without BOM, "\x20\x21" is interpreted as U+2120 by
iconv when converting _from_ "UTF-16". That is, both '\xff\xfe\x20\x21' and
'\x20\x21' gave me the same result (interpreted as LE). Has there been a change
recently (2.3.x)?  

> So using the explicit names is *always* correct, regardless of 
> platform.

 No doubt about that. As I wrote, one problem with that is that UTF-16LE and
UTF-16BE are not available in old glibc. Anyway, using UNICODELITTLE and
UNICODEBIG as fallback at the cost of losing non-BMP support is reasonable.
Glibc has /usr/include/byteswap.h which has optimized swapping 
functions. You should use them where available.
Comment on attachment 125292 [details] [diff] [review]
a new patch to fix make it work on RH 6.2

This is not well tested nor safe. (crashes on bz's RH 6.2). Let's go with the
other one.
Attachment #125292 - Attachment is obsolete: true
> using UNICODELITTLE and UNICODEBIG as fallback at the cost of 
> losing non-BMP support in old glibc is reasonable.

Darin (Simon, and Alec) , what do you think? I think attachment 125293 [details] [diff] [review] (except
for not so pretty '__linux') is a reasonable compromise. If you agree, pls give
r/sr to the patch so that I can remove the blocker (bug 208809).  
I've done some tests (direct and indirect) and found that UTF-16LE and UTF-16BE
work under glibc 2.2.x, Solaris 8 or newer, and Tru64 (Compaq - HP) V5. Under
Tru64 (both V4 and V5), the meaning of UTF-16 is dependent on the environment
variable ICONV_BYTEORDER, which is yet another reason to use UTF-16LE (because
alpha is LE). By default, 'UTF-16' in Tru64 is interpreted as LE (platform
endian). Unfortunately, Tru64 V4 doesn't support 'UTF-16LE' so that it'd break
Mozilla if ICONV_BYTEORDER is set to 'big-endian', but there's not much we can
do other than providing a shell script wrapper. Solaris 7 and IRIX 6 (both BE.
MIPS can be run as LE as well, but SGI machines are BE) interpret UTF-16 as BE
and emit UTF-16BE for 'UTF-16'. 
As mentioned earlier, for glibc 2.1.x, we need to use UNICODELITTLE
and UNICODEBIG because 'UTF-16' is always interpreted as BE by glibc 2.1.x and
UTF-16LE/UTF-16BE are not supported. 

Based on these and the fact that we have used UCS-2 and variants so far (AIX,
HP/UX and other Unix), I think we can use the following for  
UTF_16_NAMES[]

+#if defined(IS_LITTLE_ENDIAN)
+    "UTF-16LE",
+#if defined(__linux)
+    "UNICODELITTLE",
+#endif
+    "UCS-2LE",
+#else
+    "UTF-16BE",
+#if defined(__linux)
+    "UNICODEBIG",
+#endif
+    "UCS-2BE",
+#endif
     "UTF-16",
     "UCS-2",
     "UCS2",
jshin: that sounds reasonable to me.
Attached patch a new patch doing what I wrote (obsolete) — Splinter Review
Darin, thanks. Here's the patch implementing that. Can I get review? Alec and
Simon, please feel free to review :-)

This is basically the same as attachment 125293 [details] [diff] [review] except that UTF16_NAMES[] is
changed as I wrote and I kept the block removing BOM in the first invocation of
iconv(3) for linux (that is, I got rid of '#if !defined(__linux)' around that
code.).
Attachment #124889 - Attachment is obsolete: true
Attachment #125293 - Attachment is obsolete: true
Just in case we need to refer to them later..Let me add a note that I used 
attachment 125288 [details] (to bug  208809) and attachment   125485 [details] (to bug 209048) ( 
for Solaris that doesn't like 4 NULL's in iconv, need to do what Mozilla  does 
in nsNativeCharsetUtils.cpp along with 'const char **' change) to try iconv(3) 
on a few platforms.  
Comment on attachment 125560 [details] [diff] [review]
a new patch doing what I wrote


>+     * We also use it to determine whether or not 'UTF-16'/'UCS-2' is 
>+     * interpreted as native endian. If it's interpreted as BE on LE,
>+     * we need byte-swapping.

The above comment should have not been included.  I got rid of it in my tree.

I've been waiting for bug 209048 to be resolved. Now that it can be fixed by a
Solaris patch, I'm asking for r/sr.
Attachment #125560 - Flags: superreview?(darin)
Attachment #125560 - Flags: review?(smontagu)
Comment on attachment 125560 [details] [diff] [review]
a new patch doing what I wrote

>Index: xpcom/io/nsNativeCharsetUtils.cpp

>+// just in case... but we  know for sure that iconv(3) in glibc for Linux

nit:		    "but we know for"

>+// doesn't need this.
>+#if !defined(__linux)

is there a GLIBC specified #define that we should test instead?


>-// PRUnichar[] is NOT a UCS-2 array BUT for UTF-16 string. Therefore, we
>-// have to use UTF-16 with iconv(3) on platforms where it's supported.
>-// We could list 'UTF-16' name variants, but all platforms known (to me) to 
>-// support UTF-16 in iconv(3) uses 'UTF-16'. Let me know (jshin) if there's an
>-// exception. (bug 206811)
>+/* 


>+ * PRUnichar[] is NOT a UCS-2 array BUT for a UTF-16 string. Therefore, we

nit: PRUnichar[] is NOT a UCS-2 array BUT a UTF-16 array.

>+ * have to use UTF-16 with iconv(3) on platforms where it's supported.
>+ * However, the way UTF-16 and UCS-2 are  interpreted varies across platforms 

nit: However, the way UTF-16 and UCS-2 are interpreted varies across platforms 


>+ * and  implementations of iconv(3). On Tru64, it also depends on the environment
>+ * variable.  To avoid the trouble arising  from byte-swapping 

nit: variable.	To avoid the trouble arising from byte-swapping 


> static const char *UTF_16_NAMES[] = {
>+#if defined(IS_LITTLE_ENDIAN)
>+    "UTF-16LE",
>+#if defined(__linux)
>+    "UNICODELITTLE",
>+#endif
>+    "UCS-2LE",
...
>     "UTF-16",
>     "UCS-2",

so it's never the case that we want to try "UTF-16" before "UCS-2LE"?


>+     * the first use of the iconv converter. The same is the case 
>+     * case of glibc 2.2.9x and Tru64 V5 (see bug 208809)

nit: "case case"


>+     * when 'UTF-16' is used. However, we use 'UTF-16LE/BE' in both
>+     * cases, instead  so that we should be safe. But just in case...

nit:   * cases, instead so that we should be safe. But just in case...

>      * This dummy conversion gets rid of the BOMs and fixes bugid 153562.

nit: s/bugid/bug/
Attachment #125560 - Flags: superreview?(darin) → superreview-
> so it's never the case that we want to try "UTF-16" before "UCS-2LE"?

I guess what's in 125560 is not 100% correct, but safer than the following.    

--------------
static const char *UTF_16_NAMES[] = {
#if defined(IS_LITTLE_ENDIAN)
    "UTF-16LE",
#if defined(__linux)
    "UNICODELITTLE",
#endif
#else
    "UTF-16BE",
#if defined(__linux)  // or if defined(__GLIBC__)
    "UNICODEBIG",
#endif
#endif
    "UTF-16",
#if defined(IS_LITTLE_ENDIAN)
    "UCS-2LE",
#else
    "UCS-2BE",
#endif
    "UCS-2",
---------------

In reality, I think there's no difference unless there's an iconv implementation
that works the way glibc 2.1.x does. That is, we're in trouble if the
implementation interprets UTF-16 as UTF-16BE on a LE machine  but does not
recognize 'UTF-16LE'. Do you know of any implementation that does? 

Anyway, which one do you like?  

>>+#if !defined(__linux)

>is there a GLIBC specified #define that we should test instead?

  Perhaps '__GLIBC__' or '__GNU_LIBRARY__' (deprecated)? 
>In reality, I think there's no difference unless there's an iconv 
>implementation that works the way glibc 2.1.x does. That is, we're in trouble 
>if the implementation interprets UTF-16 as UTF-16BE on a LE machine  but does 
>not recognize 'UTF-16LE'. Do you know of any implementation that does? 

no, i don't know of any such implementations...


>Anyway, which one do you like?  

perhaps your first way was better since it'd be better... i defer to your best
judgement.  we can of course always fix it if we discover a problem on some
platform ;-)


>>>+#if !defined(__linux)
>
>>is there a GLIBC specified #define that we should test instead?
>
>  Perhaps '__GLIBC__' or '__GNU_LIBRARY__' (deprecated)? 

yeah, let's use __GLIBC__ instead since it's possible someone on another OS
might be using glibc as well.
I decided to do the same as in attachment 125560 [details] [diff] [review]. So in terms of the actual
code, this is the same as attachment 125560 [details] [diff] [review] except that __GLIBC__ is used
instead of __linux.
Attachment #125560 - Attachment is obsolete: true
Attachment #125560 - Flags: review?(smontagu)
Attachment #127768 - Flags: superreview?(darin)
Attachment #127768 - Flags: review?(smontagu)
Comment on attachment 127768 [details] [diff] [review]
a new patch with nits taken care of and  __linux replaced by __GLIBC__

sr=darin
Attachment #127768 - Flags: superreview?(darin) → superreview+
Ping :-) Simon, can I get r? Or shall I ask someone else? 
Probably better ask someone else since, as I said before in comment 12, I really
know almost nothing about iconv.
Comment on attachment 127768 [details] [diff] [review]
a new patch with nits taken care of and  __linux replaced by __GLIBC__

Ulrich, can you review? You're the authority on iconv :-)
Attachment #127768 - Flags: review?(smontagu) → review?(drepper)
Comment on attachment 127768 [details] [diff] [review]
a new patch with nits taken care of and  __linux replaced by __GLIBC__

The patch looks fine.  glibc will never use the BOM if the endianness is
explicitly requested but it probably doesn't hurt.

If the style is OK I see no problems.
r=drepper
Attachment #127768 - Flags: review?(drepper) → review+
Comment on attachment 127768 [details] [diff] [review]
a new patch with nits taken care of and  __linux replaced by __GLIBC__

Fix checked into the trunk. Thank you all.

Now I'm asking for a1.4 as  Darin and I discussed (see comment #8). 

This patch (along with attachment 124092 [details] [diff] [review]) is to make nsNativeCharsetUtils on
Unix(including Linux) be UTF-16-safe. Some problems in attachment 124092 [details] [diff] [review] have
been identified and addressed in this patch so that I believe the risk is
pretty low. Linux with both glibc 2.2.x or later and old glibc 2.1.x were
tested. Solaris 8/9(Solaris 7 needs either a iconv patch from Sun or GNU iconv,
which was already relase-noted) is fine. Other Unix should be all right because
nsNativeCharsetUtils.cpp (with or without this patch) has a few  'defense
measures' against various possibilities. I also tested 'iconv(3)' APIs on Tru64
and HP/UX 11 and it works as is assummed here.

Considering that 1.4 would be a long lived one(like 1.0), I guess we can't just
assume that nobody would use non-BMP characters in filenames and other places.

On Linux, there was a little worry about the start-up time, but there's
virtually no difference (or slightly faster with this patch than without).

Given all these, I think it's safe to land this combined with attachment 124092 [details] [diff] [review]
in 1.4branch
Attachment #127768 - Flags: approval1.4.x?
Comment on attachment 127768 [details] [diff] [review]
a new patch with nits taken care of and  __linux replaced by __GLIBC__

We don't have many testing resources and we're short on time. This doesn't
appear to be a high-profile problem for users so it's not 1.4 material.
Attachment #127768 - Flags: approval1.4.x? → approval1.4.x-
sorry it's late. marking as fixed. 
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: