Closed Bug 184120 Opened 22 years ago Closed 22 years ago

UTF-32LE/UTF-32BE converters don't support non-BMP characters

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: smontagu)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 4 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021205
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021205

UTF-32LE/UTF-32BE converters convert non-BMP characters (plane 1 thru
plane 16) to Non-character (U+FFFD) instead of a pair of 
surrogate codepoints.  'PRUnichar*' in Mozilla is UTF-16 (not UCS-2)
and nsAString is a 'container' for 'PRUnichar*'. Therefore, when
converting Non-BMP characters in UTF-32LE/UTF-32BE to  'Unicode'(UTF-16), 
a pair of  surrogate codepoints have to be emitted. At the moment,
uCheckAndScanAlways4Byte(Swap) () in uscan.c doesn't. 


Reproducible: Always

Steps to Reproduce:
1. Under MS Windows, run mozilla to view the page at the URL field above 
2. It's possible to reproduce this under Linux, but my patches 
    for bug 182877 have to be applied. 
3.

Actual Results:  
Non-BMP characters came out all with question marks

Expected Results:  
They have to be rendered exactly the same way as
http://jshin.net/i18n/utftest/plane1.utf16le.html
sorry that I misslected the componet. Resetting it to i18n and adding some people.

Patch coming soon. 
Component: Browser-General → Internationalization
Keywords: intl
Summary: UTF-32LE/UTF-32BE converters don't Non-BMP characters → UTF-32LE/UTF-32BE converters don't support non-BMP characters
.
Assignee: asa → smontagu
QA Contact: asa → ylong
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached patch a patch (obsolete) — Splinter Review
Attachment #108977 - Flags: review?(smontagu)
changing the URL field now that I modified my UTF-16/32 test pages to test both
BMP(non-US-ASCII) and non-BMP characters in a single page. 
Somehow, VC++ doesn't like (LE|BE)_STRING_TO_UCS4 macro in the prev. patch. 
I changed them to make VC++ happy. g++ works either way.
Attachment #108977 - Attachment is obsolete: true
Attachment #109080 - Flags: review?(smontagu)
Comment on attachment 109080 [details] [diff] [review]
a slightly modified patch(with xml file change for macbuild)

>+    else                       // plane 17 and higher
>+      *dest++ = 0xfffd;

Please comment this more fully, explaining that plane 17 and higher are
undefined in UTF32 and that 0xfffd is the replacement character. 

With that, r=smontagu
Attachment #109080 - Flags: review?(smontagu) → review+
Simon, thanks for r. I made a change suggested.
Can you carry over your r to this attachment if that's required? (I can't do
that myself.
How can I get that prev.? )  I'll just go ahead for sr tomorrow morning because
the change  is only in comment)
Attachment #109080 - Attachment is obsolete: true
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

This patch is identical
to the one I got r from Simon
except for a little change
in comment so thatI think his r
is still valid.
Attachment #109272 - Flags: superreview?(dbaron)
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

transferring r=smontagu
Attachment #109272 - Flags: review+
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

>Index: ucvlatin/Makefile.in
>-		nsUCS4BEToUnicode.cpp \
>-		nsUCS4LEToUnicode.cpp \
>+		nsUTF32ToUnicode.cpp \

You're planning to cvs remove the removed files, right?  That's not in the
patch.


>Index: ucvlatin/nsUTF32ToUnicode.cpp
>+#define LE_STRING_TO_UCS4(s)                                       \
>+        (PRUint8(*(s)) | (PRUint8(*((s) + 1)) << 8) |              \
>+         (PRUint8(*((s) + 2)) << 16) | (PRUint8(*((s) + 3)) << 24))
>+
>+#define BE_STRING_TO_UCS4(s)                                       \
>+        (PRUint8(*((s) + 3)) | (PRUint8(*((s) + 2)) << 8) |         \
>+         (PRUint8(*((s) + 1)) << 16) | (PRUint8(*(s)) << 24))

Perhaps you could use IS_LITTLE_ENDIAN and IS_BIG_ENDIAN to make each one of
these closer to a no-op half the time?	(See nsIStreamBufferAccess.idl for an
example.)

>+nsUTF32LEToUnicode::nsUTF32LEToUnicode() : nsUTF32ToUnicode()
>+{
>+}

If the constructor is empty, how about just omitting it?  (Same for BE.)

>+NS_IMPL_ISUPPORTS_INHERITED0(nsUnicodeToUTF32BE, nsUnicodeToUTF32);

INHERITED0 is needed only in very rare cases (separately inheriting an
interface that's already implemented by another ancestor), right?  This can be
omitted.

>+
>+nsUnicodeToUTF32BE::nsUnicodeToUTF32BE() : nsUnicodeToUTF32() 
>+{
>+  NS_INIT_ISUPPORTS();

NS_INIT_ISUPPORTS is a no-op and not needed anymore.  (I have a bug on removing
the existing ones...)

Then the same empty constructor issue as before.

More in a bit...
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

>Index: ucvlatin/nsUTF32ToUnicode.cpp

>+static nsresult ConvertCommon(const char * aSrc, 

>+  nsresult rv = NS_OK;  // conversion result

This could be declared later, since it's not used for a bit.

However, I wonder if the two if blocks before the loop could
be merged into the rest of the loop.  It doesn't seem like it
would be too hard, although I haven't thought the whole thing
through:

>+  if (*aState > *aSrcLength) 
      ...

>+  // prev. run left a partial UTF-32 seq. 
>+  if (*aState > 0)
      ... [although quoted below]

Here, within the second |if| block above, you haven't checked
|destEnd|.  You should at least have an assertion that 
|*aDestLength >= 2|.


>+  for ( ; src < srcEnd && dest < destEnd; src += 4)
  ...
>+      *dest++= PRUnichar((ucs4 >> 10) + 0xd7c0); 
>+      *dest++= PRUnichar(ucs4 & 0x3ffL | 0xdc00);

The |dest < destEnd| check isn't sufficient if you're
going to add two characters in some cases.  You could
overflow the buffer.
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

>+    if (!(ucs4 >> 16))    // BMP  

It seems a little clearer to use |if (!(ucs4 & 0xffff))|.

>+    {
>+      *dest++= PRUnichar(ucs4);

Should you check that the input isn't in the surrogate ranges here (and replace
with 0xfffd, perhaps?), since failing to check could allow an incorrectly
encoded thing that looks like a surrogate into the data?  Or do all our
routines that handle UTF-16 correctly handle such errors?

>+    }
>+    else if (ucs4 < 0x110000L)  // plane 1 through plane 16 
>+    {
>+      if (destEnd - dest < 2) 

Oh, never mind about my comment about checking |destEnd|.  I missed this
before.

>+  return ConvertCommon(aSrc, aSrcLength, aDest, aDestLength, &mState, 
>+                       mBufferInc, PR_FALSE);


>Index: ucvlatin/nsUnicodeToUTF32.cpp
>+#define UCS4_TO_LE_STRING(u, s) memcpy((s), &(u), 4)

Could this just be |*((PRUint32*)s) = u| ?  (And likewise for BE?)

I still need to look at ConvertCommon and FinishCommon in the Encoder...
Comment on attachment 109272 [details] [diff] [review]
with a more thourohgh comment about plane 17 and higher

sr=dbaron with my above comments
Attachment #109272 - Flags: superreview?(dbaron) → superreview+
David, can you take a look at this and carry over your sr
if there's no problem?
Attachment #109272 - Attachment is obsolete: true
Thanks for reviewing, David.

>+    {
>+      *dest++= PRUnichar(ucs4);

> Should you check that the input isn't in the surrogate ranges here (and > replace
> with 0xfffd, perhaps?), since failing to check could allow an 
> incorrectly
> encoded thing that looks like a surrogate into the data?  

  It's not very clear what to do in such a case. (UAX #19 doesn't
say anything about what to do when converting UTF-32 to other
encoding forms). I'll ask around. In the meantime, I added a comment.

> Or do all our
> routines that handle UTF-16 correctly handle such errors?

  They're supposed to check for unpaired surrogate code points.
In all cases I'm aware of, our routines do the 'right thing'.
Therefore, I guess we can get away with just passing
thru all codepoints below 0x10000 for the time being. 

>+    {
>+      *dest++= PRUnichar(ucs4);


> However, I wonder if the two if blocks before the loop could
> be merged into the rest of the loop.

  My first patch(which I haven't uploaded here) did just that, but
after looking at UTF8 en/decoder, I changed my mind because
it seemed to make it less clear what's going on without
much gain. 

BTW, I'm away from my place and build environment. I managed to
compile my new patch on a Linux box accessed remotely, but haven't
been able to test it (nsconv somehow didn't work). I'll try to
test it soon, but it may take a while.
fixed stupid mistakes made in the prev. patch. 
I also managed to test it.
Attachment #110661 - Attachment is obsolete: true
Attachment #112242 - Flags: superreview?(dbaron)
Comment on attachment 112242 [details] [diff] [review]
a new patch addr. David's concerns

sr=dbaron.  Sorry for the delay.
Attachment #112242 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 112242 [details] [diff] [review]
a new patch addr. David's concerns

Thanks David for sr.

Now asking for a to 1.3b.
This patch adds non-BMP(>=
U+10000) character support for UTF-32.
UTF-32 is rarely used
in practice so that the real-life risk is very low even there's something wrong
with this patch. On the other hand, it'd be 
nice to boast that Mozilla 1.3 now supports
UTF-32 thoroughly
(meaning it supports
non-BMP in UTF-32).
I tested the patch on Linux RH 8 with the test page given at the URL. The
previous patch(almost identical) was tested under both Linux and Win32. I also
tested
the encoding/decoding
of a few UTF-32 
text files
with 'nsconv' utilty program in intl/uconv/test.
Attachment #112242 - Flags: approval1.3b?
Comment on attachment 112242 [details] [diff] [review]
a new patch addr. David's concerns

a=asa (on behalf of drivers) for checkin to 1.3beta.
Attachment #112242 - Flags: approval1.3b? → approval1.3b+
Patch was checked in last weekend. Sorry I forgot
to mark this as resolved/fixed. Gonna do now.  Thank you all.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Attachment #108977 - Flags: review?(smontagu)
*** Bug 7644 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: