Closed
Bug 184120
Opened 22 years ago
Closed 22 years ago
UTF-32LE/UTF-32BE converters don't support non-BMP characters
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
People
(Reporter: jshin1987, Assigned: smontagu)
References
()
Details
(Keywords: intl)
Attachments
(1 file, 4 obsolete files)
60.55 KB,
patch
|
dbaron
:
superreview+
asa
:
approval1.3b+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021205 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3a) Gecko/20021205 UTF-32LE/UTF-32BE converters convert non-BMP characters (plane 1 thru plane 16) to Non-character (U+FFFD) instead of a pair of surrogate codepoints. 'PRUnichar*' in Mozilla is UTF-16 (not UCS-2) and nsAString is a 'container' for 'PRUnichar*'. Therefore, when converting Non-BMP characters in UTF-32LE/UTF-32BE to 'Unicode'(UTF-16), a pair of surrogate codepoints have to be emitted. At the moment, uCheckAndScanAlways4Byte(Swap) () in uscan.c doesn't. Reproducible: Always Steps to Reproduce: 1. Under MS Windows, run mozilla to view the page at the URL field above 2. It's possible to reproduce this under Linux, but my patches for bug 182877 have to be applied. 3. Actual Results: Non-BMP characters came out all with question marks Expected Results: They have to be rendered exactly the same way as http://jshin.net/i18n/utftest/plane1.utf16le.html
Reporter | ||
Comment 1•22 years ago
|
||
sorry that I misslected the componet. Resetting it to i18n and adding some people. Patch coming soon.
Component: Browser-General → Internationalization
Keywords: intl
Summary: UTF-32LE/UTF-32BE converters don't Non-BMP characters → UTF-32LE/UTF-32BE converters don't support non-BMP characters
Updated•22 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Comment 3•22 years ago
|
||
Reporter | ||
Updated•22 years ago
|
Attachment #108977 -
Flags: review?(smontagu)
Reporter | ||
Comment 4•22 years ago
|
||
changing the URL field now that I modified my UTF-16/32 test pages to test both BMP(non-US-ASCII) and non-BMP characters in a single page.
Reporter | ||
Comment 5•22 years ago
|
||
Somehow, VC++ doesn't like (LE|BE)_STRING_TO_UCS4 macro in the prev. patch. I changed them to make VC++ happy. g++ works either way.
Attachment #108977 -
Attachment is obsolete: true
Reporter | ||
Updated•22 years ago
|
Attachment #109080 -
Flags: review?(smontagu)
Assignee | ||
Comment 6•22 years ago
|
||
Comment on attachment 109080 [details] [diff] [review] a slightly modified patch(with xml file change for macbuild) >+ else // plane 17 and higher >+ *dest++ = 0xfffd; Please comment this more fully, explaining that plane 17 and higher are undefined in UTF32 and that 0xfffd is the replacement character. With that, r=smontagu
Attachment #109080 -
Flags: review?(smontagu) → review+
Reporter | ||
Comment 7•22 years ago
|
||
Simon, thanks for r. I made a change suggested. Can you carry over your r to this attachment if that's required? (I can't do that myself. How can I get that prev.? ) I'll just go ahead for sr tomorrow morning because the change is only in comment)
Attachment #109080 -
Attachment is obsolete: true
Reporter | ||
Comment 8•22 years ago
|
||
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher This patch is identical to the one I got r from Simon except for a little change in comment so thatI think his r is still valid.
Attachment #109272 -
Flags: superreview?(dbaron)
Assignee | ||
Comment 9•22 years ago
|
||
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher transferring r=smontagu
Attachment #109272 -
Flags: review+
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher >Index: ucvlatin/Makefile.in >- nsUCS4BEToUnicode.cpp \ >- nsUCS4LEToUnicode.cpp \ >+ nsUTF32ToUnicode.cpp \ You're planning to cvs remove the removed files, right? That's not in the patch. >Index: ucvlatin/nsUTF32ToUnicode.cpp >+#define LE_STRING_TO_UCS4(s) \ >+ (PRUint8(*(s)) | (PRUint8(*((s) + 1)) << 8) | \ >+ (PRUint8(*((s) + 2)) << 16) | (PRUint8(*((s) + 3)) << 24)) >+ >+#define BE_STRING_TO_UCS4(s) \ >+ (PRUint8(*((s) + 3)) | (PRUint8(*((s) + 2)) << 8) | \ >+ (PRUint8(*((s) + 1)) << 16) | (PRUint8(*(s)) << 24)) Perhaps you could use IS_LITTLE_ENDIAN and IS_BIG_ENDIAN to make each one of these closer to a no-op half the time? (See nsIStreamBufferAccess.idl for an example.) >+nsUTF32LEToUnicode::nsUTF32LEToUnicode() : nsUTF32ToUnicode() >+{ >+} If the constructor is empty, how about just omitting it? (Same for BE.) >+NS_IMPL_ISUPPORTS_INHERITED0(nsUnicodeToUTF32BE, nsUnicodeToUTF32); INHERITED0 is needed only in very rare cases (separately inheriting an interface that's already implemented by another ancestor), right? This can be omitted. >+ >+nsUnicodeToUTF32BE::nsUnicodeToUTF32BE() : nsUnicodeToUTF32() >+{ >+ NS_INIT_ISUPPORTS(); NS_INIT_ISUPPORTS is a no-op and not needed anymore. (I have a bug on removing the existing ones...) Then the same empty constructor issue as before. More in a bit...
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher >Index: ucvlatin/nsUTF32ToUnicode.cpp >+static nsresult ConvertCommon(const char * aSrc, >+ nsresult rv = NS_OK; // conversion result This could be declared later, since it's not used for a bit. However, I wonder if the two if blocks before the loop could be merged into the rest of the loop. It doesn't seem like it would be too hard, although I haven't thought the whole thing through: >+ if (*aState > *aSrcLength) ... >+ // prev. run left a partial UTF-32 seq. >+ if (*aState > 0) ... [although quoted below] Here, within the second |if| block above, you haven't checked |destEnd|. You should at least have an assertion that |*aDestLength >= 2|. >+ for ( ; src < srcEnd && dest < destEnd; src += 4) ... >+ *dest++= PRUnichar((ucs4 >> 10) + 0xd7c0); >+ *dest++= PRUnichar(ucs4 & 0x3ffL | 0xdc00); The |dest < destEnd| check isn't sufficient if you're going to add two characters in some cases. You could overflow the buffer.
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher >+ if (!(ucs4 >> 16)) // BMP It seems a little clearer to use |if (!(ucs4 & 0xffff))|. >+ { >+ *dest++= PRUnichar(ucs4); Should you check that the input isn't in the surrogate ranges here (and replace with 0xfffd, perhaps?), since failing to check could allow an incorrectly encoded thing that looks like a surrogate into the data? Or do all our routines that handle UTF-16 correctly handle such errors? >+ } >+ else if (ucs4 < 0x110000L) // plane 1 through plane 16 >+ { >+ if (destEnd - dest < 2) Oh, never mind about my comment about checking |destEnd|. I missed this before. >+ return ConvertCommon(aSrc, aSrcLength, aDest, aDestLength, &mState, >+ mBufferInc, PR_FALSE); >Index: ucvlatin/nsUnicodeToUTF32.cpp >+#define UCS4_TO_LE_STRING(u, s) memcpy((s), &(u), 4) Could this just be |*((PRUint32*)s) = u| ? (And likewise for BE?) I still need to look at ConvertCommon and FinishCommon in the Encoder...
Comment on attachment 109272 [details] [diff] [review] with a more thourohgh comment about plane 17 and higher sr=dbaron with my above comments
Attachment #109272 -
Flags: superreview?(dbaron) → superreview+
Reporter | ||
Comment 14•22 years ago
|
||
David, can you take a look at this and carry over your sr if there's no problem?
Attachment #109272 -
Attachment is obsolete: true
Reporter | ||
Comment 15•22 years ago
|
||
Thanks for reviewing, David. >+ { >+ *dest++= PRUnichar(ucs4); > Should you check that the input isn't in the surrogate ranges here (and > replace > with 0xfffd, perhaps?), since failing to check could allow an > incorrectly > encoded thing that looks like a surrogate into the data? It's not very clear what to do in such a case. (UAX #19 doesn't say anything about what to do when converting UTF-32 to other encoding forms). I'll ask around. In the meantime, I added a comment. > Or do all our > routines that handle UTF-16 correctly handle such errors? They're supposed to check for unpaired surrogate code points. In all cases I'm aware of, our routines do the 'right thing'. Therefore, I guess we can get away with just passing thru all codepoints below 0x10000 for the time being. >+ { >+ *dest++= PRUnichar(ucs4); > However, I wonder if the two if blocks before the loop could > be merged into the rest of the loop. My first patch(which I haven't uploaded here) did just that, but after looking at UTF8 en/decoder, I changed my mind because it seemed to make it less clear what's going on without much gain. BTW, I'm away from my place and build environment. I managed to compile my new patch on a Linux box accessed remotely, but haven't been able to test it (nsconv somehow didn't work). I'll try to test it soon, but it may take a while.
Reporter | ||
Comment 16•22 years ago
|
||
fixed stupid mistakes made in the prev. patch. I also managed to test it.
Attachment #110661 -
Attachment is obsolete: true
Reporter | ||
Updated•22 years ago
|
Attachment #112242 -
Flags: superreview?(dbaron)
Comment on attachment 112242 [details] [diff] [review] a new patch addr. David's concerns sr=dbaron. Sorry for the delay.
Attachment #112242 -
Flags: superreview?(dbaron) → superreview+
Reporter | ||
Comment 18•22 years ago
|
||
Comment on attachment 112242 [details] [diff] [review] a new patch addr. David's concerns Thanks David for sr. Now asking for a to 1.3b. This patch adds non-BMP(>= U+10000) character support for UTF-32. UTF-32 is rarely used in practice so that the real-life risk is very low even there's something wrong with this patch. On the other hand, it'd be nice to boast that Mozilla 1.3 now supports UTF-32 thoroughly (meaning it supports non-BMP in UTF-32). I tested the patch on Linux RH 8 with the test page given at the URL. The previous patch(almost identical) was tested under both Linux and Win32. I also tested the encoding/decoding of a few UTF-32 text files with 'nsconv' utilty program in intl/uconv/test.
Attachment #112242 -
Flags: approval1.3b?
Comment 19•22 years ago
|
||
Comment on attachment 112242 [details] [diff] [review] a new patch addr. David's concerns a=asa (on behalf of drivers) for checkin to 1.3beta.
Attachment #112242 -
Flags: approval1.3b? → approval1.3b+
Reporter | ||
Comment 20•22 years ago
|
||
Patch was checked in last weekend. Sorry I forgot to mark this as resolved/fixed. Gonna do now. Thank you all.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Updated•21 years ago
|
Attachment #108977 -
Flags: review?(smontagu)
Reporter | ||
Comment 21•21 years ago
|
||
*** Bug 7644 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•