Closed Bug 101998 Opened 23 years ago Closed 23 years ago

implement GB18030 to Unicode supplement planes conversion

Categories

(Core :: Internationalization, defect)

x86
Windows 95
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9.9

People

(Reporter: ftang, Assigned: ftang)

Details

(Keywords: intl)

Attachments

(1 file, 2 obsolete files)

we implement GB18030 conversion for BMP, but not for supplement planes. Here are 
more information.

From:        Kenneth Whistler <kenw@sybase.com>   1:02 PM
 Subject:    Re: GB18030
     To:     ftang@netscape.com
    CC:      unicode@unicode.org, kenw@sybase.com
Frank,

> You don't need to explain to me
> the concept of GB18030. The question I have is about details mapping
> information.

Now, now, there's no need to get snippy with me. It sounded
like you were unclear from the kinds of questions you were
asking.

> I look at
> http://oss.software.ibm.com/cvs/icu/charset/data/xml/gb-18030-2000.xml .
> 
> It is interesting that the mapping between U+10000 and U+10FFFF is check
> in only 5 weeks ago in the version 1.3
> 
>              |          30910:   <range uFirst="10000" uLast="10FFFF"
> bFirst="90 30 81 30" bLast="E3 32 9A 35"  bMin="81 30 81 30" bMax="FE 39
> FE 39"/>
> 

> Is the U+10000 - U+10FFFF mapping between Unicode and GB18030 specified
> in the GB18030 standard itself? can someone fax me that page ? Thanks.

Unfortunately, I don't have the revised and corrected version of
the standard to hand.

But on p. 5, clause 7.3 of the original GB 18030-2000, it states (in
Chinese):

"From 0x90308130 to 0xE339FE39, altogether 1058400 code points, correspond
to GB 13000's 16 supplementary planes..."

If you look at the ICU specification, bFirst="90 30 81 30" and
bLast="E3 32 9A 35" corresponds to:

83 "groups" (90..E2) of GB 18030:    83 x 10 x 1260 = 1045800 code points
 2 "planes" (E3 30..31) of GB 18030:       2 x 1260 =    2520 code points
25 "rows"   (E3 32 81..99) of GB 18030:        25 x 10 =  250 code points
 6 "cells"  (E3 32 9A 30..35) of GB 18030:                  6 code points
                                             Total    1048576 code points

And 1048576 code points = 16 x 66536 code points = 16 planes of 10646.

So GB 18030 and ICU agree. Start at 0x90308130 and lay out all the
rest of the Unicode supplementary code points in order.

--Ken
reassign to ftang
Assignee: yokoyama → ftang
Attached patch add surrogate support here. (obsolete) — Splinter Review
Brian@Sun, can you review and test the patch?
Excuse me, any information/utility about how to test the converter used in
Mozilla? 
Switching qa contact to teruko for now.
Keywords: intl
QA Contact: andreasb → teruko
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.7
Attached patch v2 of the patch. (obsolete) — Splinter Review
v2 patch still have some problem with encoder. I think the decoder is fine. 
add shanjian to the cc list.
Target Milestone: mozilla0.9.7 → mozilla0.9.6
Blocks: 104056
I lost my perl script which will generate test data. .. need some time to make 
that ... Change this to m0.9.7
Target Milestone: mozilla0.9.6 → mozilla0.9.7
move to 0.9.8
Target Milestone: mozilla0.9.7 → mozilla0.9.8
mass move unimportant m9.8 bug to m9.9 for now. 
Target Milestone: mozilla0.9.8 → mozilla0.9.9
I wrote some perl test code to cover the whole range, it looks like with the
current patch, we have at least the following problems:

from unicode to gb18030 conversion
a. convert from U+9FA6 - U+9FFF to 82358f33 - 82359832 does not work, it convert
to 0000 instead

b. convert from surrogate pairs to 90xxxxx not working yet.
Attachment #51127 - Attachment is obsolete: true
Attachment #51631 - Attachment is obsolete: true
I am very very very high confident about this patch.
It pass my stress test and round trip unit test.
ok, let us walk thorugh the patch here
1. nsGBKConvUtil.cpp
a. if we hit surrogate char, return false to safe performance 
(no table indexing neither linear search)
b. in the case of U+AFXX, there are some characters cannot be map to 
the two byte range of GB18030, therefore, we need to return false
if the map is 0 to indicate additional table look up is needed for such
case. this solve the problem I memtioned eariler.
2. nsGBKToUnicode.cpp

a. change the logic so it will check the byte range and give to surrogate
decoder if it fit into that range
b. add the algorithm based 4 bytes to surrogate mapping
3. nsGBKToUnicode.h

add DecodeToSurrogate for both the GBK function and the GB18030 one

4. nsUnicodeToGBK.cpp

a. add the algorithm based surrogate to 4 bytes mapping
b. check surrogate range to tune performance 


To test this, we can load http://warp.mcom.com/u/ftang/utf8test/gb18030.cgi
I run the normale test, it display character as I expected
I also run a roundtrip test (will check in the test script into intl/uconv/tests
later. It have no problem 
I also test buffer boundary with http://warp.mcom.com/u/ftang/utf8test/buffer.cgi


shanjian, please r= this one.

 Brian Yuan- you are welcome to review this too.
I will send you my test cgi
Thanks Frank, I will have a look.

Brian.
@@ -348,6 +436,7 @@
       break;
     }
   }
+afterwhileloop:
   *aDestLength = iDestLength;
   *aSrcLength = iSrcLength;
   return res;

If "afterwhileloop" is not used anywhere else, you might want to remove it. 
r=shanjian after that. 
afterloop is not used by any goto statement . but I refer to it from a comment
for a break statement
 so it is easier to understand the logic.
Attachment #68338 - Flags: review+
Comment on attachment 68338 [details] [diff] [review]
v3 of the patch. fix unicode to gb18030 problem

sr=alecf
Attachment #68338 - Flags: superreview+
QA Contact: teruko → ylong
Switching qa contact to ylong@netscape.com.
fixed and check in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
No longer blocks: 104056
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: