Closed
Bug 101998
Opened 23 years ago
Closed 23 years ago
implement GB18030 to Unicode supplement planes conversion
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
FIXED
mozilla0.9.9
People
(Reporter: ftang, Assigned: ftang)
Details
(Keywords: intl)
Attachments
(1 file, 2 obsolete files)
12.24 KB,
patch
|
shanjian
:
review+
alecf
:
superreview+
|
Details | Diff | Splinter Review |
we implement GB18030 conversion for BMP, but not for supplement planes. Here are more information. From: Kenneth Whistler <kenw@sybase.com> 1:02 PM Subject: Re: GB18030 To: ftang@netscape.com CC: unicode@unicode.org, kenw@sybase.com Frank, > You don't need to explain to me > the concept of GB18030. The question I have is about details mapping > information. Now, now, there's no need to get snippy with me. It sounded like you were unclear from the kinds of questions you were asking. > I look at > http://oss.software.ibm.com/cvs/icu/charset/data/xml/gb-18030-2000.xml . > > It is interesting that the mapping between U+10000 and U+10FFFF is check > in only 5 weeks ago in the version 1.3 > > | 30910: <range uFirst="10000" uLast="10FFFF" > bFirst="90 30 81 30" bLast="E3 32 9A 35" bMin="81 30 81 30" bMax="FE 39 > FE 39"/> > > Is the U+10000 - U+10FFFF mapping between Unicode and GB18030 specified > in the GB18030 standard itself? can someone fax me that page ? Thanks. Unfortunately, I don't have the revised and corrected version of the standard to hand. But on p. 5, clause 7.3 of the original GB 18030-2000, it states (in Chinese): "From 0x90308130 to 0xE339FE39, altogether 1058400 code points, correspond to GB 13000's 16 supplementary planes..." If you look at the ICU specification, bFirst="90 30 81 30" and bLast="E3 32 9A 35" corresponds to: 83 "groups" (90..E2) of GB 18030: 83 x 10 x 1260 = 1045800 code points 2 "planes" (E3 30..31) of GB 18030: 2 x 1260 = 2520 code points 25 "rows" (E3 32 81..99) of GB 18030: 25 x 10 = 250 code points 6 "cells" (E3 32 9A 30..35) of GB 18030: 6 code points Total 1048576 code points And 1048576 code points = 16 x 66536 code points = 16 planes of 10646. So GB 18030 and ICU agree. Start at 0x90308130 and lay out all the rest of the Unicode supplementary code points in order. --Ken
Assignee | ||
Comment 2•23 years ago
|
||
Comment 3•23 years ago
|
||
Brian@Sun, can you review and test the patch?
Comment 4•23 years ago
|
||
Excuse me, any information/utility about how to test the converter used in Mozilla?
Comment 5•23 years ago
|
||
Switching qa contact to teruko for now.
Keywords: intl
QA Contact: andreasb → teruko
Assignee | ||
Updated•23 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla0.9.7
Assignee | ||
Comment 6•23 years ago
|
||
Assignee | ||
Comment 7•23 years ago
|
||
v2 patch still have some problem with encoder. I think the decoder is fine.
Assignee | ||
Comment 8•23 years ago
|
||
add shanjian to the cc list.
Assignee | ||
Updated•23 years ago
|
Target Milestone: mozilla0.9.7 → mozilla0.9.6
Assignee | ||
Comment 9•23 years ago
|
||
I lost my perl script which will generate test data. .. need some time to make that ... Change this to m0.9.7
Target Milestone: mozilla0.9.6 → mozilla0.9.7
Assignee | ||
Comment 11•23 years ago
|
||
mass move unimportant m9.8 bug to m9.9 for now.
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Assignee | ||
Comment 12•23 years ago
|
||
I wrote some perl test code to cover the whole range, it looks like with the current patch, we have at least the following problems: from unicode to gb18030 conversion a. convert from U+9FA6 - U+9FFF to 82358f33 - 82359832 does not work, it convert to 0000 instead b. convert from surrogate pairs to 90xxxxx not working yet.
Assignee | ||
Updated•23 years ago
|
Attachment #51127 -
Attachment is obsolete: true
Assignee | ||
Updated•23 years ago
|
Attachment #51631 -
Attachment is obsolete: true
Assignee | ||
Comment 13•23 years ago
|
||
I am very very very high confident about this patch. It pass my stress test and round trip unit test.
Assignee | ||
Comment 14•23 years ago
|
||
ok, let us walk thorugh the patch here 1. nsGBKConvUtil.cpp a. if we hit surrogate char, return false to safe performance (no table indexing neither linear search) b. in the case of U+AFXX, there are some characters cannot be map to the two byte range of GB18030, therefore, we need to return false if the map is 0 to indicate additional table look up is needed for such case. this solve the problem I memtioned eariler. 2. nsGBKToUnicode.cpp a. change the logic so it will check the byte range and give to surrogate decoder if it fit into that range b. add the algorithm based 4 bytes to surrogate mapping 3. nsGBKToUnicode.h add DecodeToSurrogate for both the GBK function and the GB18030 one 4. nsUnicodeToGBK.cpp a. add the algorithm based surrogate to 4 bytes mapping b. check surrogate range to tune performance To test this, we can load http://warp.mcom.com/u/ftang/utf8test/gb18030.cgi I run the normale test, it display character as I expected I also run a roundtrip test (will check in the test script into intl/uconv/tests later. It have no problem I also test buffer boundary with http://warp.mcom.com/u/ftang/utf8test/buffer.cgi shanjian, please r= this one.
Assignee | ||
Comment 15•23 years ago
|
||
Brian Yuan- you are welcome to review this too. I will send you my test cgi
Comment 16•23 years ago
|
||
Thanks Frank, I will have a look. Brian.
Comment 17•23 years ago
|
||
@@ -348,6 +436,7 @@ break; } } +afterwhileloop: *aDestLength = iDestLength; *aSrcLength = iSrcLength; return res; If "afterwhileloop" is not used anywhere else, you might want to remove it. r=shanjian after that.
Assignee | ||
Comment 18•23 years ago
|
||
afterloop is not used by any goto statement . but I refer to it from a comment for a break statement so it is easier to understand the logic.
Updated•23 years ago
|
Attachment #68338 -
Flags: review+
Comment 19•23 years ago
|
||
Comment on attachment 68338 [details] [diff] [review] v3 of the patch. fix unicode to gb18030 problem sr=alecf
Attachment #68338 -
Flags: superreview+
Updated•23 years ago
|
QA Contact: teruko → ylong
Comment 20•23 years ago
|
||
Switching qa contact to ylong@netscape.com.
Assignee | ||
Comment 21•23 years ago
|
||
fixed and check in
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•