Unicode to GBK converter not working for some GBK chars

VERIFIED FIXED in M16

Status

()

Core
Internationalization
P3
normal
VERIFIED FIXED
18 years ago
18 years ago

People

(Reporter: yueheng.xu, Assigned: yueheng.xu)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Assignee)

Description

18 years ago
Mr. Xianping Ge reported a Unicode to GBK converter bug to me in a few
email communications. Here I relay them to the bugzilla database.


================================Msg 2 =====================

In message <7DAA70BEB463D211AC3E00A0C96B7AB20359EFEA@orsmsx41.jf.intel.com>, "X
u, Yueheng" writes:
>	Dear Mr. Ge,
>
>	Thank you for your time spent on this. Did you tested your local
>build on both Windows and Linux platform ?

I tested it on Linux. I do not have a Windows box.

>	I have not touched those code for months since last check in and now
>I am busy with other thigns. Let's test
>	them thoroughly offline before we check them in. Do you have a test
>page that contains those characters
>	that are in GBK but not in GB2312 ?
>
> Can you send me those test files (or put them in a public accessible web
> server) so I can verify the
> problems you mentioned and also verify any changes we made are effective. 

http://www.ics.uci.edu/~xge/clinux/gbk-test-files/
  Tao-Hua-Yuan.gbk.html
  zhu-rong-ji.gbk.html
  xcin.gbk.html

Talk to you later.

 -- Xianping
 xge@ics.uci.edu
=============================== Msg 1 =====================

>> The problem is that Mozilla does not load my "gb13000.1993-1" font
>> for rendering GBK text. It seems to get the glyphs from GB2312, Japanese,
>> etc, and cannot display the character 'Rong(2)' in 'Premier Zhu Rongji'.
>> Any suggestion?
>> 
>As for GBK support in Mozilla, it is already working, you just need to set
>charset to x-gbk in your HTML page's meta tag and it willl work.
>If it is not working, blame Frank Tang ( ftang@netscape.com ).  See the
>email paragraph below.
>
>
>GB2312V2 - for now "windows-936"
>HZ- "HZ-GB-2312"
>GBK- "x-gbk"

I spent last night figuring this out. I removed all other Han fonts (GB, 
Big5, Japanese, etc) and Mozilla loaded my "gb13000.1993-1" font, but still
had problem correctly rendering a GBK (or a simple GB2312) file. I looked
into some of your files (under  mozilla/intl/uconv/ucvcn) and found some
typo's. Attached is the patch (based M11) for your consideration; 
can you merge into the main CVS if it's OK?

Here is a short description of the patches:
1. Associate X charset encoding "gb13000.1993-1" with mime type "x-gbk":
  gfx/src/gtk/nsFontMetricsGTK.cpp
  gfx/src/xlib/nsFontMetricsXlib.cpp

2. Add a menu item for GBK (close to the menuitem GB):
  editor/ui/composer/content/editorOverlay.xul
  editor/ui/composer/locale/en-US/editorOverlay.dtd
  mailnews/base/resources/locale/en-US/messenger.dtd
  mailnews/compose/resources/content/messengercompose.xul
  mailnews/compose/resources/locale/en-US/messengercompose.dtd
  xpfe/browser/resources/content/navigatorOverlay.xul
  xpfe/browser/resources/locale/en-US/navigator.dtd

3. Associate Linux locale zh_CN.GBK with "x-gbk".
  intl/uconv/src/unixcharset.properties

4. Some typo's, bugs corrected. (New bugs introduced? :-)
  intl/uconv/ucvcn/nsGB2312ToUnicodeV2.cpp
  intl/uconv/ucvcn/nsGBKToUnicode.cpp
  intl/uconv/ucvcn/nsUnicodeToGB2312V2.cpp
  intl/uconv/ucvcn/nsUnicodeToGBK.cpp
  intl/uconv/ucvcn/nsUnicodeToHZ.cpp

 The "corrections" for these files:
   - change 0x41 to 0x40 for the starting value of GBK right byte.
   - change "row size" from (0x00FE - 0x0080) to 0x00BF (==0xFE-0x3F)
   - remove conflicts between the variables i  in outer and inner blocks.
   - The result of (i / 0x00BF + 0x0081)
                   ( i % 0x00BF+ 0x0040)
     should not (or unnecessary) be |0x80.
     For GBK, OR'ing the right byte may be wrong; e.g. between 0x40 and 0x80.
   - big-endian problem. The first byte of a Uint16 may not be 
     the least significant byte on a big-endian machine:
     + #if 0  // This will run into trouble for big-endian machines
       pSrcDBCode = (DByte *)pSrc;
       *aDest = pSrcDBCode->leftbyte;
     + #else
     + *aDest= (unsigned char)(*pSrc);
     + #endif
   - struct packing. In some places, you alias (using pointer) the
     two bytes in a char array by a DByte struct. The two bytes in
     a struct may or may not packed tightly; there might be a hole
     if the two bytes are 4-byte aligned. This does not seem to be
     a problem with GCC, but personally I think the alternative 
     (directly working with aDest[0] = 'x', aDest[1] = 'x') is 
     simpler, and more robust. 5 years ago, I wrote some C code on
     SCO Unix, it took me a long time to find the bug introduced 
     by struct packing/padding.

 -- Xianping
 xge@ics.uci.edu

Comment 1

18 years ago
Created attachment 6401 [details] [diff] [review]
nsFontMetricsGTK.diff

Comment 2

18 years ago
Created attachment 6402 [details] [diff] [review]
nsFontMetricsXlib.diff

Comment 3

18 years ago
Created attachment 6404 [details] [diff] [review]
nsUnicodeToGBK.diff
(Assignee)

Comment 4

18 years ago
Code ready for check in, Pending review from Ftang@netscape.com
Status: NEW → ASSIGNED

Comment 5

18 years ago
erik- 
Please review change from yueheng.xu@intel.com 


      Created an attachment (id=6404) nsUnicodeToGBK.diff - r=ftang
      Created an attachment (id=6402) nsFontMetricsXlib.diff <- erik please 
review. I don't 
     think this make sense since no code referred to GBK here.
      Created an attachment (id=6401) nsFontMetricsGTK.diff <- erik please 
review. I think it 
     is reasonable but we should let erik approve this.

Comment 6

18 years ago
can we mark this M16 ?

Comment 7

18 years ago
As far as I know, the Xlib version is not being maintained/developed any more,
so there is no need to check in that fix. Also, I agree with Frank that it is
incomplete anyway.

The other one (for GTK) is fine.

Comment 8

18 years ago
can you check in this by 5/16 . If so, please mark this M16
(Assignee)

Comment 9

18 years ago
I will check the fix in this week.
Target Milestone: --- → M16
(Assignee)

Comment 10

18 years ago
fix checked in last night. But probably need GBK font installed to test.
In the following test page of zhu-rong-ji.gbk.html, the correct behavior of
gbk enabled browser should render Mr. Zhu Rong-Ji's name correctly. A GB2312
only browser will miss the character 'Rong'.

http://www.ics.uci.edu/~xge/clinux/gbk-test-files/
  Tao-Hua-Yuan.gbk.html
  zhu-rong-ji.gbk.html
  xcin.gbk.html
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → FIXED

Comment 11

18 years ago
I verified this in 2000-05-31-08 build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.