Closed Bug 182089 Opened 22 years ago Closed 22 years ago

Extended hkscs.uf and hkscs.ut for HKSCS-2001 support

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: anthony, Assigned: shanjian)

References

()

Details

(Keywords: intl, Whiteboard: New hkscs.uf/hkscs.ut and source files are attached, courtesy of ThizLinux Laboratory Ltd.)

Attachments

(11 files, 3 obsolete files)

72.45 KB, image/png
Details
95.15 KB, image/png
Details
290.07 KB, text/plain
shanjian
: review+
blizzard
: superreview+
Details
89.72 KB, text/plain
shanjian
: review+
blizzard
: superreview+
Details
126.90 KB, text/plain
Details
85.56 KB, text/plain
Details
2.63 KB, text/plain
Details
66.73 KB, image/png
Details
71.03 KB, image/png
Details
75.82 KB, image/png
Details
30.08 KB, text/plain
Details
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; zh-TW; rv:1.1) Gecko/20020913 Debian/1.1-1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; zh-TW; rv:1.1) Gecko/20020913 Debian/1.1-1

The current HKSCS support in Mozilla is limited to HKSCS-1999.  Now that
HKSCS-2001 has been released for nearly a year, and with maturing support from
various OS (GNU/Linux, MS Windows, etc.)

Reproducible: Always

Steps to Reproduce:
1. Open a HKSCS test page, e.g.:
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-81xx.html

Actual Results:  
Note that (Big5 0x8C40-0x8C7E, 0x8CA1-0x8CFE, 0x8D40-0x8D5F), new in HKSCS-2001,
are not properly displayed in the current Mozilla 1.1 or 1.2 beta as of November
2002.

Expected Results:  
The 116 newly added characters in HKSCS-2001 (Big5 0x8C40-0x8C7E, 0x8CA1-0x8CFE,
0x8D40-0x8D5F) should be properly displayed.  Also, HKSCS-2001 has moved some
characters in Unicode Private User's Area into Unicode CJK Ideographs Areas, so
the hkscs.uf and hkscs.ut tables should be updated with the latest HKSCS-2001
data provided by the Hong Kong SAR Government.
Whiteboard: Patch (contributed by ThizLinux Laboratory) already created; will upload soon
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: intl
Yay, it seems the patch works.	:-)

Of course, I have an HKSCS-2001 font installed on my system (bmin00l.ttf,
Arphic Mingti Light B5, with HKSCS-2001, version 2.50), otherwise the
characters probably won't show.

The screenshots were taken on my Debian GNU/Linux system at home.  I'll test it
tomorrow at work with our Thiz Linux system.  :-)

The patch was generated as follows:

1. I wrote a Perl script to change Gavin Ho's HKSCS-1999 hkscs.ut and hkscs.uf
back to source mapping tables (à la "xscii.txt")
2. I wrote another Perl script (modified from previous scripts I used for glibc
and Qt) to try to re-create the same "xscii.txt" tables using HKSCS-1999 data.
3. Generated new xscii.ut.txt and xscii.uf.txt files from HKSCS-2001 data.
4. Ran "tou" and "fromu" binaries to generate the new hkscs.ut and hkscs.uf
files.

If you are interested in these scripts, I'll be glad to attach them too.  I
have to admit, they may be a bit quick-and-dirty.  ;-)

Cheers,

Anthony Fok
on behalf of ThizLinux Laboratory Ltd., Hong Kong
http://www.thizlinux.com/
I shalt never upload compressed patch file to bugzilla again.  :-)
(oops!)  I will upload hkscs.uf and hkscs.ut as plain text instead.
Attachment #107514 - Attachment is obsolete: true
Attachment #107517 - Attachment mime type: application/octet-stream → text/plain
This script currently discards everything above U+FFFF, hence hkscs.uf contains
no mapping from "CJK Ideographs Extension B" (U+20000 and above).  My question
is, does Mozilla support beyond the Basic Multilingual Plane?  If so, could
fromu, tou, and related helper libraries be modified to support U+20000+?  
If not, are there future plans for such support?

I am curious because the ITSD (IT Services Department of the HKSAR Government)
wishes systems/applications to eventually support the full HKSCS-2001 (with
ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001 support) to minimize and
eventually eliminate the dependency on the Private User Area, and yes, we
GNU/Linux vendors in Hong Kong get asked by ITSD about this every now and then.
 ;-)

By the way, no, I haven't tested this patch on MS Windows (not a Windows
developer myself), although I suspect it would work just fine on WIndows
NT/2000/XP.  Not so sure about Windows 98, as I recall that there were existing
bug reports about missing chars in Windows 98?	Dunno.	:-)

Cheers, and thanks again!  :-)

Anthony
Whiteboard: Patch (contributed by ThizLinux Laboratory) already created; will upload soon → New hkscs.uf/hkscs.ut and source files are attached, courtesy of ThizLinux Laboratory Ltd.
>does Mozilla support beyond the Basic Multilingual Plane?

Yes, on Windows and Mac, and there is work in progress for support on Linux (bug
127713)

Shanjian and ftang, who can look at the attachments and assess them?
Oops, my previous "parse-mozilla-encoding-table.pl" was buggy:
I forgot to remove an "exit" etc., bugs crept in when I tried to clean up the
script somewhat before my initial upload but didn't test.  My apologies.  :-)

By the way, may I go ahead and start to revise big5.ut and big5.uf too? 
Reading the Mozilla zh_TW L10n web site (Francis Lin), I noticed that Taiwanese
users were forced to use Big5-HKSCS in order to display Japanese
hirigana/katagana characters.  Yes, this has to do with the numerous Big5
variations.  I intend to revise big5.{uf,ut} to bring them in line with the
upcoming Big5 standard being worked on by the OpenI18n-big5 subgroup.  I'll
file another bug report when I have something ready.  :-)
Attachment #107525 - Attachment is obsolete: true
Take it for myself. 
Assignee: smontagu → shanjian
thanks anthony for you great work. Is there a test page which constains full
enumeration of characters in hkscs? Once we verified with such a test page, we
can go ahead and check it in. I can verify this with a window build, frank or
naoki can take a look at mac. 

>> My question
>>is, does Mozilla support beyond the Basic Multilingual Plane?  If so, >>could
>>fromu, tou, and related helper libraries be modified to support >>U+20000+?  
>>If not, are there future plans for such support?
How does hkscs plan to support non-BMP? For GB18030, surrogates are encoded in a
way that algorithm conversion is possible. If same thing is true for hkscs, we
need to change the code to support that.
Hello Shanjian,

Thanks for looking into this report.  Yes, a complete enumeration of Big5-HKSCS
(including Big5 proper) is available at:

http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-81xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-90xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-A0xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-B0xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-C0xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-D0xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-E0xx.html
http://www.thizlinux.com/~anthony/hkscs/Big5-HKSCS-2001-test-F0xx.html

Note that B5+8140-B5+87FE are supposed to be blank because they are either
unassigned or reserved for end-user defined characters.)  There are some other
scattered "blank spots", which are normal.

If you would rather have the enumeration list some other form, please don't
hesitate to let me know.

These pages display properly if I use <html> or <html lang="zh-TW">, but not
<html lang="zh-HK">, perhaps because Mozilla does not recognize "zh-HK" and
therefore picked the wrong fontset to display the page such that not all
characters are displayed.  It would be nice to fix this too, perhaps by making
zh-HK an alias of zh-TW for the purpose of font selection.

By the way, there are a few other minor issues, e.g. XFree86 uses "big5hkscs-0"
rather than "hkscs-1", etc.  I have a rough, incomplete todo list:

    mozilla/intl/uconv/src/unixcharset.properties:
        There is no zh_HK.Big5-HKSCS, and no "big5-hkscs"!

    mozilla/gfx/src/qt/nsFontMetricsQT.cpp:
    mozilla/gfx/src/gtk/nsFontMetricsGTK.cpp:
    mozilla/gfx/src/xlib/nsFontMetricsXlib.cpp:
    mozilla/gfx/src/xprint/nsFontMetricsXlib.cpp:
        static nsFontCharSetInfoXlib HKSCS =
          { "hkscs-1", DoubleByteConvert, 1,
              TT_OS2_CPR1_CHINESE_SIMP, 0 };
        # Shouldn't that be TT_OS2_CPR1_CHINESE_TRAD?
        # See mozilla/gfx/src/x11shared/nsFreeType.h for #define

        static nsFontCharSetMapXlib gCharSetMap[] =
        # add big5hkscs-0 to the list

        (It seems these files have been reorganized in Mozilla 1.2)

    mozilla/gfx/src/gtk/nsFT2FontCatalog.cpp:
        May need to add range for HKSCS?  Dunno.  Maybe not.

Should I file a new bug report for these miscellaneous issues, or should I add
information to this bug report?  Many thanks!  :-)

Anthony
Hello again,

I forgot to mention how HKSCS uses Plane 2 in ISO 10646-2:2001.  Note that HKSCS
 is not intended to cover the whole Unicode range (U+0000 to U+10FFFF) like
GB18030 does.  Rather, it adds characters used in Hong Kong to Big5, and to the
ISO 10646 standard.  HKSCS contains 1651 characters in Plane 2 (discrete values
starting from U+20021 to U+2F9D4), so no, they are not calculated
algorithmically, but should be looked up in a table instead.

For more information, please see the table at:

http://www.info.gov.hk/digital21/chi/hkscs/download/big5-iso.txt

Cheers,

Anthony
Please file separate bugs for separate issues. 

I filed bug 183037 for zh-HK issue. 

I don't think our existing table converter will work for surrogates. We may have
to change the code of converter for hkscs. But let's keep that one as another
issue, and file another bug for it.
I downloaded HKSCS2001 from
http://www.info.gov.hk/digital21/eng/hkscs/download.html. With your test page, I
did see great improvement. But I found something suspecious, I am not sure if
they are problem or now. 

With you test page, many code poins is still displayed as "?", are those
characters unassigned or it is a problem? code points like 8942, 8944, 8a42,
8a75, just name a few.

The glyph for code points 9648 and 9649 seems very suspecious. I don't know the
correct glyph look like, so it may not be a problem. 
Hello Shanjian,

These three screenshot are attached for comparison.  On Linux systems,
characters like 8942, 8944, 8a42, 8a75 etc. show as "blanks" instead of
question marks.  These are unassigned codepoints in HKSCS (probably for
compatibility with HKSCS's ancestor GCCS "Government Common Character Set"),
and they are mapped to PUA instead.  For example, in the source files that I
used to feed to "fromu" and "tou":

    $ grep -i 8942 mozilla-xscii-hkscs-2001-u?.txt
    mozilla-xscii-hkscs-2001-uf.txt:0x8942  0xF3A2
    mozilla-xscii-hkscs-2001-ut.txt:0x8942  0xF3A2

I suppose rendering them as "blanks" or as "?" question marks is equally
acceptable.  I suppose MS Windows (or at least the font you're using) uses the
question mark to mean "missing glyph", whereas on my Linux system with Arphic
font, the blank is shown instead.

By the way, just to make sure, do you see each of these characters rendered as
_one_ single question mark, or _two_ question marks?  ("one" is correct, "two"
is not.)

As for 9648 and and 9649, they are rendered correctly on my system, as
"gold+rhino" and "sheep+rhino".

If you are interested, you may download the HKSCS-2001 (with all the correct
glyphs listed in tables) in PDF format at:

    http://www.info.gov.hk/digital21/eng/hkscs/document.html
or  http://www.info.gov.hk/digital21/chi/hkscs/document.html

Cheers,

Anthony
I verified the 2 table (hkscs.uf, hkscs.ut) on windows at last. What delays me
is several characters defined in PUA, namely 0x9648, 9649, 965f and 96a3. 
Those code points was used in "New Time Roman" font on my machine, so the glyph
displayed is incorrect. After I set the hkscs font for big 5, everything seems fine.

Anthony, would you mind if I check in the 2 perl script in
/mozilla/intl/uconv/tools directory? They don't need to be r/sr/a since they are
not in build, but I do want to see them there for future reference.
 
I will go ahead drive this bug in.
Status: NEW → ASSIGNED
Attachment #107517 - Flags: superreview?(blizzard)
Attachment #107517 - Flags: review+
Comment on attachment 107517 [details]
hkscs.uf that supports HKSCS-2001

rs=blizzard on this auto-generated code
Attachment #107517 - Flags: superreview?(blizzard) → superreview+
Hello all,

I cleaned up the script somewhat and have finally grouped the different output
options into different subroutines.  But yes, I still should add command-line
options when I have time.  :-)

Note that if you generate a new hkscs-uf.txt with this script, it will differ
slightly from the old one:

--- mozilla-xscii-hkscs-2001-uf.txt	2002-11-26 15:05:44.000000000 +0800
+++ /tmp/uf.txt 2002-12-10 23:52:09.000000000 +0800
@@ -219,7 +219,31 @@
 0xC8EF 0x2ED7
 0xC8F0 0x2EDE
 0xC8F1 0x2EE3
+0xC6BF 0x2F02
+0xC6C0 0x2F03
+0xC6C1 0x2F05
+0xC6C2 0x2F07
+0xC6C3 0x2F0C
+0xC6C4 0x2F0D
+0xC6C5 0x2F0E
+0xC6C6 0x2F13
+0xC6C7 0x2F16
+0xC6C8 0x2F19
+0xC6C9 0x2F1B
+0xC6CA 0x2F22
+0xC6CB 0x2F27
+0xC6CC 0x2F2E
 0xC6CD 0x2F33
+0xC6CE 0x2F34
+0xC6CF 0x2F35
+0xC6D0 0x2F39
+0xC6D1 0x2F3A
+0xC6D2 0x2F41
+0xC6D3 0x2F46
+0xC6D4 0x2F67
+0xC6D5 0x2F68
+0xC6D6 0x2FA1
+0xC6D7 0x2FAA
 0xC6E0 0x3005
 0xC6E1 0x3006
 0xC6E2 0x3007

Those are the additional mappings from the "Kangxi Radicals" area, not
specified in the HKSCS-2001 standard, but what the OpenI18N Big5 subgroup
intends to add to the upcoming TW-BIG5 standard.  (You can see in the Perl
scripts that I added those mappings one by one manually.  I added them to
generate the CharMapML tables for the TW-BIG5 and Big5-HKSCS discussion.)

I think we don't need to worry about this minor detail in Mozilla as yet. 
These additions, if finalized by the OpenI18N-Big5 subgroup, could be added
later.	There _will_ be a HKSCS-2003 or HKSCS-2004 anyhow.  :-)  Or perhaps
these additions will be added to big5.uf instead.  We'll see.

Cheers,

Anthony
Attachment #107527 - Attachment is obsolete: true
Hello Shanjian,

Oh, sorry, I forgot to answer your question.  Yes, please feel free to check my
Perl scripts in.  That's why I attached them to this bug in the first place.  It
took me quite a while to figure out how Gavin Ho generated the HKSCS-1999 data
because he didn't detailed his procedures in generating the data.

So yes, having these new Perl scripts checked in would be be helpful to us or to
whoever intends to revise hkscs.{uf.ut} next time around.  :-)

Thanks again,

Cheers,

Anthony
Comment on attachment 107519 [details]
hkscs.ut that supports HKSCS-2001

forgot this part.
Attachment #107519 - Flags: superreview?(blizzard)
Attachment #107519 - Flags: review+
Comment on attachment 107519 [details]
hkscs.ut that supports HKSCS-2001

rs=blizzard on auto-generated code
Attachment #107519 - Flags: superreview?(blizzard) → superreview+
Attachment #107517 - Flags: approval1.0.x?
Attachment #107517 - Flags: approval1.0.x?
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Verified display properly on 12-16 trunk build on WinXP, linuxRH7.2 and Mac
10.2.2 with data that mentioned in comment #14.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: