Big5 Unicode Mapping Table Update

RESOLVED FIXED in mozilla1.8.1beta1

Status

()

RESOLVED FIXED
14 years ago
12 years ago

People

(Reporter: piaip, Assigned: smontagu)

Tracking

({fixed1.8.1})

Trunk
mozilla1.8.1beta1
fixed1.8.1
Points:
---
Bug Flags:
blocking1.8rc1 -
blocking1.8.1 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

(Reporter)

Description

14 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8b5) Gecko/20050921 Firefox/1.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8b5) Gecko/20050921 Firefox/1.4

The Big5 (The most popular charset for Traditional Chinese) to Unicode mapping
table used in Mozilla source tree is last touched by bug #9686.
However the table should be updated again because of the following reason:

First please allow me to explain the brief history of #9686 and Big5 variants.
There are many Big5 variants (or, extensions) currently in use. 
Windows has its own table named "CP950" which is widely used but it lacks
of some unicode mappings like Japanese hinakana/katakana which is included in
other Big5 variants and already used in many files/webpages/documents.
Mozilla's BIG5 table was similiar to CP950 before #9686.
So that's mainly what we did in bug #9686 - add these mappings and correct some
wrong mappings.

The most important Big5 variants are: (ordered by number of mappings from least
from most)
- CP950      (Used by Windows)
- Big5-2003  (Which is the official standard by Taiwan government now)
- UAO        (Unicode-At-On, an un-official variant trying to add most CJK Unihan)
  P.S: UAO is installed by many people in Taiwan. It was almost compatible with
       Big5-2003 although the latest version is a little incompatible with 
       Big5-2003 and Big5-HKSCS.
  A comparision table for Big5 variants and their code page can be found from
  Big5-2003's introduction page: http://www.cns11643.gov.tw/web/big5/ (Chinese,
sorry)

The table currently used by Mozilla* now is very similiar to Big5-2003.

The problem is, if a user browsing non-Big5 pages (e.g., sjis or utf8) 
copied some characters not in CP950 (e.g, Japanese hitakana) and pasted to 
Big5 websites then other users with pure CP950 environment (e.g, a Japanese
using Japanese Windows and Internet Explorer) cannot see these characters
correctly. They will mostly get blank display. But if we use real CP950 table
then they will be encoded as HTML entity form so that everybody (even with
original CP950+IE) can read it correctly.

So I'd like to suggest following changes:
(1) Unicode -> Big5 should use the original CP950 table for most compatibility.
(2) Big5 -> Unicode can use Big5-2003, or even UAO.

P.S: does anyone know where to get "fromu" and "tou" which is required to
    generate new table of Mozilla Big5 table?


Reproducible: Always

Steps to Reproduce:
1. Browser a SJIS or UTF8 web page and copy Japanese Hitakana/Katakana characters
2. Find a BIG5 website with text area forms (e.g, a php-BB forum), paste and submit
3. Browse the result page with non-Mozilla browsers (e.g: IE or Opera) on
non-Big5-2003 system (e.g: Windows, or unpatched Linux)
Actual Results:  
Non-mozilla browsers see blank characters

Expected Results:  
Should be Japanese hitakana/katakana characterse (in &12345; HTML entity form)

CP950 Unicode Mapping Table (from Unicode.org):
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
Big5-2003 Unicode Mapping Table:
http://moztw.org/docs/big5/big5-2003.txt

Comment 1

14 years ago
It's good to know that Big5 has been standardized by Taiwanese government.
Something similar to what you suggested is done for a couple of encodings
Mozilla supports. Before going further, let me ask you a question. Is the
character repertoire of CP950 a subset of that of Big5-2003? Moreover, do
characters in the intersection of two have exactly the same code point
assignments in CP950 and Big5-2003? Well, I can check them out myself, but I'm
being lazy here thinking you'll be able to answer them more quickly..


OS: Windows XP → All
Hardware: PC → All
(Reporter)

Comment 2

14 years ago
We (Mozilla Taiwan) are currently making new tables and asking members for
test for the new table. We'll try our best to complete these in few days and 
we do hope it can be landed in Mozilla 1.8 branch.

(In reply to comment #1)
> Is the character repertoire of CP950 a subset of that of Big5-2003? 
> Do characters in the intersection of two have exactly the same code point
> assignments in CP950 and Big5-2003?

I'm afraid that the answer may be "No".
Big5-2003 is a superset of CP950 in most case, but there is difference in 
Symbols section.
9 characters in this section have "same looking" but different unicode value.
I mean, they look almost the same, like: (you may check these by Unicode.org
 http://www.unicode.org/charts/unihan.html)
(Big5=0xA156) +U2015 +U2013
So we will need to put both 2015/2013 in "fromu" table.

Other different symbols are:
 (Big5  B5-2003  CP950)
  0xA1C2 +U203E +U00AF
  0xA2A4 +U2501 +U2550
  0xA2A5 +U251D +U255E
  0xA2A6 +U253F +U256A
  0xA2A7 +U2525 +U2561
  0xA2CC +U3038 +U5341
  0xA2CD +U3039 +U5344
  0xA2CE +U303A +U5345

BTW, The UAO used only non-used (user private area) part of CP-950 so CP950
IS exactly a subset of UAO. UAO is also designed to be compatible with Big5-2003,
(but since it's a subset of CP950, it has same problem in Symbol section)
so our plan now is to make a "tou"(big5 to unicode) table based on Big5-2003 
plus compatible UAO mappings.
(Reporter)

Comment 4

14 years ago
The draft version of the result table is:
http://moztw.org/docs/big5/table/moz18-b2u.txt
http://moztw.org/docs/big5/table/moz18-u2b.txt

I'll attach big5.ut and big5.uf after we complete and 
verified several tests.
(Reporter)

Comment 5

14 years ago
It seems like that the new table works fine for most people. The only special 
case is for Hong Kong user (Hong Kong uses Big5 but they have their own 
extension named Big5-hkscs, which is also supported by Mozilla as a 
different charset).

Although Mozilla has "BIG5-HKSCS" charset, because IE has no "Big5-HKSCS"
(only Big5 in IE) so many web pages still describe themselves as "Big5" only.
For all non-HK users, the only way to see HKSCS on Mozilla is to set charset 
to Big5-HKSCS so they won't get bothered by the new table. This also applies 
to  HK users who installed Big5 extensions which does not change System Font.

So exactly who'll be affected? Those installed Microsoft HKSCS (which changed
both system NLS table and system font) and browsing Big5-HKSCS pages (which use
only "Big5" in their content type meta directive) without setting charset to
Big5-HKSCS. Because MS HKSCS changed system font, it puts HK character glyphs 
on the font's user private area (by the mappings of original Big5). So whether 
the program converts multibyte to correct Unicode or not user can always "see"
correct glyphs ("see" only. Because they are actually different Unicode value
if copy/paste/written to disk).

This may be the only issue of the new table. If we want to be fully compatible,
we can change UAO in user private area back to BIG5-2003. However since there
is still big5-hkscs, maybe this is not necessary...
supports correct Unicode mapping or not 
(Reporter)

Comment 6

14 years ago
We've decided that it should be O.K to apply UAO extension table. Here is the
reason:

1. Mozilla DOES have a big5-Hkscs charset.
2. Many webpages which supports both ANSI text and HTML mode (e.g., a website
   providing telnet/SSH services and newsgroup service) already used UAO charset.
   A user can always succesfully browser Big5-HKSCS pages by Mozilla without
   HKSCS extension installed on his PC, but a user cannot browse UAO pages even
   with UAO extension installed.

Because the conflict comes from wrong meta information (charset=Big5) for those
Big5-HKSCS pages, we believe a better solution to this issue is to provide an
preference to determine "how to select which Big5 uconv to use", or an extension
that converts all charset=big5 meta request to big5-hkscs.
   
(Reporter)

Comment 9

14 years ago
The final version of diff file for new Big5 table
Attachment #198205 - Flags: review?(smontagu)
(Reporter)

Comment 10

14 years ago
The final version of new table [with Big5-2003+UAO] of big5.ut
Attachment #198206 - Flags: review?(smontagu)
(Reporter)

Comment 11

14 years ago
Please use attachment 198205 [details] [diff] [review] and 198206 to patch new Big5 table.
They are already tested by several non-official builds of Firefox.

The big5.uf (unicode->big5) table is based on strict CP950. All mappings
to user private area and buggy areas are eliminated and followed CP950.

The big5.ut (big5->unicode) table is based on CP950 plus Big5-2003. (i.e.,
mappings conflicted between Big5-2003 and CP950 still follow CP950 for 
compatibility to make it a complete subset of CP950) For user private area, 
the mappings follow Big5-2003 and overriden by UAO2.41 extension.
(Reporter)

Updated

14 years ago
Attachment #198205 - Flags: review?(smontagu) → review?
(Reporter)

Updated

14 years ago
Attachment #198206 - Flags: review?(smontagu) → review?
(Reporter)

Comment 12

14 years ago
One more comment. If you worry about compatibility, please at least
commit big5.uf (attach 198205) as soon as possible because it's bugging
more and more user recently and we do really hope it commited before the
incoming Fx1.5. Is this possible?

big5.ut (b->u) is somehow more like an "improvement" which changed a lot
while big5.uf (u->b) is basically original Big5/CP950 so it's almost harmless
in any concern and is a real "bug fix". However we still do wish big5.ut
to be commited at the same time.

The files are tested by several volunteers for a period and should be OK
for most user.
(Reporter)

Comment 13

14 years ago
(In reply to comment #6)
> Because the conflict comes from wrong meta information (charset=Big5) for those
> Big5-HKSCS pages, we believe a better solution to this issue is to provide an
> preference to determine "how to select which Big5 uconv to use", or an extension
> that converts all charset=big5 meta request to big5-hkscs.
   This can be solved by writing big5=BIG5-HKSCS in res/charsetalias.properties
   Maybe we can split Big5-UAO as an independent locale (because it does not have
   an official name in IANA yet) but it seems good enough now.

   For a HKSCS user in the situation mentioned in comment #5, a solution is to
   modify res/charsetalias.properties. (this may be achievd by an XPI.)
(Reporter)

Comment 14

14 years ago
(In reply to comment #13)
>    This can be solved by writing big5=BIG5-HKSCS in res/charsetalias.properties
>    For a HKSCS user in the situation mentioned in comment #5, a solution is to
>    modify res/charsetalias.properties. (this may be achievd by an XPI.)

  A sample XPI to demonstrate this solution can be found from
  http://moztw.org/dls/xpi/hkscs.xpi
(Reporter)

Updated

14 years ago
Attachment #198205 - Attachment description: cvs diff for /intl/uconv/ucvtw/big5.uf [fromu] → (patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu]
(Reporter)

Updated

14 years ago
Attachment #198206 - Attachment description: cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO → (patchset) cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO
(Reporter)

Comment 15

14 years ago
The patches has been tested by Taiwan users for a while (by un-official
community builds) so they should be stable enough to be commited for 1.8 and trunk.
Flags: blocking1.8rc1?
tool late in the game to block on non-critical changes. 
Flags: blocking1.8rc1? → blocking1.8rc1-

Updated

14 years ago
Flags: blocking1.9a1?
Flags: blocking1.8.1?
(Reporter)

Updated

14 years ago
Flags: blocking1.9a1?
Flags: blocking1.8.1?

Updated

14 years ago
Flags: blocking1.8.1?

Comment 17

13 years ago
I wonder if mozilla can apply the big5-2003 + UAO patch to firefox 2.0?
Leaving this problem unsolved will just continue to bring inconvenience to chinese users.
(Assignee)

Comment 18

13 years ago
Comment on attachment 198205 [details] [diff] [review]
(patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu]

I can't assess these patches codepoint by codepoint, but I am happy to accept them based on comments 12 and 15. Auto-generated table patches in intl don't need super-review, but I'd like jshin's approval before checking in.
Attachment #198205 - Flags: review?(smontagu)
Attachment #198205 - Flags: review?(jshin1987)
Attachment #198205 - Flags: review+
(Assignee)

Updated

13 years ago
Attachment #198206 - Flags: review?(smontagu)
Attachment #198206 - Flags: review?(jshin1987)
Attachment #198206 - Flags: review+

Comment 19

13 years ago
Thanks a lot smontagu!
Hope these patches can be commited before the official release of firefox 2.0

Comment 20

13 years ago
Comment on attachment 198205 [details] [diff] [review]
(patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu]

Sorry for the long delay. 

I'll edit  big5.uf and big5.ut  to add the urls of conversion tables you used.

lxr will point back at this bug so that we can do without that, but still it is nice to have that.
Attachment #198205 - Flags: review?(jshin1987) → review+

Comment 21

13 years ago
Comment on attachment 198206 [details] [diff] [review]
(patchset) cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO

r=jshin
Attachment #198206 - Flags: review?(jshin1987) → review+

Comment 22

13 years ago
Thank you jshin!
BTW, apart from big5-2003, there is a bug about big5-hkscs table...
The one that mozilla use is too old.
The Hong Kong government has updated the big5-hkscs table in 2004 on its official site...
I hope mozilla can fix this bug as well.
Here is the table releaed by hk government:
http://www.info.gov.hk/digital21/chi/hkscs/download/hkscs-2004-big5-iso.txt

For more information about the update, please go to
http://www.info.gov.hk/digital21/eng/hkscs/mapping_table.html
Whiteboard: [checkin needed]
Target Milestone: --- → mozilla1.8.1beta1
Attachment #198205 - Flags: approval1.8.1+
Attachment #198206 - Flags: approval1.8.1+
(Assignee)

Comment 23

13 years ago
Checked in to trunk
(Assignee)

Comment 24

13 years ago
Checked in to MOZILLA_1_8_BRANCH. BTW, I added links to the conversion tables as suggested in comment 20 to all checkins.
Status: NEW → RESOLVED
Last Resolved: 13 years ago
Keywords: fixed1.8.1
Resolution: --- → FIXED
Whiteboard: [checkin needed]
You need to log in before you can comment on or make changes to this bug.