Closed Bug 310299 Opened 15 years ago Closed 15 years ago
Big5 Unicode Mapping Table Update
272.21 KB, text/plain
277.98 KB, text/plain
446.54 KB, patch
|Details | Diff | Splinter Review|
426.61 KB, patch
|Details | Diff | Splinter Review|
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8b5) Gecko/20050921 Firefox/1.4 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8b5) Gecko/20050921 Firefox/1.4 The Big5 (The most popular charset for Traditional Chinese) to Unicode mapping table used in Mozilla source tree is last touched by bug #9686. However the table should be updated again because of the following reason: First please allow me to explain the brief history of #9686 and Big5 variants. There are many Big5 variants (or, extensions) currently in use. Windows has its own table named "CP950" which is widely used but it lacks of some unicode mappings like Japanese hinakana/katakana which is included in other Big5 variants and already used in many files/webpages/documents. Mozilla's BIG5 table was similiar to CP950 before #9686. So that's mainly what we did in bug #9686 - add these mappings and correct some wrong mappings. The most important Big5 variants are: (ordered by number of mappings from least from most) - CP950 (Used by Windows) - Big5-2003 (Which is the official standard by Taiwan government now) - UAO (Unicode-At-On, an un-official variant trying to add most CJK Unihan) P.S: UAO is installed by many people in Taiwan. It was almost compatible with Big5-2003 although the latest version is a little incompatible with Big5-2003 and Big5-HKSCS. A comparision table for Big5 variants and their code page can be found from Big5-2003's introduction page: http://www.cns11643.gov.tw/web/big5/ (Chinese, sorry) The table currently used by Mozilla* now is very similiar to Big5-2003. The problem is, if a user browsing non-Big5 pages (e.g., sjis or utf8) copied some characters not in CP950 (e.g, Japanese hitakana) and pasted to Big5 websites then other users with pure CP950 environment (e.g, a Japanese using Japanese Windows and Internet Explorer) cannot see these characters correctly. They will mostly get blank display. But if we use real CP950 table then they will be encoded as HTML entity form so that everybody (even with original CP950+IE) can read it correctly. So I'd like to suggest following changes: (1) Unicode -> Big5 should use the original CP950 table for most compatibility. (2) Big5 -> Unicode can use Big5-2003, or even UAO. P.S: does anyone know where to get "fromu" and "tou" which is required to generate new table of Mozilla Big5 table? Reproducible: Always Steps to Reproduce: 1. Browser a SJIS or UTF8 web page and copy Japanese Hitakana/Katakana characters 2. Find a BIG5 website with text area forms (e.g, a php-BB forum), paste and submit 3. Browse the result page with non-Mozilla browsers (e.g: IE or Opera) on non-Big5-2003 system (e.g: Windows, or unpatched Linux) Actual Results: Non-mozilla browsers see blank characters Expected Results: Should be Japanese hitakana/katakana characterse (in &12345; HTML entity form) CP950 Unicode Mapping Table (from Unicode.org): http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT Big5-2003 Unicode Mapping Table: http://moztw.org/docs/big5/big5-2003.txt
It's good to know that Big5 has been standardized by Taiwanese government. Something similar to what you suggested is done for a couple of encodings Mozilla supports. Before going further, let me ask you a question. Is the character repertoire of CP950 a subset of that of Big5-2003? Moreover, do characters in the intersection of two have exactly the same code point assignments in CP950 and Big5-2003? Well, I can check them out myself, but I'm being lazy here thinking you'll be able to answer them more quickly..
OS: Windows XP → All
Hardware: PC → All
We (Mozilla Taiwan) are currently making new tables and asking members for test for the new table. We'll try our best to complete these in few days and we do hope it can be landed in Mozilla 1.8 branch. (In reply to comment #1) > Is the character repertoire of CP950 a subset of that of Big5-2003? > Do characters in the intersection of two have exactly the same code point > assignments in CP950 and Big5-2003? I'm afraid that the answer may be "No". Big5-2003 is a superset of CP950 in most case, but there is difference in Symbols section. 9 characters in this section have "same looking" but different unicode value. I mean, they look almost the same, like: (you may check these by Unicode.org http://www.unicode.org/charts/unihan.html) (Big5=0xA156) +U2015 +U2013 So we will need to put both 2015/2013 in "fromu" table. Other different symbols are: (Big5 B5-2003 CP950) 0xA1C2 +U203E +U00AF 0xA2A4 +U2501 +U2550 0xA2A5 +U251D +U255E 0xA2A6 +U253F +U256A 0xA2A7 +U2525 +U2561 0xA2CC +U3038 +U5341 0xA2CD +U3039 +U5344 0xA2CE +U303A +U5345 BTW, The UAO used only non-used (user private area) part of CP-950 so CP950 IS exactly a subset of UAO. UAO is also designed to be compatible with Big5-2003, (but since it's a subset of CP950, it has same problem in Symbol section) so our plan now is to make a "tou"(big5 to unicode) table based on Big5-2003 plus compatible UAO mappings.
Big5<->Unicode Mapping Tables (all presented in [big5-value unicode-value] format) (b2u=toU=big5->unicode, u2b=fromU=unicode->big5) CP950 http://moztw.org/docs/big5/table/cp950-b2u.txt http://moztw.org/docs/big5/table/cp950-u2b.txt Big5-2003 http://moztw.org/docs/big5/table/big5_2003-b2u.txt http://moztw.org/docs/big5/table/big5_2003-u2b.txt UAO2.41 http://moztw.org/docs/big5/table/uao241-b2u.txt http://moztw.org/docs/big5/table/uao241-u2b.txt
The draft version of the result table is: http://moztw.org/docs/big5/table/moz18-b2u.txt http://moztw.org/docs/big5/table/moz18-u2b.txt I'll attach big5.ut and big5.uf after we complete and verified several tests.
It seems like that the new table works fine for most people. The only special case is for Hong Kong user (Hong Kong uses Big5 but they have their own extension named Big5-hkscs, which is also supported by Mozilla as a different charset). Although Mozilla has "BIG5-HKSCS" charset, because IE has no "Big5-HKSCS" (only Big5 in IE) so many web pages still describe themselves as "Big5" only. For all non-HK users, the only way to see HKSCS on Mozilla is to set charset to Big5-HKSCS so they won't get bothered by the new table. This also applies to HK users who installed Big5 extensions which does not change System Font. So exactly who'll be affected? Those installed Microsoft HKSCS (which changed both system NLS table and system font) and browsing Big5-HKSCS pages (which use only "Big5" in their content type meta directive) without setting charset to Big5-HKSCS. Because MS HKSCS changed system font, it puts HK character glyphs on the font's user private area (by the mappings of original Big5). So whether the program converts multibyte to correct Unicode or not user can always "see" correct glyphs ("see" only. Because they are actually different Unicode value if copy/paste/written to disk). This may be the only issue of the new table. If we want to be fully compatible, we can change UAO in user private area back to BIG5-2003. However since there is still big5-hkscs, maybe this is not necessary... supports correct Unicode mapping or not
We've decided that it should be O.K to apply UAO extension table. Here is the reason: 1. Mozilla DOES have a big5-Hkscs charset. 2. Many webpages which supports both ANSI text and HTML mode (e.g., a website providing telnet/SSH services and newsgroup service) already used UAO charset. A user can always succesfully browser Big5-HKSCS pages by Mozilla without HKSCS extension installed on his PC, but a user cannot browse UAO pages even with UAO extension installed. Because the conflict comes from wrong meta information (charset=Big5) for those Big5-HKSCS pages, we believe a better solution to this issue is to provide an preference to determine "how to select which Big5 uconv to use", or an extension that converts all charset=big5 meta request to big5-hkscs.
The final version of diff file for new Big5 table
The final version of new table [with Big5-2003+UAO] of big5.ut
Please use attachment 198205 [details] [diff] [review] and 198206 to patch new Big5 table. They are already tested by several non-official builds of Firefox. The big5.uf (unicode->big5) table is based on strict CP950. All mappings to user private area and buggy areas are eliminated and followed CP950. The big5.ut (big5->unicode) table is based on CP950 plus Big5-2003. (i.e., mappings conflicted between Big5-2003 and CP950 still follow CP950 for compatibility to make it a complete subset of CP950) For user private area, the mappings follow Big5-2003 and overriden by UAO2.41 extension.
One more comment. If you worry about compatibility, please at least commit big5.uf (attach 198205) as soon as possible because it's bugging more and more user recently and we do really hope it commited before the incoming Fx1.5. Is this possible? big5.ut (b->u) is somehow more like an "improvement" which changed a lot while big5.uf (u->b) is basically original Big5/CP950 so it's almost harmless in any concern and is a real "bug fix". However we still do wish big5.ut to be commited at the same time. The files are tested by several volunteers for a period and should be OK for most user.
(In reply to comment #6) > Because the conflict comes from wrong meta information (charset=Big5) for those > Big5-HKSCS pages, we believe a better solution to this issue is to provide an > preference to determine "how to select which Big5 uconv to use", or an extension > that converts all charset=big5 meta request to big5-hkscs. This can be solved by writing big5=BIG5-HKSCS in res/charsetalias.properties Maybe we can split Big5-UAO as an independent locale (because it does not have an official name in IANA yet) but it seems good enough now. For a HKSCS user in the situation mentioned in comment #5, a solution is to modify res/charsetalias.properties. (this may be achievd by an XPI.)
(In reply to comment #13) > This can be solved by writing big5=BIG5-HKSCS in res/charsetalias.properties > For a HKSCS user in the situation mentioned in comment #5, a solution is to > modify res/charsetalias.properties. (this may be achievd by an XPI.) A sample XPI to demonstrate this solution can be found from http://moztw.org/dls/xpi/hkscs.xpi
Attachment #198205 - Attachment description: cvs diff for /intl/uconv/ucvtw/big5.uf [fromu] → (patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu]
Attachment #198206 - Attachment description: cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO → (patchset) cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO
The patches has been tested by Taiwan users for a while (by un-official community builds) so they should be stable enough to be commited for 1.8 and trunk.
tool late in the game to block on non-critical changes.
Flags: blocking1.8rc1? → blocking1.8rc1-
I wonder if mozilla can apply the big5-2003 + UAO patch to firefox 2.0? Leaving this problem unsolved will just continue to bring inconvenience to chinese users.
Attachment #198205 - Flags: review? → review?(smontagu)
Attachment #198206 - Flags: review? → review?(smontagu)
Comment on attachment 198205 [details] [diff] [review] (patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu] I can't assess these patches codepoint by codepoint, but I am happy to accept them based on comments 12 and 15. Auto-generated table patches in intl don't need super-review, but I'd like jshin's approval before checking in.
Thanks a lot smontagu! Hope these patches can be commited before the official release of firefox 2.0
Comment on attachment 198205 [details] [diff] [review] (patchset) cvs diff for /intl/uconv/ucvtw/big5.uf [fromu] Sorry for the long delay. I'll edit big5.uf and big5.ut to add the urls of conversion tables you used. lxr will point back at this bug so that we can do without that, but still it is nice to have that.
Attachment #198205 - Flags: review?(jshin1987) → review+
Comment on attachment 198206 [details] [diff] [review] (patchset) cvs diff for /intl/uconv/ucvtw/big5.ut [tou], Big5-2003+UAO r=jshin
Attachment #198206 - Flags: review?(jshin1987) → review+
Thank you jshin! BTW, apart from big5-2003, there is a bug about big5-hkscs table... The one that mozilla use is too old. The Hong Kong government has updated the big5-hkscs table in 2004 on its official site... I hope mozilla can fix this bug as well. Here is the table releaed by hk government: http://www.info.gov.hk/digital21/chi/hkscs/download/hkscs-2004-big5-iso.txt For more information about the update, please go to http://www.info.gov.hk/digital21/eng/hkscs/mapping_table.html
Whiteboard: [checkin needed]
Target Milestone: --- → mozilla1.8.1beta1
Checked in to trunk
Checked in to MOZILLA_1_8_BRANCH. BTW, I added links to the conversion tables as suggested in comment 20 to all checkins.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Whiteboard: [checkin needed]
You need to log in before you can comment on or make changes to this bug.