Closed Bug 108136 Opened 23 years ago Closed 22 years ago

Shift_JIS conversion problem on MacOS9, OS/2

Categories

(Core :: Internationalization, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla1.2beta

People

(Reporter: shom, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(7 files, 2 obsolete files)

Now, the internal mapping table for Japanese is fully based on CP932 (bug-54135). MacOS9 and OS/2 have another mapping table, so some characters have conversion problem when mozilla passes internal UCS2 codes to OS Native functions which handle UCS2. PROBLEM: testpage: http://rh.vinelinux.org/~shom/sjisprob.html a problem on MacOS9 http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=364 a problem on OS/2 http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=367 RELATED BUGS : bug 35166, bug 58637, bug 33162, bug 65991 SOLUTIONs: i) convert internal UCS2 codes to compatible codes of OS native codes when use every OS function which treat UCS2. SO HARD? ii) implement dual mapping method to conversion tables. VERY HARD, I think. iii) make other tables for Shift_JIS variants. Currently Japanese:UCS2 conversion table is generated from CP932.txt with mkjpconv.pl (bug 54135). Since this tool can generate other mapping tables (ex APPLE_JAPANESE.txt), it is easy to make Shift_JIS(MacOS9) and Shift_JIS(OS2) -- or Shift_JIS(IBM943). This solution have another advantage -- can treat platform depend characters without unicode sequences (surrogate pairs?).
teruko: can you confirm. ->nhotta
Assignee: yokoyama → nhotta
*** This bug has been confirmed by popular vote. ***
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reassign to ftang.
Reassign to ftang.
Assignee: nhotta → ftang
Status: NEW → ASSIGNED
what will happen if we don't fix this?
Priority: -- → P4
Cannot treat many vendor specific Shift JIS kanji chars (I know NC4 can). # CP932 contains MS specific kanji chars, so on Windows can treat them :b and legal chars in JIS X 0208 have conversion problem. [reported in bugzilla-jp <http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=868>] testpage : http://rh.vinelinux.org/~shom/sjisprob2.html * OS/2 SJIS 4 chars (0x815c,0x8160,0x8161,0x817c) have problem. screen shot http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=538 screen shot after re-input '?' chars in and submit http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539 - display problem (0x815c,0x8160,0x8161,0x817c) are displayed as '?' on page body, bookmark title, tab, javascript alert. on titlebar, ' '. - query send problem When input one of (0x815c,0x8160,0x8161,0x817c) in INPUT type=text / TEXTAREA, chars following these chars are truncated. (http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539) - compose problem (0x815c,0x8160,0x8161,0x817c) becomes &#8212; &#12316; &#8214; &#8722; in saved page. - mail/news send problem (0x815c,0x8160,0x8161,0x817c) treated as illegal, so cannot send. if ignore alert, 0x815c becomes '--', others '?'. * Mac OS 9 (and probably Mac OS X) (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) have problem. - query send problem When input (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) in INPUT type=text / TEXTAREA, chars following these chars are truncated. - mail/news send problem (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) treated as illegal, so cannot send. if ignore alert, 0x815c becomes '--', others '?'. - bookmark problem bookmark title contains (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) in menubar of OS are displaed as blank.
Blocks: 157673
one of the top problem mozilla japanese group report. not sure how to solve it yet. May need to break down to different tasks.
Keywords: intl, nsbeta1+
Priority: P4 → P2
Target Milestone: --- → mozilla1.2beta
Kohei Ichioka has made a patch for this bug. http://www5a.biglobe.ne.jp/~expf/ucvja.tar.gz This file contains readme.txt which explains how to apply the patch. And chado has made a Mac build based on this patch. ftp://download.sourceforge.jp/wazilla/996/Wazilla-mac-1.1-2156c.sea.bin
Adding mkaply to Cc. Tarball in comment 8 contains the patch for OS/2, but Kohei Ichioka hasn't tested it. He doesn't have OS/2. Can you review the patch and test it?
Severity: normal → critical
In the original report, >MacOS9 and OS/2 have another mapping table Does the problem exist for MacOSX or this is specific to MacOS9?
Can anyone attach a patch using cvs diff -u to this bug?
Attached file gzipped patch (obsolete) —
* change Japanese to Unicode conversion rule pref("intl.jis0208.map", "Apple") using MacJapanese conversion rule. pref("intl.jis0208.map", "IBM943") using IBM943 conversion rule. * dual mapping for Unicode to Japanese conversion rule CP932 ,Apple ,IBM943 SJIS (JIS) U+2015,U+2014,U+2014 -> 0x815C(01-29) U+FF5E,U+301C,U+301C -> 0x8160(01-33) U+2225,U+2016,U+2016 -> 0x8161(01-34) U+FF0D,U+2212,U+2212 -> 0x817C(01-61) U+FFE0,U+00A2,U+FFE0 -> 0x8191(01-81) U+FFE1,U+00A3,U+FFE1 -> 0x8192(01-82) U+FFE2,U+00AC,U+FFE2 -> 0x81CA(02-44) U+FFE4,U+FFE4,U+00A6 -> 0xEEFA(92-92) U+FFE4,U+FFE4,U+00A6 -> 0xFA55 mozilla/intl/uconv/tools/jamap.pl creates maps. mozilla/intl/uconv/ucvja/japanese.map is the map for Japanese to Unicode.
Matsumoto san, Does the problem exist for MacOSX or this is specific to MacOS9?
I don't know. But I think MacOSX uses the same conversion rule as MacOS9 for backward compatibility.
could you give us a patch instead of a application/x-gzip ?
Attached patch patch #1/3Splinter Review
Attached patch patch #2/3Splinter Review
Attached patch patch #3/3Splinter Review
Attached patch patch #1/4Splinter Review
Attached patch patch #2/4Splinter Review
Attached patch patch #3/4Splinter Review
Attached patch patch #4/4 (obsolete) — Splinter Review
The patch id=98510-98512 is incomplete. id=102147-102150 is the actual patch.
Attachment #102147 - Flags: review+
Attachment #102148 - Flags: review+
Attachment #102149 - Flags: review+
Attachment #102150 - Flags: review+
Attachment #98078 - Attachment is obsolete: true
Changed QA contact to ylong@netscape.com.
QA Contact: teruko → ylong
Comment on attachment 102147 [details] [diff] [review] patch #1/4 sr=alecf
Comment on attachment 102148 [details] [diff] [review] patch #2/4 sr=alecf
Attachment #102148 - Flags: superreview+
Comment on attachment 102149 [details] [diff] [review] patch #3/4 sr=alecf
Attachment #102149 - Flags: superreview+
Comment on attachment 102150 [details] [diff] [review] patch #4/4 what does this notation mean? + const PRUint16 (*mMapIndex)[128]; this seems a little confusing, how about const PRUint16* mMapIndex[128]? Though actually are you storing a pointer to a 128 bit array? I think this is a misuse of this type and what you might really want is PRUint16** mMapIndex? Also, storing the per-platform in prefs seems unnecessary... I mean, the value is never going to change right? why not just #ifdef the code? Prefs should only be used when the value is going to be changed... the per-platform pref stuff is when you want the DEFAULT value of the pref to vary based on the platform, but you still expect the user to change it later.
Attachment #102150 - Attachment is obsolete: true
mMapIndex is actually a pointer to a 128-PRUint16-values array. It points the first item of gIndex, gCP932Index, or gIBM943Index. const PRUint16 gIndex[2][128]; const PRUint16 gCP932Index[2][128]; const PRUint16 gIBM943Index[2][128]; If I use PRUint16** mMapIndex, I must use extra variables. const PRUint16 *const gIndex[2] = { gIndex1, gIndex2 }; const PRUint16 gIndex1[128] = { ... } const PRUint16 gIndex2[128] = { ... } ...
reassign to smontagu for landing
Assignee: ftang → smontagu
Status: ASSIGNED → NEW
Kohei, can you attach a new version of attachment 102150 [details] [diff] [review] addressing alecf's comments? I'm assuming that all 4 attachments need to be checked in together.
In some cases, users will want to change the conversion table. On unix, the suitable conversion table depends the installed fonts. And it is not fixed at compile time. For another case, a macintosh mozilla user had an accident with a web site and contact with the web site engineer, the engineer uses a windows machine and not has a macintosh. The enginner will want to look into the behavior of conversion on his windows machine. (In Japan, troubles related to the character-conversion often occur) If a windows mozilla user attaches importance to the compatibility with java programs than the looks on the screen, the user will want to use the standard conversion table instead of the windows(CP932) conversion table.
Re Comment 14: this happens also on Mac OS X.
Comment on attachment 106482 [details] [diff] [review] patch #4/4 using PRUint16** mMapIndex Transferring r=ftang and requesting sr
Attachment #106482 - Flags: superreview?(alecf)
Attachment #106482 - Flags: review+
Comment on attachment 106482 [details] [diff] [review] patch #4/4 using PRUint16** mMapIndex I thought I had commented about this earlier: (maybe it was another bug?) Why are we using prefs to choose the charset on a per-platform basis - can't we do this with #ifdefs? I guess I'm trying to understand the situation where the user will be changing this value? If this isn't going to be changed by the user, then we shouldn't add more dependencies on prefs. The patch looks ok, but I'm going to hold off on my sr= until this is explained..
see #33 and... Japanese "Shift JIS" has many variants. Many pages in Japanese Shift JIS has "Shift_JIS" charset, but actually some of them are Shift_JIS, others are Windows-31J, and others are Apple Japanese, IBM943C, etc. They have the same "encoding (Shift JIS)", but have each "charset" and Unicode mapping rules. We Japanese -- espacially web developpers -- sometimes want to use them properly. case-1) vendor specific Shift JIS characters problem Up to this time, Windows specific chars could not be displayed on Mac/UNIX, Mac specifics on Windows/UNIX). Now, if we change the charset in runtime, we can see them via iso10646-1 glyph mapping (at the costs of finding glyphs). Especially on UNIX, some users want to use only "Shift_JIS" characters because the cost of searching iso10646 font glyphs is so large, but others want to see "Windows-31J" specific chars because many web pages (and some mails) use them with "charset=Shift_JIS". IMHO, the best solution is to make each charset/mapping rules for major variants of Shift JIS, and we could specify a rule to be used as "Shift JIS" at runtime. (In addition, ISO-2022-JP compatible with Windows-31J - many Windows mailer generates - is different from ISO-2022-JP compatible with Shift_JIS - JIS spec. case-2) Unicode conversion problem on XML with charset=UTF-8 Shift JIS variants have each mapping rules for Unicode. Unfortunately they are not compatible with each other, so there are Shift_JIS/Windows-31J/Apple Japanese compatible UTF-8s. For example, XMLs with "charset=UTF-8" converted/generated from Shift JIS datum by XML processor using "Windows-31J/CP932" mapping rules -- I think Microsoft products are so -- will not be usable on other systems. This problem does not come up with surface as far, but it may become large as XMLs with "charset=UTF-8" comes to be used.
Comment on attachment 106482 [details] [diff] [review] patch #4/4 using PRUint16** mMapIndex ok, that seems like a reasonable explanation. sr=alecf By the way, you should learn to use "cvs diff" - you don't need to keep two seperate tree's around.
Attachment #106482 - Flags: superreview?(alecf) → superreview+
Comment on attachment 102147 [details] [diff] [review] patch #1/4 setting sr=alecf per comment 26
Attachment #102147 - Flags: superreview+
Fix checked in.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
The test page: http://rh.vinelinux.org/~shom/sjisprob2.html and http://rh.vinelinux.org/~shom/sjisprob.html are displayed fine on 11-26 trunk build / Mac 9.2.1. Mark as verified as fixed.
Status: RESOLVED → VERIFIED
Attachment #102147 - Flags: approval1.0.x?
Attachment #102148 - Flags: approval1.0.x?
Attachment #102149 - Flags: approval1.0.x?
Attachment #106482 - Flags: approval1.0.x?
No longer blocks: 157673
Depends on: 180372
I'm trying to understand the fix to this bug. At first glance, it seems fundamentally incorrect. From the look of this patch, we're treating incoming content from the web differently depending on platform, so that some characters work on some platforms and some on others. If that's true, it's simply wrong, and should be undone. Was the real problem here that when some platforms use something they call Shift_JIS as their native character encoding (e.g., for the filesystem), they mean different things? If that's the case, then we should call those different things different names, have encoders/decoders for all of them, and fix up the name when determining what the filesystem/native encoding is. Or am I misunderstanding what this fix did?
> Was the real problem here that when some platforms use something they call > Shift_JIS as their native character encoding (e.g., for the filesystem), they > mean different things? Yes, this is the crux of the problem. The differences, however, are limited to a small number of characters. But these characters are often used, too. Now, do we want a full table of encoders/decoders for Mac, Windows, OS/2, etc.? Or do we handle only these small number of characters differently? Vendors are clear about differences in their technical specs and even use different names though some are quite similar in naming. The major problem is the web pages and the way Mozilla used to treat pages that are determined to be in Shift_JIS. On web pages, there is only one dominant name used, i.e. Shift_JIS. We only have one encoding name, i.e. Shift_JIS in the Character Coding menu to relfect that overwheling reality of over 65% of Japanese web pages. (The remaining pages use either EUC-JP or ISO-2022-JP) It would be nearly impossible to persuade web developers to use different names at this point -- it's been over 15 years with this single familiar name to most web surfers. Can browser users tolerate different names for encodings that have been treated for so many years as the same Shift_JIS thing (except for a small number of characters)?
Are the pages on the Web in the standard version of Shift_JIS, the Windows version, or the OS/2 version? If they're in the Windows version, then perhaps we should treat "Shift_JIS" as the Windows version of Shift_JIS on all platforms? Treating it as the Windows version only on Windows seems problematic, since it could cause pages to work on Windows and fail on other platforms, which is exactly what we don't want -- and why there should NOT be platform differences at this level of the code.
> Are the pages on the Web in the standard version of Shift_JIS, the Windows > version, or the OS/2 version? We cannot tell which in reality because people by now are used to minor glyph shape differences. A good place to begin is this image above showing the differences between Mac and Windows: http://bugzilla.mozilla.gr.jp/attachment.cgi?id=364&action=view * The leftmost column shows: Shift_JIS codepoints * The middle column shows glyphs used by Mac Japanese & corresponding Unicode points * The rightmost column shows Windows glyphs & corresponding Unicode points You can see that the same Shift_JIS codepoints lead to slightly different glyph shapes between the 2 platforms. But except for Shift_JIS 0x007e (overline) Shift_JIS 0x815F (reverse solidus) all others look remarkably alike in glyph shapes. Users really don't care about these minor glyph differences. As for the overline and revserse solidus characters, by now after so many years of seeing how these 2 codepoints may use different glyph shapes, users now regard the two separate glyphs on different OS's as **cognitively** equivalent. So the glyph shapes are not an issue here. And given this situation, for all practical purposes, Shift_JIS pages on the web can be considered platform-independent. The real problem happens when Mozilla has to convert internal Unicode points back to OS native encodings. We had been using only the Windows mapping table before this bug got fixed: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT Take the wave dash character, which is used a lot in mail and web logs in Japanese. This is Shift_JIS: 0x8160. On Windows, it maps to \uFF5E. Now on Mac, if we need to convert this to the native encoding, there is no \uFF5E codepoint in the Mac Japanese mapping table: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT hence users see a question mark there on Mac. Now we had converted Shift_JIS 0x8160 to \u301C on Mac in the first place, that would solve this roundtrip problem for Mac. I believe the current code now takes care of this issue.
Sorry, I should have said that the first part of conversion from OS encoding -> Unicode created the real problem because we used to use only the Windows mapping. The roundtrip is also a problem. By the way, this type of problem would not have occurred if we lived only in the world of native encodings. The need for conversion to/from Unicode is what exposes this problem so clearly.
Attachment #102147 - Flags: approval1.0.x?
Attachment #102148 - Flags: approval1.0.x?
Attachment #102149 - Flags: approval1.0.x?
Attachment #106482 - Flags: approval1.0.x?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: