Shift_JIS conversion problem on MacOS9, OS/2

VERIFIED FIXED in mozilla1.2beta

Status

()

P2
critical
VERIFIED FIXED
18 years ago
16 years ago

People

(Reporter: shom, Assigned: smontagu)

Tracking

({intl})

Trunk
mozilla1.2beta
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments, 2 obsolete attachments)

(Reporter)

Description

18 years ago
Now, the internal mapping table for Japanese is fully based on CP932 (bug-54135).
MacOS9 and OS/2 have another mapping table, so some characters have conversion
problem when mozilla passes internal UCS2 codes to OS Native functions which
handle UCS2.

PROBLEM:

testpage: http://rh.vinelinux.org/~shom/sjisprob.html

a problem on MacOS9
 http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=364

a problem on OS/2
 http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=367

RELATED BUGS : bug 35166, bug 58637, bug 33162, bug 65991

SOLUTIONs:

i) convert internal UCS2 codes to compatible codes of OS native codes when use
every OS function which treat UCS2. SO HARD?

ii) implement dual mapping method to conversion tables. VERY HARD, I think.

iii) make other tables for Shift_JIS variants. Currently Japanese:UCS2
conversion table is generated from CP932.txt with mkjpconv.pl (bug 54135). Since
this tool can generate other mapping tables (ex APPLE_JAPANESE.txt), it is easy
to make Shift_JIS(MacOS9) and Shift_JIS(OS2) -- or Shift_JIS(IBM943). This
solution have another advantage -- can treat platform depend characters without
unicode sequences (surrogate pairs?).

Comment 1

17 years ago
teruko: can you confirm.
->nhotta
Assignee: yokoyama → nhotta

Comment 2

17 years ago
*** This bug has been confirmed by popular vote. ***
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 3

17 years ago
Reassign to ftang.

Comment 4

17 years ago
Reassign to ftang.

Assignee: nhotta → ftang

Updated

17 years ago
Status: NEW → ASSIGNED

Comment 5

17 years ago
what will happen if we don't fix this?

Updated

17 years ago
Priority: -- → P4
(Reporter)

Comment 6

17 years ago
Cannot treat many vendor specific Shift JIS kanji chars (I know NC4 can).
# CP932 contains MS specific kanji chars, so on Windows can treat them :b

and legal chars in JIS X 0208 have conversion problem.

[reported in bugzilla-jp <http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=868>]

testpage : http://rh.vinelinux.org/~shom/sjisprob2.html

* OS/2 

SJIS 4 chars (0x815c,0x8160,0x8161,0x817c) have problem.

screen shot 
  http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=538
screen shot after re-input '?' chars in and submit
  http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539

 - display problem
   (0x815c,0x8160,0x8161,0x817c) are displayed as '?'
   on page body, bookmark title, tab, javascript alert.
   on titlebar, ' '.

 - query send problem
   When input one of (0x815c,0x8160,0x8161,0x817c) in INPUT type=text /
   TEXTAREA, chars following these chars are truncated.
   (http://bugzilla.mozilla.gr.jp/showattachment.cgi?attach_id=539)

 - compose problem
   (0x815c,0x8160,0x8161,0x817c) becomes &#8212; &#12316; &#8214; &#8722;
   in saved page.

 - mail/news send problem
   (0x815c,0x8160,0x8161,0x817c) treated as illegal, so cannot send.
   if ignore alert, 0x815c becomes '--', others '?'.


* Mac OS 9 (and probably Mac OS X)

 (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) have problem.

 - query send problem
   When input (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) in
   INPUT type=text / TEXTAREA, chars following these chars are truncated.

 - mail/news send problem
   (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca) treated as illegal,
   so cannot send.
   if ignore alert, 0x815c becomes '--', others '?'.

 - bookmark problem
   bookmark title contains (0x815c,0x8160,0x8161,0x817c,0x8191,0x8192,0x81ca)
   in menubar of OS are displaed as blank.

Updated

17 years ago
Blocks: 157673

Comment 7

17 years ago
one of the top problem mozilla japanese group report. not sure how to solve it
yet. May need to break down to different tasks.
Keywords: intl, nsbeta1+
Priority: P4 → P2
Target Milestone: --- → mozilla1.2beta

Comment 8

17 years ago
Kohei Ichioka has made a patch for this bug.

http://www5a.biglobe.ne.jp/~expf/ucvja.tar.gz

This file contains readme.txt which explains how to apply the patch.

And chado has made a Mac build based on this patch.
ftp://download.sourceforge.jp/wazilla/996/Wazilla-mac-1.1-2156c.sea.bin

Comment 9

17 years ago
Adding mkaply to Cc.

Tarball in comment 8 contains the patch for OS/2, but Kohei Ichioka
hasn't tested it. He doesn't have OS/2. Can you review the patch and
test it?
Severity: normal → critical

Comment 10

17 years ago
In the original report,
>MacOS9 and OS/2 have another mapping table
Does the problem exist for MacOSX or this is specific to MacOS9?

Comment 11

17 years ago
Can anyone attach a patch using cvs diff -u to this bug?
(Reporter)

Comment 12

17 years ago
Created attachment 98078 [details]
gzipped patch

* change Japanese to Unicode conversion rule
 pref("intl.jis0208.map", "Apple") using MacJapanese conversion rule.
 pref("intl.jis0208.map", "IBM943") using IBM943 conversion rule.

* dual mapping for Unicode to Japanese conversion rule
 CP932 ,Apple ,IBM943	 SJIS  (JIS)
 U+2015,U+2014,U+2014 -> 0x815C(01-29)
 U+FF5E,U+301C,U+301C -> 0x8160(01-33)
 U+2225,U+2016,U+2016 -> 0x8161(01-34)
 U+FF0D,U+2212,U+2212 -> 0x817C(01-61)
 U+FFE0,U+00A2,U+FFE0 -> 0x8191(01-81)
 U+FFE1,U+00A3,U+FFE1 -> 0x8192(01-82)
 U+FFE2,U+00AC,U+FFE2 -> 0x81CA(02-44)
 U+FFE4,U+FFE4,U+00A6 -> 0xEEFA(92-92)
 U+FFE4,U+FFE4,U+00A6 -> 0xFA55

mozilla/intl/uconv/tools/jamap.pl creates maps.
mozilla/intl/uconv/ucvja/japanese.map is the map for Japanese to Unicode.

Comment 13

17 years ago
Matsumoto san,
Does the problem exist for MacOSX or this is specific to MacOS9?
(Reporter)

Comment 14

17 years ago
I don't know.
But I think MacOSX uses the same conversion rule as MacOS9 for backward
compatibility.

Comment 15

17 years ago
could you give us a patch instead of a application/x-gzip ?
(Reporter)

Comment 16

17 years ago
Created attachment 98510 [details] [diff] [review]
patch #1/3
(Reporter)

Comment 17

17 years ago
Created attachment 98511 [details] [diff] [review]
patch #2/3
(Reporter)

Comment 18

17 years ago
Created attachment 98512 [details] [diff] [review]
patch #3/3

Comment 23

17 years ago
The patch id=98510-98512 is incomplete.
id=102147-102150 is the actual patch.

Updated

17 years ago
Attachment #102147 - Flags: review+

Updated

17 years ago
Attachment #102148 - Flags: review+

Updated

17 years ago
Attachment #102149 - Flags: review+

Updated

17 years ago
Attachment #102150 - Flags: review+

Updated

17 years ago
Attachment #98078 - Attachment is obsolete: true

Comment 25

17 years ago
Changed QA contact to ylong@netscape.com.
QA Contact: teruko → ylong

Comment 26

17 years ago
Comment on attachment 102147 [details] [diff] [review]
patch #1/4

sr=alecf

Comment 27

17 years ago
Comment on attachment 102148 [details] [diff] [review]
patch #2/4

sr=alecf
Attachment #102148 - Flags: superreview+

Comment 28

17 years ago
Comment on attachment 102149 [details] [diff] [review]
patch #3/4

sr=alecf
Attachment #102149 - Flags: superreview+

Comment 29

17 years ago
Comment on attachment 102150 [details] [diff] [review]
patch #4/4

what does this notation mean?
+ const PRUint16 (*mMapIndex)[128];

this seems a little confusing, how about

const PRUint16* mMapIndex[128]?

Though actually are you storing a pointer to a 128 bit array? I think this is a
misuse of this type and what you might really want is PRUint16**  mMapIndex?

Also, storing the per-platform in prefs seems unnecessary... I mean, the value
is never going to change right? why not just #ifdef the code?

Prefs should only be used when the value is going to be changed... the
per-platform pref stuff is when you want the DEFAULT value of the pref to vary
based on the platform, but you still expect the user to change it later.
Attachment #102150 - Attachment is obsolete: true

Comment 30

17 years ago
mMapIndex is actually a pointer to a 128-PRUint16-values array.
It points the first item of gIndex, gCP932Index, or gIBM943Index.

const PRUint16 gIndex[2][128];
const PRUint16 gCP932Index[2][128];
const PRUint16 gIBM943Index[2][128];

If I use PRUint16** mMapIndex, I must use extra variables.
const PRUint16 *const gIndex[2] = { gIndex1, gIndex2 };
const PRUint16 gIndex1[128] = {
  ...
}
const PRUint16 gIndex2[128] = {
  ...
}
...

Comment 31

17 years ago
reassign to smontagu for landing
Assignee: ftang → smontagu
Status: ASSIGNED → NEW
(Assignee)

Comment 32

17 years ago
Kohei, can you attach a new version of attachment 102150 [details] [diff] [review] addressing alecf's
comments? I'm assuming that all 4 attachments need to be checked in together.

Comment 33

17 years ago
In some cases, users will want to change the conversion table.

On unix, the suitable conversion table depends the installed fonts.
And it is not fixed at compile time.

For another case, a macintosh mozilla user had an accident with
a web site and contact with the web site engineer, the engineer uses
a windows machine and not has a macintosh.
The enginner will want to look into the behavior of conversion
on his windows machine.
(In Japan, troubles related to the character-conversion often occur)

If a windows mozilla user attaches importance to the compatibility
with java programs than the looks on the screen,
the user will want to use the standard conversion table instead of
the windows(CP932) conversion table.

Comment 34

17 years ago
Created attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16**  mMapIndex

Comment 35

17 years ago
Re Comment 14: this happens also on Mac OS X.
(Assignee)

Comment 36

17 years ago
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16**  mMapIndex

Transferring r=ftang and requesting sr
Attachment #106482 - Flags: superreview?(alecf)
Attachment #106482 - Flags: review+

Comment 37

17 years ago
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16**  mMapIndex

I thought I had commented about this earlier: (maybe it was another bug?) 
Why are we using prefs to choose the charset on a per-platform basis - can't we
do this with #ifdefs? I guess I'm trying to understand the situation where the
user will be changing this value? If this isn't going to be changed by the
user, then we shouldn't add more dependencies on prefs.

The patch looks ok, but I'm going to hold off on my sr= until this is
explained..
(Reporter)

Comment 38

17 years ago
see #33 and...

Japanese "Shift JIS" has many variants. Many pages in Japanese Shift JIS has
"Shift_JIS" charset, but actually some of them are Shift_JIS, others are
Windows-31J, and others are Apple Japanese, IBM943C, etc.

They have the same "encoding (Shift JIS)", but have each "charset" and Unicode
mapping rules. We Japanese -- espacially web developpers -- sometimes want to
use them properly.


case-1) vendor specific Shift JIS characters problem

Up to this time, Windows specific chars could not be displayed on Mac/UNIX, Mac
specifics on Windows/UNIX). Now, if we change the charset in runtime, we can see
them via iso10646-1 glyph mapping (at the costs of finding glyphs).

Especially on UNIX, some users want to use only "Shift_JIS" characters 
because the cost of searching iso10646 font glyphs is so large, but others want
to see "Windows-31J" specific chars because many web pages (and some mails) use
them with "charset=Shift_JIS".

IMHO, the best solution is to make each charset/mapping rules for major variants
of Shift JIS, and we could specify a rule to be used as "Shift JIS" at runtime.
(In addition, ISO-2022-JP compatible with Windows-31J - many Windows mailer
generates - is different from ISO-2022-JP compatible with Shift_JIS - JIS spec.


case-2) Unicode conversion problem on XML with charset=UTF-8

Shift JIS variants have each mapping rules for Unicode. Unfortunately they are
not compatible with each other, so there are Shift_JIS/Windows-31J/Apple
Japanese compatible UTF-8s.

For example, XMLs with "charset=UTF-8" converted/generated from Shift JIS datum
by XML processor using "Windows-31J/CP932" mapping rules -- I think Microsoft
products are so -- will not be usable on other systems.

This problem does not come up with surface as far, but it may become large as
XMLs with "charset=UTF-8" comes to be used.

Comment 39

17 years ago
Comment on attachment 106482 [details] [diff] [review]
patch #4/4 using PRUint16**  mMapIndex

ok, that seems like a reasonable explanation. sr=alecf

By the way, you should learn to use "cvs diff" - you don't need to keep two
seperate tree's around.
Attachment #106482 - Flags: superreview?(alecf) → superreview+
(Assignee)

Comment 40

17 years ago
Comment on attachment 102147 [details] [diff] [review]
patch #1/4

setting sr=alecf per comment 26
Attachment #102147 - Flags: superreview+
(Assignee)

Comment 41

17 years ago
Fix checked in.
Status: NEW → RESOLVED
Last Resolved: 17 years ago
Resolution: --- → FIXED

Comment 42

17 years ago
The test page:
http://rh.vinelinux.org/~shom/sjisprob2.html and
http://rh.vinelinux.org/~shom/sjisprob.html
are displayed fine on 11-26 trunk build / Mac 9.2.1.

Mark as verified as fixed.
Status: RESOLVED → VERIFIED

Updated

16 years ago
Attachment #102147 - Flags: approval1.0.x?

Updated

16 years ago
Attachment #102148 - Flags: approval1.0.x?

Updated

16 years ago
Attachment #102149 - Flags: approval1.0.x?

Updated

16 years ago
Attachment #106482 - Flags: approval1.0.x?

Updated

16 years ago
No longer blocks: 157673

Updated

16 years ago
Depends on: 180372
I'm trying to understand the fix to this bug.  At first glance, it seems
fundamentally incorrect.  From the look of this patch, we're treating incoming
content from the web differently depending on platform, so that some characters
work on some platforms and some on others.  If that's true, it's simply wrong,
and should be undone.

Was the real problem here that when some platforms use something they call
Shift_JIS as their native character encoding (e.g., for the filesystem), they
mean different things?  If that's the case, then we should call those different
things different names, have encoders/decoders for all of them, and fix up the
name when determining what the filesystem/native encoding is.

Or am I misunderstanding what this fix did?

Comment 44

16 years ago
> Was the real problem here that when some platforms use something they call
> Shift_JIS as their native character encoding (e.g., for the filesystem), they
> mean different things? 

Yes, this is the crux of the problem. The differences, however, are 
limited to a small number of characters. But these characters are 
often used, too. Now, do we want a full table of encoders/decoders
for Mac, Windows, OS/2, etc.? Or do we handle only these small
number of characters differently?

Vendors are clear about differences in their technical specs and even
use different names though some are quite similar in naming. 

The major problem is the web pages and the way Mozilla used to treat pages 
that are determined to be in Shift_JIS. On web pages, there is only
one dominant name used, i.e. Shift_JIS. We only have one encoding
name, i.e. Shift_JIS in the Character Coding menu to relfect that
overwheling reality of over 65% of Japanese web pages. (The remaining
pages use either EUC-JP or ISO-2022-JP)

It would be nearly impossible to persuade web developers to use different
names at this point -- it's been over 15 years with this single familiar
name to most web surfers. Can browser users tolerate different names for
encodings that have been treated for so many years as the same Shift_JIS
thing (except for a small number of characters)?
Are the pages on the Web in the standard version of Shift_JIS, the Windows
version, or the OS/2 version?  If they're in the Windows version, then perhaps
we should treat "Shift_JIS" as the Windows version of Shift_JIS on all
platforms?  Treating it as the Windows version only on Windows seems
problematic, since it could cause pages to work on Windows and fail on other
platforms, which is exactly what we don't want -- and why there should NOT be
platform differences at this level of the code.

Comment 46

16 years ago
> Are the pages on the Web in the standard version of Shift_JIS, the Windows
> version, or the OS/2 version?

We cannot tell which in reality because people by now are used to
minor glyph shape differences. A good place to begin is this image
above showing the differences between Mac and Windows:

http://bugzilla.mozilla.gr.jp/attachment.cgi?id=364&action=view

* The leftmost column shows: Shift_JIS codepoints
* The middle column shows glyphs used by Mac Japanese & corresponding Unicode
 points
* The rightmost column shows Windows glyphs & corresponding Unicode points

You can see that the same Shift_JIS codepoints lead to slightly
different glyph shapes between the 2 platforms. But except for

Shift_JIS 0x007e (overline)
Shift_JIS 0x815F (reverse solidus)

all others look remarkably alike in glyph shapes. Users really don't care 
about these minor glyph differences. As for the overline and revserse solidus
characters, by now after so many years of seeing how these 2 codepoints
may use different glyph shapes, users now regard the two separate
glyphs on different OS's as **cognitively** equivalent. 

So the glyph shapes are not an issue here. And given this situation,
for all practical purposes, Shift_JIS pages on the web can be
considered platform-independent.

The real problem happens when Mozilla has to convert internal Unicode
points back to OS native encodings. We had been using only the Windows 
mapping table before this bug got fixed:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Take the wave dash character, which is used a lot in mail and web logs in
Japanese. This is Shift_JIS: 0x8160. 

On Windows, it maps to \uFF5E.
Now on Mac, if we need to convert this to the native encoding, there is
no \uFF5E codepoint in the Mac Japanese mapping table:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

hence users see a question mark there on Mac.
Now we had converted Shift_JIS 0x8160 to \u301C on Mac in the first place,
that would solve this roundtrip problem for Mac. I believe the current
code now takes care of this issue.

Comment 47

16 years ago
Sorry, I should have said that the first part of conversion
from OS encoding -> Unicode created the real problem because
we used to use only the Windows mapping. The roundtrip is also
a problem.

By the way, this type of problem would not have occurred if we 
lived only in the world of native encodings. The need for conversion
to/from Unicode is what exposes this problem so clearly.
Attachment #102147 - Flags: approval1.0.x?
Attachment #102148 - Flags: approval1.0.x?
Attachment #102149 - Flags: approval1.0.x?
Attachment #106482 - Flags: approval1.0.x?
You need to log in before you can comment on or make changes to this bug.