Closed
Bug 35166
Opened 26 years ago
Closed 25 years ago
[regression] Shift_JIS 0x8160 shows as "?" in form submission
Categories
(Core :: Internationalization, defect, P1)
Tracking
()
VERIFIED
FIXED
People
(Reporter: momoi, Assigned: ftang)
Details
(Whiteboard: nsbeta3+)
** Observed with 3/24/2000 B15 b1 build **
The above character shows as a question mark "?"
when you use form to submit JPN string under Shift_JIS
encoding.
Observd with ith ** 3/28/2000 B15 trunk build **
It is worse here, form submission on Windows shows "?"
for several Shift_JIS characters I tried in form.
http://kaze:8000/formecho.html
You can use any form echo but here is one available quickly.
I suspect that
| Reporter | ||
Comment 1•26 years ago
|
||
I suspect that the problem with the 3/24/2000 M15 b1 build
is a different one than the regression for other Shift_JIS
characters.
| Reporter | ||
Comment 2•26 years ago
|
||
In using the above form echi page. Set the encoding to Shift_JIS
first .
| Reporter | ||
Comment 3•26 years ago
|
||
To clarify what I described above, the original problem
of the Shift_JIS character was reported for Netscape PR1.
But upon checking on a more recent M15 build, I see that
no Shift_JIS characters are submitted correctly.
Please check to see if this latter problem is still there in
the latest M15 build.
My concern is that when the latter problem is resolved, the
original problem may still be there.
Cc'd jbetak who was working on another form and charset conversion related bug.
Added [regression] to summary because of this comments from momoi on
2000-04-08 18:48:
To clarify what I described above, the original problem
of the Shift_JIS character was reported for Netscape PR1.
But upon checking on a more recent M15 build, I see that
no Shift_JIS characters are submitted correctly.
Summary: Shift_JIS 0x8160 shows as "?" in form submission → [regression] Shift_JIS 0x8160 shows as "?" in form submission
| Assignee | ||
Comment 5•26 years ago
|
||
why this problem report to erik ?
I double check jbetak's fix for 29062. That chage have some problem (so I reopen
it). But nothing looks wrong with non file upload form posting.
Comment 6•26 years ago
|
||
The character is:
SJIS JIS Unicode Name
0x8160 0x2141 U+301C WAVE DASH
There may be differences between the various Unicode converters used in the
industry. For example, we may be using Microsoft's converter when the character
is entered via the keyboard, and then using our own converter when drawing the
character. (This bug concerns form submission == keyboard input.)
Re-assigning to Frank, our Unicode converter expert.
Assignee: erik → ftang
Comment 7•26 years ago
|
||
Correction for my comment above: We don't use a Unicode converter in the font
engine when the platform is any Windows version other than Japanese Win95.
I also just noticed that MS Gothic does not have a glyph for U+301C. Perhaps
Microsoft expects people to use a different Unicode for WAVE DASH. The Unicode
3.0 book says that the industry has settled on FULLWIDTH TILDE (U+FF5E) for
JIS 1-33 (0x2141).
So maybe we need to change our JIS table(s) to convert to and from U+FF5E (not
U+301C).
Is this really a regression? Did earlier build handle Shift_JIS 0x8160 in
forms submissions?
Comment 10•26 years ago
|
||
A significant contribution has recently been sent to W3C, that includes a table
of characters that are converted to Unicode differently by various vendors. See:
http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen
Comment 11•25 years ago
|
||
http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen writes:
... x-sjis-cp932 is the only conversion table which provides
peculiar mapping of 0x8160(WAVE DASH), 0x8161(DOUBLE VERTICAL
LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN)
and 0x081CA(NOT SIGN).
...
where "x-sjis-cp932" is the Unicode Consortium conversion table for
Microsoft CP932.
If we change the mapping to correspond to cp932, what happens to Mac?
It seems like CP932 is the most NON-standard. It seems like
x-sjis-unicode-0.9 (Shift-JIS (version 0.9)) or even x-sjis-jisx0221-1995
(derived from JIS X0221:1995) is more standard. But maybe MS CP932 is
more pervasive???
Whichever mapping we choose, could we use transliteration tables to
fallback to the other mappings for font rendering?
But is round-tripping also a problem for the editor, forms input, etc.?
Is the solution to support multiple SJIS converters? But IANA only
defines one Shift_JIS, so how would we distinguish between them? (And
the one IANA defines mentions MS in the Source description.)
From http://www.isi.edu/in-notes/iana/assignments/character-sets:
Name: Shift_JIS (preferred MIME name)
MIBenum: 17
Source: A Microsoft code that extends csHalfWidthKatakana to include
kanji by adding a second byte when the value of the first
byte is in the ranges 81-9F or E0-EF.
Alias: MS_Kanji
Alias: csShiftJIS
Comment 12•25 years ago
|
||
Anyone have a comment about my recent comment on SJIS conversion?
Also I tried http://kaze:8000/formecho.html with 2000062708 build on USWin95,
but when I hit the "Submit Query" button, I get the security warning dialog,
I hit OK, the barber pole status cycles a bit, and nothing happens. I tried
setting the View|Character Coding to Latin1 and SJIS. 4.72 works.
Are forms working at all?
| Reporter | ||
Comment 13•25 years ago
|
||
The form submission thing is working on 6/26 build, but
apparently not on today's or yesterday's builds.
As to the differences between Mac and Windows, a quick
check indicates that on Windows we have this problem,
but on Mac we don't, meaning that the JPN wave dash maps to
U+301C and comes back as Shift_JIS 0x8160.
So the conversion is already different between Mac and Windows,
could it affect Mac by making a change on Windows? If so, why?
Comment 14•25 years ago
|
||
We currently have one SJIS coverter, so if we change it to use the cp932
mapping, we break others (e.g., Mac). I checked the mapping tables on
www.unicode.org and the Mac uses the same mappings as SJIS 0.9:
ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT
for the 5 codepoints previously mentioned:
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
0x8160 0xFF5E #FULLWIDTH TILDE
0x8161 0x2225 #PARALLEL TO
...
0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS
...
0x8191 0xFFE0 #FULLWIDTH CENT SIGN
0x8192 0xFFE1 #FULLWIDTH POUND SIGN
...
0x81CA 0xFFE2 #FULLWIDTH NOT SIGN
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
0x8160 0x301C # WAVE DASH
0x8161 0x2016 # DOUBLE VERTICAL LINE
...
0x817C 0x2212 # MINUS SIGN
...
0x8191 0x00A2 # CENT SIGN
0x8192 0x00A3 # POUND SIGN
...
0x81CA 0x00AC # NOT SIGN
| Reporter | ||
Comment 15•25 years ago
|
||
There is a note about \u301c "wave dash" character
in Unicode 3.0 book (p.568).
It says that
"This character was encoded to match JIS C626-1978 1-33 "wave dash".
Subsequent revisions of JIS standard and industry practice have settled
on JIS 1-33 as being the fullwidth tilde character --> FF5E."
The wavy dash is now u\3030.
So at least for this particular character, CP932 seems to reflect
more commonly accepted mapping.
| Reporter | ||
Comment 16•25 years ago
|
||
I've looked at Mac OS9's Unicode mapping info for the font
Osaka. The wave dash is assigned \u301C but \uFF5E also
has the same character though you cannot select and copy the
one at \uFF5E, implication being that the one for \uFF5E is
linked to u\301C. So for at least, Mac OS9, changing to
CP932 probably will not break the mapping for this particular
character in regard to font selection.
Comment 17•25 years ago
|
||
I'm not so worried about font selection. I think if that's the only problem,
we could handle that with transliteration fallbacks.
I'm more worried about roundtripping. What happens when the editor converts
a SJIS document to Unicode and then writes it back out? Will the data
survive the roundtrip if its cp932 or SJIS v0.9? The same issue would
affect HTML forms submissions.
Frank,
Under http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvja/
I see these files: sjis.uf, sjis.ut and cp932.uf.
And under http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvja/, I find:
nsUnicodeToSJIS.cpp, line 29 -- #include "sjis.uf"
nsSJIS2Unicode.cpp, line 29 -- #include "sjis.ut"
But I could not find (via lxr) cp932.uf used anywhere. And why isn't there
a cp932.ut?
p.s. I found another reference on the SJIS mess, "Clarification of existing
charsets" by MURATA Makoto (murata@apsdc.ksp.fujixerox.co.jp):
http://www19.w3.org/Archives/Public/ietf-charsets/1998JulSep/0036.html
and it implies that for cp932 should correspond to this IANA entry:
>Name: Windows-31J
>MIBenum: 2024
>Source: Windows Japanese. A further extension of csShiftJIS
> to include several OEM-specific kanji extensions.
> Like csShiftJIS, it adds a second byte when the value
> of the first byte is in the ranges 81-9F or E0-EF.
> PCL Symbol Set id: 19K
>Alias: csWindows31J
but I bet all cp932 pages are labeled as 'Shift_JIS" not "Windows-31J"!
| Reporter | ||
Comment 18•25 years ago
|
||
All of this discussion seems to point to a need to
support multiple codeset tables under the rubric
of "Shift_JIS". Something like:
1. For processing input, use a vendor/OS specific codeset to
convert to Unicode. (We know which platform we are
running on, don't we? The same for output handling.)
2. For output use a vendor/OS specific codeset to
convert from Unicode. Or alternatively use CERs for HTML
and XML output just for those problem characters.
(cf. http://www.w3.org/TR/japanese-xml/#sjis)
Given the chaotic situation, as a practical implemntation
issue it does not seem possible to use a vendor-neutral
Shift_JIS table. (Though there may be a situation where
a vendor neutral table may be more desirable.)
Such suggestios are contained in the document cited in
Murata's document suggestd above by Erik. E.g.,
http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html#ch2_3
Comment 19•25 years ago
|
||
Keep in mind that although we only have one Shift-JIS table, Mozilla sometimes
calls the OS's converter. For example, when we receive keyboard input, Mozilla
calls Windows's Shift-JIS -> Unicode converter (on Windows only, of course).
This could lead to problems if an HTML form is pre-populated with some text and
the user enters additional text. The pre-populated text would be converted with
Mozilla's table, and the user's text would be converted with Windows's table,
leading to 2 different Unicodes for the same Shift-JIS. When you submit that
form, Mozilla uses its own table, and one of those Unicodes might get garbled.
Maybe this means that we should always use the same table.
Also, although it is theoretically possible to use transliteration to solve some
of these problems, we currently don't have transliteration on Windows at all,
and even on Unix, where we do transliterate, it only transliterates to ASCII.
A more general Unicode -> Unicode transliteration will require more work in the
font engine(s).
Comment 20•25 years ago
|
||
Pasting in my email from when bugzilla was down this morning:
Subject: Re: [Bug 35166] Changed - [regression] Shift_JIS 0x8160 shows
as "?" in form submission
Date: Fri, 30 Jun 2000 02:29:57 -0700
From: Bob Jung <bobj@netscape.com>
Organization: Netscape Communications Corporation
To: Katsuhiko Momoi <momoi@netscape.com>
CC: momoi@netscape.com, teruko@netscape.com,
ftang@netscape.com,msanz@netscape.com
References: <200006300853.e5U8rJ903156@lounge.mozilla.org>
Bugzilla is down right now.
+ 1. For processing input, use a vendor/OS specific codeset to
+ convert to Unicode. (We know which platform we are
+ running on, don't we? The same for output handling.)
Yes, for keyboard input (or copy/pasting, saving plaintext, etc.), we could
assume Windows uses cp932, MacOS pre-9 uses SJIS v0.9, etc.
+ 2. For output use a vendor/OS specific codeset to
+ convert from Unicode. Or alternatively use CERs for HTML
+ and XML output just for those problem characters.
Yes, when the editor creates SJIS HTML, it could use NCRs (not CERs).
But you still have not addressed the problem I keep raising:
- What happens when I edit an exising SJIS HTML document with these
characters?
- or submit a SJIS HTML form input with predefined text containing
these characters?
I assume all flavors of SJIS HTML use the same IANA charset label.
If we use the wrong SJIS converter, then the roundtrip will fail and we will
corrupt the data.
I guess we just have to pick one flavor of SJIS and the others will lose.
And I guess we'll probably bend over to MS again and use cp932...
Hopefully these characters are not used much...
Comment 21•25 years ago
|
||
Bob and I ran some tests this morning, and here is what we found:
Shift-JIS 0x8160 is converted to Unicode U+FF5E (probably), while
Unicode U+301C is converted to Shift-JIS 0x8160, and
Unicode U+FF5E is converted to Shift-JIS 0x3F (question mark)
So it looks like our Shift-JIS "to" and "from" tables (*.ut and *.uf) are
inconsistent. We can probably make our Unicode->SJIS converter convert both
U+301C and U+FF5E to 0x8160, but we need to choose between 301C and FF5E when
converting from 8160 to Unicode. Should we simply take the market leader's side?
I.e. Microsoft's CP932's FF5E?
| Reporter | ||
Comment 22•25 years ago
|
||
Can we experiment and see what happens on Mac and Linux when
we convert from SJIS 0x8160 to \uFF5E? I'm hoping that this will
not break Mac.
Comment 23•25 years ago
|
||
>Can we experiment and see what happens on Mac and Linux when
>we convert from SJIS 0x8160 to \uFF5E? I'm hoping that this will
>not break Mac.
Kat, you can see the effects of FF5E by simply including the hex NCR ~ in
HTML forms and/or paragraphs. E.g. whether or not it displays correctly.
| Reporter | ||
Comment 24•25 years ago
|
||
I looked at current Win, Mac and Linux builds for the display of
\uFF5E. Here's what I found:
Mac: displays it with the same glyph as \u301C.
Linux: Displays it OK but a slightly different glyph from \u301C.
Win: Displays it OK and the \u301C glyph is different but still
similar in appearance. This platform uses the most different
glyphs for both, however.
On these platforms, the glyphs are quite similar between these
2 codepoints.
Comment 25•25 years ago
|
||
Reassiging to Cata to fix the SJIS-to-Unicode conversion tables.
As long as the SJIS flavors are mapping 1-to-N in the Unicode direction
(I assume that's true...), there should not be roundtrip problems because
there will be no ambiguity when mapping back from Unicode to the SJIS flavor.
For the original problem reported, it seems like the "generic"
Unicode-to-SJIS converter should map both
(1) Unicode 0xFF5E (FULLWIDTH TILDE) to SJIS 0x8160
and
(2) Unicode 0x301C (WAVE DASH) to SJIS 0x8160
We are doing (2), but not (1). The converter is currently mapping
Unicode 0xFF5E to Shift-JIS 0x3F (question mark)
We should look at the other 5 codepoints and make similar changes.
[Kinda separate from this bug...
We may still have problems in how to treat the converted codepoints for
the various SJIS flavors. Do we treat the variant mapped codepoints as
equivalents? Is that the correct thing to do? Here's the list:
FULLWIDTH TILDE equivalent to WAVE DASH
PARALLEL TO equivalent to DOUBLE VERTICAL LINE
FULLWIDTH HYPHEN-MINUS equivalent to MINUS SIGN
FULLWIDTH CENT SIGN equivalent to CENT SIGN
FULLWIDTH POUND SIGN equivalent to POUND SIGN
FULLWIDTH NOT SIGN equivalent to NOT SIGN
If we do, will parsers or other code that use any of the above as meta
characters do the wrong thing? But probably parsers will only use
"ASCII" codepoints for meta characters (e.g., HYPHEN-MINUS (0x002D) and
not either FULLWIDTH HYPHEN-MINUS or MINUS SIGN).
Also, mapping these codepoints as equivalent would affect all data not
just codepoints converted from a SJIS flavor. Is that OK?]
| Assignee | ||
Updated•25 years ago
|
Keywords: correctness,
nsbeta3
| Assignee | ||
Comment 26•25 years ago
|
||
shanjian- please take a look at this
| Assignee | ||
Comment 27•25 years ago
|
||
Reassinging to myself.
Assignee: cata → ftang
Status: ASSIGNED → NEW
Comment 29•25 years ago
|
||
When used in mail compose, the conversion error causes the wrong charset alert
to come up.
A Japanse mail list which I subscribe always use this characters like this.
~~*~~Paragraph Title~~*~~
paragraph text
~~~~~~~~~~~
Reply or forward (inline) that mail always alert the user although the mail does
not contain non-Japanese characters.
| Assignee | ||
Comment 30•25 years ago
|
||
nsbeta3+ per bug meeting. P1. We should also make sure our converter is
round trip UnicodeToShift_JIS(Shift_JIStoUnicode(shiftjis)) == shiftjis
Priority: P3 → P1
Whiteboard: nsbeta3+
| Assignee | ||
Comment 31•25 years ago
|
||
fix and check in. Use the new table generate by CP932.TXT for both unicode to
sjis and unicode to jis0208 converter.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 32•25 years ago
|
||
There is a mail problem to this bug.
After checking it with HTML form, re-assign to me
for the Mail side issue verification.
Comment 33•25 years ago
|
||
I tested this in 2000-08-17-08 Win32, Mac, and Linux build.
This works fine in Win32 and Linux build. In Mac build, this character is
displayed as "?". -bug 49380
Changed QA contact to momoi@netscape.com.
QA Contact: teruko → momoi
| Reporter | ||
Comment 34•25 years ago
|
||
** Checked with 9/11/2000 Win32 biuld **
The mail side of the problem was the warning that comes
up when trying to reply to the original msg containing this
character.
This problem no longer occurs as it now maps to an existing
Shift_JIS charater.
Marking the fix verified.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•