Closed Bug 35166 Opened 26 years ago Closed 25 years ago

[regression] Shift_JIS 0x8160 shows as "?" in form submission

Categories

(Core :: Internationalization, defect, P1)

x86
Windows NT
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: momoi, Assigned: ftang)

Details

(Whiteboard: nsbeta3+)

** Observed with 3/24/2000 B15 b1 build ** The above character shows as a question mark "?" when you use form to submit JPN string under Shift_JIS encoding. Observd with ith ** 3/28/2000 B15 trunk build ** It is worse here, form submission on Windows shows "?" for several Shift_JIS characters I tried in form. http://kaze:8000/formecho.html You can use any form echo but here is one available quickly. I suspect that
I suspect that the problem with the 3/24/2000 M15 b1 build is a different one than the regression for other Shift_JIS characters.
In using the above form echi page. Set the encoding to Shift_JIS first .
To clarify what I described above, the original problem of the Shift_JIS character was reported for Netscape PR1. But upon checking on a more recent M15 build, I see that no Shift_JIS characters are submitted correctly. Please check to see if this latter problem is still there in the latest M15 build. My concern is that when the latter problem is resolved, the original problem may still be there.
Cc'd jbetak who was working on another form and charset conversion related bug. Added [regression] to summary because of this comments from momoi on 2000-04-08 18:48: To clarify what I described above, the original problem of the Shift_JIS character was reported for Netscape PR1. But upon checking on a more recent M15 build, I see that no Shift_JIS characters are submitted correctly.
Summary: Shift_JIS 0x8160 shows as "?" in form submission → [regression] Shift_JIS 0x8160 shows as "?" in form submission
why this problem report to erik ? I double check jbetak's fix for 29062. That chage have some problem (so I reopen it). But nothing looks wrong with non file upload form posting.
The character is: SJIS JIS Unicode Name 0x8160 0x2141 U+301C WAVE DASH There may be differences between the various Unicode converters used in the industry. For example, we may be using Microsoft's converter when the character is entered via the keyboard, and then using our own converter when drawing the character. (This bug concerns form submission == keyboard input.) Re-assigning to Frank, our Unicode converter expert.
Assignee: erik → ftang
Correction for my comment above: We don't use a Unicode converter in the font engine when the platform is any Windows version other than Japanese Win95. I also just noticed that MS Gothic does not have a glyph for U+301C. Perhaps Microsoft expects people to use a different Unicode for WAVE DASH. The Unicode 3.0 book says that the industry has settled on FULLWIDTH TILDE (U+FF5E) for JIS 1-33 (0x2141). So maybe we need to change our JIS table(s) to convert to and from U+FF5E (not U+301C).
converter problem reassign to bobj
Assignee: ftang → bobj
Is this really a regression? Did earlier build handle Shift_JIS 0x8160 in forms submissions?
Status: NEW → ASSIGNED
A significant contribution has recently been sent to W3C, that includes a table of characters that are converted to Unicode differently by various vendors. See: http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen
Target Milestone: --- → M17
http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen writes: ... x-sjis-cp932 is the only conversion table which provides peculiar mapping of 0x8160(WAVE DASH), 0x8161(DOUBLE VERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN) and 0x081CA(NOT SIGN). ... where "x-sjis-cp932" is the Unicode Consortium conversion table for Microsoft CP932. If we change the mapping to correspond to cp932, what happens to Mac? It seems like CP932 is the most NON-standard. It seems like x-sjis-unicode-0.9 (Shift-JIS (version 0.9)) or even x-sjis-jisx0221-1995 (derived from JIS X0221:1995) is more standard. But maybe MS CP932 is more pervasive??? Whichever mapping we choose, could we use transliteration tables to fallback to the other mappings for font rendering? But is round-tripping also a problem for the editor, forms input, etc.? Is the solution to support multiple SJIS converters? But IANA only defines one Shift_JIS, so how would we distinguish between them? (And the one IANA defines mentions MS in the Source description.) From http://www.isi.edu/in-notes/iana/assignments/character-sets: Name: Shift_JIS (preferred MIME name) MIBenum: 17 Source: A Microsoft code that extends csHalfWidthKatakana to include kanji by adding a second byte when the value of the first byte is in the ranges 81-9F or E0-EF. Alias: MS_Kanji Alias: csShiftJIS
Anyone have a comment about my recent comment on SJIS conversion? Also I tried http://kaze:8000/formecho.html with 2000062708 build on USWin95, but when I hit the "Submit Query" button, I get the security warning dialog, I hit OK, the barber pole status cycles a bit, and nothing happens. I tried setting the View|Character Coding to Latin1 and SJIS. 4.72 works. Are forms working at all?
The form submission thing is working on 6/26 build, but apparently not on today's or yesterday's builds. As to the differences between Mac and Windows, a quick check indicates that on Windows we have this problem, but on Mac we don't, meaning that the JPN wave dash maps to U+301C and comes back as Shift_JIS 0x8160. So the conversion is already different between Mac and Windows, could it affect Mac by making a change on Windows? If so, why?
We currently have one SJIS coverter, so if we change it to use the cp932 mapping, we break others (e.g., Mac). I checked the mapping tables on www.unicode.org and the Mac uses the same mappings as SJIS 0.9: ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/SHIFTJIS.TXT for the 5 codepoints previously mentioned: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT 0x8160 0xFF5E #FULLWIDTH TILDE 0x8161 0x2225 #PARALLEL TO ... 0x817C 0xFF0D #FULLWIDTH HYPHEN-MINUS ... 0x8191 0xFFE0 #FULLWIDTH CENT SIGN 0x8192 0xFFE1 #FULLWIDTH POUND SIGN ... 0x81CA 0xFFE2 #FULLWIDTH NOT SIGN ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT 0x8160 0x301C # WAVE DASH 0x8161 0x2016 # DOUBLE VERTICAL LINE ... 0x817C 0x2212 # MINUS SIGN ... 0x8191 0x00A2 # CENT SIGN 0x8192 0x00A3 # POUND SIGN ... 0x81CA 0x00AC # NOT SIGN
There is a note about \u301c "wave dash" character in Unicode 3.0 book (p.568). It says that "This character was encoded to match JIS C626-1978 1-33 "wave dash". Subsequent revisions of JIS standard and industry practice have settled on JIS 1-33 as being the fullwidth tilde character --> FF5E." The wavy dash is now u\3030. So at least for this particular character, CP932 seems to reflect more commonly accepted mapping.
I've looked at Mac OS9's Unicode mapping info for the font Osaka. The wave dash is assigned \u301C but \uFF5E also has the same character though you cannot select and copy the one at \uFF5E, implication being that the one for \uFF5E is linked to u\301C. So for at least, Mac OS9, changing to CP932 probably will not break the mapping for this particular character in regard to font selection.
I'm not so worried about font selection. I think if that's the only problem, we could handle that with transliteration fallbacks. I'm more worried about roundtripping. What happens when the editor converts a SJIS document to Unicode and then writes it back out? Will the data survive the roundtrip if its cp932 or SJIS v0.9? The same issue would affect HTML forms submissions. Frank, Under http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvja/ I see these files: sjis.uf, sjis.ut and cp932.uf. And under http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvja/, I find: nsUnicodeToSJIS.cpp, line 29 -- #include "sjis.uf" nsSJIS2Unicode.cpp, line 29 -- #include "sjis.ut" But I could not find (via lxr) cp932.uf used anywhere. And why isn't there a cp932.ut? p.s. I found another reference on the SJIS mess, "Clarification of existing charsets" by MURATA Makoto (murata@apsdc.ksp.fujixerox.co.jp): http://www19.w3.org/Archives/Public/ietf-charsets/1998JulSep/0036.html and it implies that for cp932 should correspond to this IANA entry: >Name: Windows-31J >MIBenum: 2024 >Source: Windows Japanese. A further extension of csShiftJIS > to include several OEM-specific kanji extensions. > Like csShiftJIS, it adds a second byte when the value > of the first byte is in the ranges 81-9F or E0-EF. > PCL Symbol Set id: 19K >Alias: csWindows31J but I bet all cp932 pages are labeled as 'Shift_JIS" not "Windows-31J"!
All of this discussion seems to point to a need to support multiple codeset tables under the rubric of "Shift_JIS". Something like: 1. For processing input, use a vendor/OS specific codeset to convert to Unicode. (We know which platform we are running on, don't we? The same for output handling.) 2. For output use a vendor/OS specific codeset to convert from Unicode. Or alternatively use CERs for HTML and XML output just for those problem characters. (cf. http://www.w3.org/TR/japanese-xml/#sjis) Given the chaotic situation, as a practical implemntation issue it does not seem possible to use a vendor-neutral Shift_JIS table. (Though there may be a situation where a vendor neutral table may be more desirable.) Such suggestios are contained in the document cited in Murata's document suggestd above by Erik. E.g., http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html#ch2_3
Keep in mind that although we only have one Shift-JIS table, Mozilla sometimes calls the OS's converter. For example, when we receive keyboard input, Mozilla calls Windows's Shift-JIS -> Unicode converter (on Windows only, of course). This could lead to problems if an HTML form is pre-populated with some text and the user enters additional text. The pre-populated text would be converted with Mozilla's table, and the user's text would be converted with Windows's table, leading to 2 different Unicodes for the same Shift-JIS. When you submit that form, Mozilla uses its own table, and one of those Unicodes might get garbled. Maybe this means that we should always use the same table. Also, although it is theoretically possible to use transliteration to solve some of these problems, we currently don't have transliteration on Windows at all, and even on Unix, where we do transliterate, it only transliterates to ASCII. A more general Unicode -> Unicode transliteration will require more work in the font engine(s).
Pasting in my email from when bugzilla was down this morning: Subject: Re: [Bug 35166] Changed - [regression] Shift_JIS 0x8160 shows as "?" in form submission Date: Fri, 30 Jun 2000 02:29:57 -0700 From: Bob Jung <bobj@netscape.com> Organization: Netscape Communications Corporation To: Katsuhiko Momoi <momoi@netscape.com> CC: momoi@netscape.com, teruko@netscape.com, ftang@netscape.com,msanz@netscape.com References: <200006300853.e5U8rJ903156@lounge.mozilla.org> Bugzilla is down right now. + 1. For processing input, use a vendor/OS specific codeset to + convert to Unicode. (We know which platform we are + running on, don't we? The same for output handling.) Yes, for keyboard input (or copy/pasting, saving plaintext, etc.), we could assume Windows uses cp932, MacOS pre-9 uses SJIS v0.9, etc. + 2. For output use a vendor/OS specific codeset to + convert from Unicode. Or alternatively use CERs for HTML + and XML output just for those problem characters. Yes, when the editor creates SJIS HTML, it could use NCRs (not CERs). But you still have not addressed the problem I keep raising: - What happens when I edit an exising SJIS HTML document with these characters? - or submit a SJIS HTML form input with predefined text containing these characters? I assume all flavors of SJIS HTML use the same IANA charset label. If we use the wrong SJIS converter, then the roundtrip will fail and we will corrupt the data. I guess we just have to pick one flavor of SJIS and the others will lose. And I guess we'll probably bend over to MS again and use cp932... Hopefully these characters are not used much...
Bob and I ran some tests this morning, and here is what we found: Shift-JIS 0x8160 is converted to Unicode U+FF5E (probably), while Unicode U+301C is converted to Shift-JIS 0x8160, and Unicode U+FF5E is converted to Shift-JIS 0x3F (question mark) So it looks like our Shift-JIS "to" and "from" tables (*.ut and *.uf) are inconsistent. We can probably make our Unicode->SJIS converter convert both U+301C and U+FF5E to 0x8160, but we need to choose between 301C and FF5E when converting from 8160 to Unicode. Should we simply take the market leader's side? I.e. Microsoft's CP932's FF5E?
Can we experiment and see what happens on Mac and Linux when we convert from SJIS 0x8160 to \uFF5E? I'm hoping that this will not break Mac.
>Can we experiment and see what happens on Mac and Linux when >we convert from SJIS 0x8160 to \uFF5E? I'm hoping that this will >not break Mac. Kat, you can see the effects of FF5E by simply including the hex NCR &#xFF5E; in HTML forms and/or paragraphs. E.g. whether or not it displays correctly.
I looked at current Win, Mac and Linux builds for the display of \uFF5E. Here's what I found: Mac: displays it with the same glyph as \u301C. Linux: Displays it OK but a slightly different glyph from \u301C. Win: Displays it OK and the \u301C glyph is different but still similar in appearance. This platform uses the most different glyphs for both, however. On these platforms, the glyphs are quite similar between these 2 codepoints.
Assignee: bobj → cata
Status: ASSIGNED → NEW
Reassiging to Cata to fix the SJIS-to-Unicode conversion tables. As long as the SJIS flavors are mapping 1-to-N in the Unicode direction (I assume that's true...), there should not be roundtrip problems because there will be no ambiguity when mapping back from Unicode to the SJIS flavor. For the original problem reported, it seems like the "generic" Unicode-to-SJIS converter should map both (1) Unicode 0xFF5E (FULLWIDTH TILDE) to SJIS 0x8160 and (2) Unicode 0x301C (WAVE DASH) to SJIS 0x8160 We are doing (2), but not (1). The converter is currently mapping Unicode 0xFF5E to Shift-JIS 0x3F (question mark) We should look at the other 5 codepoints and make similar changes. [Kinda separate from this bug... We may still have problems in how to treat the converted codepoints for the various SJIS flavors. Do we treat the variant mapped codepoints as equivalents? Is that the correct thing to do? Here's the list: FULLWIDTH TILDE equivalent to WAVE DASH PARALLEL TO equivalent to DOUBLE VERTICAL LINE FULLWIDTH HYPHEN-MINUS equivalent to MINUS SIGN FULLWIDTH CENT SIGN equivalent to CENT SIGN FULLWIDTH POUND SIGN equivalent to POUND SIGN FULLWIDTH NOT SIGN equivalent to NOT SIGN If we do, will parsers or other code that use any of the above as meta characters do the wrong thing? But probably parsers will only use "ASCII" codepoints for meta characters (e.g., HYPHEN-MINUS (0x002D) and not either FULLWIDTH HYPHEN-MINUS or MINUS SIGN). Also, mapping these codepoints as equivalent would affect all data not just codepoints converted from a SJIS flavor. Is that OK?]
Status: NEW → ASSIGNED
Target Milestone: M17 → M19
Keywords: correctness, nsbeta3
shanjian- please take a look at this
Reassinging to myself.
Assignee: cata → ftang
Status: ASSIGNED → NEW
mark as assign
Status: NEW → ASSIGNED
When used in mail compose, the conversion error causes the wrong charset alert to come up. A Japanse mail list which I subscribe always use this characters like this. ~~*~~Paragraph Title~~*~~ paragraph text ~~~~~~~~~~~ Reply or forward (inline) that mail always alert the user although the mail does not contain non-Japanese characters.
nsbeta3+ per bug meeting. P1. We should also make sure our converter is round trip UnicodeToShift_JIS(Shift_JIStoUnicode(shiftjis)) == shiftjis
Priority: P3 → P1
Whiteboard: nsbeta3+
fix and check in. Use the new table generate by CP932.TXT for both unicode to sjis and unicode to jis0208 converter.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
There is a mail problem to this bug. After checking it with HTML form, re-assign to me for the Mail side issue verification.
I tested this in 2000-08-17-08 Win32, Mac, and Linux build. This works fine in Win32 and Linux build. In Mac build, this character is displayed as "?". -bug 49380 Changed QA contact to momoi@netscape.com.
QA Contact: teruko → momoi
** Checked with 9/11/2000 Win32 biuld ** The mail side of the problem was the warning that comes up when trying to reply to the original msg containing this character. This problem no longer occurs as it now maps to an existing Shift_JIS charater. Marking the fix verified.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.