Closed Bug 245770 Opened 16 years ago Closed 16 years ago

backslash rendered as yen in japanese locale

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

VERIFIED FIXED

People

(Reporter: glandium, Assigned: jshin1987)

Details

(Keywords: fixed-aviary1.0, fixed1.7.5)

Attachments

(3 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.6) Gecko/20040602 Firefox/0.8
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ja-JP; rv:1.6) Gecko/20040602 Firefox/0.8

In the page attached, both backslash and yen characters are rendered as yen
character if the locale is japanese, while the page is definitely coded in
UTF-8, providing 2 different codes for these characters. The rendering is
correct in english locale.

Reproducible: Always
Steps to Reproduce:
Attached file testcase
Assignee: firefox → smontagu
Component: General → Internationalization
Product: Firefox → Browser
QA Contact: firefox.general → amyy
Version: unspecified → Trunk
well, a notorious Yen vs reverse solidus (back slash). 
It's well-known and was made this way on purpose somewhere in 'layout' code.
 The same was the case of Korean locale (with WON sign), but I persuaded ftang
to get rid of that years ago. (see
http://lxr.mozilla.org/seamonkey/source/layout/html/base/src/nsTextTransformer.cpp#816
and bug 88050). I did like to remove that for Japanese, too, but ftang wanted to
keep that. 

The problem is that we really don't know what 0x5c in Shift_JIS and EUC-JP
represent. An alternative to what we're doing is to replace 'back slash' with
'Yen' only when the locale is Japanese and the doc. charset is one of legacy
Japanese character encodings so that UTF-* would preserve the distinction
between back slash and Yen. I have yet to check whether this is feasible. 

For a better tracking, assigning to myself.
Assignee: smontagu → jshin
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
According to bug 4238 comment 29, the original code in nsTextTransformer was
designed to apply only to legacy Japanese charsets, but I don't see from a quick
look at the patches there how this was supposed to happen.
It seems like nsPresContext::UpdateCharSet() in attachment 15040 [details] [diff] [review] was written  to
apply JA-specific 'transformer' only to documents in Japanese legacy encodings.
[1]  I need to take a closer look as to why it's also applied to UTF-8 documents
under Japanese locale. 

[1]
http://lxr.mozilla.org/seamonkey/source/layout/base/src/nsPresContext.cpp#719
Then, there's probably an issue with EUC-JP as well... because in EUC-JP, 0x5C
is not *necessarily* the yen symbol...
See http://sources.redhat.com/ml/libc-alpha/2000-10/msg00190.html for details.

(for instance, converting the attached testcase back and forth to euc-jp through
iconv gives two backslashes, converting back and forth to shift-jis gives two
yen symbols)
(In reply to comment #5)
> Then, there's probably an issue with EUC-JP as well... because in EUC-JP, 0x5C
> is not *necessarily* the yen symbol...

 Sure, I'm very well aware of the problem. I'd rather remove the
'transformation' all together as I wrote in bug 88050. 
(In reply to comment #4)
>I need to take a closer look as to why it's also applied to UTF-8 documents
> under Japanese locale. 

I think the problem is here:
http://lxr.mozilla.org/seamonkey/source/intl/locale/src/nsLanguageAtomService.cpp#249
which comes from bug 39570.
We discussed this problem on Bugzilla-jp.
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3595

As for this problem, opinions are divided also among Japanese people.
1. Mozilla should not replace 0x5C to U+5A always.
2. Mozilla should replace 0x5C to U+5A that documents only encoded by Shift_JIS
and EUC-JP and IS0-2022-JP.
3. User should be able to choice that to replace or not.

We have a question: Why Mozilla replace that.
In WinIE and Opera, they have no behavior for 0x5C.
Sorry.
U+5A -> U+A5.
(In reply to comment #8)

Thanks for your input.

> 1. Mozilla should not replace 0x5C to U+5A always.
> 2. Mozilla should replace 0x5C to U+5A that documents only encoded by Shift_JIS
> and EUC-JP and IS0-2022-JP.
> 3. User should be able to choice that to replace or not.
> 
> We have a question: Why Mozilla replace that.
> In WinIE and Opera, they have no behavior for 0x5C.

See bug 39570 and bug 88050.  Anyway, I'm strongly in favor of option #1 because
unless Mozilla suddenly acquires 'near-human' intelligence :-), it's all but
impossible to tell which character the author of a document meant by 0x5c, 'back
slash' or Yen,  in which case I think just leaving it as it is better.



 
However, we are that this is a bug in UTF-*.
If locale of OS is Japan, 0x5C is always replaced to U+A5.
If an element has :lang(en) that document encoded by UTF-*,but Mozilla replace
it in the element.
This behavior is wrong.
The replacement of 0x5C to U+A5 should exist in the element that has :lang(ja).

And this behavior has an problem.
In spec of HTML, the document is treated at ISO10646.
In other words, \ and ¥(or ¥) are different character.

In source code(e.x., perl), 0x5c is replaced to U+A5, this behavior is troublesome.
In this case, user can't display backslash.
But, Japanese fonts usually have the yen sign glyph at U+5C.

I think that the best choice of the behavior is #3 of comment 8.
(In reply to comment #11)
> The replacement of 0x5C to U+A5 should exist in the element that has :lang(ja).

 I don't think that's the case. There is NOT Japanese Unicode. Neither is there
non-Japanese Unicode. There's only one Unicode and 'U+005C' is 'Reverse Solidus'
period. Note that what you wrote above is different from option 3 and option 2. 

> But, Japanese fonts usually have the yen sign glyph at U+5C.

 Having 'YEN' sign glyph for U+005C is clearly a bug of those fonts. Microsoft
should fix their bug in their fonts. Anyway, that's besides the point here.
 
> I think that the best choice of the behavior is #3 of comment 8.

If we do that, what should be the default? 


> I don't think that's the case. There is NOT Japanese Unicode. 
> Neither is there non-Japanese Unicode.
Yes.
I was going to say.
If Mozilla replace 0x5C to U+A5 on UTF-*, 0x5C should be replaced only in the
element that has :lang(ja).
Mozilla should not replace it in the other element.
I think that this behavior is QUIRKS for environment of Japanese language.
In other words, this behavior should not exist in the document that written in
other language. Though that is displaied on the system that locale is Japan.

> Microsoft should fix their bug in their fonts.

It is not realistic.
Because that is true on the Unicode applications.
But on native code(CP932) applications, those applications cannot display yen sign.

> If we do that, what should be the default? 

I think that the default value is that is NOT replaced.
And momoi-san said the same opinion on
http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3595#c69 .
Becase the most famous UA is WinIE, and WinIE doesn't have this behavior.
(In reply to comment #13)
> > I don't think that's the case. There is NOT Japanese Unicode. 
> > Neither is there non-Japanese Unicode.
> Yes.
> I was going to say.
> If Mozilla replace 0x5C to U+A5 on UTF-*, 0x5C should be replaced only in the
> element that has :lang(ja).
> Mozilla should not replace it in the other element. ....
> Though that is displaied on the system that locale is Japan.

  It sounds to me that what you wrote above is equivalent to saying that there
are two versions of Unicode, Japanese and non-Japanese. Note that the reporter
of this bug wants to get back his backslash even when the locale is JA (if the
document is in UTF-8)
 
> > Microsoft should fix their bug in their fonts.
> 
> It is not realistic.
> Because that is true on the Unicode applications.
> But on native code(CP932) applications, those applications cannot display yen
sign.

It's realistic and possible. See my posting to the Unicode mailing list at
http://www.unicode.org/mail-arch/unicode-ml/y2002-m10/0340.html (use
'unicode-ml' and 'unicode' as the username and password)
 
> > If we do that, what should be the default? 
> 
> I think that the default value is that is NOT replaced.
> And momoi-san said the same opinion on
> http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3595#c69 .
> Becase the most famous UA is WinIE, and WinIE doesn't have this behavior.

Ok. That's easy enough. I'll do that over the weekend.
Status: NEW → ASSIGNED
momoi-san:
If you have opinion, please comment here.
I hope it.

> It sounds to me that what you wrote above is equivalent to saying that there
> are two versions of Unicode, Japanese and non-Japanese.

In the rendered character, I said so.
But in the encoding, I didn't say so.
# sorry. I cannot speak English well.

> See my posting to the Unicode mailing list

# Sorry. I have not yet read it.
In Japan, many people recognize the 0x5C as yen sign.
If this is fixed by font, many Japanese people will feel sense of incongruity.
e.x., The windows path separator is yen sign, not backslash for many Japanese
people.
Attached patch patch (obsolete) — Splinter Review
I added layout.enable_japanese_specific_transform pref. entry. It's false by
default.
Comment on attachment 150722 [details] [diff] [review]
patch

asking for r/sr. 

I also got rid of 'eLanguageSpecificTransformType_Korean' because it's not used
anywhere.
Attachment #150722 - Flags: superreview?(dbaron)
Attachment #150722 - Flags: review?(smontagu)
Comment on attachment 150722 [details] [diff] [review]
patch

If the pref is set to true, will backslash still be replaced by Yen on Japanese
locale even in UTF-8 documents or when it is specified by \?
We also recognize the problem.
We don't hope that 0x5C is replaced to U+A5 in UTF-* document.

However, there may also be those who desire the replacement.
(the people's environment can be the reason.)
So, we need this behavior in Japanese document.
Attached patch a new patch (obsolete) — Splinter Review
smontagu's comment prompted me to make the pref
(layout.enable_japanese_specific_transform) only effective when the character
encoding is one of Japanese legacy encodings (EUC-JP, Shift_JIS, ISO-2022-JP).
Attachment #150722 - Attachment is obsolete: true
Attachment #150722 - Flags: superreview?(dbaron)
Attachment #150722 - Flags: review?(smontagu)
I've just realized that this is likely to result in a regression. A better patch
is coming up soon.
It bothers me that this is still not identical to any of the options described
in comment 8. Doesn't option 3 mean that when the pref is set, replacement will
take place in all documents whatever the encoding?
Attached patch yet another patch (obsolete) — Splinter Review
Instead of changing the behavior of 'UpdateCharset', I change the condition for
activating the Japanese specific transform. It's now activated only with all of
the following three conditions satisfied:

  1. mLangGroup is ja
  2. the pref. entry is true
  3. charset is not one of Unicode encodings

Actually, the check for the 3rd condtion is not robust enough because the raw
charset name of a Unicode encoding does not always begin with 'UTF-'. I can
invoke the charset alias resolution routine, but it seems expensive for little
gain. Simon, what do you think?
Attachment #150726 - Attachment is obsolete: true
Simon:

When we discussed, our conclusion is that the replacement should not be occured
in non-Japanese encoding document.
The reason is that if we will see the document that written in other language,
we don't want the replacement.
Instead of that, why don't you remove the code in LookupCharset that start
   if (langGroup == mUnicode) {
     langGroup = GetLocaleLanguageGroup(&res);
and do that in nsPresContext::UpdateCharSet() after setting the transform type?
Why do we want a pref?  What's the right thing to do?
Note that the condition 1 in comment 23 might not work always, because of bug
234485. (my guess)
(In reply to comment #26)
> Why do we want a pref?  What's the right thing to do?

 There's no clear-cut answer because what '0x5c' means in
Shift_JIS/EUC-JP/ISO-2022-JP is always ambiguous. I'd rather remove the
replacement all together, but some Japanese users want to keep that behavior for
documents in one of legacy Japanese encodings (but not in documents in Unicode.
There's no ambiguity at all in the identity of U+005C.) and the consensus among
Japanese mozilla users is that we need a pref which is off by default. 

Nakano-san answered Simon's questions (actually, I did - unless my memory is
failing me-, too, but it seems like my answer got thrown away). 

re comment #25: I've just made that change. I'll upload it later today or
tomorrow after testing it.

re comment #27: Even with that fixed, it wouldn't work. The behavior only
depends on whether the current document encoding is Japanese or not and the
value of the pref.  That is, xml:lang and lang don't play any role here. That
shouldn't matter much because I don't think there are many documents in the wild
with 'lang=xx or xml:lang=xx' (where xx is not ja/ja_JP) that are encoded in
Shift_JIS/EUC-JP/ISO-2022-JP. For Unicode encoded-documents, we want to leave
U+005C alone no matter what so that we don't have to worry about them. 
Attached patch a new patch Splinter Review
changed per Simon's comment
Attachment #150730 - Attachment is obsolete: true
Comment on attachment 150786 [details] [diff] [review]
a new patch 

r=smontagu. We will want to release note this change in behaviour.
Attachment #150786 - Flags: review+
Comment on attachment 150786 [details] [diff] [review]
a new patch 

thansk for r.
asking for sr.
Attachment #150786 - Flags: superreview?(dbaron)
Comment on attachment 150786 [details] [diff] [review]
a new patch 

I'm not convinced about the need for the pref, but sr=dbaron.
Attachment #150786 - Flags: superreview?(dbaron) → superreview+
checked in to the trunk
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment on attachment 150786 [details] [diff] [review]
a new patch 

asking for a1.7.1 (considering that 1.7.* branch will be long-lived, I think we
have to make it consistent with 1.8 and later). I'll ask for aviary 1.0
approval, seprately.

risk : very low
affected users: anyone using Mozilla under Japanese locale and anyone  viewing
documents in legacy Japanese encodings.

effect: turn off, by default (a pref. was added to turn it on), the replacement
of '0x5c' (backslash) with Yen Sign in documents in legacy Japanese encodings.
In documents in any other encodings, U+005C is preserved regardless of the
pref. value.
Attachment #150786 - Flags: approval1.7.1?
What are all the other changes in this patch related to Korean?
Virtually nothing. I just got rid of what should have been removed a long time
ago. They haven't been used  since bug 88050 was fixed. 
Comment on attachment 150786 [details] [diff] [review]
a new patch 

a=mkaply for 1.7.1
Attachment #150786 - Flags: approval1.7.1? → approval1.7.1+
I tested on 2004062109-trunk/WinXP.
This patch works fine.

Thank you for all people who related to this bug.
I'm sorry I didn't realized that there were some changes (inclduing
deCOMization) between 1.7branch and the trunk in files affected by my patch.
I'll make a separate patch for 1.7branch and ask for r/sr/a. 

BTW, I found the following document about the conversion between Japanese
encodings and Unicode. It also talks about backslash vs Yen problem.

http://www.w3.org/TR/japanese-xml/
is the patch applied in aviary branch ? 'cause with a checkout of 2 days ago,
the bug is still here.
This bug still occurs on Firefox 1.0 PR.
Jungshik Shin, when the fix will be applied to Firefox or Mozilla 1.7.x ?
I'm sorry I haven't gotten back to this earlier. Due to some chnages in
language atom service and nsPresContext, attachment 150786 [details] [diff] [review] can't be applied to
1.7/av 1.0 branch. 

This is rather similar to attachment 150730 [details] [diff] [review]. I was sorta forced to take this
approach.
(In reply to comment #43)
> This is rather similar to attachment 150730 [details] [diff] [review].
Jungshik Shin, do you mean that Comment #25 from Simon can be ignored when 1.7
branch and Aviary branch?
(In reply to comment #44)
> (In reply to comment #43)
> > This is rather similar to attachment 150730 [details] [diff] [review].
> Jungshik Shin, do you mean that Comment #25 from Simon can be ignored when 1.7
> branch and Aviary branch?

As an end-user, you are not likely to see any problem. Simon's comment #25 is
about avoiding the following test for Unicode encoding forms,
|nsCRT::strncasecmp(aCharSet, "UTF-", 4))|, which is not as robust as we want it
to be. I'd love to address it for the branch, but I couldn't come up with a
clean way so that I ended up falling back to a less robust alternative. 

Comment on attachment 160035 [details] [diff] [review]
1.7 branch and aviary 1.0 patch

I thought I had asked for r/sr, but apparently I haven't. This is basically the
same as what's been committed to the trunk except that the test for 'Unicode'
encoding form is less robust (Simon's comment #25 was not addressed in this
patch) because I couldn't find a clean way to do that in the branch.
Attachment #160035 - Flags: superreview?(dbaron)
Attachment #160035 - Flags: review?(smontagu)
Attachment #160035 - Flags: review?(smontagu) → review+
Attachment #160035 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 160035 [details] [diff] [review]
1.7 branch and aviary 1.0 patch

asking for approval to branches.
The previous patch was already approved for the branch check-in, but it turned
out that the branch needs a different patch.
Attachment #160035 - Flags: approval1.7.x?
Attachment #160035 - Flags: approval-aviary?
Comment on attachment 160035 [details] [diff] [review]
1.7 branch and aviary 1.0 patch

a=asa for branches checkins.
Attachment #160035 - Flags: approval1.7.x?
Attachment #160035 - Flags: approval1.7.x+
Attachment #160035 - Flags: approval-aviary?
Attachment #160035 - Flags: approval-aviary+
Jshin, have your patch been checked-in to Firefox?

Problem still occurs on both "Firefox 1.0 PR release build" and "Firefox 1.0 RC1
release build" (I tested on Win-2K).
Rough changelog of Firefox 1.0RC also does not include this bug.
 ( http://www.mozilla.org/projects/firefox/qa/changelog-rc1.html )

Since this bug's severity is not blocker nor critical, this bug can not be a
blocker of Firefox 1.0.
However, we Japanse will be happy if this bug will be fixed on Firefox 1.0 in
addition to Mozilla trunk.
Sorry and thanks. Somehow I landed the patch only in 1.7 branch. I've just asked
for the approval for aviary-1.0 checkin (it seems like I need a new approval) 
This bug still occurs on Firefox 1.0 Release Candidate 2 (Release build, Win-2K).
Jshin, the fix will not be applied to final Firefox 1.0 release? 
I'm waiting for the re-approval. 
Comment on attachment 160035 [details] [diff] [review]
1.7 branch and aviary 1.0 patch

a=asa for aviary checkin.
checked into the av-1.0 branch
Verified with Firefox nightly latest-trunk build(Win32,ZIP).
> Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8a5) Gecko/20041104
Firefox/0.9.1+

Jshin, thanks for your effort.
verifying per Wada's comment
Status: RESOLVED → VERIFIED
Not fixed in View source window with 2004-11-07-12-0.11/Win32.
What font is used to render view-source? If you use one of *broken* Japanese
truetype fonts shipped with Japanese Windows, you can't tell because in those
fonts, the glyph for U+005C (Reverse Solidus/Backslash) has the shape of
'Japanese Yen'. I tried to persuade MS engineers to fix this font issue, but
failed. 

http://www.unicode.org/mail-arch/unicode-ml/y2002-m10/0340.html (username :
unicode, password: unicode-ml)
Do we still need this pref? Do people actually use it? It's annoying to have special code just for this.
You need to log in before you can comment on or make changes to this bug.