Open Bug 121193 Opened 20 years ago Updated 12 years ago

Need a way to let user specify language for unicode encoded webpages.

Categories

(Core :: Internationalization, defect)

x86
Linux
defect
Not set
normal

Tracking

()

Future

People

(Reporter: jg, Assigned: jshin1987)

References

()

Details

(Keywords: intl)

Attachments

(1 file)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.7) Gecko/20011226
BuildID:    2001122617

Messenger displays Japanese utf-8 emails fine. when using helvetica fonts.
However in Mozilla when i use babelfish to translate some words I dont know the
punctuation characters "." and "," the full stop looks like a small circle and
the commer is angled forward not back.

the formating displays them as if they are ' eg at the top of the line.  I have
seen this in other programs, the problem is that Japanese can go vertically down
as well as left -> right, the other time i saw this error the text was rotated
90 degrees to the left as well.  

So to sum up please check that punctuation is sopported in vertical, and that
the horizontal left -> right formating is fixed.

Possibly this is a problem with X, but as Messenger works fine i suspect its
mozilla.

If you would like to test, check this.

BEGIN しく。 また、  END of japanese, it means nothing if u are trying to work
it out.. :)



Reproducible: Always
Steps to Reproduce:
1.type some text with japanese full stops and comers
2.paste it into babelfish.altavista.com
3.

Actual Results:  commer and full stop at top of line

Expected Results:  commer should be at bottom and so should full stop


Unless in the vertical Japanese mode, please dont stop this.
Changing component from browser-general to internationalization.
Component: Browser-General → Internationalization
reassign for real
Assignee: asa → yokoyama
QA Contact: doronr → ruixu
I failed to see the problem on my W2K-Ja

I pasted the given Japanese text
BEGIN しく。 また、  END  (note: the text between BEGIN and END)
into babelfish.altavista.com and got
BEGIN It does, the く.  In addition, END  

reporter: I think I need a screen shot. Can you attach the image?
ruixu: can you verify this? is this Linux only?





These comma and full stop are displayed as if they are vertical text, but the
text is left->right so they should rest at the bottom, on the line
jg: thanks for the screen shot.  It looks like only in Linux platform.  I didn't
see the problem on my Win machine.

shanjian: I am not sure about the linux fonts, so I am assigning this to you. Do
you see a similar problem before?  It looks familiar to me.....
Assignee: yokoyama → shanjian
Reporter: 
Could you please provide us more detailed repro steps and more information 
about your system environment, e.g. your Linux version and language, your
working locale, etc.? Do you still see the same problem on the latest build?
Thank you.
The glyph for unicode u+3002 (cjk full stop) and u+3001 (CJK comma) in Korean font are like 
these. For unicode encoded webpages, we could not find out its language just base on its 
encoding. The font search list is somewhat random. If a korean font is tried first, these 
2 characters will be rendered this way. 

Web page authors can specify the language throgh the use of "lang" attribute. We really 
need a mechanism for end user to fill in this piece of information when necessary. But 
before bug 115121 is fixed, we can't do much about it.
Status: UNCONFIRMED → ASSIGNED
Depends on: 115121
Ever confirmed: true
Summary: UTF-8 Japanese incorrectly displayed → Need a way to let user specify language for unicode encoded webpages.
Target Milestone: --- → Future
Well, it looks like we are waiting on
http://bugzilla.mozilla.org/show_bug.cgi?id=115121

This is my system setup currently, changed since I reported the bug but still
the same problem, so i will list the differences. I wanted to get kinput2
working so i made some modifications.


$ locale
LANG=ja_JP
LC_CTYPE=ja
LC_NUMERIC=en_GB
LC_TIME=en_GB
LC_COLLATE=en_GB
LC_MONETARY=en_GB
LC_MESSAGES=en_GB
LC_PAPER="ja_JP"
LC_NAME="ja_JP"
LC_ADDRESS="ja_JP"
LC_TELEPHONE="ja_JP"
LC_MEASUREMENT="ja_JP"
LC_IDENTIFICATION="ja_JP"
LC_ALL=

What was different before was LC_CTYPE=en_GB:en if i remeber correctly.

Please note this bug is only present in mozilla, not messenger, that must be
doing something correctly, perhaps you can compare the utf8 stuff?

My OS is Linux mandrake8.1, English install.

Other things I have modifed when tyring to get japanese to work are

http://www.mandrakeforum.org/article.php?sid=1420&lang=en

and an english guide for japanese as a second language
http://www.math.wisc.edu/~stefanss/japanese/index.html

it still does not work after the modifications.

Unicode fonts in mozilla tested are. wadalab-gothic-jisx0208, 1983-0
and helvetica, it still displays incorrectly though

JG

JG, Thank you for the detailed information.

We are able to reproduce this problem with recent builds on RedHat JA Linux 7.1. 
When copying your test text between BEGIN and END to the site 
http://babelfish.altavista.com, the problem seems only happening if the encoding 
is set as Korean in the Mozilla browser. We also tried UTF-8, Japanese and  
Chinese encodings and the text can be displayed correctly.

Could you please check what your browser encoding is? Thank you.
Keywords: intl
Hello
I just did some more tests,
I have never visited a korean site to my knowlege and I have not set it to
korean  in the View->Character coding.. menu.

I tested the following encodings (the babelfish site was only an example of a
site with a box to type into, i did these tests on this mozilla bug report page!)

UTF-8:      Bug present
Shift_JIS:  NOT
EUC_JP:     NOT
ISO-2022-JP:  NOT  (is this some windows format?)

So it appears that it is a UTF-8 only bug, thinking korean, as someone sugested,
but as unicode has separate punctuation for each language (i think) i dont know
why it would display korean full stop and comma.  

I believe it is the same bug I experienced on windows with some english text
editors. i had all text rotated 90 Degrees anti-clockwise, thus the full stop
was at the top, this is due to Japanese going top->bottom (then moving left
across page) as well as left->right.

I dont know if my idea is correct.

Today Mozilla locked up when writing a japanese email, the whole thing crashed.
Just when writing this it locked up for 10secs before comming back. I wont
report a bug as I doubt I can replicated it.

I dont know who did the UTF-8 code for top->bottom code, but could someone add
them to the CC list?

JG

JG, Thanks a lot for the updates!

Here is the summary:
1. On JA RedHat Linux 7.1, JA MacOS 9.1 and EN MacOS X:
    When encoding is set as Korean and using JA IME, the problem will appear, 
    but this combination is unlikely to happen with real users. 
2. On EN Linux mandrake8.1:
    When encoding is set as UTF-8 and using JA IME, the problem will appare, it 
    is really bad in this case. 

It can even be reproduced using the "Additional Comments" box in the bug report.
QA Contact: ruixu → ylong
Hello,
Updates are fine :) Thank you for checking this bug, and hopefully fixing it!

my browser is set to UTF-8 default, so its any site with a text box in mozilla.
Note messenger is fine, what is it doing differently?

JG
I need to clarify something here. 
1) this problem should only exist with unicode encoding (like UTF-8).
2) Changing encoding from UTF8 to something else will cause garbled display, 
   unless the character is represented in NCR, (like —).
3) We can't do much before 115121 is fixed. For unicode encoding, our best 
   guess is locale language. That's say you should see 2 chars rendered in 
   japanese font if you are running in japanese locale. (We may still have 
   some minor issues.) If you see korean font is used under japanese locale 
   inside browser, and you do have japanese font, please file a bug. 

I will keep this bug for the issue mentioned in summary. 

> this problem should only exist with unicode encoding (like UTF-8).

Per our discussing, there is only one this kind of case: 
UTF-8 On EN Linux mandrake8.1

but it not reproducible with UTF-8/linux RedHat7.1



I could reproduce it on RH7.2. It is depend on font availability and 
resolving order. Even you don't see it in certain platform, problem 
still exists. It will happen in a different scenario.
> I could reproduce it on RH7.2. It is depend on font availability and 
> resolving order. Even you don't see it in certain platform, problem 
> still exists. It will happen in a different scenario.

  I agree. I was able to reproduce it on my RH 7.1. I launched Mozilla
under Japanese locale, but still saw Korean glyphs take precedence
over Japanese glyphs for two characters in question (with Character coding
set to UTF-8). 

  BTW, these two characters (ideographic full stop and ideographic comma)
should NOT have been unified with full stop for vertical writing
and comma for vertical writing ONLY present in KS X 1001 (at 1-2 and 1-3,
respectively) in Unicode/ISO 10646. Apparently, annoations to two
characters at 1-2 and 1-3 in KS X 1001 were overlooked and they're falsely
identified with ideographic full stop and ideographic comma in JIS 
(and perhaps in corresponding PRC and ROC standards). I'll raise the issue
at Unicode mailing list.

Any progress on resolving this bug? I can see it now in 2002 03 11 15
Related bug (the opposite case) is bug 122779 (closed as of now, but
to be reopened). 
Confirming still present in
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021029, build 2002102908

Could a prioity or milestone be set on this one? i think its rather more
important than "future"

Regards

JG
Keywords: mozilla1.3
We can't do much before xml:lang is resolved. After that, there are still a lot
of work to do, so I don't see this problem will be resolved in near future. The
work around s to use "lang" attribute in html web page.
Depends on: 41978
bug 41978 (xml:lang) was fixed, but there are
still a lot of works to fix this (including
adding langObserver similar to charsetObserver)
as Shanjian wrote.

IMHO, it should be evangelized that every 
Unicode web page specifies lang pseudo-elment
for html and xml:lang for xml. Even with that,
there's a bug(bug 204586) to fix (which is easier to fix
than this) with languages/scripts that should
benefit most from using Unicode (that is,
those scripts/languages for which Unicode is the
first and only widely accepted character set).
Depends on: 234599
No longer depends on: 41978
(In reply to comment #21)

> IMHO, it should be evangelized that every 
> Unicode web page specifies lang pseudo-elment
> for html

You mean lang attribute?  I was just thinking the same thing myself.  But in the
meantime...

(In reply to comment #7)

> Web page authors can specify the language throgh the use of "lang" attribute.
We really 
> need a mechanism for end user to fill in this piece of information when necessary.

Yes please!  This bites me all the time.  I need to set LANG (or LC_CTYPE and
LC_MESSAGES) to ja_JP in order to enable Japanese text input, but then all
Unicode pages lacking language tags are rendered using a Japanese font, which
looks terrible for English text (because it's fixed-width).  I haven't been able
to find a way to make the default language English without disabling Japanese
input.  Having separate control over the default language for various purposes
(UI input, UI output, page rendering) would be very helpful.  Being able to
change the language in a single window would be even better.  Maybe via a menu,
analogous to the menu that allows the charset to be overridden for one window.

By the way, this is with Firefox 0.9.3 on Linux.
Not that I can work on this at the moment (it requires a lot of changes) but
that I have a slightly higher chance of being able to work on this than shanian
.. and it helps me track this one better.
Assignee: shanjian → jshin
Status: ASSIGNED → NEW
QA Contact: amyy → i18n
You need to log in before you can comment on or make changes to this bug.