Open Bug 121193 Opened 20 years ago Updated 12 years ago
Need a way to let user specify language for unicode encoded webpages
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.7) Gecko/20011226 BuildID: 2001122617 Messenger displays Japanese utf-8 emails fine. when using helvetica fonts. However in Mozilla when i use babelfish to translate some words I dont know the punctuation characters "." and "," the full stop looks like a small circle and the commer is angled forward not back. the formating displays them as if they are ' eg at the top of the line. I have seen this in other programs, the problem is that Japanese can go vertically down as well as left -> right, the other time i saw this error the text was rotated 90 degrees to the left as well. So to sum up please check that punctuation is sopported in vertical, and that the horizontal left -> right formating is fixed. Possibly this is a problem with X, but as Messenger works fine i suspect its mozilla. If you would like to test, check this. BEGIN しく。 また、 END of japanese, it means nothing if u are trying to work it out.. :) Reproducible: Always Steps to Reproduce: 1.type some text with japanese full stops and comers 2.paste it into babelfish.altavista.com 3. Actual Results: commer and full stop at top of line Expected Results: commer should be at bottom and so should full stop Unless in the vertical Japanese mode, please dont stop this.
Changing component from browser-general to internationalization.
Component: Browser-General → Internationalization
reassign for real
Assignee: asa → yokoyama
QA Contact: doronr → ruixu
I failed to see the problem on my W2K-Ja I pasted the given Japanese text BEGIN しく。 また、 END (note: the text between BEGIN and END) into babelfish.altavista.com and got BEGIN It does, the く. In addition, END reporter: I think I need a screen shot. Can you attach the image? ruixu: can you verify this? is this Linux only?
These comma and full stop are displayed as if they are vertical text, but the text is left->right so they should rest at the bottom, on the line
jg: thanks for the screen shot. It looks like only in Linux platform. I didn't see the problem on my Win machine. shanjian: I am not sure about the linux fonts, so I am assigning this to you. Do you see a similar problem before? It looks familiar to me.....
Assignee: yokoyama → shanjian
Reporter: Could you please provide us more detailed repro steps and more information about your system environment, e.g. your Linux version and language, your working locale, etc.? Do you still see the same problem on the latest build? Thank you.
The glyph for unicode u+3002 (cjk full stop) and u+3001 (CJK comma) in Korean font are like these. For unicode encoded webpages, we could not find out its language just base on its encoding. The font search list is somewhat random. If a korean font is tried first, these 2 characters will be rendered this way. Web page authors can specify the language throgh the use of "lang" attribute. We really need a mechanism for end user to fill in this piece of information when necessary. But before bug 115121 is fixed, we can't do much about it.
Status: UNCONFIRMED → ASSIGNED
Depends on: 115121
Ever confirmed: true
Summary: UTF-8 Japanese incorrectly displayed → Need a way to let user specify language for unicode encoded webpages.
Target Milestone: --- → Future
Well, it looks like we are waiting on http://bugzilla.mozilla.org/show_bug.cgi?id=115121 This is my system setup currently, changed since I reported the bug but still the same problem, so i will list the differences. I wanted to get kinput2 working so i made some modifications. $ locale LANG=ja_JP LC_CTYPE=ja LC_NUMERIC=en_GB LC_TIME=en_GB LC_COLLATE=en_GB LC_MONETARY=en_GB LC_MESSAGES=en_GB LC_PAPER="ja_JP" LC_NAME="ja_JP" LC_ADDRESS="ja_JP" LC_TELEPHONE="ja_JP" LC_MEASUREMENT="ja_JP" LC_IDENTIFICATION="ja_JP" LC_ALL= What was different before was LC_CTYPE=en_GB:en if i remeber correctly. Please note this bug is only present in mozilla, not messenger, that must be doing something correctly, perhaps you can compare the utf8 stuff? My OS is Linux mandrake8.1, English install. Other things I have modifed when tyring to get japanese to work are http://www.mandrakeforum.org/article.php?sid=1420&lang=en and an english guide for japanese as a second language http://www.math.wisc.edu/~stefanss/japanese/index.html it still does not work after the modifications. Unicode fonts in mozilla tested are. wadalab-gothic-jisx0208, 1983-0 and helvetica, it still displays incorrectly though JG
JG, Thank you for the detailed information. We are able to reproduce this problem with recent builds on RedHat JA Linux 7.1. When copying your test text between BEGIN and END to the site http://babelfish.altavista.com, the problem seems only happening if the encoding is set as Korean in the Mozilla browser. We also tried UTF-8, Japanese and Chinese encodings and the text can be displayed correctly. Could you please check what your browser encoding is? Thank you.
Hello I just did some more tests, I have never visited a korean site to my knowlege and I have not set it to korean in the View->Character coding.. menu. I tested the following encodings (the babelfish site was only an example of a site with a box to type into, i did these tests on this mozilla bug report page!) UTF-8: Bug present Shift_JIS: NOT EUC_JP: NOT ISO-2022-JP: NOT (is this some windows format?) So it appears that it is a UTF-8 only bug, thinking korean, as someone sugested, but as unicode has separate punctuation for each language (i think) i dont know why it would display korean full stop and comma. I believe it is the same bug I experienced on windows with some english text editors. i had all text rotated 90 Degrees anti-clockwise, thus the full stop was at the top, this is due to Japanese going top->bottom (then moving left across page) as well as left->right. I dont know if my idea is correct. Today Mozilla locked up when writing a japanese email, the whole thing crashed. Just when writing this it locked up for 10secs before comming back. I wont report a bug as I doubt I can replicated it. I dont know who did the UTF-8 code for top->bottom code, but could someone add them to the CC list? JG
JG, Thanks a lot for the updates! Here is the summary: 1. On JA RedHat Linux 7.1, JA MacOS 9.1 and EN MacOS X: When encoding is set as Korean and using JA IME, the problem will appear, but this combination is unlikely to happen with real users. 2. On EN Linux mandrake8.1: When encoding is set as UTF-8 and using JA IME, the problem will appare, it is really bad in this case. It can even be reproduced using the "Additional Comments" box in the bug report.
QA Contact: ruixu → ylong
Hello, Updates are fine :) Thank you for checking this bug, and hopefully fixing it! my browser is set to UTF-8 default, so its any site with a text box in mozilla. Note messenger is fine, what is it doing differently? JG
I need to clarify something here. 1) this problem should only exist with unicode encoding (like UTF-8). 2) Changing encoding from UTF8 to something else will cause garbled display, unless the character is represented in NCR, (like —). 3) We can't do much before 115121 is fixed. For unicode encoding, our best guess is locale language. That's say you should see 2 chars rendered in japanese font if you are running in japanese locale. (We may still have some minor issues.) If you see korean font is used under japanese locale inside browser, and you do have japanese font, please file a bug. I will keep this bug for the issue mentioned in summary.
> this problem should only exist with unicode encoding (like UTF-8). Per our discussing, there is only one this kind of case: UTF-8 On EN Linux mandrake8.1 but it not reproducible with UTF-8/linux RedHat7.1
I could reproduce it on RH7.2. It is depend on font availability and resolving order. Even you don't see it in certain platform, problem still exists. It will happen in a different scenario.
> I could reproduce it on RH7.2. It is depend on font availability and > resolving order. Even you don't see it in certain platform, problem > still exists. It will happen in a different scenario. I agree. I was able to reproduce it on my RH 7.1. I launched Mozilla under Japanese locale, but still saw Korean glyphs take precedence over Japanese glyphs for two characters in question (with Character coding set to UTF-8). BTW, these two characters (ideographic full stop and ideographic comma) should NOT have been unified with full stop for vertical writing and comma for vertical writing ONLY present in KS X 1001 (at 1-2 and 1-3, respectively) in Unicode/ISO 10646. Apparently, annoations to two characters at 1-2 and 1-3 in KS X 1001 were overlooked and they're falsely identified with ideographic full stop and ideographic comma in JIS (and perhaps in corresponding PRC and ROC standards). I'll raise the issue at Unicode mailing list.
Any progress on resolving this bug? I can see it now in 2002 03 11 15
Related bug (the opposite case) is bug 122779 (closed as of now, but to be reopened).
Confirming still present in Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021029, build 2002102908 Could a prioity or milestone be set on this one? i think its rather more important than "future" Regards JG
We can't do much before xml:lang is resolved. After that, there are still a lot of work to do, so I don't see this problem will be resolved in near future. The work around s to use "lang" attribute in html web page.
Depends on: 41978
bug 41978 (xml:lang) was fixed, but there are still a lot of works to fix this (including adding langObserver similar to charsetObserver) as Shanjian wrote. IMHO, it should be evangelized that every Unicode web page specifies lang pseudo-elment for html and xml:lang for xml. Even with that, there's a bug(bug 204586) to fix (which is easier to fix than this) with languages/scripts that should benefit most from using Unicode (that is, those scripts/languages for which Unicode is the first and only widely accepted character set).
(In reply to comment #21) > IMHO, it should be evangelized that every > Unicode web page specifies lang pseudo-elment > for html You mean lang attribute? I was just thinking the same thing myself. But in the meantime... (In reply to comment #7) > Web page authors can specify the language throgh the use of "lang" attribute. We really > need a mechanism for end user to fill in this piece of information when necessary. Yes please! This bites me all the time. I need to set LANG (or LC_CTYPE and LC_MESSAGES) to ja_JP in order to enable Japanese text input, but then all Unicode pages lacking language tags are rendered using a Japanese font, which looks terrible for English text (because it's fixed-width). I haven't been able to find a way to make the default language English without disabling Japanese input. Having separate control over the default language for various purposes (UI input, UI output, page rendering) would be very helpful. Being able to change the language in a single window would be even better. Maybe via a menu, analogous to the menu that allows the charset to be overridden for one window. By the way, this is with Firefox 0.9.3 on Linux.
Not that I can work on this at the moment (it requires a lot of changes) but that I have a slightly higher chance of being able to work on this than shanian .. and it helps me track this one better.
Assignee: shanjian → jshin
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.