Closed Bug 122779 Opened 23 years ago Closed 13 years ago

Use fonts according to Content-Language/lang attribute in Unicode page

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 547267
Future

People

(Reporter: dzy, Assigned: jshin1987)

Details

(Keywords: intl)

Attachments

(1 file, 3 obsolete files)

Mozilla use only one set of fonts to render the text in Unicode pages, if some
glyphs are not available in the chosen font, then Mozilla/host OS tries to
render them with other fonts.

I think it would be better if we can set the font to use for each language, and
Moziila choose the font to render according to the Content-Language header or
lang attribute in Unicode pages.
QA Contact: ruixu → ylong
push to future. 
Status: UNCONFIRMED → NEW
Ever confirmed: true
Target Milestone: --- → Future
Keywords: intl
->shanjian
Assignee: yokoyama → shanjian
We now honor lang attribute. Could you verify?
Status: NEW → ASSIGNED
This has been fixed. 

*** This bug has been marked as a duplicate of 105199 ***
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → DUPLICATE
Verified as dup.  Please re-open if disagree.
Status: RESOLVED → VERIFIED
'lang' attribute in html doc is honored in font selection
(as Shanjian wrote, it's fixed in bug 105199), 
but 'Content-Language' specification in HTTP header
doesn't seem to be honored by Mozilla. 

Try the following three pages under non-SC locale
(e.g. JA, TC or KO) and compare the results.

  http://jshin.net/moztest/zh-CN.utf8.html 
  http://jshin.net/moztest/zh-CN2.utf8.html
  http://jshin.net/moztest/zh-CN3.utf8.html

I set up htaccess file in my server in such a way
that my web server emits 'Content-Language: zh-CN'
in http header for the first file. 
The second file has meta-tag to specify Content-Language.
In the third one, 'lang' attribute is explicitly used. 

Only the third one is rendered with a SC font
while the first two pages are rendered with
multiple fonts ( I tried them under KO locale
so that KO and SC fonts were mixed to make
the result look like a 'ransom' note.)

I suggest that this bug be reopened with
the following summary line 'Content-Language 
has to be refered to in font selection for a UTF-8
page'.  

C-L http header field  is specified in http 1.1 section 14.12
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html)
Attached patch v1 patch (obsolete) — Splinter Review
With this patch, the first test case works as intended.
It's rendered with a single SC font when Mozilla
is run under Korean locale. The second test still
doesn't work. I thought C-L specified in meta-tag
gets parsed, but I may have been wrong. 
The third test case works well as it should(ie.
rendered with a single SC font)
However, with this patch, the default font size got smaller
for the third case. 

The patch for bug 98929 laid out the foundation for 
this fix, but 'lang' set by the patch for 98929
gets 'sort of' masked/shadowed by nsPresContext::UpdateCharset().
My patch reads off nsDocument::mLanguage (set by
the patch for 98929) from Content-Language http header)
and use it to set nsPresContext::mLanguage and langGroup
for Unicode-encoded pages(UTF-8 and other UTF's). 

My patch doesn't change the behavior of Mozilla for
html docs encoded in non-Unicode legacy encodings 
(ISO-8859-x, EUC-JP/KR, GB2312, Big5, KOI8-R/U, CP12xx,etc).
For those documents, langGroup is still derived from
the encoding(charset). 

UpdateCharset() may not be the best place to do this
and I'm open to suggestions for a better place.
Attached patch v2 patch w/ some tightening (obsolete) — Splinter Review
I tightened up some loose ends. When intl.accept.language
was missing in prefs.js(and C-L header field is absent
in http header), mContentLanguage was not
set explicitly and had a 'random' value. Now it's set to
NULL string explicitly. 

The reason the third case got rendered with a smaller
size font with my patch than otherwise turned out 
to be that I didn't have 'intl.accept.languages' 
in prefs.js and that resulted in langGroup 
being set to 'x-western', which has a smaller 
default font size than 'zh-CN' in my preference.

As for the second case, C-L in metatag is not
recognized yet. It has to be filed as a separate
bug. 

In summary, this patch does all it can do for the moment.
Reopen the bug for content/language problem as suggested by jshin. 
Status: VERIFIED → REOPENED
Resolution: DUPLICATE → ---
That's a very good job. You patch certainly make sense. I have 2 questions.
1, What about the priority between document charset and C-L? For example, a
document encoded in GB2312, but C-L specifies japanese?
2, When C-L is misspelled, should default language be used?

I suggest to try content language in all situation and use it if one is found,
otherwise fallback to existing code. 

jshin, can I reassign the bug to you or I have to act like a proxy? 
Status: REOPENED → ASSIGNED
Shanijian,
Thank you for your comment and glad that you like it.
 
> I have 2 questions.
> 1, What about the priority between document charset and C-L? 
> For example, a
> document encoded in GB2312, but C-L specifies japanese?

  I thought about it and decided to leave those a bit edge
cases alone and to work only on UTF-* cases.
 Currently, mContentLanguage is obtained from
two different sources, C-L header and intl.accept.languages.
If mContentLanguage is from C-L header (or meta-tag: not
yet implemented), I think Mozilla should respect the author's
intent for cases where langGroup deduced from the encoding
is different from that specified in C-L. However, if it's
obtained from intl.accept.languages, I'm afraid we'd better
stick to the one deduced from the encoding(charset). 
For instance, I have 'ko,en-US' in intl.accept.languages.
It's fine to use 'ko' for UTF-8 page without C-L header.
However, it doesn't make sense to use 'ko' for GB2312 encoded
pages without C-L header. 

  One way to work around this issue is add mContentLangSource
(a la mCharacterSetSource) to nsDocument class so that
we can differentiate between mContentLanguage obtained
from C-L header(and meta-tag when implemented.
this has to be done at nsHTMLDocument class, though) and 
intl.accept.languges.

  Do you think it's worth pursuing? 

> 2, When C-L is misspelled, should default language be used?

  For misspelled C-L, langGroup is set to x-western by
nsLangAtom::LookupLanguage(). It appears harmless for
UTF-* documents although not most desirable. For documents
encoded in non-Unicode encodings, it can do some damages,
but currently my patch doesn't deal with them as I wrote
in my answer to your first question. Do you think 
that it's necessary to modify LookupLanguage() to accept
an _optional_ argument (charset) and set langGroup
to 'x-unicode' instead of ('x-western') if charset is 
one of UTF-*'s and aLanguage argument is unknown/misspelled? 
We can go even further and make LookupLanguage()
to set different default langGroup for different
charset (of course, this should be optional.)

  Alternative is to do some check in the caller, but ...



> I suggest to try content language in all situation and use it 
> if one is found, otherwise fallback to existing code.

  I explained some issues with doing this in the above.
Can you tell me what you think of them? 
   

> jshin, can I reassign the bug to you or I have to act like a proxy? 

  Yes, you can reassign it to me.

BTW, here's a different problem. When I took up this bug,
I tried to solve it by making Mozilla behave 
as if there were 'lang' attribute in <body> or <html>
of which value is obtained from C-L http header
as below:

<html lang="zh-TW">
<head>....</head>
<body>
...

or 

<html> ... 
<body lang="zh-TW'>

That is, I wanted to set lang(pseudo-class) in the
very root of style resolution, but I couldn't figure
out how and came up with modifying UpdateCharset(),
instead.  I'd like to hear your opinion on this
approach compared with my present patch. 

Another BTW, in my patch(attachment 95594 [details] [diff] [review]), there's
an mistake using 'end-1' where just 'end' is used
in calling Substring().
>   I thought about it and decided to leave those a bit edge
> cases alone and to work only on UTF-* cases.
>  Currently, mContentLanguage is obtained from
> two different sources, C-L header and intl.accept.languages.
> If mContentLanguage is from C-L header (or meta-tag: not
> yet implemented), I think Mozilla should respect the author's
> intent for cases where langGroup deduced from the encoding
> is different from that specified in C-L. However, if it's
> obtained from intl.accept.languages, I'm afraid we'd better
> stick to the one deduced from the encoding(charset). 
Agree. I didn't realize that mContentLanguage can originated from
accept languages. That make things complicated.

>  One way to work around this issue is add mContentLangSource
> (a la mCharacterSetSource) to nsDocument class so that
> we can differentiate between mContentLanguage obtained
> from C-L header(and meta-tag when implemented.
> this has to be done at nsHTMLDocument class, though) and 
> intl.accept.languges.
Who will use the mContentLanguage besides what you are doing here? Is it
possible to let mContentLanguage be originated from only one source (ie, C-L
header) or take into consideration of charset when deciding mContentLanguage?
Using your example (accept-lang = ko, charset = gb2312), I don't think setting
mContentLanguage to Ko can lead to any reasonable result anywhere.

>  For misspelled C-L, langGroup is set to x-western by
> nsLangAtom::LookupLanguage(). It appears harmless for
> UTF-* documents although not most desirable. For documents
> encoded in non-Unicode encodings, it can do some damages,
> but currently my patch doesn't deal with them as I wrote
> in my answer to your first question. Do you think 
> that it's necessary to modify LookupLanguage() to accept
> . an _optional_ argument (charset) and set langGroup
> to 'x-unicode' instead of ('x-western') if charset is 
> one of UTF-*'s and aLanguage argument is unknown/misspelled? 
> We can go even further and make LookupLanguage()
> to set different default langGroup for different
> charset (of course, this should be optional.)
I suggest treat misspelled C-L as no C-L, ie. fall back to charset.

> That is, I wanted to set lang(pseudo-class) in the
> very root of style resolution, but I couldn't figure
> out how and came up with modifying UpdateCharset(),
> instead.  I'd like to hear your opinion on this
> approach compared with my present patch. 

I strongly favor your current approach. lang attribute from tags can still
override the default one. 
give it to jshin.
Assignee: shanjian → jshin
Status: ASSIGNED → NEW
> Who will use the mContentLanguage besides what you are doing here? 

  It's used in content/html/style/src/nsCSSStyleSheet.cpp
to select lang-based selector(??). The idea for referencing
intl.accept.languages probably arose to handle cases
where charset/encoding can't be mapped to a single unique language
That includes UTF-8(x-unicode) and ISO-8859-1(x-western). 
Try http://jshin.net/moztest/lang.latin1.html (with intl.accept.languages
="de", "fr" and "fr,de", "de,ko,en-US"). The way it's used
in nsCSSStyleSheet.cpp is different from the way it's used
in nsPresContext.cpp, though. In the former case, if there are 
multiple elements in intl.accept.languages and multiple
lang-based selectors in CSS, it seems like the last lang based
selector in CSS matched with one of languages specified in
intl.accept.language  gets effective. (that is, the order languages
are specified in intl.accept.language does not matter). 

However, there should be 
very few documents with something like 'q:lang(de)' in CSS but without 
explicit use of 'lang' attribute in html elements (here it's 'q').
This, along with not-so-intutive way of choosing
lang-based selector when multiple langs are present in intl.accept.languages
(as described above) I have some reservation about the usefulness of obtaining
mContentLanguage from intl.accept.languages.   It also has to be
noted, though, that C-L http header can have multiple languages listed
(however starnage it may sound.)

> Is it
> possible to let mContentLanguage be originated from only one source (ie, C-L
> header) 

  Yes, it's possible. It's easy (I just have to take out
a part of the patch for bug 98929: attachment 48982 [details] [diff] [review]),
but the question is whether or not to do. I am inclined to
take that out for the reason given above, but like to hear  
from Ulrich who added it in in his patch for bug 98929
before going ahead. 

  Another aspect that may make things complicated in the future
is that once bug 121193 is fixed, we  may have yet another way 
to obtain the value for mContentLanguage. With this,
mContentLanguage becomes almost like mCharSet in terms
of the number of sources where its value can come from
: C-L http header, meta-tag, user setting via UI 
(like character coding menu) and user pref. value
in intl.accept.language (settable via Pref|Language).
Of course, when bug 121193 is fixed, probably we
have to remove the last (intl.accept.languages).   

>>Do you think 
>> that it's necessary to modify LookupLanguage() to accept
>> . an _optional_ argument (charset) and set langGroup
>> to 'x-unicode' instead of ('x-western') if charset is 
>> one of UTF-*'s and aLanguage argument is unknown/misspelled? 
>> We can go even further and make LookupLanguage()
>> to set different default langGroup for different
>> charset (of course, this should be optional.)

> I suggest treat misspelled C-L as no C-L, ie. fall back to charset.

  Now it does. Instead of maing nsLookupLanguage() to
have an optional third argument (I don't know how
to specify an optional argument with the default
value in XPCOM IDL), I modified it to return 
NS_ERROR_LANGATOM_UNKNOWN_LANG (its severity bit
is still 0 so that NS_SUCCEEDED() results in 'true')
instead of NS_OK when mContentLanguage has
an unknown/misspelled language. In UpdateCharSet()
in nsPresContext.cpp, the return value is checked
and acted upon accordingly. 

> I strongly favor your current approach. lang attribute 
> from tags can still override the default one.
 
  All right. I agree with you that lang attrib. can
override the default det. from C-L.  

Attached patch a new patch (obsolete) — Splinter Review
Addressing some of Shanjian's concerns.
I haven't yet taken out the code to obtain mC-L from intl.accept.languages.
mC-L is still only checked for UTF-* cases, but it's easy
to check mC-L for all charsets. However, that is contingent on
what we decide to do with intl.accept.language as a source
of mC-L.
sorry for spamming. I forgot to include nsLanguageAtomService.cpp
Attachment #95528 - Attachment is obsolete: true
Attachment #95594 - Attachment is obsolete: true
Attachment #95718 - Attachment is obsolete: true
> I modified nsLookupLanguage() to return 
> NS_ERROR_LANGATOM_UNKNOWN_LANG (its severity bit
> is still 0 so that NS_SUCCEEDED() results in 'true')
> instead of NS_OK when mContentLanguage has
> an unknown/misspelled language.

  Because of caching in nsLookupLanguage(), the second time
it's handed 'an invalid/misspelled' value in aLanguage,
it sets aResult to 'x-western' returning NS_OK
instead of NS_ERROR_LANGATOM_UNKNOWN. To work around
this, we have to record the
fact that language was set to 'x-western' because
of invalid/misspelled aLanguage is recorded in
the cache. How? One way is to add a third field
to nsILanguageAtom, but I have little idea
if it's allowed/desirable because that involves
changes in xpcom? and the use for that value
is pretty rare(?). 

See also bug 163271 which led me to discover this problem. 

BTW, I don't know why I can't accept this bug. Bugzilla doesn't
show 'accept' button. I've accepted some bugs in the past,,,,
Keywords: mozilla1.3, patch
Status: NEW → ASSIGNED
http://www.faqs.org/rfcs/rfc3282.html

Adding an RFC link for Content-Language. 
QA Contact: amyy → i18n
What is the status of this bug? Was the patch merged or is it now outdated? Is the bug still valid?
Ficed by bug 547267.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago13 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: