Closed Bug 234485 Opened 21 years ago Closed 11 years ago

xml:lang is not used for font selection or hyphenation

Categories

(Core :: CSS Parsing and Computation, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24

People

(Reporter: glandium, Assigned: dbaron)

References

Details

(Keywords: intl)

Attachments

(5 files)

User-Agent:       
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040210 Firefox/0.8

In xhtml 1.1, the lang attribute has been deprecated. Therefore, using xml:lang
is the normal way of telling which language is used in an element.
Even if xml:lang is recognized on the styling part (:lang() selector works
properly), the fonts used for rendering are still those for the default lang.

Reproducible: Always
Steps to Reproduce:
1.
2.
3.
Note that if the content-type given is text/html, xml:lang is not "recognized"
(:lang() stops working for xml:lang'ed elements)
Also note that as the lang attribute is deprecated, it should not be there,
it's only there for demonstration purpose.
I'm also attaching an image showing the rendering result.
The fonts strangeness comes from the fact that my Western font contains
hiragana and katakana, but not kanjis, which means that when rendering kanjis,
the font rendering engines fallbacks to an other font, which doesn't render in
the same way, therefore creating crappy display.
In the correctly rendered paragraph, all is rendered using the Japanese font.
No relationship to view rendering (that would be painting).

The lang attr is mapped into the pseudo-style "mLang" thingie at
http://lxr.mozilla.org/seamonkey/source/content/html/content/src/nsGenericHTMLElement.cpp#2915

We could modify this to also map xml:lang, but we should really be doing that
for all elements, not just HTML ones....  And we don't really want to duplicate
the code in nsCSSStyleSheet too much.  Maybe we could push up GetLang to
nsIContent (it's already on nsIDOMHTMLElement, but....) or something?
Assignee: roc → dbaron
Status: UNCONFIRMED → NEW
Component: Layout: View Rendering → Style System (CSS)
Ever confirmed: true
Putting it on nsIContent sounds like a good solution to me (though maybe we
should call it something like GetContentLang to avoid 'hides' warnings). At
least if the stylesystem would be able to use that without a performance hit.
I though we fixed this in bug 35768(bug 91190 comment 13 and
bug 35768 - attachment 95593 [details]), but apparently we didn't as far as the font
selection is concerned.
  
Keywords: intl
OS: Linux → All
Hardware: PC → All
The real problem is that bug 115121 was marked duplicate incorrectly.

sicking, I don't really see how putting it on nsIContent will slow things down
much (I guess it's an extra virtual function call, but compared to the actual
work that needs doing -- getting attrs and all -- that seems minor).
bz, I suspected so, but i thought i'd ask. I look forward for a patch ;-)
Not likely to come from here any time soon, unfortunately. Should be a matter of
implementing GetContentLanguage by copying the CSS code, then making CSS use it
and making attr mapping use it too.

Would need to do attr mapping on nsGenericElement for that, of course...
This bug is confusing at first read. I had to go back to the earlier bugs to
clear things up in my mind. Some extra comments to clarify it a bit from the
font selection pespective:

The font sub-system gets the language from nsFrame.cpp's SetFontFromStyle().
http://lxr.mozilla.org/seamonkey/source/layout/html/base/src/nsFrame.cpp#401

Thus the bug also means that nsStyleVisibility::mLanguage is not reliable. It is
only half of the equation. It does not walk up the tree to find the language.
|SetFontFromStyle| seemed to assume that the language would have been fully
resolved already in nsRuleNode, but that's not the case.

The style system's nsCSSStyleSheet has its own way of getting/processing the
lang. It is quite comprehensive. It also has the advantage that it is lazy (no
extra work unless the :lang() pseudo is encountered).

Apart for adding a nsGenericElement::GetContentLang [along the same lines as
nsGenericElement::GetBaseURI which is already there for xml:base], another way
to fix the bug might be in |SetFontFromStyle|.

But probably the fix is somewhere between the two. That is
1) add nsGenericElement::GetContentLang 
2) do the genuflexions for the attr mapping of xml:lang and the HTML's lang in
nsGenericElement, to honor the precedence.
3) make |SetFontFromStyle| do the right thing.
There's more involved than just SetFontFromStyle -- see
http://lxr.mozilla.org/seamonkey/source/layout/html/base/src/nsHTMLReflowState.cpp#2153
if nothing else.

I'd say the first thing we need here is a clear description of what exactly this
code is trying to achieve... First off, if we add this GetLanguageGroup method,
why store this stuff in the style system?  SetFontFromStyle and the reflow state
could just call the method on the frame's mContent directly, no?
I think the original reason why |SetFontFromStyle| was made is because people
kept ommitting to set the language when passing a font to GFX.

>First off, if we add this GetLanguageGroup method, why store this stuff in the
>style system?

It doesn't have to be saved in the style system, if that is deemed unnecessary.

>SetFontFromStyle and the reflow state could just call the method on the frame's
>mContent directly, no?

Sure. That's what I meant by 'do the right thing' -- whatever is appropriate
after getting the basics.
Probably, it also has to be taken into account that 'Content-Language' (http
header and meta declaration) can set the 'overall' language(s) of a document.
Besides, there's been a talk of adding 'View | Set Language' (for the manual
overriding of the language of a document currently in view) although we may not
do that in the near future. 
Yep. The style system's nsCSSStyleSheet is quite comprehensive and includes the
case of the 'Content-Language' HTTP header.
(In reply to comment #9)
> Thus the bug also means that nsStyleVisibility::mLanguage is not reliable. It
> is only half of the equation. It does not walk up the tree to find the
> language.

Actually, it does.  The values in the visibility struct are all inherited by
default, so if a node doesn't have an explicit language set it automatically
inherits the parent's.  The root node's language is inited from
nsIPresContext::GetLanguage() if it's not set explicitly.  So we should just
keep this in the style system -- it's simpler that way...  The method on
nsIContent should just return the language on that node, if any, not walk up
parents.  The tree-walking code should stay in the style resolution code.

(In reply to comment #12)
> Probably, it also has to be taken into account that 'Content-Language' (http
> header and meta declaration) can set the 'overall' language(s) of a document.

Which just means that we should init the prescontext's mLanguage based on the
document's GetContentLanguage() returns, I would think... (except for the mess
with getting the latter out of prefs).  Right now, the context's mLanguage is
set based on the document's character set (see nsPresContext::UpdateCharSet).
>The method on nsIContent should just return the language on that node, if any,
>not walk up parents.  The tree-walking code should stay in the style resolution
>code.

I wonder, then, if it is necessary to add such a dummy GetLang to nsIContent.
I had a look and nsDocument::RetrieveRelevantHeaders() [which feeds the
document's GetContentLanguage()] looks up the prefs too.
Oh, I assumed that nsIContent::GetContentLang would walk up the parentchain. If
that's not the case then we should make that clear in the function-name, or add
an |PRBool aCheckParentChain| argument.
Oh, except attribute mapping is entirely the wrong mechanism.  The language
should come from somewhere in content rather than the style data so that you
*can* map xml:lang easily.
*** Bug 41978 has been marked as a duplicate of this bug. ***
Yeah, I guess we could cache the language on the content node instead of caching
it in the style.. the one nice thing with caching in style is that it's cheaper
because of style context sharing, maybe (maybe).
I kinda also like the saving being achieved via the style system. Why is/was
that important to have nsIContent::GetContentLang()? Is is going to supersede
the one in nsIDOMHTMLElement (or, has is the ambiguity going to be resolved)?
Wouldn't it be possible to simply map the xml:lang in nsGenericElement?
s/has is/how is/
> Why is/was that important to have nsIContent::GetContentLang()?

To avoid the duplication of the "what's the lang set on this content node" logic
(the fact that you have to check xml:lang and then if and only if the node is
HTML check the lang attribute in the null namespace).  That way the next time
the (X)HTML WG decides to make some half-assed decisions we will only have to
change one place in the code...

It's not _that_ important.  Just nice-to-have.
What about the ambiguity with nsIDOMHTMLElement? 

Someone does nsIDOMHTMLElement.setlang(), but gets something different with
nsIDOMHTMLElement.getlang() because there is an old xml:lang that takes precedence?

or, the deal is that there will be two different results with GetLang() and
GetContentLang(), albeit funny? 
> Someone does nsIDOMHTMLElement.setlang(), but gets something different with
> nsIDOMHTMLElement.getlang() because there is an old xml:lang that takes
> precedence?

Didn't I mention "half-baked" and "(X)HTML WG" in the same sentence already?
I looked at what it will take to fix this, and here are the results of my
investigation:

1) It is not possible presently to leverage on the mapping code to map xml:lang
to style. This is because the |MapAttributesInto| business is sooo
HTML-specific... How did I forget that?!? It turned out that this is precisely
the same issue that has blocked bug 69409 for MathML attributes.

2) While it is possible to fetch the lang attribute everytime, it is much nicer
to just map it into the style data due to these primary reasons:
  a) the lang string is actually converted to a langGroupAtom that groups many
languages/scripts. Given a top-element with lang, e.g., <html lang>, there is
only one such atom in one visibility struct of the style system. The atom is
fetched rapidly whenever/wherever it is neeeded. Plus, the
HasAttributeDependentStyle stuff is kept in sync. (This BTW has to be expanded
to include the namespace as righly pointed out in an XXX comment there). 
  b) Take the lang out of the style system, and for every element in the doc,
one will have to walk back to <html>, convert it to an atom, deal with errors in
children (e.g., an unknown lang) rather than just inheriting the most recent
known lang as can be neatly done with the style system at style resolution
(albeit this is not done yet).
  c) Trying to map xml:lang to the style data provides a nice examplar for you
guys to really appreciate why bug 69409 has been blocked all this time :-)

3) Probably this is the most startling observation: despite the recent rework
that Jonas did (which was mostly for space-efficiency) the attribute mapping
code remains over-engineered... What is going on with this |nsMappedAttributes|
class which is cloned, etc? Actually, the mapping functions are _constant_ for
each tag and I have this feeling that a table-driven approach could do the job
pretty well, while providing a way to finally cater for MathML attributes. This
is not the best place to discuss it, but since things are so fresh in mind, I
may as well put them down for the record. Consider the first-level table:

/*static*/
{
  {noneNS_ID, &MapHTMLAttributesInto}.
  {xmlNS_ID, &MapXMLAttributesInto},
#ifdef MOZ_MATHML
  {mathNS_ID, &MapMathMLAttributesInto},
#endif
#ifdef MOZ_SVG
  {svgNS_ID, &MapSVGAttributesInto}
#endif
}

and then the second-level tables

static void 
MapHTMLAttributesInto(...)
{
  /*static*/
  {nsHTMLAtoms:div, &MapDivAttributesInto}
  {nsHTMLAtoms:img, &MapImgAttributesInto}
  etc..
}

static void 
MapXMLAttributesInto(...)
{
  /*static*/
  {nsXMLAtoms:lang, &MapLangAttributesInto}
}
etc

Then, propagate the change so that the Style System calls the mapping function
based on the namespace directly, reflecting into the attribute style sheet of
the current document, rather than going to the element in a rather complex and
convoluted manner at present (to get the same _function_ for each tag, clone, etc).

Doing so will suppress the middle-man and the over-engineering that is
happening. It is somewhat reminiscent of what alecf did to convert
HasAttributeDependentStyle into a table-driven code.
> This is because the |MapAttributesInto| business is sooo HTML-specific...

With sicking's changes, this should be pretty simple to fix.

> What is going on with this |nsMappedAttributes| class which is cloned, etc?

If nothing else, it's an nsIStyleRule implementation and the thing the nodes in
the ruletree point to.

> Actually, the mapping functions are _constant_

I'm not sure what this means, exactly...

> based on the namespace directly

Wouldn't that break for XML nodes in the null namespace vs non-XML HTML nodes?

More to the point, what will be the nsIStyleRule objects involved?  The current
cloning, etc is because we actually share nsMappedAttributes objects.  So if you
have 10 <font size="3"> tags we only have one nsMappedAttributes object for
them.  The speed and footprint savings in the ruletree from this sharing are
pretty noticeable.....
> > This is because the |MapAttributesInto| business is sooo HTML-specific...
>
> With sicking's changes, this should be pretty simple to fix.

I hope it is in the plan. I have been waiting for that to happen.

> > Actually, the mapping functions are _constant_
>
> I'm not sure what this means, exactly...

C.f. |GetAttributeMappingFunction|, it returns a function based on the tag. It
is currently designed so that different instances of the object could return
different things. But in practice, different instances return the same thing and
this suggests that a static tag-based table-driven approach is appropriate.

> >based on the namespace directly
>
>Wouldn't that break for XML nodes in the null namespace vs non-XML HTML nodes?

What do you mean? The behavior won't be any different from what is there at the
moment. Attributes are either mappable or they aren't, based on the
(attrNamespaceID, attrName).

Take <p xml:lang>, it will boil down to MapHTMLAttributesInto, which should
include a call to MapXmlLangAttr(p_content, xmlNS_ID, nsHTMLAtoms::lang,
to_ruleData) if the doc is XHTML.

> have 10 <font size="3"> tags

The management is pretty involved at the moment. With the table-driven approach,
there would just be the hard-coded tables as illustrated earlier. The question
is whether that won't work and why.
The style system requires that the style data come from implementations of
nsIStyleRule that meet the requirements in 

http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/base/public/nsIStyleRule.h&rev=3.22&mark=54-82#54
So back to the original issue: How do we want to map xml:lang into style? Do we
want to use the attribute-mapping code and do the same thing as we do for
"lang", or do we want to add GetContentLang to nsIContent?

One problem with using attribute-mapping code is that parts of it currently
can't deal with namespaced mapped attributes. Mostly the functions on
nsIStyledContent only takes an localname-atom argument rather then
localname-atom and namespaceid.

We might need this for other attributes too, but so far neither svg, xul or html
needs this. Don't know about mathml, but i don't think it has any namespaced
attributes at all.
> Mostly the functions on nsIStyledContent only takes an localname-atom 

That (and the HasAttributeDependentStyle code) have needed fixing for a while. 
We should just do it, imo.

Doing attribute-mapping from nsGenericElement here seems like the way to go.
HasAttributeDependentStyle is just an optimization.  It wouldn't surprise me if
the cost of passing an extra 32-bit parameter to that function and checking it
were higher than the benefits gained.
(Also, if this code is to be rewritten, the rules on presentational attributes
in CSS 2.1 should be considered.)
Fair enough.  Not much call to change HasAttributeDependentStyle, I guess.  ;)
In fact, xml:lang does not seem to be recognised as an alternative to lang at
all. Even with xml:lang set to something, firefox does not recognise the text to
be in that language.
Sorry about that stupid comment, I'm new to this interface. Won't happen again.
Please just ignore it.
Blocks: 293511
Assignee: dbaron → nobody
QA Contact: ian → style-system
Hi all!

I've been bitten by this bug on this page:

http://www.shlomifish.org/humour/TheEnemy/The-Enemy-rev4.html

This is after fixing it according to the input of the good people on #mozilla.il. Given enough will, I'd also like to work on a patch for that.

Regards,

      Shlomi Fish

See: http://www.shlomifish.org/meta/FAQ/#xhtml_11 for why I'm using XHTML 1.1 with a content-type of "text/html".
xml:lang is ignored in text/html content, as it should be.  That is, you're not seeing this bug.
Summary: xml:lang is not used for font selection → xml:lang is not used for font selection or hyphenation
Any progress with this issue? I'm writing a web app that outputs application/xhtml+xml for Firefox and I would like to use automatic hyphenation for the content. The goal is to do public release within a year. Should I put a workaround in the code for Firefox (for example, repeat the value of xml:lang in the attribute "lang") or will this get fixed soon enough?
I would recommend not using application/xhtml+xml, period.  A number of new HTML5 features work much worse with that MIME type than with text/html.  And yes, you should probably just put both xml:lang= and lang= in your markup if you do want to keep using XHTML.  I certainly don't see us rewriting our attribute code around this edge case (which is what would have to happen) anytime soon.
If so, would it make sense to mark this "WONTFIX"?
This actually doesn't even need attribute mapping.  We can make a really simple rule class that stores a string for the language attribute, and have a hashtable of them indexed by the string (so that we only have one per string, so the rule tree works well).  (We just need it not to apply when content->IsInHTMLDocument().)
(In reply to David Baron [:dbaron] (don't cc:, use needinfo? instead) from comment #44)
> (We just need it not to apply when
> content->IsInHTMLDocument().)

Actually that's not the case:


[2013-05-29 19:08:30] <annevk> that's actually also a bug for HTML
[2013-05-29 19:08:44] <dbaron> what is?
[2013-05-29 19:08:45] <annevk> HTML requires xml:lang to work (when properly namespaced in the DOM)
[2013-05-29 19:08:58] <dbaron> even in HTML documents?
[2013-05-29 19:09:00] <annevk> (only possible through scripting)
[2013-05-29 19:09:04] <dbaron> ah, ok
[2013-05-29 19:09:09] <dbaron> then we shouldn't condition it
[2013-05-29 19:09:20] <dbaron> (I'm going to just paste this IRC into the bug if that's ok with you)
[2013-05-29 19:09:30] <annevk> yes
annevk also pointed out the relevant part of the spec:
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#language

"To determine the language of a node, user agents must look at the nearest ancestor element (including the element itself if the node is an element) that has a lang attribute in the XML namespace set or is an HTML element and has a lang in no namespace attribute set. That attribute specifies the language of the node (regardless of its value).

If both the lang attribute in no namespace and the lang attribute in the XML namespace are set on an element, user agents must use the lang attribute in the XML namespace, and the lang attribute in no namespace must be ignored for the purposes of determining the element's language."

IIRC, it’s the HTML parser that doesn’t recognize the namespace prefix syntax and doesn’t put an xml:lang attribute in the right namespace, but it’s still possible to set namespaced attributes through the DOM.
Assignee: nobody → dbaron
Status: NEW → ASSIGNED
The code in nsHTMLStyleSheet implements LangRule to map xml:lang into
style and the code to manage its uniqueness.

The change to nsGenericHTMLElement fixes the mapping of the HTML lang
attribute to do cascading the way all other rule mapping does so that
the cascading works correctly.

The tests test that the correct style language is used for hyphenation
by copying over a set of hyphenation reftests that check its basic
response to languages.  There are no specific tests for font selection,
but font selection is known to use the same language data from style.

I verified manually (see other attachments to bug) that the rule
uniqueness is being managed correctly.
Attachment #755476 - Flags: review?(bzbarsky)
With this patch, loading attachment 755478 [details] gave the output:
Making new lang rule 0x2944840 for language en
Making new lang rule 0x28ba810 for language fr
Making new lang rule 0x3ea2dc0 for language en-US
which is the expected result (only 3 lines, no duplicate languages).
Though, actually, I think I should ditch all the reftests except for -1, -4, -11{a,b}. and -12{a,b}.

Boris, does that make sense to you?
Comment on attachment 755476 [details] [diff] [review]
Map xml:lang attribute into style so that it's used for font selection and hyphenation.

r=me, though it seems like this table can grow without bound for a given document as long as people keep adding new xml:lang values and then removing them...

Also, why do you want to ditch the reftests you want to ditch?
Attachment #755476 - Flags: review?(bzbarsky) → review+
Because some of them were really testing other hyphenation things rather than testing that xml:lang works, which I think only really needs the tests I listed.  It feels kinda like a waste of machine time to run more than that.  Though I suppose some of them test other interesting characteristics of xml:lang, so it's vaguely useful to have the extras.

As far as the memory use thing... I'm at least hoping that people will use xml:lang for its intended use, and the number of languages used in a document will be reasonably limited.  If people really want to drive up memory usage, they can always create lots of content nodes and *not* get rid of them.
Eh, I guess I'll keep the tests.  (I'm going to add "-xmllang" to the filenames, though, to make the difference a little clearer rather than just .html vs. .xhtml.)
https://hg.mozilla.org/mozilla-central/rev/0a1944211b61
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: