Closed Bug 524107 Opened 10 years ago Closed 10 years ago

pass value of "lang" attribute from DOM to gfx text code

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: jfkthame, Assigned: jfkthame)

References

Details

Attachments

(2 files, 11 obsolete files)

104.24 KB, patch
Details | Diff | Splinter Review
86.61 KB, patch
Details | Diff | Splinter Review
Currently, the gfx font/text code has access to the langGroup as part of the style information. However, to properly support language-sensitive text rendering, we actually need the lang attribute itself, as langGroup loses many important distinctions and is not an appropriate abstraction for this purpose.

This is needed in order to fix bug 24139 in text back-ends (e.g., harfbuzz; also Uniscribe, Pango) that allow a specific OpenType language system to be used.
OS: Mac OS X → All
Hardware: x86 → All
Right now, the point where we turn it from as-it's-specified to a language group is nsRuleNode::ComputeVisibilityData; nsStyleVisibility now holds only mLangGroup.
This patch adds a 'language' field to gfxFontStyle, handled directly parallel to 'langGroup'.

By itself, this patch has no effect on behavior, but the coming Harfbuzz back-end will use the language field to select language-specific rendering options in OpenType fonts.

Flagging for sr? in addition to r? as this cuts across modules and interfaces.

(Eventually, I'd like to see langGroup go away; it seems like a leftover from the world of multiple 8-bit encodings and fonts, and isn't really the right model now. But that's a separate issue, not addressed here.)
Assignee: nobody → jfkthame
Attachment #424048 - Flags: superreview?(roc)
Attachment #424048 - Flags: review?(dbaron)
+    sc1->GetStyleVisibility()->mLanguage == sc2->GetStyleVisibility()->mLanguage &&
     sc1->GetStyleVisibility()->mLangGroup == sc2->GetStyleVisibility()->mLangGroup &&

If the mLanguages are the same, aren't we guaranteed that the mLangGroup is also the same? Maybe we could assert that here and skip the mLangGroup check?

Looks like nsSVGGlyphFrame::EnsureTextRun should be getting language info from nsStyleVisibility, not off the prescontext. Fix it here or file a bug for that.

A quick dig through our language-setting code suggests that we're not using the HTTP Content-Language header to determine the language in nsStyleVisibility. Basically we honor the "lang" attribute on elements, and if there are none of those in scope, we use a language guessed from the charset in nsPresContext::UpdateCharSet. On the other hand, langMatches in nsCSSRuleProcessor does something completely different that uses doc->GetContentLanguage() (which uses the header value). We need a bug on fixing this up ... probably nsStyleVisibility::nsStyleVisibility should set the initial language and langGroup based on doc->GetContentLanguage() if that's non-empty, otherwise guess the language and langgroup based on the charset.
Comment on attachment 424048 [details] [diff] [review]
patch v1: add 'language' field to the gfxFontStyle

Looks good to me. Switching dbaron to sr to reduce his workload.
Attachment #424048 - Flags: superreview?(roc)
Attachment #424048 - Flags: superreview?(dbaron)
Attachment #424048 - Flags: review?(dbaron)
Attachment #424048 - Flags: review+
Actually, I've changed my mind. Instead of passing around the langGroup, we should just pass around the language and when we need the langGroup, we can obtain it from the language. That could be made super-fast, since we could just hash on the nsIAtom* of the language. We could even cache the last language -> langGroup mapping if necessary. This would keep things simpler and smaller.
It does mean that we'll have to monkey around with nsLanguageAtomService::LookupCharSet to return a language instead of a langGroup, which is a bit annoying, but we need to do that anyway to get a language to use with gfxFontStyle.
Attachment #424048 - Flags: superreview?(dbaron)
(In reply to comment #6)
> It does mean that we'll have to monkey around with
> nsLanguageAtomService::LookupCharSet to return a language instead of a
> langGroup, which is a bit annoying, but we need to do that anyway to get a
> language to use with gfxFontStyle.

Hmm. The idea of getting a language from a charset is a bit weird, as there's generally no way to know the actual language. (This just serves to illustrate how our old langGroup is really linked more to legacy encodings than to actual languages.)

I suppose we could choose a "representative" language for each charset, so (for example) iso-8859-1 maps to English, iso-8859-2 maps to something like Czech, etc. But this seems very hackish, and would be bad if/when we use language for anything else besides choosing font preferences - e.g., hyphenation. I think it would be better to have "pseudo-language" codes (which in many cases could be the existing langGroup codes such as x-western, x-cyrillic, etc) that can map to fonts but will not be mistaken for actual language codes.
(In reply to comment #7)
> (In reply to comment #6)
> > It does mean that we'll have to monkey around with
> > nsLanguageAtomService::LookupCharSet to return a language instead of a
> > langGroup, which is a bit annoying, but we need to do that anyway to get a
> > language to use with gfxFontStyle.
> 
> Hmm. The idea of getting a language from a charset is a bit weird, as there's
> generally no way to know the actual language. (This just serves to illustrate
> how our old langGroup is really linked more to legacy encodings than to actual
> languages.)

Right, it's just a way to make an informed guess. I think we should use that guess for shaping and font selection, but not for other purposes such as CSS language selector matching.

> I suppose we could choose a "representative" language for each charset, so (for
> example) iso-8859-1 maps to English, iso-8859-2 maps to something like Czech,
> etc. But this seems very hackish, and would be bad if/when we use language for
> anything else besides choosing font preferences - e.g., hyphenation. I think it
> would be better to have "pseudo-language" codes (which in many cases could be
> the existing langGroup codes such as x-western, x-cyrillic, etc) that can map
> to fonts but will not be mistaken for actual language codes.

I like that idea.
Tackling this in two stages:

(1) Replace langGroup members with language throughout, and remove the mapping from language to langGroup in nsStyleVisibility. The "language" fields will contain whatever language was specified in markup, or inferred from the charset or other sources (in which case they're really langGroup codes, usually). We'll map language->langGroup only at the point of looking for font preferences. We have "dummy" mappings from langGroup -> langGroup, so in the case where "language" was actually an inferred langGroup already, this will still work ok.

(2) Change gfxFontStyle's language field from an nsCString to an nsIAtom*. We already have language as atom within layout, so it's pointless to convert back to strings within gfx.

As a followup bug, I think we should also:

(3) Replace our custom langGroup identifiers with BCP47-based coding (will affect the font properties in prefs). This will enable us to consider moving from legacy charset-oriented organization of font preferences to something based on script and region, and will let us make a cleaner distinction between "real" language codes (such as those from HTML lang attributes) and "language group" or script codes such as we infer from charsets.

(Unless someone argues that this is a terrible idea, I expect I'll file a separate bug to address it.)
This patch implements part 1, maintaining the original language code in the gfxFontStyle and mapping to langGroup only when needed to look up font preferences.
Attachment #424048 - Attachment is obsolete: true
Attachment #426858 - Flags: review?(roc)
Part 2, store the language as an atom in the gfxFontStyle rather than using nsCString. This should optimize comparison/lookup operations.
Attachment #426860 - Flags: review?(roc)
-        if (!mLocaleLangGroup) {
-            mLocaleLangGroup = do_GetAtom("x-western");
+        if (!mLocaleLanguage) {
+            mLocaleLanguage = do_GetAtom("en");

Why is "en" better than just continuing to use x-western as a pseudo-language here?

+    if (langGroup.IsEmpty())
+        langGroup.Assign("x-unicode"); // XXX or should use "x-user-def"?

{}

+        nsIAtom* lang = NS_NewAtom(mStyle.language);

nsCOMPtr (although I know it's temporary)

+    if (NS_SUCCEEDED(rv))
+      group->GetUTF8String(&langGroup);

{}

   nsInterfaceHashtable<nsStringHashKey, nsIAtom> mLangs;
+  nsInterfaceHashtable<nsISupportsHashKey, nsIAtom> mLangToGroup;

Why do we need both of these? Don't they contain roughly the same data, a mapping of language to langgroup? Can we replace mLangs with mLangToGroup?

+  nsCOMPtr<nsIAtom> mPrevLanguage;
+  nsCOMPtr<nsIAtom> mPrevLangGroup;

Do we need this optimization? A pointer-to-pointer hash lookup should be very fast, I would have thought.

 static PRBool IsChineseJapaneseLangGroup(nsIFrame* aFrame)

Rename this to IsChineseOrJapanese?

I'd like to have two bugs filed:
1) nsSVGGlyphFrame should get the language from the style of its frame, not the prescontext
2) the language in nsStyleVisibility should be initialized with the language obtained from nsIDocument::GetContentLanguage (which is the language specified by Content-Language). In general we need to ensure we follow this part of the HTML5 spec:
http://www.whatwg.org/specs/web-apps/current-work/#language
3) I think probably we shouldn't be using the locale language for anything. The places where we use nsIRenderingContext APIs with no passed-in language are probably errors, and we should probably get rid of them.
+    nsIAtom *langGroupAtom = NS_NewAtom(aLangGroup);
+
+    nsresult rv = gfxPlatform::GetPlatform()->GetFontList(langGroupAtom, generic, fontList);
+
+    langGroupAtom->Release();

Why not nsCOMPtr<nsIAtom> and do_GetAtom?

+    nsCAutoString key;
+    aLangGroup->ToUTF8String(key);

Make it nsCAutoString?

+        language = NS_NewPermanentAtom("x-unicode");
+        nsIAtom *language = NS_NewPermanentAtom("en"); // TODO: get the correct language?
+        AppendGenericFontFromPref(aFonts, NS_NewPermanentAtom("x-unicode"), nsnull);

Here, and probably all the other places you're using NS_NewPermanentAtom, we should have an atom table like nsGkAtoms or nsWidgetAtoms.

+    nsresult rv = GetFontListInternal(fonts, langGroupStr);

We don't need to do it all in one patch, but perhaps we should convert more language parameters from strings to atoms?
(In reply to comment #12)
> -        if (!mLocaleLangGroup) {
> -            mLocaleLangGroup = do_GetAtom("x-western");
> +        if (!mLocaleLanguage) {
> +            mLocaleLanguage = do_GetAtom("en");
> 
> Why is "en" better than just continuing to use x-western as a pseudo-language
> here?

It's not; I've reverted that. (I think it came from when I was aiming to always have a real language code in 'language', but we can't do that.)

>    nsInterfaceHashtable<nsStringHashKey, nsIAtom> mLangs;
> +  nsInterfaceHashtable<nsISupportsHashKey, nsIAtom> mLangToGroup;
> 
> Why do we need both of these? Don't they contain roughly the same data, a
> mapping of language to langgroup? Can we replace mLangs with mLangToGroup?

Yes, it's redundant. We still had a couple of places in font code that use LookupLanguage, which is string-based. But I've reimplemented that based on the atom hash, so the duplication is gone now.

> +  nsCOMPtr<nsIAtom> mPrevLanguage;
> +  nsCOMPtr<nsIAtom> mPrevLangGroup;
> 
> Do we need this optimization? A pointer-to-pointer hash lookup should be very
> fast, I would have thought.

Well, a single comparison that is true 90-95% of the time (based on browsing with an instrumented build) would still be faster than even the best hash lookup.

But in practice GetLanguageGroup is not called so much that it really matters, so I've dropped this.

>  static PRBool IsChineseJapaneseLangGroup(nsIFrame* aFrame)
> 
> Rename this to IsChineseOrJapanese?

Yes, that's better. I've also revised the implementation so that it will work with variations on the language codes (e.g., accept an undifferentiated "zh" code in addition to the country-specific ones).
(In reply to comment #13)

> +    nsCAutoString key;
> +    aLangGroup->ToUTF8String(key);
> 
> Make it nsCAutoString?

Huh? It is....

> Here, and probably all the other places you're using NS_NewPermanentAtom, we
> should have an atom table like nsGkAtoms or nsWidgetAtoms.

I've created gfxLangAtoms. Yes, much nicer.

> +    nsresult rv = GetFontListInternal(fonts, langGroupStr);
> 
> We don't need to do it all in one patch, but perhaps we should convert more
> language parameters from strings to atoms?

We may want to do some more, but there are certain points where we still need to extract the string values (e.g., to construct preference keys, or to pass to APIs like fontconfig or pango). So we have to decide in each case where to make the change. I think that's just something to keep in mind as we continue to re-work font/language/preference code.
Attachment #426858 - Attachment is obsolete: true
Attachment #427343 - Flags: review?(roc)
Attachment #426858 - Flags: review?(roc)
Attachment #426860 - Attachment is obsolete: true
Attachment #427344 - Flags: review?(roc)
Attachment #426860 - Flags: review?(roc)
Attachment #427343 - Flags: review?(roc)
Attachment #427344 - Flags: review?(roc)
Sorry for the bugspam - accidentally posted a broken version of the patches as v.2. These should be correct now, I hope!
Attachment #427344 - Attachment is obsolete: true
Attachment #427406 - Flags: review?(roc)
(In reply to comment #15)
> (In reply to comment #13)
> > +    nsCAutoString key;
> > +    aLangGroup->ToUTF8String(key);
> > 
> > Make it nsCAutoString?
> 
> Huh? It is....

Oops!

> > Here, and probably all the other places you're using NS_NewPermanentAtom, we
> > should have an atom table like nsGkAtoms or nsWidgetAtoms.
> 
> I've created gfxLangAtoms. Yes, much nicer.

Can we call it gfxAtoms? That's simpler and we can more easily use it for other things later.

+static nsIAtom *gUnicodeRangeToLangGroupAtomTable[] =
+  gfxLangAtoms::x_cyrillic,
+  gfxLangAtoms::el,
+  gfxLangAtoms::tr,
+  gfxLangAtoms::he,

er, does this actually work? These atom values are not initialized when this array is set up, I would have thought. One way to work around this is to make it an nsIAtom** pointing to the addresses where the static atoms will be filled in.
One other comment about the atom comments above:  best to consider NS_NewAtom to be deprecated:  use do_GetAtom (and nsCOMPtr) instead.
Attachment #427406 - Attachment is obsolete: true
Attachment #427447 - Flags: review?(roc)
Attachment #427406 - Flags: review?(roc)
(In reply to comment #20)

> Can we call it gfxAtoms?

Sure.

> +static nsIAtom *gUnicodeRangeToLangGroupAtomTable[] =
> +  gfxLangAtoms::x_cyrillic,
> +  gfxLangAtoms::el,
> +  gfxLangAtoms::tr,
> +  gfxLangAtoms::he,
> 
> er, does this actually work? These atom values are not initialized when this
> array is set up, I would have thought.

Oops. That is indeed a problem. Fixed, as suggested.

Also removed the remaining use of NS_NewAtom from the patch. (There are still some existing uses in the context; I haven't touched those for now.)
One more update to Part 1 - there was still a stray usage of NS_NewPermanentAtom, now replaced by the proper do_GetAtom.
Attachment #427447 - Flags: superreview?(matspal)
Attachment #427461 - Flags: superreview?(matspal)
(In reply to comment #12)
> I'd like to have two bugs filed:
> 1) nsSVGGlyphFrame should get the language from the style of its frame, not the
> prescontext

This is now Bug 546813.

> 2) the language in nsStyleVisibility should be initialized with the language
> obtained from nsIDocument::GetContentLanguage (which is the language specified
> by Content-Language). In general we need to ensure we follow this part of the
> HTML5 spec:
> http://www.whatwg.org/specs/web-apps/current-work/#language

And this is Bug 547267.
Attachment #427403 - Attachment is obsolete: true
Attachment #427461 - Attachment is obsolete: true
Attachment #428585 - Flags: superreview?(matspal)
Attachment #427461 - Flags: superreview?(matspal)
Attachment #427447 - Attachment is obsolete: true
Attachment #428586 - Flags: superreview?(matspal)
Attachment #427447 - Flags: superreview?(matspal)
Attachment #428586 - Attachment description: part 2, v2.2: store language as an nsIAtom* in gfxFontStyle - updated for current trunk, fixed Linux issue found on tryserver → part 2, v2.3: store language as an nsIAtom* in gfxFontStyle - updated for current trunk, fixed Linux issue found on tryserver
Comment on attachment 428585 [details] [diff] [review]
part 1, v2.3: replace langGroup with language in gfxFontStyle - merged with current trunk

sr=mats with the following nits

In gfx/src/thebes/nsThebesDeviceContext.cpp
>      // XXX figure out why aLangGroup is NULL sometimes
> -    if (!aLangGroup) {
> -        aLangGroup = mLocaleLangGroup;
> +    //      -> see nsPageFrame.cpp:511
> +    //if (!aLangGroup) {
> +    //    aLangGroup = mLocaleLangGroup;
> +    //}
> +
> +    // XXX ditto for aLanguage
> +    if (!aLanguage) {
> +        aLanguage = mLocaleLanguage;

I'm confused by the reference to non-existent 'aLangGroup'.
Couldn't you write this as:

> +    // XXX figure out why aLanguage is NULL sometimes
> +    //      -> see nsPageFrame.cpp:511
> +    if (!aLanguage) {
> +        aLanguage = mLocaleLanguage;


In gfx/thebes/src/gfxFont.cpp:
In part 1:
>  gfxFontStyle::gfxFontStyle() :
>      style(FONT_STYLE_NORMAL), systemFont(PR_TRUE), printerFont(PR_FALSE), 
>      familyNameQuirks(PR_FALSE), weight(FONT_WEIGHT_NORMAL),
>      stretch(NS_FONT_STRETCH_NORMAL), size(DEFAULT_PIXEL_FONT_SIZE),
> -    langGroup(NS_LITERAL_CSTRING("x-western")), sizeAdjust(0.0f)
> +    language(NS_LITERAL_CSTRING("en")), sizeAdjust(0.0f)

In part 2:
> -    language(NS_LITERAL_CSTRING("en")), sizeAdjust(0.0f)
> +    language(nsnull), sizeAdjust(0.0f)

nsnull?  why not gfxAtoms::en or gfxAtoms::x_western?
(should be consistent with the other gfxFontStyle ctor?)

> gfxTextRun::Dump
> +        nsCString lang;
> +        style->language->ToUTF8String(lang);

nsCAutoString


In gfx/thebes/src/gfxPangoFonts.cpp
> -        if (!gLangService) {
> -            CallGetService(NS_LANGUAGEATOMSERVICE_CONTRACTID, &gLangService);
> -        }

Are we guarranted to always reach the conitional code in
gfxFontGroup::ForEachFontInternal that sets gLangService before we reach
gfxPangoFontGroup::MakeFontSet ?
Attachment #428585 - Flags: superreview?(matspal) → superreview+
Comment on attachment 428586 [details] [diff] [review]
part 2, v2.3: store language as an nsIAtom* in gfxFontStyle - updated for current trunk, fixed Linux issue found on tryserver

sr=mats

In gfx/thebes/src/gfxAtomList.h  gfxAtoms.cpp  gfxAtoms.h
> + * Portions created by the Initial Developer are Copyright (C) 2005

2010

In gfx/thebes/src/gfxPangoFonts.cpp:
> +GuessPangoLanguage(nsIAtom *aLangGroup)
...
> +    nsCString lg;
> +    aLangGroup->ToUTF8String(lg);

nsCAutoString

In gfx/thebes/src/gfxPlatform.cpp:
> +gfxPlatform::GetFontPrefLangFor(nsIAtom *aLang)
...
> +    nsCString lang;
> +    aLang->ToUTF8String(lang);

nsCAutoString
Attachment #428586 - Flags: superreview?(matspal) → superreview+
Blocks: 502906
(In reply to comment #28)
> nsnull?  why not gfxAtoms::en or gfxAtoms::x_western?
> (should be consistent with the other gfxFontStyle ctor?)

Yes, better to default to gfxAtoms::x_western consistently.

(I want to re-examine our collection of language codes and groups, and how we use them, but that's a separate issue.)

> Are we guarranted to always reach the conitional code in
> gfxFontGroup::ForEachFontInternal that sets gLangService before we reach
> gfxPangoFontGroup::MakeFontSet ?

I thought the answer was "yes", but on re-examining, that turns out to be a mistake. So I've restored the code to get the service if necessary.
Part 1: http://hg.mozilla.org/mozilla-central/rev/fe3c9f571228
Part 2: http://hg.mozilla.org/mozilla-central/rev/4c60c40075e9
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
It seems that it causes some changes in font fallback (especially for Traditional Chinese characters), and some crash like Bug 548535?
Yes, bug 548535 is a regression from this - I'll be looking into it.

If you're seeing changes in font fallback behavior, please file a report with specific testcases so we can look into it.
(In reply to comment #34)
> It seems that it causes some changes in font fallback (especially for
> Traditional Chinese characters)

Is this specifically on Windows? If so, I think I understand what's happening, and how to fix it.
True. It seems that it is Windows-only.
(Fortunately I have Win7/Mac/Linux that are all running the nightly build)
(In reply to comment #37)
> True. It seems that it is Windows-only.
> (Fortunately I have Win7/Mac/Linux that are all running the nightly build)

I've filed bug 548608 for this. If you have samples/testcases you could add there, please do.
Depends on: 548608
Depends on: 548978
gfxFontStyle doesn't keep references to the language nsIAtom.
What ensures that the pointer remains valid?

nsRuleNode::ComputeVisibilityData ref-counts the result of do_GetAtom.
Is there something to ensure that languages are permanent/static atoms?
Depends on: 597212
Duplicate of this bug: 382710
You need to log in before you can comment on or make changes to this bug.