Closed Bug 1857742 Opened 8 months ago Closed 7 months ago

:lang() fails to understand some valid BCP 47 language tags since FF 114

Categories

(Core :: CSS Parsing and Computation, defect)

Firefox 114
defect

Tracking

()

RESOLVED FIXED
121 Branch
Tracking Status
firefox-esr115 --- fixed
firefox119 --- wontfix
firefox120 --- wontfix
firefox121 --- fixed

People

(Reporter: 747.neutron, Assigned: jfkthame)

References

(Regression)

Details

(Keywords: regression)

Attachments

(10 files)

1017 bytes, text/html
Details
33.47 KB, image/png
Details
120.78 KB, image/png
Details
113.79 KB, image/png
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0

Steps to reproduce:

While I'm not sure whether it is HTML or CSS parser to blame, but the :lang() pseudo-class apparently refuses to parse some valid BCP 47 tags, while it allows some other invalid ones.

As shown in my attached file, the behavior leads me to believe that Firefox actually recognizes Unicode language identifier as defined in UTS #35, instead of BCP 47.

Unicode language and locale identifiers inherit the design and the repertoire of subtags from [BCP47] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:

  • It does not allow for the full syntax of [BCP47]:
    • No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47 Section 4.5 and Section 3.1.7)
    • No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47)
    • A tag must not start with the subtag "x": thus a privateuse (eg x-abc) can only be after a language subtag, like "und"
  • It allows for certain semantic additions and constraints:
    • Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML
    • Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see Section 4.1.2)
  • It allows certain syntax for backwards compatibility (not BCP 47-compatible):
    • The "_" character for field separator characters, as well as the "-" used in [BCP47] (however, the canonical form is with "-")
    • The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
    • The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.

I think it is a bug, because according to the HTML and CSS standards, their values should be handled as a BCP 47 language tag.

The lang attribute (in no namespace) specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string.

The :lang() pseudo-class, which accepts a comma-separated list of one or more language ranges [...] An element’s content language matches a language range if, when represented in BCP 47 syntax [BCP47], it matches that language range in an extended filtering operation per [RFC4647] Matching of Language Tags (section 3.3.2).

As I tested on Browserling, the behavior was introduced at Firefox 114. Chrome 117 is not affected by this issue.

Actual results:

Open the attachment with the browser.

  • The browser does not render the iw-ase-jpan-basiceng paragraph in boldface, even though it is a valid BCP 47 tag that should match iw.
  • The browser does render the zh_gb_oxendict paragraph in boldface, even though it is not a valid BCP 47 tag (and thus should not match zh).
  • The browser does not render the en-gb-oed and i-navajo paragraphs in boldface, even though they are valid (grandfathered) BCP 47 tags.
  • The browser does not render the x-lojban paragraph in boldface, even though it is a valid BCP 47 private tag.

Expected results:

Each paragraph should be rendered as how the text says.

Rendering on Google Chrome 117 (which I believe is correct).

Rendering on FF 113, Windows 10, according to Browserling

Rendering on FF 114, Windows 10, according to Browserling

The Bugbug bot thinks this bug should belong to the 'Core::Internationalization' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Internationalization
Product: Firefox → Core
Component: Internationalization → CSS Parsing and Computation
Keywords: regression
Regressed by: 1121792
Version: Firefox 118 → Firefox 114

:jfkthame, since you are the author of the regressor, bug 1121792, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(jfkthame)

Yeah, I think the report is correct... basically, the issue is that we're using unic_langid_new to parse the tags, which (as its name suggests) is parsing them as Unicode Language Identifiers, and that's not (quite) the same thing as a BCP47 language tag.

Doesn't look like the unic-lang-id crate knows anything about BCP47 tags, so we'll need to find another way to handle these more correctly. :(

Severity: -- → S3
Flags: needinfo?(jfkthame)

Thank you for your input. Does "crate" mean Rust crate? If it is just a crate problem, I'd suggest simply replacing it with something like oxilangtag. (Sorry if I'm saying something completely pointless; I know nothing about the internal logic.)

[marking 'confirmed' per comment 6, and ticking 'affected' flags for releases since the regression]

Status: UNCONFIRMED → NEW
Ever confirmed: true

(In reply to 747.neutron from comment #7)

Thank you for your input. Does "crate" mean Rust crate? If it is just a crate problem, I'd suggest simply replacing it with something like oxilangtag. (Sorry if I'm saying something completely pointless; I know nothing about the internal logic.)

Unfortunately oxilangtag only handles parsing, but doesn't support the matching functionality we'd need.

Its "parent" language_tags might work; it supports a matches() method that I think provides what we'd need. It's a substantially more heavyweight crate, though; not sure exactly how much it would bloat Gecko to include it, but this may be a concern.

(In reply to 747.neutron from comment #1)

Created attachment 9357249 [details]
:lang() tester rendering on GC 117

Rendering on Google Chrome 117 (which I believe is correct).

Looking into this further, I'm not convinced the Chrome rendering of these examples is correct in all cases.

In particular, I think that the element with lang="x-lojban" should be italic but NOT bold. This is because it will match p[lang|=x] (the value begins with x followed by a hyphen), but it should NOT match p:lang(x), because x by itself is not a well-formed language tag at all. x-lojban is a private-use tag; it can match itself, but is not parseable as primary language x with some kind of extension.

Similarly, I also don't believe lang="art-lojban" should be bold, because I don't think it should match p:lang(art). This is because art-lojban is not a normal BCP47 tag; it's a "grandfathered" tag that IMO should be matched in its entirety. Alternatively, it could be canonicalized to its preferred modern form, but that would be jbo, not art. So there's no justification for p:lang(art) to match it.

So I'm looking into a potential fix for Firefox, but it will not match Chrome on these two examples. (But please let me know if my analysis here is faulty; it's entirely possible that I have overlooked something.)

Flags: needinfo?(747.neutron)

To improve the behavior here, I think we can try using oxilangtag to parse the tags (as it will correctly parse all BCP47 tags, unlike unic-lang-id which is focused on Unicode language identifiers, and rejects some valid BCP47 tags). Then we'll need to provide a matching function, but that's not too onerous -- we can even base it on the method in rust-language-tags, though we don't need all the extra machinery from that crate (the IANA registry lists, etc).

The one deviation from the standard that I think we probably want to keep is accepting underscore as a subtag separator, in addition to the hyphen that BCP47 expects. This was implemented based on observed usage on the web, so dropping it is likely to have some (though perhaps minor) compatibility implications. (If we do want to consider that, let's do it in a separate bug.)

Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b5d3d5d74c79
patch 1 - Vendor the oxilangtag crate into third_party/rust. r=supply-chain-reviewers,dholbert
https://hg.mozilla.org/integration/autoland/rev/2720d1aca77a
patch 2 - Use oxilangtag rather than unic_langid to parse lang tags for nsStyleUtil::LangTagCompare. r=layout-reviewers,dholbert,TYLin
https://hg.mozilla.org/integration/autoland/rev/77f410f31503
patch 3 - Add some more :lang()-matching reftests. r=layout-reviewers,dholbert
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/43241 for changes under testing/web-platform/tests

(In reply to Jonathan Kew [:jfkthame] from comment #10)

In particular, I think that the element with lang="x-lojban" should be italic but NOT bold. This is because it will match p[lang|=x] (the value begins with x followed by a hyphen), but it should NOT match p:lang(x), because x by itself is not a well-formed language tag at all. x-lojban is a private-use tag; it can match itself, but is not parseable as primary language x with some kind of extension.

Similarly, I also don't believe lang="art-lojban" should be bold, because I don't think it should match p:lang(art). This is because art-lojban is not a normal BCP47 tag; it's a "grandfathered" tag that IMO should be matched in its entirety. Alternatively, it could be canonicalized to its preferred modern form, but that would be jbo, not art. So there's no justification for p:lang(art) to match it.

Thank you for your careful inspection, but I don't think language tag matching is working like that. Let's take a look at specs.

(1) HTML Standard 3.2.6.2:

The lang attribute (in no namespace) specifies the primary language for the element's contents and for any of the element's attributes that contain text. Its value must be a valid BCP 47 language tag, or the empty string.

(2) CSS selector spec 7.2:

The :lang() pseudo-class, which accepts a comma-separated list of one or more language ranges, represents an element whose content language is one of the languages listed in its argument. Each language range in :lang() must be a valid CSS <ident> or <string>. (Thus language ranges containing asterisks, for example, must be either correctly escaped or quoted as strings, e.g. :lang(*-Latn) or :lang("*-Latn").)

[...]

An element’s content language matches a language range if, when represented in BCP 47 syntax [BCP47], it matches that language range in an extended filtering operation per [RFC4647] Matching of Language Tags (section 3.3.2). For this purpose, a wildcard language range ("*") does not match elements whose language is not tagged (e.g. lang=""), but does match elements whose language is tagged as undetermined (lang=und). The matching is performed ASCII case-insensitively. The language range does not need to be a valid language code to perform this comparison.

(3) RFC 4647 3.2:

Extended filtering compares extended language ranges to language
tags. Each extended language range in the language priority list is
considered in turn, according to priority. A language range matches
a particular language tag if each respective list of subtags matches.
To determine a match:

  1. Split both the extended language range and the language tag being
    compared into a list of subtags by dividing on the hyphen (%x2D)
    character. Two subtags match if either they are the same when
    compared case-insensitively or the language range's subtag is the
    wildcard '*'.

  2. Begin with the first subtag in each list. If the first subtag in
    the range does not match the first subtag in the tag, the overall
    match fails. Otherwise, move to the next subtag in both the
    range and the tag.

  3. While there are more subtags left in the language range's list:

    A. If the subtag currently being examined in the range is the
    wildcard ('*'), move to the next subtag in the range and
    continue with the loop.

    B. Else, if there are no more subtags in the language tag's
    list, the match fails.

    C. Else, if the current subtag in the range's list matches the
    current subtag in the language tag's list, move to the next
    subtag in both lists and continue with the loop.

    D. Else, if the language tag's subtag is a "singleton" (a single
    letter or digit, which includes the private-use subtag 'x')
    the match fails.

    E. Else, move to the next subtag in the language tag's list and
    continue with the loop.

  4. When the language range's list has no more subtags, the match
    succeeds.

(4) RFC 4647 2.2:

extended-language-range = (1*8ALPHA / "*")
                          *("-" (1*8alphanum / "*"))

(5) RFC 4647 4.1

Most matching schemes make no attempt to process the semantic meaning
of the subtags. The language range is compared, in a case-
insensitive manner, to each language tag being matched, using basic
string processing. Users SHOULD select language ranges that are
well-formed, valid language tags according to [RFC4646] (substituting
wildcards as appropriate in extended language ranges).

(6) RFC 4647 2

There are different types of language range, whose specific
attributes vary according to their application. Language ranges are
similar to language tags: they consist of a sequence of subtags
separated by hyphens. In a language range, each subtag MUST either
be a sequence of ASCII alphanumeric characters or the single
character '*' (%x2A, ASTERISK). The character '*' is a "wildcard"
that matches any sequence of subtags. The meaning and uses of
wildcards vary according to the type of language range.

(7) RFC 5646 2.2:

  • "Subtag" refers to a specific section of a tag, delimited by a
    hyphen, such as the subtags 'zh', 'Hant', and 'CN' in the tag "zh-
    Hant-CN". Examples of subtags in this document are enclosed in
    single quotes ('Hant').

[...]

  • The single-letter subtag 'x' introduces a sequence of private use
    subtags. The interpretation of any private use subtag is defined
    solely by private agreement and is not defined by the rules in
    this section or in any standard or registry defined in this
    document.
  1. Despite what you put into the lang attribute in HTML being required to be a valid language tag (see (1)), what you put into CSS :lang() pseudo-class only needs to be a comma-separated list of language ranges, which do "not need to be a valid language code to perform this comparison" (see (2)). I was perhaps unclear about this fact in my first comment.

  2. (2) further states that the "language tag" (of attribute value) and the "language ranges" (of pseudo-class value) should be matched using RFC 4647's extended filtering. The extended filtering is defined as matching of "extended language ranges to language tags" (see (3)), where extended language range is defined as (4). An extended language range is split by hyphen into list of subtags and matched against likewise split list of subtags from the language tag using the algorithm described in (3).

  3. Following the algorithm in (3), a language range consists of a single subtag x or art always succeeds to match a language tag whose first subtag is identical to itself. Thus x (as range) should match x-lojban (as tag) and art (as range) should match art-lojban. (Please double check it just in case.) RFC 4647 also states that "Users SHOULD select language ranges that are well-formed, valid language tags" but it is not REQUIRED (as in (5)).

  4. But is a single letter x really a "subtag"? Yes, it is、 according to RFC 4647 (see (6)), and RFC 5646 that defines language tags also agrees with the definition (see (7); technically, RFC 4647 only refers to a legacy RFC 4646, but this section is practically unchanged). So x is a subtag in this discussion, and, while not a well-formed language tag as a whole, a valid language range as a whole.

So, while I don't deny that your more semantically strict matching of language tags is superior, it would entail update of one or more standards cited above.

The one deviation from the standard that I think we probably want to keep is accepting underscore as a subtag separator, in addition to the hyphen that BCP47 expects. This was implemented based on observed usage on the web, so dropping it is likely to have some (though perhaps minor) compatibility implications. (If we do want to consider that, let's do it in a separate bug.)

I see. I didn't know about the actual usage data and you are probably right. Anyway it does not formally contradict with (2), because it says nothing about when "element’s content language" is not "represented in BCP 47 syntax".

Flags: needinfo?(747.neutron)

Thank you for this detailed analysis, that's really helpful (but will take some time to study carefully...).

For now, the patch that has just landed here will resolve a number of the issues noted, where we were failing to match correctly, but it sounds like some further handling may still be needed for cases where the range given in the :lang() selector is not a valid BCP47 tag.

I'll file a followup bug about this, so that we can better keep track of the issues/changes being made.

Blocks: 1865482
Upstream PR merged by moz-wptsync-bot
Flags: in-testsuite+

Is this something we need to uplift to ESR115? Not sure what the real-world impact of this bug is. The patches would need some rebasing, however.

Flags: needinfo?(jfkthame)

It's probably worth taking, given the lifespan that ESR115 still has. The bug here will result in CSS failing to match the lang attribute correctly in some cases -- it'll only affect a small minority of pages, but cases where (for instance) we end up rendering with a Japanese font because we failed to match a Chinese language tag properly (or vice versa) would be pretty annoying for users, and it'd be totally unclear what they could do about it.

I'll have a look at what rebasing is needed; leaving ni? for now.

I've rebased the patches to esr115, but unfortunately esr115 doesn't want to build on my local machine so I have not tested this locally; and ./mach try is also not working for me when on esr115. I'll try from a different machine tomorrow....

(Note that if we uplift this, we should also take the followup bug 1865482, which corrects the behavior implemented here to more fully follow the spec.)

Flags: needinfo?(jfkthame)

Comment on attachment 9367542 [details]
Bug 1857742 - patch 1 [esr115] - Vendor the oxilangtag crate into third_party/rust.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Incorrect language tag processing prevents some lang attributes matching as expected, which may result in wrong display of content (e.g. using a Japanese font in place of Chinese).
  • User impact if declined: Bad display of some content, resulting in culturally wrong presentation and sometimes impacting readability/usability.
  • Fix Landed on Version: 121
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Change is localized to CSS lang attribute matching, so no wider impact; new code relies on a safe Rust parsing implementation.
Attachment #9367542 - Flags: approval-mozilla-esr115?
Attachment #9367543 - Flags: approval-mozilla-esr115?
Attachment #9367544 - Flags: approval-mozilla-esr115?

Comment on attachment 9367542 [details]
Bug 1857742 - patch 1 [esr115] - Vendor the oxilangtag crate into third_party/rust.

Approved for 115.6esr.

Attachment #9367542 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
Attachment #9367543 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
Attachment #9367544 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: