Closed Bug 1643536 Opened 5 years ago Closed 3 years ago

Firefox treats some html lang="en" pages as "other writing systems" instead of Roman alphabet [due to invalid lang tag on https://www.aclu.org/]

Categories

(Web Compatibility :: Site Reports, defect, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: erwinm, Unassigned)

References

()

Details

(Keywords: webcompat:site-wait)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:77.0) Gecko/20100101 Firefox/77.0

Steps to reproduce:

I use Andika for Roman and Cyrillic alphabets, Skeirs for Other Writing Systems.

I visited https://www.aclu.org/news/free-speech/police-are-attacking-journalists-at-protests-were-suing/

Actual results:

It displayed in Skeirs.

Checking the page source, it's html lang="en" and also uses Javascript.

Expected results:

It should display in Andika.

The alphabet switch is a bit awkward.

See also bug 1633627

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → DOM: Core & HTML
Product: Firefox → Core
Component: DOM: Core & HTML → Layout: Text and Fonts
See Also: → 1633627

(In reply to MarjaE from comment #0)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:77.0) Gecko/20100101 Firefox/77.0

Steps to reproduce:

I use Andika for Roman and Cyrillic alphabets, Skeirs for Other Writing Systems.

I visited https://www.aclu.org/news/free-speech/police-are-attacking-journalists-at-protests-were-suing/

Actual results:

It displayed in Skeirs.

Checking the page source, it's html lang="en" and also uses Javascript.

Looking at the page in the Inspector, I see that the <html> tag actually has an attribute lang="en en", which is invalid and therefore ignored.

(The original page that the server delivers does seem to have lang="en", so I presume something in its JavaScript subsequently updates this (along with lots of other changes -- e.g. I see that it adds a style attribute to the root element), and does so incorrectly. Indeed, the original page has <html lang="en" data-n-head="%7B%22lang%22:%7B%221%22:%22en%22%7D%7D">, where the data-n-head attribute would decode to {"lang":{"1":"en"}}, which looks very much like it could be getting used by a loading script to add the spurious extra en. I guess they should either remove the original lang attribute, if they're always relying on this being added by script, or make the script smarter so that it replaces the existing value instead of appending to it and creating an invalid tag.)

I see the same lang="en en" when inspecting the loaded page in Chrome, so it does look like this is a site bug rather than a Firefox bug.

Component: Layout: Text and Fonts → Desktop
Product: Core → Web Compatibility
Summary: Firefox treats some html lang="en" pages as "other writing systems" instead of Roman alphabet → Firefox treats some html lang="en" pages as "other writing systems" instead of Roman alphabet [due to invalid lang tag on https://www.aclu.org/]
Version: 77 Branch → unspecified
Severity: -- → S2
Priority: -- → P3

I sent a message to a engineering leader via LinkedIn.

(I got a reply that the bug would be passed along to the right folks.)

I've tried reporting similar errors to webcompat, they close as unable to reproduce the errors when I am unable to avoid them.

The ACLU page where this was originally reported no longer seems to show the invalid lang tag problem as described in comment 2, so I guess that has been fixed. @MarjaE, if you're seeing similar problems again please indicate the specific page (or pages) involved.

Flags: needinfo?(erwinm)

Here's one: http://www.digitalattic.org/home/war/vegetius/

View Page Info shows content-language English.

Flags: needinfo?(erwinm)

The issue that MarjaE is talking about is
https://github.com/webcompat/web-bugs/issues/68809

(In reply to MarjaE from comment #7)

Here's one: http://www.digitalattic.org/home/war/vegetius/

View Page Info shows content-language English.

This page ends up with the Other Writing Systems font preference because it does not have a lang=en (or equivalent) attribute.

Here's the beginning of the document, from View Source:

<!-- HEADER -->

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
   
<head>
<title>The Military Institutions of the Romans (De Re Militari)</title>

<meta http-equiv="content-Type" content="text/html;charset=utf-8">
<meta http-equiv="content-Language" content="English">

<meta name="author" content="Mads Brevik">
<meta name="Robots" content="all">

<meta name="description" content="'De Re Militari' by Flavius Vegetius Renatus">
<meta name="keywords" content="flavius vegetius renatus, de re militari">

<link href="/favicon.ico" rel="shortcut icon">
<link href="/home/_include/attic.css" type="text/css" rel="stylesheet">

</head>

<!-- BODY -->

<body>

<div class="Body">
<div id="Mainmenu">

	<div class="MainTitle">
		<a href="/" style="color: black">Digital Attic</a>
	</div>

	<div class="TopMenuLink">
		<a href="/home/read/asoiaf/">A Song of Ice and Fire</a>&nbsp;:&nbsp;<a href="/home/war/">Warfare</a>
	</div>

</div>

Note the absence of any lang attribute. What Page Info shows (content-Language: English) is not a language tag (lang attribute) but a meta tag, which (despite its name) is not the correct way to declare the language of the document itself; see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Language; its purpose is slightly different.

In addition, even if this were the relevant place to declare the document language, it wouldn't work as intended because it literally says "English" as the value (which Page Info dutifully shows); but the value is supposed to be a formally-defined language tag such as "en-US" that can be parsed in a well-defined way for processing. The string "English" is not a defined language tag and so would be ignored anyway.

So to sum up: the page does not declare its document language because of two authoring errors: trying to use an arbitrary language name rather than a well-formed language tag, and putting it in the http-equiv="content-Language" meta tag rather than as an HTML lang attribute. As a result, Firefox ends up using the Other Writing Systems font preference to resolve sans-serif for this content.

Also occurs on Project Gutenberg:

https://www.gutenberg.org/ebooks/search/?query=test+search&submit_search=Go%21

Where line 11 of the page source reads:

--><html lang="en_US">

(In reply to MarjaE from comment #10)

Also occurs on Project Gutenberg:

https://www.gutenberg.org/ebooks/search/?query=test+search&submit_search=Go%21

Where line 11 of the page source reads:

--><html lang="en_US">

In this case, the issue is that "en_US" is not a well-formed language tag; it should be "en-US".

See for example https://datatracker.ietf.org/doc/html/rfc5646#section-2.

Although comment 10 is really a website error, it appears that both Blink and Webkit recognize such "broken" lang tags for the purpose of font selection. So I've filed bug 1757578 to propose making Firefox do the same, in the interests of compatibility.

See Also: → 1757578

It seems now that the HTML tag has the correct formed attribute, "en-US", thus not being able to reproduce the issue

https://prnt.sc/dS1COaNW8qLR

Marja, is the issue reproducible on your side?

Tested with:

Browser / Version: Firefox Release 102.0 (64-bit)/ Firefox Nightly 104.0a1 (2022-06-28) (64-bit)
Operating System: Mac OSX Catalina 10.15.7

Flags: needinfo?(erwinm)
Status: UNCONFIRMED → NEW
Ever confirmed: true

Yes, it currently works on my side.

Flags: needinfo?(erwinm)

Thanks for the update. I will be closing this issue.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.