Closed Bug 673087 Opened 13 years ago Closed 3 years ago

XML declaration in text/html not used as an internal character encoding declaration (due to WHATWG HTML compliance)

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox89 --- fixed

People

(Reporter: radek, Assigned: hsivonen)

References

()

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0
Build ID: 20110615151330

Steps to reproduce:

Open http://m.gsmweb.cz/
Environment: Firefox 5, Windows 7, system-default encoding windows-1250

Headers sent:
GET /m/ HTTP/1.1
Host: gsmweb.cz
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: cs,en-us;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: windows-1250,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

(Accept-Charset specifies both windows-1250 and utf-8)



Actual results:

Website displays in windows-1250 instead of utf-8.

Headers received:
HTTP/1.1 200 OK
Date: Thu, 21 Jul 2011 12:18:52 GMT
Server: Apache/2.2.18 (FreeBSD) mod_ssl/2.2.18 OpenSSL/0.9.8n DAV/2 PHP/5.3.6 with Suhosin-Patch
X-Powered-By: PHP/5.3.6
Content-Length: 1808
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html

(no encoding specified in headers received)


Expected results:

Website should display in utf-8. The site has xml prolog with utf-8 encoding, thus the page encoding is also utf-8:
<?xml version="1.0" encoding="UTF-8"?>

Google Chrome and Opera on Windows display the page correctly. Firefox and Internet Explorer both break the encoding.
OS: Other → Windows 7
Hardware: All → x86
Reproducible on:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110720 Firefox/8.0a1

> Website displays in windows-1250 instead of utf-8.
The issue is visible. Bug could be set as NEW if not a dupe.
(In reply to comment #0)

> Headers received:
...
> Content-Type: text/html

I'm guessing the problem might be the Content-Type header; should the page be served as application/xhtml+xml instead, in order for the XML (including its encoding declaration) to be handled properly?
I cannot reproduce in FF5, Win XP, Slovak. The page displays fine as UTF-8.
But comment 2 may be right. The XML header could be ignored because of text/html type. Maybe add a <meta http-equiv="content-type"> declaration with charset for safety?
aceman: you can't always force the site developer to do what you want. In this case, he does not want to modify the (X)HTML source.

Also, the page displays fine when loaded from cache (e.g. load page, close firefox, reopen firefox with the same page loaded from cache), in that case the xml prolog isn't ignored.
It displayed for me fine on first load. Page info says it is in UTF-8.

My request headers have:
Accept-Charset:ISO-8859-2,utf-8;q=0.7,*;q=0.7

AndreiD says he can confirm it. Can it be Win 7 only?
I tested on the following Nightlies:
-> Mac 10.6: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0a1) Gecko/20110724 Firefox/8.0a1
-> Ubuntu 11.04: Mozilla/5.0 (X11; Linux i686; rv:8.0a1) Gecko/20110724 Firefox/8.0a1
-> Windows 7:  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a1) Gecko/20110724 Firefox/8.0a1

And the page loaded with ISO-8859-1 on the first run. 
However, I can have the page displayed correctly if manually changing the character enconding to UTF-8.
But this leaves the question open, why Firefox did not recognize the character enconding from the very beginning although it's defined in the source.
OK, I can see it when I set Character encoding -> Autodetect -> Off. Then my default encoding (set in Options, Content) of ISO-8859-2 is used (when the page doesn't specify otherwise). That produced bad characters in the page. Confirming.
Status: UNCONFIRMED → NEW
Component: General → HTML: Parser
Ever confirmed: true
OS: Windows 7 → All
Product: Firefox → Core
QA Contact: general → parser
Hardware: x86 → All
This is the correct behavior per the HTML5 spec.  This page is not XML, so the XML encoding is just garbage bytes as far as the parser is concerned.

Gecko's old parser did use to look at the <?xml?> bit even in HTML, but the HTML5-compliant parser does not.

Henri, are we tracking this fallout somewhere?  We've had several reports about this issue now...
Assignee: nobody → czech
Component: HTML: Parser → Czech
Product: Core → Tech Evangelism
QA Contact: parser → czech
Version: 5 Branch → unspecified
OK, can you also explain comment 4? I can confirm that - loading the page again after FF restart PROBABLY loads it from cache and SOMETIMES causes the page to display fine in UTF-8. What about that inconsistency?
(In reply to comment #8)
> Henri, are we tracking this fallout somewhere? 

No, we aren't.

> We've had several reports about this issue now...

This is maybe the second report that I've noticed.

Clearly, there's some tension between HTML5 using IE behavior as a proxy for reasoning about what existing content needs and existing "mobile" content that never targeted IE.

Why does Chrome decode the page as UTF-8? Is Chrome willfully violating HTML5?
> No, we aren't.

We need to be.

> This is maybe the second report that I've noticed.

I think I've seen 3 or 4 at this point.

> Why does Chrome decode the page as UTF-8?

The page has no encoding specified anywhere, so the browser can do whatever heuristics it wants, no?  Certainly we do in the charset autodetect mode...

aceman, I can't explain comment 4; Henri, any idea what's up with that?
Problem not found
Mozilla/5.0 (X11; Linux i686; rv:23.0) Gecko/20100101 Firefox/23.0
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
http://m.gsmweb.cz/ specifically no longer shows this bug because it sends a header Content-Type: application/xhtml+xml, but the underlying issue remains.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
In that case please move this bug out of Tech Evangelism - bugs in this component are about specific websites that should be asked to change their code in some way to achieve better cross-browser compatibility. If the site changed and fixed the problem, the bug is no longer relevant for us.
Status: REOPENED → NEW
Assignee: czech → nobody
Component: Czech → HTML: Parser
Product: Tech Evangelism → Core
Version: unspecified → Trunk
Summary: Website charset encoding mismatch (utf8 in xml header vs. cp1250 displayed) → XML declaration in text/html not used as an internal character encoding declaration (due to WHATWG HTML compliance)
For what it's worth, I think we should change both our behavior and the HTML spec here.  See bug 1280556 comment 4 and bug 1280556 comment 6.

I agree that we should change behavior now that the situation with Edge is what it is.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

https://treeherder.mozilla.org/#/jobs?repo=try&revision=de621f9f05fe7885f9f8182c85e5ad0e59adb591

The number of try runs here shows how brittle the interaction of the internal encoding declations and the menu are. I'm eager to implement bug 1687635, which would allow this code to be simplified a lot.

Attachment #9208049 - Attachment description: Bug 673087 - Honor encoding declared via XML declaration in text/html. → WIP: Bug 673087 - Honor encoding declared via XML declaration in text/html.
Attachment #9208049 - Attachment description: WIP: Bug 673087 - Honor encoding declared via XML declaration in text/html. → Bug 673087 - Honor encoding declared via XML declaration in text/html.
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/8452d5b62778
Honor encoding declared via XML declaration in text/html. r=emk
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/28183 for changes under testing/web-platform/tests
Regressions: 1700363
Regressions: 1700338
Status: ASSIGNED → RESOLVED
Closed: 11 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch
Upstream PR merged by moz-wptsync-bot
Regressions: 1700508
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: