Closed Bug 1280556 Opened 9 years ago Closed 9 years ago

Encoding detection mismatch on http://www.idpf.org/epub/pgt/

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 673087

People

(Reporter: annevk, Unassigned)

Details

Attachments

(1 file)

Chrome manages to detect UTF-8 somehow.
Could you please explain more? I didn't see any obvious problem.
(In reply to Masatoshi Kimura [:emk] from comment #1) > Could you please explain more? I didn't see any obvious problem. I see windows-1252 and "Copyright © 2011, 2012 International Digital Publishing Forum™"
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Edge shows this site as UTF-8 too. IE11 behaves like Firefox. Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML spec and our behavior....
Flags: needinfo?(hsivonen)
Flags: needinfo?(annevk)
> Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML spec and our behavior.... What would the change to the HTML spec be? You don’t mean a requirement to use the encoding in the XML declaration? That’s not what Edge, WebKit, and Blink are doing is it? I thought their behavior was just from doing their own heuristics, as you mention in https://bugzilla.mozilla.org/show_bug.cgi?id=673087#c11 > The page has no encoding specified anywhere, so the browser can do whatever heuristics it wants, no? …and if so, that’s not something that would require a change to the HTML spec, right?
> You don’t mean a requirement to use the encoding in the XML declaration? That's exactly what I mean, yes. Obviously its priority wrt other sources of encoding information would need to be sorted out. > That’s not what Edge, WebKit, and Blink are doing is it? That's _precisely_ what they are doing. Here's a simple testcase in case you want to black-box test this. This document, given no other encoding information (e.g. from file://): <!DOCTYPE html> <script> document.write(document.charset); </script> Some text. shows "windows-1252" in Chrome and Edge and "ISO-8859-1" in Safari; in the case of Chrome and Safari both are US localizations on US-localized Mac OS; in the case of Edge I'm running it via BrowserStack, but I assume it's equivalent (US localization on US-localized operating system). On the other hand, this document: <?xml version="1.0" encoding="KOI8-R"?> <!DOCTYPE html> <script> document.write(document.charset); </script> Some text. shows "KOI8-R" in Chrome/Safari and "koi8-r" in Edge. Neither document contains any non-ASCII characters that could be used in any meaningful heuristics, so all three engines are in fact using the encoding in the XML declaration. Also, note that this is not a case of "xml declaration just means UTF-8". Of course in the case of Blink/WebKit you can just look at their source too. For example, see the comment at https://chromium.googlesource.com/chromium/src.git/+/9f7c5f2/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#305 and the code that follows. > I thought their behavior was just from doing their own heuristics You thought wrong. I should add the the actual parsing of the XML declaration in Blink/WebKit does differ from that in Edge. For example, this document: <?xml oxencoding="KOI8-R" version="1.0"?> <!DOCTYPE html> <script> document.write(document.charset); </script> Some text. comes up "KOI8-R" in Chrome and Safari but "windows-1252" in Edge. So does this document: <?xml version="encoding = 'KOI8-R'"?> <!DOCTYPE html> <script> document.write(document.charset); </script> Some text. The WebKit/Blink result is not surprising given the behavior of the function at https://chromium.googlesource.com/chromium/src.git/+/9f7c5f2/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#154 but seems unlikely to be required for web compat at least in terms of its treatment of "oxencoding". I can't speak to the space-skipping or control-char-skipping aspects, though. This document: <?xml encoding = 'KOI8-R'?> <!DOCTYPE html> <script> document.write(document.charset); </script> Some text. comes up "KOI8-R" in all of Chrome, Safari, and Edge. On a more general note, if we have an area of non-interop, and sufficient interop problems that a major browser engine feels like it needs to change its behavior, that's a pretty good indicator that the spec needs to define things better. So this does in fact require a change to the HTML spec in my opinion: the spec is not matching reality.
I filed https://github.com/whatwg/html/issues/1438 against HTML. I suggest we fix this as part of bug 673087 since that's the older bug?
Flags: needinfo?(annevk)
That's probably fine, yes.
(In reply to Boris Zbarsky [:bz] from comment #4) > Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML > spec and our behavior.... I agree. Let's fix our behavior over at bug 673087 once we have spec text to implement.
Flags: needinfo?(hsivonen)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: