Closed Bug 1280556 Opened 8 years ago Closed 8 years ago

Encoding detection mismatch on http://www.idpf.org/epub/pgt/

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 673087

People

(Reporter: annevk, Unassigned)

Details

Attachments

(1 file)

Chrome manages to detect UTF-8 somehow.
Could you please explain more? I didn't see any obvious problem.
(In reply to Masatoshi Kimura [:emk] from comment #1)
> Could you please explain more? I didn't see any obvious problem.

I see windows-1252 and "Copyright © 2011, 2012 International Digital Publishing Forum™"
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Edge shows this site as UTF-8 too.  IE11 behaves like Firefox.

Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML spec and our behavior....
Flags: needinfo?(hsivonen)
Flags: needinfo?(annevk)
> Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML spec and our behavior....

What would the change to the HTML spec be? You don’t mean a requirement to use the encoding in the XML declaration? That’s not what Edge, WebKit, and Blink are doing is it? I thought their behavior was just from doing their own heuristics, as you mention in https://bugzilla.mozilla.org/show_bug.cgi?id=673087#c11

> The page has no encoding specified anywhere, so the browser can do whatever heuristics it wants, no?  

…and if so, that’s not something that would require a change to the HTML spec, right?
> You don’t mean a requirement to use the encoding in the XML declaration?

That's exactly what I mean, yes.  Obviously its priority wrt other sources of encoding information would need to be sorted out.

> That’s not what Edge, WebKit, and Blink are doing is it?

That's _precisely_ what they are doing.  Here's a simple testcase in case you want to black-box test this.  This document, given no other encoding information (e.g. from file://):

  <!DOCTYPE html>
  <script>
  document.write(document.charset);
  </script>
  Some text.

shows "windows-1252" in Chrome and Edge and "ISO-8859-1" in Safari; in the case of Chrome and Safari both are US localizations on US-localized Mac OS; in the case of Edge I'm running it via BrowserStack, but I assume it's equivalent (US localization on US-localized operating system).  On the other hand, this document:

  <?xml version="1.0" encoding="KOI8-R"?>
  <!DOCTYPE html>
  <script>
  document.write(document.charset);
  </script>
  Some text.

shows "KOI8-R" in Chrome/Safari and "koi8-r" in Edge.  Neither document contains any non-ASCII characters that could be used in any meaningful heuristics, so all three engines are in fact using the encoding in the XML declaration.  Also, note that this is not a case of "xml declaration just means UTF-8".

Of course in the case of Blink/WebKit you can just look at their source too.  For example, see the comment at https://chromium.googlesource.com/chromium/src.git/+/9f7c5f2/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#305 and the code that follows.

> I thought their behavior was just from doing their own heuristics

You thought wrong.

I should add the the actual parsing of the XML declaration in Blink/WebKit does differ from that in Edge.  For example, this document:

  <?xml oxencoding="KOI8-R" version="1.0"?>
  <!DOCTYPE html>
  <script>
  document.write(document.charset);
  </script>
  Some text.

comes up "KOI8-R" in Chrome and Safari but "windows-1252" in Edge.  So does this document:

  <?xml version="encoding    =  'KOI8-R'"?>
  <!DOCTYPE html>
  <script>
  document.write(document.charset);
  </script>
  Some text.

The WebKit/Blink result is not surprising given the behavior of the function at https://chromium.googlesource.com/chromium/src.git/+/9f7c5f2/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#154 but seems unlikely to be required for web compat at least in terms of its treatment of "oxencoding".  I can't speak to the space-skipping or control-char-skipping aspects, though.  This document:

  <?xml encoding    =  'KOI8-R'?>
  <!DOCTYPE html>
  <script>
  document.write(document.charset);
  </script>
  Some text.

comes up "KOI8-R" in all of Chrome, Safari, and Edge.

On a more general note, if we have an area of non-interop, and sufficient interop problems that a major browser engine feels like it needs to change its behavior, that's a pretty good indicator that the spec needs to define things better.  So this does in fact require a change to the HTML spec in my opinion: the spec is not matching reality.
I filed https://github.com/whatwg/html/issues/1438 against HTML. I suggest we fix this as part of bug 673087 since that's the older bug?
Flags: needinfo?(annevk)
That's probably fine, yes.
(In reply to Boris Zbarsky [:bz] from comment #4)
> Given Edge, WebKit, and Blink agreeing, we may just want to change the HTML
> spec and our behavior....

I agree. Let's fix our behavior over at bug 673087 once we have spec text to implement.
Flags: needinfo?(hsivonen)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: