Open Bug 1176827 Opened 5 years ago Updated 3 years ago

Character encoding in Reader Mode is broken if page content without specified encoding is loaded off network with XHR (rather than using 'guessed' character encoding for existing document) - affects reloads, bookmarks, undo close tab, etc.

Categories

(Toolkit :: Reader Mode, defect, P3)

38 Branch
Unspecified
Linux
defect

Tracking

()

People

(Reporter: mnowak, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [reader-mode-readability-algorithm])

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
Build ID: 20150609094003

Steps to reproduce:

Switched http://users.ox.ac.uk/~kgroup/tools/fieldwork.shtml to Reader Mode (got: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." in Console) then hit Reload.


Actual results:

The text is now with undefined characters:

"Studying animals in the lab has the advantage that observation conditions can�to some degree at least�be controlled by the researcher." The character "—" is displayed as "�".

And:

"Likewise, only careful monitoring of wild populations across the island can reveal whether these crows indeed possess a �culture� of tool technology." "‘" & "’" → "�"


Expected results:

Correct character should be preserved.
WFM with FF38 on Win 7. Maybe specific to Linux.
OS: Unspecified → Linux
I can reproduce it on Windows XP, probably not OS-specific.
Priority: -- → P2
Summary: Character encoding in Reader Mode is broken after page without character encoding set is reload → Character encoding in Reader Mode is broken if page content without specified encoding is loaded off network with XHR (rather than using 'guessed' character encoding for existing document) - affects reloads, bookmarks, undo close tab, etc.
Duplicate of this bug: 1176921
Duplicate of this bug: 1178167
Status: UNCONFIRMED → NEW
Ever confirmed: true
Whiteboard: [reader-mode-readability-algorithm]
Blocks: 1286221
The root cause is that we get the document[1] in wrong encoding format from the `_downloadDocument` method[2]. It supported to be "ISO-8859-1", but the `XMLHttpRequest` object there get it in "UTF-8".

So somehow we might need to re-encode the document into correct encoding format, here is "ISO-8859-1" after it's downloaded. But it seems to spend too much resource to do that.

Or we could use `xhr.overrideMimeType` to switch to correct encoding before download it. But I think there might be no way to get the correct encoding format before download the document.

So maybe we should just need to re-encode the document after download it? What do you think, Gijs? Do you have any better idea?

[1]: http://users.ox.ac.uk/~kgroup/tools/fieldwork.shtml
[2]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/ReaderMode.jsm#236-310
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #5)
> The root cause is that we get the document[1] in wrong encoding format from
> the `_downloadDocument` method[2]. It supported to be "ISO-8859-1", but the
> `XMLHttpRequest` object there get it in "UTF-8".
> 
> So somehow we might need to re-encode the document into correct encoding
> format, here is "ISO-8859-1" after it's downloaded. But it seems to spend
> too much resource to do that.

I don't understand this paragraph.

> Or we could use `xhr.overrideMimeType` to switch to correct encoding before
> download it. But I think there might be no way to get the correct encoding
> format before download the document.

Right, there is no way to know if the server does not send a header with the encoding in (which it doesn't) nor includes an equivalent <meta> tag (which it also doesn't). See https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/HTML_in_XMLHttpRequest#Character_encoding .

> So maybe we should just need to re-encode the document after download it?
> What do you think, Gijs? Do you have any better idea?

We could use the arraybuffer responseType to get the raw binary data from the XHR. But I don't know that we expose an easy way of reusing whatever method the browser is using to go from the raw data to a document. I looked at using fetch() instead, and as far as I can tell that has the same problem. You can use XHR or fetch() to get a blob, then get a blob URI for that, then load that in a frame, but I don't see a way of making the frame not make any external requests (though it happens to work with this particular page because its references are relative, that won't be the case in general, and if you manipulate the DOM to have an absolute-ref'd <img> it still loads it).

So, in order to do better here, we'd need to solve at least one of these problems:
1) find a way of applying the html character encoding logic to xhr/fetch input
2) find a way of loading a blob'd html document in a(n offscreen) frame while not allowing it to run any script or load any images.

Both of these are things to ask a platform engineer - I'm not aware of solutions. Perhaps you know someone in Taipei who can help?

TBH, considering this only happens with pages that specify no character encoding at all, I'm not convinced this should stay P2, especially if we can't find answers to either (1) or (2).
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(evan)
> We could use the arraybuffer responseType to get the raw binary data from
> the XHR. But I don't know that we expose an easy way of reusing whatever
> method the browser is using to go from the raw data to a document. I looked
> at using fetch() instead, and as far as I can tell that has the same
> problem. You can use XHR or fetch() to get a blob, then get a blob URI for
> that, then load that in a frame, but I don't see a way of making the frame
> not make any external requests (though it happens to work with this
> particular page because its references are relative, that won't be the case
> in general, and if you manipulate the DOM to have an absolute-ref'd <img> it
> still loads it).
> 
> So, in order to do better here, we'd need to solve at least one of these
> problems:
> 1) find a way of applying the html character encoding logic to xhr/fetch
> input
> 2) find a way of loading a blob'd html document in a(n offscreen) frame
> while not allowing it to run any script or load any images.
> 
> Both of these are things to ask a platform engineer - I'm not aware of
> solutions. Perhaps you know someone in Taipei who can help?
I don't know either.

I looked at `page-worker`[1] to see if it can help the number 2 problem, "find a way of loading a blob'd html document in a(n offscreen) frame while not allowing it to run any script or load any images". Looks like we can use it to forbid executing script but seems no way to forbid loading images or other resources.

> TBH, considering this only happens with pages that specify no character
> encoding at all, I'm not convinced this should stay P2, especially if we
> can't find answers to either (1) or (2).
Agree, this don't need to stay P2. If OK, let's change the priority. I think we could focus other P2 bugs first.

[1]: https://developer.mozilla.org/en-US/Add-ons/SDK/High-Level_APIs/page-worker
Flags: needinfo?(evan) → needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #7)
> > TBH, considering this only happens with pages that specify no character
> > encoding at all, I'm not convinced this should stay P2, especially if we
> > can't find answers to either (1) or (2).
> Agree, this don't need to stay P2. If OK, let's change the priority. I think
> we could focus other P2 bugs first.

Yes. Note that I think the perf regression which we just merged to aurora should really be treated as P1. We can't ship that.
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to :Gijs Kruitbosch from comment #8)
> (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #7)
> > > TBH, considering this only happens with pages that specify no character
> > > encoding at all, I'm not convinced this should stay P2, especially if we
> > > can't find answers to either (1) or (2).
> > Agree, this don't need to stay P2. If OK, let's change the priority. I think
> > we could focus other P2 bugs first.
> 
> Yes. Note that I think the perf regression which we just merged to aurora
> should really be treated as P1. We can't ship that.

Let's set the importance of this bug as P3.
Priority: P2 → P3
Duplicate of this bug: 1378684
You need to log in before you can comment on or make changes to this bug.