Closed Bug 801097 Opened 7 years ago Closed 7 years ago
Reader mode title shows HTML character encodings
I noticed this when looking at the image on http://www.technostarry.com/android/how-to-enable-and-use-reader-mode-in-firefox-for-android/ STR: 1) Go to http://mrintech.com/5223/free-paid-responsive-wordpress-themes-templates/ 2) Click the reader icon 3) Add the page to the reading list 4) Look at the reading list Expected results: Shows "Best Free & Paid ..." Actual results: Shows "Best Free & Paid ..."
This seems to work.
Assignee: nobody → markcapella
Status: NEW → ASSIGNED
Attachment #707103 - Flags: review?(bnicholson)
Comment on attachment 707103 [details] [diff] [review] Patch (v1) Thanks for looking at this. I'm not sure we want a catch-all fix since it will decode page titles that shouldn't necessarily be decoded. For instance, search "&" on Google; the page title should show as "& - Google Search" (I'm using Firefox desktop as a model for the correct behavior). However, with this patch, it appears as "& - Google Search". I think it would be better to handle this from Gecko, so you shouldn't have to touch the Java code at all. I think you should just be able to decode the title before returning article here: http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/browser.js#7834
Attachment #707103 - Flags: review?(bnicholson) → review-
yah ... I wondered about just placing the decode into here: http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/Readability.js#147 But I'm not extremely proficient at JS ... is there something built-in to the language to do that? (decodeURI() ? )
(In reply to Mark Capella [:capella] from comment #4) > yah ... I wondered about just placing the decode into here: > http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/ > Readability.js#147 > > But I'm not extremely proficient at JS ... is there something built-in to > the language to do that? (decodeURI() ? ) Unfortunately, no. One technique I've seen used in the past is creating an element, setting its innerHTML, then using its textContent: let el = document.createElement("span"); el.innerHTML = encodedHtml; let decodedHtml = el.textContent; One concern I have about doing this here is that the document in browser.js is privileged, so we'd need to be careful inserting arbitrary HTML into one of its elements. I assume this should be safe as long as the element isn't actually appended to the document, but we should have a security peer take a look if we were to use this approach. If we were to do it this way, the element creation code would need to be in browser.js since Readability.js doesn't have access to a real DOM document (it uses http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/JSDOMParser.js). Another possibility would be to create a mapping of encoded strings -> decoded strings, then do a replace on the title. But it'd be difficult to be completely thorough as there's quite a few character entities (see http://www.w3.org/TR/html4/sgml/entities.html).
We could try to use nsIParserUtils.convertToPlainText IDL: http://mxr.mozilla.org/mozilla-central/source/parser/html/nsIParserUtils.idl#104 xample: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/feeds/FeedProcessor.js#618
Thanks to mfinkle for the tip about convertToPlainText() ... this patch fixs things nicely for the problem page reported in comment #0 ... However... google search for |&| remains problematic ... the new screenshot shows how it's treated in the current nightly's readermode, and after my patch is applied ... seems we substitute one set of problems for another. I'm starting to wonder if this falls in the category of intentional page design ... and since we can't read the authors minds we render as best as we can ... (ala cantfix) ? (maybe I'm tired and missing something obvious)
Comment on attachment 707656 [details] [diff] [review] Patch (v2) (In reply to Mark Capella [:capella] from comment #7) > Created attachment 707656 [details] [diff] [review] > Patch (v2) > > Thanks to mfinkle for the tip about convertToPlainText() ... this patch fixs > things nicely for the problem page reported in comment #0 ... > Nice, convertToPlainText() is exactly what we need. > However... google search for |&| remains problematic ... the new > screenshot shows how it's treated in the current nightly's readermode, and > after my patch is applied ... seems we substitute one set of problems for > another. > > I'm starting to wonder if this falls in the category of intentional page > design ... and since we can't read the authors minds we render as best as we > can ... (ala cantfix) ? > > (maybe I'm tired and missing something obvious) So here's the problem: We use JSDOMParser.js for reader mode parses. It's a custom DOM parser that doesn't include certain things we often take for granted when using the DOM such as HTML decoding. If you search "&" on Google in desktop Firefox, you'll notice that the <title> element is "&amp; - Google Search", but document.title will be "& - Google Search" (the same thing shown in the title bar). The same is true for the textContent property of any element in a page; if the HTML contents of that element include an encoded entity, the textContent property of that element will be the decoded entity. Note that you can get the raw contents by using the element's innerHTML property. However, JSDOMParser.js doesn't decode any HTML. Using this Google example, document.title will give you "&amp; - Google Search". In JSDOMParser.js, textContent is identical to innerHTML. When we get back a result from the reader worker, we're getting an article whose title is not decoded (since the article was parsed using JSDOMParser.js). When we save the article in the database , send the article information to Java , or set the document title of about:reader , we should use the decoded title since none of these things would expect encoded HTML. Your patch does this. But in the case of , we're putting the article title back into a page by setting the innerHTML of an element in the about:reader document. When you set the innerHTML, its encoded entities will be decoded when they are shown on the page (as you see in your screenshot in comment 8). In your patch, you've already decoded the title, so rather than setting the innerHTML at , you should be setting the textContent.  http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/browser.js#7612  http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/aboutReader.js#204  http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/aboutReader.js#462  http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/aboutReader.js#461
Attachment #707656 - Flags: feedback+
Comment on attachment 708074 [details] [diff] [review] Patch (v3) Review of attachment 708074 [details] [diff] [review]: ----------------------------------------------------------------- You have f=mfinkle in your commit message, but I didn't see mfinkle give feedback+ anywhere. It's not a big deal since it's just for feedback, but we normally don't put f= in the commit message unless someone has explicitly flagged a patch as feedback+. ::: mobile/android/chrome/content/browser.js @@ +7802,5 @@ > + let result = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils). > + convertToPlainText(article.title, > + Ci.nsIDocumentEncoder.OutputSelectionOnly | Ci.nsIDocumentEncoder.OutputAbsoluteLinks, > + 0); > + article.title = result; Nit: The spacing here seems a bit weird, especially with the 0 being on its own line. Maybe you can change it to something like this? let flags = Ci.nsIDocumentEncoder.OutputSelectionOnly | Ci.nsIDocumentEncoder.OutputAbsoluteLinks; article.title = Cc["@mozilla.org/parserutils;1"].getService(Ci.nsIParserUtils) .convertToPlainText(article.title, flags, 0);
Attachment #708074 - Flags: review?(bnicholson) → review+
Great, thanks ... I'll fix both, |f=| was a hat-tip for nsIParserUtils.convertToPlainText (I think I picked up the habit outside the mobile group)
Oh, and I used a temp var |result| since here: http://mxr.mozilla.org/mozilla-central/source/parser/html/nsIParserUtils.idl#97 says C++ callers are allowed ... blah blah blah ... wondered if that implied JS would have issues?
(In reply to Mark Capella [:capella] from comment #14) > Oh, and I used a temp var |result| since here: > http://mxr.mozilla.org/mozilla-central/source/parser/html/nsIParserUtils. > idl#97 > says C++ callers are allowed ... blah blah blah ... wondered if that implied > JS would have issues? It shouldn't - that just means C++ can use the same string object without creating a new one (not relevant for JS).
ever the cautious https://tbpl.mozilla.org/?tree=Try&rev=f6c2a94fba07
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 21
This issue is not reproducible on the latest Nightly. Closing bug as verified fixed on Nightly 21.0a1(2013-02-07) using HTC Desire Z(Android 2.3.3)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.