Closed Bug 1173548 Opened 9 years ago Closed 8 years ago

Reader View picks wrong direction for some RTL pages

Categories

(Toolkit :: Reader Mode, defect, P2)

38 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox53 --- fixed

People

(Reporter: mehdi, Assigned: evanxd)

References

(Blocks 1 open bug)

Details

(Keywords: rtl, Whiteboard: [reader-mode-readability-algorithm])

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.0.1 Build ID: 20150525043631 Steps to reproduce: Go to this page: http://mehdix.ir/rtl-feed.html click the reader icon. The ouput's direction is left to right. If it shows correctly (rtl) for you, don't stop. Try this page: http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed and press the reader icon, the direction is false. However on this page the direction is showing correctly: https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%87%D8%A8%D9%88%D8%B7_%D8%B9%D9%84%D9%89_%D8%B3%D8%B7%D8%AD_%D8%A7%D9%84%D9%82%D9%85%D8%B1 Actual results: Just browsing different pages with my updated iceweasel 38.0.1 on debian jessie that I see the direction is wrong for most RTL pages that I visit. Expected results: The content should appear RTL if the main page is also RTL. I change direction in css having `body { direction: rtl; }` not `dir="rtl"` as html attributed.
Component: Untriaged → Reader Mode
Keywords: rtl
Product: Firefox → Toolkit
(In reply to Mehdi Sadeghi from comment #0) > I change > direction in css having `body { direction: rtl; }` not `dir="rtl"` as html > attributed. This is not a good idea -- both the HTML spec [1] and the CSS spec [2] recommend setting direction with the HTML dir attribute and not the CSS direction property. That said, I don't know exactly how Reader View chooses a direction, and we may well be able to improve things here. [1] https://html.spec.whatwg.org/multipage/dom.html#the-dir-attribute:the-dir-attribute-18 [2] http://www.w3.org/TR/css-writing-modes-3/#bidi
(In reply to Simon Montagu :smontagu from comment #1) > (In reply to Mehdi Sadeghi from comment #0) > > I change > > direction in css having `body { direction: rtl; }` not `dir="rtl"` as html > > attributed. > > This is not a good idea -- both the HTML spec [1] and the CSS spec [2] > recommend setting direction with the HTML dir attribute and not the CSS > direction property. > > That said, I don't know exactly how Reader View chooses a direction, and we > may well be able to improve things here. > > [1] > https://html.spec.whatwg.org/multipage/dom.html#the-dir-attribute:the-dir- > attribute-18 > [2] http://www.w3.org/TR/css-writing-modes-3/#bidi Setting direction with the HTML dir attribute solved my problem. I change the status to RESOLVED.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
I still want to investigate whether we can improve behaviour on sites like http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed where there is no dir attribute on the body, but all the content is contained in a <div> with dir="rtl".
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INVALID → ---
Web designers should use dir attribute instead of CSS direction property, but there are sites out there not conforming to the recommendation. We can still do better on this by preserving the text direction. A very safe implementation is to get computed style of the extracted elements, and set this to the dir attribute. This would guarantee the text direction is correct, but takes more time. DOM distiller is a similar project, but for Chromium. Their implementation is here: https://github.com/chromium/dom-distiller/search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection
(In reply to Wei-Yin Chen from comment #5) > A very safe implementation is to get computed style of the extracted > elements, and set this to the dir attribute. This would guarantee the text > direction is correct, but takes more time. DOM distiller is a similar > project, but for Chromium. Their implementation is here: > https://github.com/chromium/dom-distiller/ > search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection This sounds sensible to me. This is broken on both desktop and Android Firefox. From the opening bug description, http://mehdix.ir/rtl-feed.html is broken for me and http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed works correctly. Another site that is broken for me: http://www.nrg.co.il/online/1/ART2/749/875.html?hp=1&cat=402&loc=1
Felipe, can CLD or maybe Intl help and determine directionality of text (or maybe just language, and can we make an inference based on that, if no direction is set and we're confident we're looking at Arabic/Hebrew/something-else-we-know-is-rtl ?)
Flags: needinfo?(felipc)
The charset autodetection code could help find the page language, but I am not sure if this component is not obsolete in Firefox. It should be better to set direction by checking the page direction and not the language, even that it will break the directionality of (very!) old webpages that have never used the direction directive. Since the problem is about pages that are written with RTL directionality, I guess that such webpages should have most of its content marked with dir=rtl or direction:rtl. RTL mode should be triggered if there is RTL content on the page, or if more than a given percentage of the page content is actually RTL.
(In reply to Amir Aharoni from comment #6) > (In reply to Wei-Yin Chen from comment #5) > > A very safe implementation is to get computed style of the extracted > > elements, and set this to the dir attribute. This would guarantee the text > > direction is correct, but takes more time. DOM distiller is a similar > > project, but for Chromium. Their implementation is here: > > https://github.com/chromium/dom-distiller/ > > search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection > > This sounds sensible to me. > > This is broken on both desktop and Android Firefox. > > From the opening bug description, http://mehdix.ir/rtl-feed.html is broken > for me and > http://www.bbc.com/persian/sport/2015/06/ > 150610_world_cup_2026_bidding_delayed works correctly. I'm not sure how it was back when I posted this comment, but now it's definitely the other way: * correct: http://mehdix.ir/rtl-feed.html * broken: http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed
(In reply to Amir Aharoni from comment #11) > * correct: http://mehdix.ir/rtl-feed.html Direction set on the HTML tag. <html dir="rtl" lang="fa"> > * broken: > http://www.bbc.com/persian/sport/2015/06/ > 150610_world_cup_2026_bidding_delayed Direction set on the content (and on <head> - why?!) <div class="direction" dir="rtl"> Broken: http://www.nrg.co.il/online/1/ART2/749/875.html?hp=1&cat=402&loc=1 - Direction set on CSS: #page-wrap {direction: rtl;} Correct: https://hwzone.co.il/%D7%94%D7%A6%D7%A6%D7%94-%D7%9C%D7%9E%D7%95%D7%A6%D7%A8%D7%99-%D7%94%D7%A2%D7%99%D7%93%D7%9F-%D7%94%D7%97%D7%93%D7%A9-%D7%A9%D7%9C-amd-%D7%A9%D7%99%D7%92%D7%99%D7%A2%D7%95-%D7%91%D7%A9%D7%A0%D7%AA-201/ - Direction set on HTML: <html dir="rtl" lang="he-IL">
Yeah, the CLD (LanguageDetector.jsm) should be able to provide a locale from a given chunk of text, which could them be run through nsIChromeRegistry.isLocaleRTL, which checks the "ui.direction.*" prefs to determine if it's in the list of RTL locales. The problem is that there could be text with mixed languages which will never look right one way or the other. I think that Reader Mode should get the computed style for the `direction` style on each chunk of text that it decided to keep, and use that information. Does Reader Mode have a direct knowledge of the DOM elements it's keeping? Or does the Readability API only give back the filtered plain text?
Flags: needinfo?(felipc)
(In reply to :Felipe Gomes (needinfo me!) from comment #13) > Does Reader Mode have a direct knowledge of the DOM elements it's keeping? No. It doesn't even have more than the HTML of the original page in some cases, so "computed style" isn't available, because there's no CSS except inline styles. So really, attempting to rely on the direction: style is hard/impossible without completely rearchitecturing the whole thing. We might be able to make the common case of clicking to enter reader mode slightly better by storing directionality from the loaded document, but even that is non-trivial. And even if we do that it won't always work because e.g. from session restore / refresh / back/fwd navigation, we won't necessarily have this information (we just XHR the HTML). Hence asking for other options.
(In reply to :Gijs Kruitbosch from comment #14) > No. It doesn't even have more than the HTML of the original page in some > cases, so "computed style" isn't available, because there's no CSS except > inline styles. Does it means that the reader mode is not aware of stylesheets at all? Because it may make it difficult to guess direction that is set on the stylesheets.
(In reply to Tomer Cohen :tomer from comment #15) > (In reply to :Gijs Kruitbosch from comment #14) > > No. It doesn't even have more than the HTML of the original page in some > > cases, so "computed style" isn't available, because there's no CSS except > > inline styles. > Does it means that the reader mode is not aware of stylesheets at all? Yes. > Because it may make it difficult to guess direction that is set on the > stylesheets. Not just "may" - it *does* make it *impossible* to know. There's a good reason the specs suggest using the "dir" attribute instead of CSS, see comment #1. It's a semantics rather than just a styling change anyway. Anyway, this is the web we get and so we should try and deal with it, somehow.
(In reply to Amir Aharoni from comment #11) > I'm not sure how it was back when I posted this comment, but now it's > definitely the other way: > * correct: http://mehdix.ir/rtl-feed.html It is correct because I have updated the website to reflect changes described in comment #1.
Priority: -- → P2
Status: REOPENED → NEW
Whiteboard: [reader-mode-readability-algorithm]
(In reply to :Gijs Kruitbosch from comment #16) > (In reply to Tomer Cohen :tomer from comment #15) > > (In reply to :Gijs Kruitbosch from comment #14) > > > No. It doesn't even have more than the HTML of the original page in some > > > cases, so "computed style" isn't available, because there's no CSS except > > > inline styles. > > Does it means that the reader mode is not aware of stylesheets at all? > > Yes. > > > Because it may make it difficult to guess direction that is set on the > > stylesheets. > > Not just "may" - it *does* make it *impossible* to know. There's a good > reason the specs suggest using the "dir" attribute instead of CSS, see > comment #1. It's a semantics rather than just a styling change anyway. Gijs, Since it deals with the "Direction set on CSS" case impossibly, should we set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases here? If we want to deal with the "Direction set on HTML" cases, somehow we might need to find a good way to get possible "dir" attribute in a web page to instead of current solution, `doc.documentElement.getAttribute("dir")`. For example, check the "dir" attribute of result HTML text selected by Reader Mode algorithm from a web page, parent(parent's parent...) node of the result HTML text, head tag, and body tag. Any thought? [1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/Readability.js#666
Flags: needinfo?(gijskruitbosch+bugs)
Is it possible for the reader mode to read two blocks with different directionality? <div dir="ltr">foo</div> <div dir="rtl">foo2</div> Also, it may be valuable to recognize directionality of text without markup (tags like RLI/FSI/PDI may indicate that), but then you'd need similar code to one layout uses to interpret the direction from the chunk.
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #19) > Since it deals with the "Direction set on CSS" case impossibly, should we > set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases > here? > > If we want to deal with the "Direction set on HTML" cases, somehow we might > need to find a good way to get possible "dir" attribute in a web page to > instead of current solution, `doc.documentElement.getAttribute("dir")`. For > example, check the "dir" attribute of result HTML text selected by Reader > Mode algorithm from a web page, parent(parent's parent...) node of the > result HTML text, head tag, and body tag. We should do two things: 1) look for a "dir" attribute on the ancestors of the elements we take out of the original DOM, not just on the document element. 2) for cases where that does not obtain a dir attribute, in the reader page JS/HTML (not in the readability code itself, where we can do the other stuff), abstract the language detection that we're doing for narrate anyway out, and make it more accessible, then use that to determine a fallback rtl/ltr-ness as per comment #13 .
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #20) > Is it possible for the reader mode to read two blocks with different > directionality? > > <div dir="ltr">foo</div> > <div dir="rtl">foo2</div> This question does not make sense to me. What does "read" mean? If the dir attribute is included in the markup reader mode extracts from a page, we don't touch it, and things should display fine. The problem is with dir attributes that are in some non-extracted ancestor of the content that we do extract, but not on the <body>/<html> tags, and pages that use CSS only to indicate directionality. So, what you're talking about is not currently a problem. > Also, it may be valuable to recognize directionality of text without markup > (tags like RLI/FSI/PDI may indicate that), but then you'd need similar code > to one layout uses to interpret the direction from the chunk. Why would we need to do layout's job for it? We don't specifically set dir=ltr if we don't find any indication to do that, and the content still has the same text in it. If layout already does not automatically determine it should be displayed RTL then I don't see why we would be able to do any better.
(In reply to :Gijs Kruitbosch (PTO Nov. 17-20 inclusive) from comment #21) > (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #19) > > Since it deals with the "Direction set on CSS" case impossibly, should we > > set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases > > here? > > > > If we want to deal with the "Direction set on HTML" cases, somehow we might > > need to find a good way to get possible "dir" attribute in a web page to > > instead of current solution, `doc.documentElement.getAttribute("dir")`. For > > example, check the "dir" attribute of result HTML text selected by Reader > > Mode algorithm from a web page, parent(parent's parent...) node of the > > result HTML text, head tag, and body tag. > > We should do two things: > 1) look for a "dir" attribute on the ancestors of the elements we take out > of the original DOM, not just on the document element. Got it. > 2) for cases where that does not obtain a dir attribute, in the reader page > JS/HTML (not in the readability code itself, where we can do the other > stuff), abstract the language detection that we're doing for narrate anyway > out, and make it more accessible, then use that to determine a fallback > rtl/ltr-ness as per comment #13 . One question here, what do you mean "make it more accessible"? In my understanding, we could do language detection for the `article.content` variable in the `_showContent` method[1]. And the language detection sample code is like below[2] ``` LanguageDetector.detectLanguage(sampleText).then(result => { resolve(result.confident ? result.language : null); }); ``` Looks like we can just reuse `LanguageDetector.jsm` without changing code for it, right? [1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/AboutReader.jsm#769-805 [2]: http://searchfox.org/mozilla-central/source/toolkit/components/narrate/Narrator.jsm#39-41
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #23) > (In reply to :Gijs Kruitbosch (PTO Nov. 17-20 inclusive) from comment #21) > > 2) for cases where that does not obtain a dir attribute, in the reader page > > JS/HTML (not in the readability code itself, where we can do the other > > stuff), abstract the language detection that we're doing for narrate anyway > > out, and make it more accessible, then use that to determine a fallback > > rtl/ltr-ness as per comment #13 . > One question here, what do you mean "make it more accessible"? > > In my understanding, we could do language detection for the > `article.content` variable in the `_showContent` method[1]. And the language > detection sample code is like below[2] > ``` > LanguageDetector.detectLanguage(sampleText).then(result => { > resolve(result.confident ? result.language : null); > }); > ``` > > Looks like we can just reuse `LanguageDetector.jsm` without changing code > for it, right? > > [1]: > http://searchfox.org/mozilla-central/source/toolkit/components/reader/ > AboutReader.jsm#769-805 > [2]: > http://searchfox.org/mozilla-central/source/toolkit/components/narrate/ > Narrator.jsm#39-41 I mean that we should call LanguageDetector.jsm in reader mode code instead of in narrator code, and the result be saved somewhere where it is accessible to the narrator code - the narrator code can depend on the reader mode code but not vice versa.
Flags: needinfo?(gijskruitbosch+bugs)
Assignee: nobody → evan
> 1) look for a "dir" attribute on the ancestors of the elements we take out of the original DOM, not just on the document element. Fixed the first problem: https://github.com/evanxd/readability/commit/d2bc22aa775f7d6e3bc512bb7991e09f9cb71910
Fixed some issues of the patch and sent a pull request for fixing problem one: https://github.com/mozilla/readability/pull/320
> 2) for cases where that does not obtain a dir attribute, in the reader page JS/HTML (not in the readability code itself, where we can do the other stuff), abstract the language detection that we're doing for narrate anyway out, and make it more accessible, then use that to determine a fallback rtl/ltr-ness as per comment #13 . Since we need to write another patch to fix problem two, we should file a new bug and fix it there.
Depends on: 1318605
Blocks: 1318605
No longer depends on: 1318605
Hi Gijs, I've updated readability from github repo, includes fix for Bug 1173548 and Bug 1255978. Could you take a look and help land it? Thanks.
Comment on attachment 8814827 [details] No bug - update readability from github repo, includes fix for Bug 1173548 and Bug 1255978, https://reviewboard.mozilla.org/r/95940/#review95918
Attachment #8814827 - Flags: review?(gijskruitbosch+bugs) → review+
Pushed a new try[1] to ensure `tc-M-V` test is good. In previous run, it is all failed for unknown reason. Once it's good, let's land it. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fdcbe4c14af
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #37) > Pushed a new try[1] to ensure `tc-M-V` test is good. In previous run, it is > all failed for unknown reason. Once it's good, let's land it. > > [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fdcbe4c14af Those are tier-2 and those oranges are not your fault.
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/e9ff40069326 No bug - update readability from github repo, includes fix for Bug 1173548 and Bug 1255978, r=Gijs
Status: NEW → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
QA Whiteboard: [good first verify]
I have reproduced this bug with Nightly 41.0a1 (2015-06-10) on WIndows 7, 64 Bit! This bg's fix is verified with latest Beta! Build ID : 20170403072723 User Agent : Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0 [bugday-20170412]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: