The default bug view has changed. See this FAQ.

Reader View picks wrong direction for some RTL pages

RESOLVED FIXED in Firefox 53

Status

()

Toolkit
Reader Mode
P2
normal
RESOLVED FIXED
2 years ago
19 days ago

People

(Reporter: Mehdi Sadeghi, Assigned: evanxd)

Tracking

(Blocks: 1 bug, {rtl})

38 Branch
mozilla53
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox53 fixed)

Details

(Whiteboard: [reader-mode-readability-algorithm])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
Created attachment 8620601 [details]
Bildschirmfoto vom 2015-06-10 23:37:25.png

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.0.1
Build ID: 20150525043631

Steps to reproduce:

Go to this page: http://mehdix.ir/rtl-feed.html
click the reader icon. The ouput's direction is left to right. If it shows correctly (rtl) for you, don't stop. Try this page: http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed and press the reader icon, the direction is false. However on this page the direction is showing correctly: https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%87%D8%A8%D9%88%D8%B7_%D8%B9%D9%84%D9%89_%D8%B3%D8%B7%D8%AD_%D8%A7%D9%84%D9%82%D9%85%D8%B1


Actual results:

Just browsing different pages with my updated iceweasel 38.0.1 on debian jessie that I see the direction is wrong for most RTL pages that I visit.


Expected results:

The content should appear RTL if the main page is also RTL. I change direction in css having `body { direction: rtl; }` not `dir="rtl"` as html attributed.

Updated

2 years ago
Component: Untriaged → Reader Mode
Keywords: rtl
Product: Firefox → Toolkit
(In reply to Mehdi Sadeghi from comment #0)
> I change
> direction in css having `body { direction: rtl; }` not `dir="rtl"` as html
> attributed.

This is not a good idea -- both the HTML spec [1] and the CSS spec [2] recommend setting direction with the HTML dir attribute and not the CSS direction property.

That said, I don't know exactly how Reader View chooses a direction, and we may well be able to improve things here.

[1] https://html.spec.whatwg.org/multipage/dom.html#the-dir-attribute:the-dir-attribute-18
[2] http://www.w3.org/TR/css-writing-modes-3/#bidi
(Reporter)

Comment 2

2 years ago
(In reply to Simon Montagu :smontagu from comment #1)
> (In reply to Mehdi Sadeghi from comment #0)
> > I change
> > direction in css having `body { direction: rtl; }` not `dir="rtl"` as html
> > attributed.
> 
> This is not a good idea -- both the HTML spec [1] and the CSS spec [2]
> recommend setting direction with the HTML dir attribute and not the CSS
> direction property.
> 
> That said, I don't know exactly how Reader View chooses a direction, and we
> may well be able to improve things here.
> 
> [1]
> https://html.spec.whatwg.org/multipage/dom.html#the-dir-attribute:the-dir-
> attribute-18
> [2] http://www.w3.org/TR/css-writing-modes-3/#bidi

Setting direction with the HTML dir attribute solved my problem. I change the status to RESOLVED.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → INVALID
I still want to investigate whether we can improve behaviour on sites like http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed where there is no dir attribute on the body, but all the content is contained in a <div> with dir="rtl".
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INVALID → ---

Comment 4

2 years ago
This page also doesn't support RTL in desktop reader mode:
http://hwzone.co.il/%D7%94%D7%A6%D7%A6%D7%94-%D7%9C%D7%9E%D7%95%D7%A6%D7%A8%D7%99-%D7%94%D7%A2%D7%99%D7%93%D7%9F-%D7%94%D7%97%D7%93%D7%A9-%D7%A9%D7%9C-amd-%D7%A9%D7%99%D7%92%D7%99%D7%A2%D7%95-%D7%91%D7%A9%D7%A0%D7%AA-201/

Kubuntu 15.04 64bit, Firefox 39.

Comment 5

2 years ago
Web designers should use dir attribute instead of CSS direction property, but there are sites out there not conforming to the recommendation. We can still do better on this by preserving the text direction.

A very safe implementation is to get computed style of the extracted elements, and set this to the dir attribute. This would guarantee the text direction is correct, but takes more time. DOM distiller is a similar project, but for Chromium. Their implementation is here:
https://github.com/chromium/dom-distiller/search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection

Comment 6

a year ago
(In reply to Wei-Yin Chen from comment #5)
> A very safe implementation is to get computed style of the extracted
> elements, and set this to the dir attribute. This would guarantee the text
> direction is correct, but takes more time. DOM distiller is a similar
> project, but for Chromium. Their implementation is here:
> https://github.com/chromium/dom-distiller/
> search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection

This sounds sensible to me.

This is broken on both desktop and Android Firefox.

From the opening bug description, http://mehdix.ir/rtl-feed.html is broken for me and http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed works correctly.

Another site that is broken for me: http://www.nrg.co.il/online/1/ART2/749/875.html?hp=1&cat=402&loc=1

Updated

9 months ago
Duplicate of this bug: 1232445

Updated

9 months ago
Duplicate of this bug: 1199961

Comment 9

9 months ago
Felipe, can CLD or maybe Intl help and determine directionality of text (or maybe just language, and can we make an inference based on that, if no direction is set and we're confident we're looking at Arabic/Hebrew/something-else-we-know-is-rtl ?)
Flags: needinfo?(felipc)

Comment 10

9 months ago
The charset autodetection code could help find the page language, but I am not sure if this component is not obsolete in Firefox. It should be better to set direction by checking the page direction and not the language, even that it will break the directionality of (very!) old webpages that have never used the direction directive. 

Since the problem is about pages that are written with RTL directionality, I guess that such webpages should have most of its content marked with dir=rtl or direction:rtl. RTL mode should be triggered if there is RTL content on the page, or if more than a given percentage of the page content is actually RTL.

Comment 11

9 months ago
(In reply to Amir Aharoni from comment #6)
> (In reply to Wei-Yin Chen from comment #5)
> > A very safe implementation is to get computed style of the extracted
> > elements, and set this to the dir attribute. This would guarantee the text
> > direction is correct, but takes more time. DOM distiller is a similar
> > project, but for Chromium. Their implementation is here:
> > https://github.com/chromium/dom-distiller/
> > search?utf8=%E2%9C%93&q=cloneSubtreeRetainDirection
> 
> This sounds sensible to me.
> 
> This is broken on both desktop and Android Firefox.
> 
> From the opening bug description, http://mehdix.ir/rtl-feed.html is broken
> for me and
> http://www.bbc.com/persian/sport/2015/06/
> 150610_world_cup_2026_bidding_delayed works correctly.

I'm not sure how it was back when I posted this comment, but now it's definitely the other way:
* correct: http://mehdix.ir/rtl-feed.html
* broken: http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed

Comment 12

9 months ago
(In reply to Amir Aharoni from comment #11)
> * correct: http://mehdix.ir/rtl-feed.html
Direction set on the HTML tag. <html dir="rtl" lang="fa">

> * broken:
> http://www.bbc.com/persian/sport/2015/06/
> 150610_world_cup_2026_bidding_delayed

Direction set on the content (and on <head> - why?!) <div class="direction" dir="rtl">

Broken: http://www.nrg.co.il/online/1/ART2/749/875.html?hp=1&cat=402&loc=1

- Direction set on CSS: #page-wrap {direction: rtl;}

Correct: https://hwzone.co.il/%D7%94%D7%A6%D7%A6%D7%94-%D7%9C%D7%9E%D7%95%D7%A6%D7%A8%D7%99-%D7%94%D7%A2%D7%99%D7%93%D7%9F-%D7%94%D7%97%D7%93%D7%A9-%D7%A9%D7%9C-amd-%D7%A9%D7%99%D7%92%D7%99%D7%A2%D7%95-%D7%91%D7%A9%D7%A0%D7%AA-201/

- Direction set on HTML: <html dir="rtl" lang="he-IL">
Yeah, the CLD (LanguageDetector.jsm) should be able to provide a locale from a given chunk of text, which could them be run through nsIChromeRegistry.isLocaleRTL, which checks the "ui.direction.*" prefs to determine if it's in the list of RTL locales.

The problem is that there could be text with mixed languages which will never look right one way or the other. I think that Reader Mode should get the computed style for the `direction` style on each chunk of text that it decided to keep, and use that information. 

Does Reader Mode have a direct knowledge of the DOM elements it's keeping? Or does the Readability API only give back the filtered plain text?
Flags: needinfo?(felipc)

Comment 14

9 months ago
(In reply to :Felipe Gomes (needinfo me!) from comment #13)
> Does Reader Mode have a direct knowledge of the DOM elements it's keeping?

No. It doesn't even have more than the HTML of the original page in some cases, so "computed style" isn't available, because there's no CSS except inline styles. So really, attempting to rely on the direction: style is hard/impossible without completely rearchitecturing the whole thing. We might be able to make the common case of clicking to enter reader mode slightly better by storing directionality from the loaded document, but even that is non-trivial. And even if we do that it won't always work because e.g. from session restore / refresh / back/fwd navigation, we won't necessarily have this information (we just XHR the HTML). Hence asking for other options.

Comment 15

9 months ago
(In reply to :Gijs Kruitbosch from comment #14)
> No. It doesn't even have more than the HTML of the original page in some
> cases, so "computed style" isn't available, because there's no CSS except
> inline styles. 
Does it means that the reader mode is not aware of stylesheets at all? Because it may make it difficult to guess direction that is set on the stylesheets.

Comment 16

9 months ago
(In reply to Tomer Cohen :tomer from comment #15)
> (In reply to :Gijs Kruitbosch from comment #14)
> > No. It doesn't even have more than the HTML of the original page in some
> > cases, so "computed style" isn't available, because there's no CSS except
> > inline styles. 
> Does it means that the reader mode is not aware of stylesheets at all?

Yes.

> Because it may make it difficult to guess direction that is set on the
> stylesheets.

Not just "may" - it *does* make it *impossible* to know. There's a good reason the specs suggest using the "dir" attribute instead of CSS, see comment #1. It's a semantics rather than just a styling change anyway.

Anyway, this is the web we get and so we should try and deal with it, somehow.
(Reporter)

Comment 17

9 months ago
(In reply to Amir Aharoni from comment #11)
> I'm not sure how it was back when I posted this comment, but now it's
> definitely the other way:
> * correct: http://mehdix.ir/rtl-feed.html

It is correct because I have updated the website to reflect changes described in comment #1.

Updated

9 months ago
Priority: -- → P2

Updated

9 months ago
Status: REOPENED → NEW
Whiteboard: [reader-mode-readability-algorithm]

Updated

9 months ago
Blocks: 1286221

Updated

8 months ago
Duplicate of this bug: 1288203
(Assignee)

Comment 19

4 months ago
(In reply to :Gijs Kruitbosch from comment #16)
> (In reply to Tomer Cohen :tomer from comment #15)
> > (In reply to :Gijs Kruitbosch from comment #14)
> > > No. It doesn't even have more than the HTML of the original page in some
> > > cases, so "computed style" isn't available, because there's no CSS except
> > > inline styles. 
> > Does it means that the reader mode is not aware of stylesheets at all?
> 
> Yes.
> 
> > Because it may make it difficult to guess direction that is set on the
> > stylesheets.
> 
> Not just "may" - it *does* make it *impossible* to know. There's a good
> reason the specs suggest using the "dir" attribute instead of CSS, see
> comment #1. It's a semantics rather than just a styling change anyway.
Gijs,

Since it deals with the "Direction set on CSS" case impossibly, should we set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases here?

If we want to deal with the "Direction set on HTML" cases, somehow we might need to find a good way to get possible "dir" attribute in a web page to instead of current solution, `doc.documentElement.getAttribute("dir")`. For example, check the "dir" attribute of result HTML text selected by Reader Mode algorithm from a web page, parent(parent's parent...) node of the result HTML text, head tag, and body tag.

Any thought?

[1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/Readability.js#666
Flags: needinfo?(gijskruitbosch+bugs)
Is it possible for the reader mode to read two blocks with different directionality?

<div dir="ltr">foo</div>
<div dir="rtl">foo2</div>

Also, it may be valuable to recognize directionality of text without markup (tags like RLI/FSI/PDI may indicate that), but then you'd need similar code to one layout uses to interpret the direction from the chunk.

Comment 21

4 months ago
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #19)
> Since it deals with the "Direction set on CSS" case impossibly, should we
> set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases
> here?
> 
> If we want to deal with the "Direction set on HTML" cases, somehow we might
> need to find a good way to get possible "dir" attribute in a web page to
> instead of current solution, `doc.documentElement.getAttribute("dir")`. For
> example, check the "dir" attribute of result HTML text selected by Reader
> Mode algorithm from a web page, parent(parent's parent...) node of the
> result HTML text, head tag, and body tag.

We should do two things:
1) look for a "dir" attribute on the ancestors of the elements we take out of the original DOM, not just on the document element.
2) for cases where that does not obtain a dir attribute, in the reader page JS/HTML (not in the readability code itself, where we can do the other stuff), abstract the language detection that we're doing for narrate anyway out, and make it more accessible, then use that to determine a fallback rtl/ltr-ness as per comment #13 .
Flags: needinfo?(gijskruitbosch+bugs)

Comment 22

4 months ago
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #20)
> Is it possible for the reader mode to read two blocks with different
> directionality?
> 
> <div dir="ltr">foo</div>
> <div dir="rtl">foo2</div>

This question does not make sense to me. What does "read" mean?

If the dir attribute is included in the markup reader mode extracts from a page, we don't touch it, and things should display fine. The problem is with dir attributes that are in some non-extracted ancestor of the content that we do extract, but not on the <body>/<html> tags, and pages that use CSS only to indicate directionality. So, what you're talking about is not currently a problem.

> Also, it may be valuable to recognize directionality of text without markup
> (tags like RLI/FSI/PDI may indicate that), but then you'd need similar code
> to one layout uses to interpret the direction from the chunk.

Why would we need to do layout's job for it? We don't specifically set dir=ltr if we don't find any indication to do that, and the content still has the same text in it. If layout already does not automatically determine it should be displayed RTL then I don't see why we would be able to do any better.
(Assignee)

Comment 23

4 months ago
(In reply to :Gijs Kruitbosch (PTO Nov. 17-20 inclusive) from comment #21)
> (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #19)
> > Since it deals with the "Direction set on CSS" case impossibly, should we
> > set "WONTFIX" here? Or we should focus on the "Direction set on HTML" cases
> > here?
> > 
> > If we want to deal with the "Direction set on HTML" cases, somehow we might
> > need to find a good way to get possible "dir" attribute in a web page to
> > instead of current solution, `doc.documentElement.getAttribute("dir")`. For
> > example, check the "dir" attribute of result HTML text selected by Reader
> > Mode algorithm from a web page, parent(parent's parent...) node of the
> > result HTML text, head tag, and body tag.
> 
> We should do two things:
> 1) look for a "dir" attribute on the ancestors of the elements we take out
> of the original DOM, not just on the document element.
Got it.

> 2) for cases where that does not obtain a dir attribute, in the reader page
> JS/HTML (not in the readability code itself, where we can do the other
> stuff), abstract the language detection that we're doing for narrate anyway
> out, and make it more accessible, then use that to determine a fallback
> rtl/ltr-ness as per comment #13 .
One question here, what do you mean "make it more accessible"?

In my understanding, we could do language detection for the `article.content` variable in the `_showContent` method[1]. And the language detection sample code is like below[2]
```
LanguageDetector.detectLanguage(sampleText).then(result => {
  resolve(result.confident ? result.language : null);
});
```

Looks like we can just reuse `LanguageDetector.jsm` without changing code for it, right?

[1]: http://searchfox.org/mozilla-central/source/toolkit/components/reader/AboutReader.jsm#769-805
[2]: http://searchfox.org/mozilla-central/source/toolkit/components/narrate/Narrator.jsm#39-41
Flags: needinfo?(gijskruitbosch+bugs)

Comment 24

4 months ago
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #23)
> (In reply to :Gijs Kruitbosch (PTO Nov. 17-20 inclusive) from comment #21)
> > 2) for cases where that does not obtain a dir attribute, in the reader page
> > JS/HTML (not in the readability code itself, where we can do the other
> > stuff), abstract the language detection that we're doing for narrate anyway
> > out, and make it more accessible, then use that to determine a fallback
> > rtl/ltr-ness as per comment #13 .
> One question here, what do you mean "make it more accessible"?
> 
> In my understanding, we could do language detection for the
> `article.content` variable in the `_showContent` method[1]. And the language
> detection sample code is like below[2]
> ```
> LanguageDetector.detectLanguage(sampleText).then(result => {
>   resolve(result.confident ? result.language : null);
> });
> ```
> 
> Looks like we can just reuse `LanguageDetector.jsm` without changing code
> for it, right?
> 
> [1]:
> http://searchfox.org/mozilla-central/source/toolkit/components/reader/
> AboutReader.jsm#769-805
> [2]:
> http://searchfox.org/mozilla-central/source/toolkit/components/narrate/
> Narrator.jsm#39-41

I mean that we should call LanguageDetector.jsm in reader mode code instead of in narrator code, and the result be saved somewhere where it is accessible to the narrator code - the narrator code can depend on the reader mode code but not vice versa.
Flags: needinfo?(gijskruitbosch+bugs)
(Assignee)

Comment 25

4 months ago
Wrote WIP patch[1] to fixed the RTL issue of the BBC webpage[2] and add tests for it.

[1]: https://github.com/evanxd/readability/commit/f7b50429e00350da65dcdc770bb31e47e94c3861
[2]: http://www.bbc.com/persian/sport/2015/06/150610_world_cup_2026_bidding_delayed
(Assignee)

Updated

4 months ago
Assignee: nobody → evan
(Assignee)

Comment 26

4 months ago
> 1) look for a "dir" attribute on the ancestors of the elements we take out of the original DOM, not just on the document element.

Fixed the first problem: https://github.com/evanxd/readability/commit/d2bc22aa775f7d6e3bc512bb7991e09f9cb71910
(Assignee)

Comment 27

4 months ago
Fixed some issues of the patch and sent a pull request for fixing problem one: https://github.com/mozilla/readability/pull/320
(Assignee)

Comment 28

4 months ago
> 2) for cases where that does not obtain a dir attribute, in the reader page JS/HTML (not in the readability code itself, where we can do the other stuff), abstract the language detection that we're doing for narrate anyway out, and make it more accessible, then use that to determine a fallback rtl/ltr-ness as per comment #13 .

Since we need to write another patch to fix problem two, we should file a new bug and fix it there.
(Assignee)

Updated

4 months ago
Depends on: 1318605
(Assignee)

Updated

4 months ago
Blocks: 1318605
No longer depends on: 1318605
(Assignee)

Comment 29

4 months ago
Updated patch for comments: https://github.com/mozilla/readability/pull/320/commits/bdbeb18c45bce0ef57e9b66203d3fd61a8aced55
(Assignee)

Comment 30

4 months ago
Updated patch for comments: https://github.com/mozilla/readability/pull/320/commits/9ddcd14378ad38a31e186c81c6e5e859cb3f04e4
(Assignee)

Comment 31

4 months ago
Updated patch for comments: https://github.com/mozilla/readability/pull/320/commits/cf6e6708bf349942f9e7e0712db6a15f9e4a3499
(Assignee)

Comment 32

4 months ago
Patch is landed on GitHub repo: https://github.com/mozilla/readability/commit/af0aa5c59fe26484ea444204a9275f05d713a038
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(Assignee)

Comment 35

4 months ago
Hi Gijs,

I've updated readability from github repo, includes fix for Bug 1173548 and Bug 1255978. Could you take a look and help land it? Thanks.

Comment 36

4 months ago
mozreview-review
Comment on attachment 8814827 [details]
No bug - update readability from github repo, includes fix for Bug 1173548 and Bug 1255978,

https://reviewboard.mozilla.org/r/95940/#review95918
Attachment #8814827 - Flags: review?(gijskruitbosch+bugs) → review+
(Assignee)

Comment 37

4 months ago
Pushed a new try[1] to ensure `tc-M-V` test is good. In previous run, it is all failed for unknown reason. Once it's good, let's land it.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fdcbe4c14af
(Assignee)

Comment 38

4 months ago
Pushed a new try[1] to debug the `tc-M-V` failures[2].

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c1df7241e32054b3048b08922c7f8e6d42267b01
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fdcbe4c14af&filter-resultStatus=testfailed

Comment 39

4 months ago
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #37)
> Pushed a new try[1] to ensure `tc-M-V` test is good. In previous run, it is
> all failed for unknown reason. Once it's good, let's land it.
> 
> [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fdcbe4c14af

Those are tier-2 and those oranges are not your fault.

Comment 40

4 months ago
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/e9ff40069326
No bug - update readability from github repo, includes fix for Bug 1173548 and Bug 1255978, r=Gijs

Comment 41

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/e9ff40069326
Status: NEW → RESOLVED
Last Resolved: 2 years ago4 months ago
status-firefox53: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
QA Whiteboard: [good first verify]
You need to log in before you can comment on or make changes to this bug.