Run the Readability script on the main thread (instead of a web worker)

NEW
Unassigned

Status

()

Firefox
General
3 years ago
3 years ago

People

(Reporter: David Bruant, Unassigned)

Tracking

38 Branch
x86
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
For context, see https://github.com/mozilla/readability/issues/77

In a nutshell, the Readability script is fast (apparently ~20ms) and can run on the main thread. It cannot run directly on a real DOM Document, because it expects a fake DOM as the one created by JSDOMParser, but this is getting fixed [1].
Running it on the main thread would save the burden to write an HTML-compliant parser in JavaScript (JSDOMParser currently fails on broken/complex HTML like https://github.com/mozilla/readability/issues/77#issuecomment-86962219 )

The dependency is set so that this task can happen after Readability has shipped (planned for Firefox 38 AAUI)

[1] https://github.com/mozilla/readability/pull/80
(In reply to David Bruant from comment #0)
> For context, see https://github.com/mozilla/readability/issues/77
> 
> In a nutshell, the Readability script is fast (apparently ~20ms) and can run
> on the main thread. It cannot run directly on a real DOM Document, because
> it expects a fake DOM as the one created by JSDOMParser, but this is getting
> fixed [1].

… Until we spot other problems :/

And note that the linked "fix" is simply removing comments from the test inputs, not fixing JSDOMParser itself. 

Side note, if we want to clone the current document so we avoid mutations right from the page itself, we could simply:

> var doc = new DOMParser().parseFromString(document.firstElementChild.outerHTML, "text/html")

Yeah, that sounds a little nasty, but we'd get a proper HTMLDocument object to feed Readability with. I suspect this being slower and/or impact end user experience compared to delegating the work to a worker, though on my (rather fast) box, this feels perfectly fine with a few sample websites of mine.
(Reporter)

Comment 2

3 years ago
> if we want to clone the current document so we avoid mutations right from the page itself

Why would it be necessary? If Readability runs within a single event loop turn, no need to worry about mutations at all, because within your event loop turn, no one can be touching the DOM (except in pathological cases of getters poisoned by content... which can probably avoided by a Jetpack-like setup)
Runtime manipulations of DOM that would affect Readability also feels already fairly unlikely, no?


> var doc = new DOMParser().parseFromString(document.firstElementChild.outerHTML, "text/html")

If cloning is actually necessary, what about document.cloneNode(true) ?
(Reporter)

Comment 3

3 years ago
Unscientific benchmark [1] tells me cloneNode is about 2x faster than outerHTML+parseFromString on this Bugzilla bug page

[1] https://gist.github.com/DavidBruant/e05fa653ce4b5033f8f9

Comment 4

3 years ago
There is no way to stop script running if we keep the same document loaded, which is another issue that would need to be addressed.
(Reporter)

Comment 5

3 years ago
(In reply to :Gijs Kruitbosch (away 26-30/3) from comment #4)
> There is no way to stop script running if we keep the same document loaded,
> which is another issue that would need to be addressed.

I do not understand your comment and how it relates to the topic at hand. Can you give an example of the issue you're seeing (preferably with some content code)?

In any case, given you have chrome privileges, I know at least 2 ways to stop scripts, one is what NoScript addons use (no idea what that is :-) ) and the other, probably more gentle is the Debugger API.
But stopping scripts seems unnecessary for the reason I gave (the Readability content extracting function can run within a single event loop turn also called "task" in spec terminology).

Comment 6

3 years ago
(In reply to David Bruant from comment #5)
> (In reply to :Gijs Kruitbosch (away 26-30/3) from comment #4)
> > There is no way to stop script running if we keep the same document loaded,
> > which is another issue that would need to be addressed.
> 
> I do not understand your comment and how it relates to the topic at hand.
> Can you give an example of the issue you're seeing (preferably with some
> content code)?

The reasoning is pretty simple: if you have a website like your average JS-enabled webmail (gmail, yahoo mail, outlook for web, whatever), and it tries to modify the DOM, and you then change the DOM lots, chances are the scripts will start breaking and/or doing "random" stuff to the reader-ized page (or, for instance, prevent the page from being readerized). The same goes for more-likely-to-be-readerized pages like weblogs with dynamic comments sections, etc.

So the issue is that we can't predict what the webpage will do and the results of applying it to the reader-ized DOM aren't going to be good.

cloneNode and appending into a new DOM should take care of this - or using the stuff described in bug 998456 comment 20, maybe? - it's just something we need to think about (and/or test!), that's all.
You need to log in before you can comment on or make changes to this bug.