Closed Bug 780519 Opened 13 years ago Closed 13 years ago

Store all reader mode parsed articles in cache

Categories

(Firefox for Android Graveyard :: Reader View, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

VERIFIED WONTFIX

People

(Reporter: bnicholson, Assigned: bnicholson)

Details

Attachments

(1 file)

Bug 779796 changed Readability checks to do a full parse for every page. We should take advantage of this by storing all parsed pages in the cache. This should also allow us to avoid repeatedly parsing the same pages that the user visits frequently, and also avoids having to re-parse every page when navigating back/forward. Of course, caching the article for every page will quickly fill up the cache, so we should have a maximum capacity with some sort of eviction policy. An LRU cache would probably work well here.
This patch stores every article result in the cache. When the cache is full, the oldest article is automatically removed. This is accomplished by adding another index, time, to the store. Getting an article from the DB will automatically update that article's time. For now, only the number of non-persistent articles are limited (that is, if you explicitly add an item to your reading list, that article will be persistent and will not count toward the cache total). To differentiate between persistent vs non-persistent articles, persistent articles are stored with a time of 0. There's still a bit of cleanup to be done here, but I just wanted to be sure that we agree on this approach before I spend any more time on it. A few things that are missing from the complete patch: * See if some of the boilerplate code can be removed * Store max article capacity in some pref, not hard-coded * Add observer to automatically purge extra cache entries when pref changes * Check for overflow when cache is first init'd (in case something failed prior to quitting previously) * Also check for cache overflow when an article is removed from the reading list (removing an article from the reading list now simply changes the article's time from 0 -> Date.now(), which means the non-persistent article count has just increased by 1)
Assignee: nobody → bnicholson
Status: NEW → ASSIGNED
Attachment #649896 - Flags: feedback?(lucasr.at.mozilla)
Comment on attachment 649896 [details] [diff] [review] WIP - Store all Readability results in cache Review of attachment 649896 [details] [diff] [review]: ----------------------------------------------------------------- Hmm, this feels like an overly complicated change for a set of no-so-frequent use cases. How often do you re-read an article in reader mode? Very rarely I think. The back-forward case is maybe a slightly more relevant one but we could probably solve that by simply mem-caching the parsed article for the next/previous pages in session history. We cache a page if you want to read it later mostly to support offline reading, not exactly for performance reasons. If you want to take advantage of the full parsing, I'd prefer a "memory cache" solution that would improve the UX in the cases you're talking about for the current browsing session.
Attachment #649896 - Flags: feedback?(lucasr.at.mozilla) → feedback-
I did consider a memory cache at first, but storing multiple fully-parsed pages in memory could increase our (already high) memory footprint very quickly. Granted our disk usage is also high, but disk space is generally cheaper. Also, we'd lose our memory cache every time the browser is closed or OOM'ed (and OOMs would be more likely if we kept these in memory). This means we'd have to re-parse every page for a session restore. And pages that the user visits frequently will need to be parsed repeatedly between browser sessions. But there are downsides to using the disk cache as well - the main one, I think, being that we don't currently have any form of expiration. Without some form of expiration implementation, articles that change will never get updated unless (a) the user clears the cache, or (b) the user navigates to 50 other pages before looking at the article again. CC'ing kats and gbrown. I think they're familiar with the browser disk/memory cache which share some overlapping concepts, so their feedback could be useful here. Perhaps there's even a way to save these articles using the existing cache API?
(In reply to Brian Nicholson (:bnicholson) from comment #3) > I did consider a memory cache at first, but storing multiple fully-parsed > pages in memory could increase our (already high) memory footprint very > quickly. Granted our disk usage is also high, but disk space is generally > cheaper. > > Also, we'd lose our memory cache every time the browser is closed or OOM'ed > (and OOMs would be more likely if we kept these in memory). This means we'd > have to re-parse every page for a session restore. And pages that the user > visits frequently will need to be parsed repeatedly between browser sessions. If you simply want to ensure we don't re-parse pages if we get OOM-killed while viewing something on reader, you could simply only cache the currently viewed articles so that we restore quickly once Fennec resumes. No need to keep caching articles viewed on reader mode indefinitely. This means we'd only have a couple of cached articles at any given time (no need to write complicated expiration code and all). You can identify those cached articles with a special url or something. I really want to avoid adding all this complexity to the reader code just to avoid re-parsing articles if we get killed. I'd be more inclined to accept it if the current performance was *really* getting on the way, UX-wise. I think we'll be fine on the 2 most important Reader use cases this change: 1. While in a page, enter reader mode (I assume we'll be simply using the result of the full readability check, which runs in background, so this should be pretty fast already, without any disk cache, right?) 2. View from reading list (page is cached for offline usage so it should be immediate as well)
(In reply to Lucas Rocha (:lucasr) from comment #4) > I really want to avoid adding all this complexity to the reader code just to > avoid re-parsing articles if we get killed. I'd be more inclined to accept > it if the current performance was *really* getting on the way, UX-wise. This wouldn't be just to handle kills - it would also allow for faster back/forward navigation. We could do this with a memory cache, but storing them to disk will reduce memory pressure. > I think we'll be fine on the 2 most important Reader use cases this change: > > 1. While in a page, enter reader mode (I assume we'll be simply using the > result of the full readability check, which runs in background, so this > should be pretty fast already, without any disk cache, right?) > 2. View from reading list (page is cached for offline usage so it should be > immediate as well) What about back/forward navigation? If we only save the most recent article per tab, and we don't have a memory cache, we'll have to re-parse pages in the bfcache. We'd also have to deal with removing entries from the previous page every time we go to a new page. Or what if we crashed, and we don't do a session restore? We'll have to go through the cache to find and remove the stale articles at startup or some other point. It seems like we'd still have a lot of the same complexity without the benefits of faster back/forward.
(In reply to Brian Nicholson (:bnicholson) from comment #5) > (In reply to Lucas Rocha (:lucasr) from comment #4) > > I really want to avoid adding all this complexity to the reader code just to > > avoid re-parsing articles if we get killed. I'd be more inclined to accept > > it if the current performance was *really* getting on the way, UX-wise. > > This wouldn't be just to handle kills - it would also allow for faster > back/forward navigation. We could do this with a memory cache, but storing > them to disk will reduce memory pressure. > > > I think we'll be fine on the 2 most important Reader use cases this change: > > > > 1. While in a page, enter reader mode (I assume we'll be simply using the > > result of the full readability check, which runs in background, so this > > should be pretty fast already, without any disk cache, right?) > > 2. View from reading list (page is cached for offline usage so it should be > > immediate as well) > > What about back/forward navigation? If we only save the most recent article > per tab, and we don't have a memory cache, we'll have to re-parse pages in > the bfcache. > > We'd also have to deal with removing entries from the previous page every > time we go to a new page. Or what if we crashed, and we don't do a session > restore? We'll have to go through the cache to find and remove the stale > articles at startup or some other point. It seems like we'd still have a lot > of the same complexity without the benefits of faster back/forward. All these issues won't exist if you index those cached articles with a stable key. But, again, I think you're overestimating the value of caching articles for the back/forward case. In practice, the "slowness" of re-parsing on back/forward would only be annoying if you're quickly (or often) going back and forward in session history for some reason—which I think is not a common enough case to worry about. Adding code to manage a cache of every article viewed in reader mode really smells like premature optimization to me. I'll definitely back this idea if we end up getting user or QA feedback that the slowness is an issue somehow.
I certainly agree that we don't want to pre-optimize for issues we don't know exist yet. Moving to a worker thread will reduce performance/blocking issues. I'm not sure how valuable it is to make the check be instant.
I've seen several pages that take a long time for the reader icon to appear, so we could definitely benefit from improving the performance here. Remember that with bug 779796, we'll be doing a full parse for every page, so the readability "check" (which is actually a full parse) will take more time than it does right now. For example, going to http://en.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji on the single-core Galaxy S takes a whopping 60s+ (!) for the icon to appear. Navigating away from that page and going back causes it to happen all over again, which is painful. Even on pages where the wait isn't that long, we could still be saving lots of CPU cycles (and perhaps phone battery) by saving these pages to disk. At the very least, we should probably save the parse results as a boolean somewhere. That would mean OOM restores and back/forward navigation would be just as fast (unless they're actually restoring or navigating back/forward to an about:reader page - that would require re-parsing). Lucas and I discussed this a bit, and here's a summary: * Another option would be to attach the article to the tab, so only the most recent article is available. This makes clicking the reader mode button instant, but going back/forward, restoring from OOM, etc. will require re-parsing. * If we keep this LRU disk cache implementation, expiration can be implemented by checking the date when the reader icon is clicked. If a certain amount of time has passed, re-fetch the article. If the user never clicks the reader icon, it doesn't matter if articles are stale (since they don't actually see them). * Since there are proven cases where the cache does considerably improve performance, this may be worth having. But since greater complexity = greater instability, we should wait for other pending reader bugs (particularly bug 779796) to land and bake for awhile first. This will allow people to get a feel for reader mode performance firsthand, and it will give us the opportunity to build some use cases for comparison (so this will no longer be "pre-optimizing", just "optimizing" ;-) ).
OK, my next feedback is: * Keep it simple. Complication will kill it. * Make it something we can easily (preference) disable. There may be phones (ARMv6) where we want to minimize all data and memory usage, even at the expense of performance.
Priority: -- → P3
The about:reader page correctly re-uses the DOM document on back now and avoids re-parsing articles completely when the document is cached. Furthermore, about:reader opens immediately when entering reader mode (see bug 782421). I think we don't need to cache any extra articles anymore. Closing.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Closing bug as verified wontfix due to comment #10.
Status: RESOLVED → VERIFIED
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: