Closed Bug 1007409 Opened 10 years ago Closed 10 years ago

Create public API to add new items to the reader mode cache

Categories

(Firefox for Android Graveyard :: Reader View, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 36

People

(Reporter: Margaret, Assigned: Margaret, Mentored)

References

Details

Attachments

(2 files, 3 obsolete files)

For my pocket panel add-on, I tried to hack the existing Reader logic to add new articles to the cache, but I ran into performance problems because the API was not designed for this.

I want a public reader mode API that takes a URL, downloads and parses a document, and adds the resulting article to the cache.
Mentor: margaret.leibovic
Whiteboard: [mentor=margaret][lang=js][bad first bug] → [lang=js][bad first bug]
Blocks: 1046913
Blocks: readerv2
No longer blocks: 1046913
This bug will help us work towards downloading/caching articles for users when they add URLs to the reading list. This will also help us create offline versions of content if we sync reading list URLs.
Assignee: nobody → margaret.leibovic
Mentor: margaret.leibovic → nobody
Whiteboard: [lang=js][bad first bug]
Based off a conversation rnewman and I had on IRC, I decided to try caching reading list articles in files, instead of in indexedDB. The hope is that in the longer term, we could actually store HTML files, as opposed to reconstructing the articles from this JSON. And by getting rid of indexedDB (or at least the way we're currently using it), it should hopefully be easier to insert multiple articles into the cache at the same time.

The main outstanding issue in this patch is figuring out where to actually store these things for real. This is working, but it will only cache one article at at time :) I suppose we'll also need to work out an eviction scheme to make sure we don't fill up your whole disk with reading list articles.

What do you guys think of this approach?

(FYI, this is based on top of my patch for bug 1085685).
Attachment #8508362 - Flags: feedback?(rnewman)
Attachment #8508362 - Flags: feedback?(bnicholson)
Comment on attachment 8508362 [details] [diff] [review]
WIP - Cache reading list articles in files, not indexedDB

Review of attachment 8508362 [details] [diff] [review]:
-----------------------------------------------------------------

Mostly questions to think about. Unqualified f+ on shifting things out of indexeddb.

::: mobile/android/chrome/content/Reader.js
@@ -4,5 @@
>   * You can obtain one at http://mozilla.org/MPL/2.0/. */
>  "use strict";
>  
>  let Reader = {
> -  // Version of the cache database schema

Wild and crazy note: even file-based schemas benefit from versioning. At some point you'll have to upgrade from flat files to descriptors+cache or something.

@@ +98,5 @@
>          }
>  
> +        this._getArticleFromCache(urlWithoutRef).then(article => {
> +        // If the article is already in reading list, bail
> +          NativeWindow.toast.show(Strings.browser.GetStringFromName("readingList.duplicate"), "short");

Maybe invert this to addArticle(...).then(success, failure)? The key concept is to be asking for what you want to do, and handling failure on the side.

(I sometimes find it helps if I explicitly do `let failure = (foo) => { bar...}`, then phrase the promise calls in terms of `success` and `failure`.)

Alternatively, maybe make the 'error' case less of an error -- either we found an article or we didn't, but it wasn't an error. An error is that something went wrong: malformed URI, no directory, whatever. Null checks aren't always the wrong answer.

Just some modeling thoughts to mull over; don't let me bias you one way or the other.

@@ +258,5 @@
>  
> +  _storeArticleInCache: function Reader_storeArticleInCache(article) {
> +    // TODO: Get this file name from reading list provider?
> +    let fileName = "article1.json";
> +    let path = OS.Path.join(OS.Constants.Path.profileDir, fileName);

Some questions to think about:

* Should cached RL items be in the Android cache dir, or in the profile dir? Which will make it easier to manage tons of data? For illustration, I have 502MB of data in my Pocket app on Android.

* Should you be using the URL to compute the destination? If so, should you be stripping fragments?

* Can this cache be shared between profiles?

* Presumably we'll have a state -- either now or later -- where we'll have added something to our reading list, but we no longer have it on disk. Shouldn't we be grabbing this path out of a DB, so that it's definitive? What should we do when the files are missing?
Attachment #8508362 - Flags: feedback?(rnewman) → feedback+
Comment on attachment 8508362 [details] [diff] [review]
WIP - Cache reading list articles in files, not indexedDB

Review of attachment 8508362 [details] [diff] [review]:
-----------------------------------------------------------------

Agree with Richard's comments. Not sure how far you're planning on going in terms of refactoring this API, but I was pretty deep into a similar endeavor a few weeks ago, so I just filed bug 1087722 and dumped my WIP there in case it's useful.

::: mobile/android/chrome/content/Reader.js
@@ +99,5 @@
>  
> +        this._getArticleFromCache(urlWithoutRef).then(article => {
> +        // If the article is already in reading list, bail
> +          NativeWindow.toast.show(Strings.browser.GetStringFromName("readingList.duplicate"), "short");
> +        }, e => {

Promise rejections should be used for exceptional cases, whenever you would throw in synchronous code. IMO, failing to get an article from the cache wouldn't be considered exceptional since it simply may not have been added previously. So I'd change _getArticleFromCache to first do an exists check, then resolve with null on non-existence (the function succeeds, it just found no article). I think it still makes sense to reject for any I/O or JSON errors, though.
Attachment #8508362 - Flags: feedback?(bnicholson) → feedback+
This is getting closer. Few questions to address:

* Where to store cached files? Right now I'm storing them in the root of the profile directory, probably not what we want.
* When to get rid of cached files? We should come up with some eviction plan, although perhaps we should do that in a separate bug.

Also makes me realize we should have some migration in place that cleans up the indexedDB cache. Do you think users will be upset if we don't migrate their existing cached articles? Maybe we should do that...

You will also be pleased to know that there are no more .bind calls in this file ;)
Attachment #8508362 - Attachment is obsolete: true
Attachment #8514680 - Flags: feedback?(rnewman)
Attachment #8514680 - Flags: feedback?(bnicholson)
Comment on attachment 8514680 [details] [diff] [review]
WIP v2 - Cache reading list articles in files, not indexedDB (

Review of attachment 8514680 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/chrome/content/Reader.js
@@ +90,5 @@
>      }
>    },
>  
>    _addTabToReadingList: function(tabID) {
> +    let uri = BrowserApp.getTabForId(tabID).browser.currentURI;

Is it possible for any of this to fail, incidentally? We've got a long chain here.

@@ +151,5 @@
>  
>      let request = { url: url, callbacks: [callback] };
>      this._requests[url] = request;
>  
> +    this.log("parseDocumentFromURL: " + url);

This can probably die now.

@@ +171,5 @@
> +
> +      // Article hasn't been found in the cache DB, we need to
> +      // download the page and parse the article out of it.
> +      this._downloadAndParseDocument(url, request);
> +    });

You should add a rejection handler.

@@ +225,5 @@
>        callback(null);
>      }
>    },
>  
> +  getArticleFromCache: function Reader_getArticleFromCache(uri) {

Comment that this returns a promise, and what it resolves and rejects.

@@ +227,5 @@
>    },
>  
> +  getArticleFromCache: function Reader_getArticleFromCache(uri) {
> +    let fileName = this._toHash(uri.specIgnoringRef);
> +    let path = OS.Path.join(OS.Constants.Path.profileDir, fileName);

Can we put these in a subdirectory?

@@ +240,5 @@
>    },
>  
>    storeArticleInCache: function Reader_storeArticleInCache(article) {
> +    let fileName = this._toHash(article.url);
> +    let path = OS.Path.join(OS.Constants.Path.profileDir, fileName);

Maybe break these two lines out into _toHashedPath(url)?

@@ +437,5 @@
> +    delete this._unicodeConverter;
> +    this._unicodeConverter = Cc["@mozilla.org/intl/scriptableunicodeconverter"]
> +                              .createInstance(Ci.nsIScriptableUnicodeConverter);
> +    this._unicodeConverter.charset = "utf8";
> +    return this._unicodeConverter 

s/ $/;

@@ +448,5 @@
> +   * @return The base64 representation of the MD5 hash.
> +   */
> +  _toHash: function(url) {
> +    let value = this._unicodeConverter.convertToByteArray(url);
> +    this._cryptoHash.init(this._cryptoHash.MD5);

One might consider using SHA-1 here, even though this isn't intended as a crypto hash. (If only we had non-crypto hashes available!)

@@ +450,5 @@
> +  _toHash: function(url) {
> +    let value = this._unicodeConverter.convertToByteArray(url);
> +    this._cryptoHash.init(this._cryptoHash.MD5);
> +    this._cryptoHash.update(value, value.length);
> +    return this._cryptoHash.finish(true);

Make sure the base64 it returns isn't ==-padded! It shouldn't be, 'cos it'll be a fixed-length hash, but worth checking.
Attachment #8514680 - Flags: feedback?(rnewman) → feedback+
(In reply to Richard Newman [:rnewman] from comment #6)

> @@ +448,5 @@
> > +   * @return The base64 representation of the MD5 hash.
> > +   */
> > +  _toHash: function(url) {
> > +    let value = this._unicodeConverter.convertToByteArray(url);
> > +    this._cryptoHash.init(this._cryptoHash.MD5);
> 
> One might consider using SHA-1 here, even though this isn't intended as a
> crypto hash. (If only we had non-crypto hashes available!)

SHA-1 gave me trouble because it included / characters, which naturally confused the file paths!

> @@ +450,5 @@
> > +  _toHash: function(url) {
> > +    let value = this._unicodeConverter.convertToByteArray(url);
> > +    this._cryptoHash.init(this._cryptoHash.MD5);
> > +    this._cryptoHash.update(value, value.length);
> > +    return this._cryptoHash.finish(true);
> 
> Make sure the base64 it returns isn't ==-padded! It shouldn't be, 'cos it'll
> be a fixed-length hash, but worth checking.

Hm, it does look like they are padded. Is there a way to get rid of this without manually removing it?
I decided to task-ify this to make the file I/O more asynchronous. This stores the cache files in a "readercache" directory in the profile directory.

I'll file a follow-up to clean up the indexedDB cache the user would have laying around. And maybe try migrating the cache as well, although I feel like I would rather just start over, since we haven't been doing anything to prune this cache, so there might be a bunch of old stuff in there :/

I want to try to write some tests for this. It shouldn't be hard to write a simple JavascriptTest that sanity checks the storeArticleInCache/getArticleFromCache/removeArticleFromCache APIs.
Attachment #8514680 - Attachment is obsolete: true
Attachment #8514680 - Flags: feedback?(bnicholson)
Attachment #8515354 - Flags: review?(rnewman)
Attached patch WIP - Test (obsolete) — Splinter Review
This test doesn't do much of anything right now, but it's just timing out without passing.

mfinkle, can you tell if I'm doing anything wrong? Poking into the global window to get the Reader object seems sketchy, but I see we're doing that in other tests.
Attachment #8515379 - Flags: feedback?(mark.finkle)
Comment on attachment 8515379 [details] [diff] [review]
WIP - Test

Actually, this works fine for me now. Not sure what was up with my build before!

I'm going to add some more testcases in here before asking for review.
Attachment #8515379 - Flags: feedback?(mark.finkle)
I'm pretty pleased with this. I added some basic test coverage for getArticleFromCache/storeArticleInCache/removeArticleFromCache, and I also added a test for parseDocumentFromURL. We'll probably want to use this in order to download and cache articles in the background as URLs are added to the reading list.
Attachment #8515379 - Attachment is obsolete: true
Attachment #8516312 - Flags: review?(rnewman)
Comment on attachment 8515354 [details] [diff] [review]
Cache reading list articles in files, not indexedDB

Review of attachment 8515354 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/chrome/content/Reader.js
@@ +237,5 @@
> +   * @resolve JS object representing the article, or null if no article is found.
> +   * @rejects OS.File.Error
> +   */
> +  getArticleFromCache: function (uri) {
> +    return Task.spawn(function* () {

Actually, I can just use Task.async here, and then I don't need the bind call below. Updating my patch to do this.
Comment on attachment 8516312 [details] [diff] [review]
Test for reading list cache

Great basic test!
(In reply to :Margaret Leibovic from comment #5)

> * Where to store cached files? Right now I'm storing them in the root of the
> profile directory, probably not what we want.

We probably want to use a subfolder. Also, if we decide to move this to the Android cache folder, we should still use profiles. The HTTP cache2 system already uses the Android cache for it's storage and it uses a mirrored profile structure.

> * When to get rid of cached files? We should come up with some eviction
> plan, although perhaps we should do that in a separate bug.

Separate bug. We should also evict based on file space triggers from Android.
(In reply to :Margaret Leibovic from comment #7)

> > Make sure the base64 it returns isn't ==-padded! It shouldn't be, 'cos it'll
> > be a fixed-length hash, but worth checking.
> 
> Hm, it does look like they are padded. Is there a way to get rid of this
> without manually removing it?

CommonUtils.encodeBase64URL has a way to remove the padding:
http://mxr.mozilla.org/mozilla-central/source/services/common/utils.js#85
(In reply to Mark Finkle (:mfinkle) from comment #16)
> (In reply to :Margaret Leibovic from comment #7)
> 
> > > Make sure the base64 it returns isn't ==-padded! It shouldn't be, 'cos it'll
> > > be a fixed-length hash, but worth checking.
> > 
> > Hm, it does look like they are padded. Is there a way to get rid of this
> > without manually removing it?
> 
> CommonUtils.encodeBase64URL has a way to remove the padding:
> http://mxr.mozilla.org/mozilla-central/source/services/common/utils.js#85

A c++ impl of base64 encoding:
http://mxr.mozilla.org/mozilla-central/source/toolkit/identity/nsIIdentityCryptoService.idl#47
Comment on attachment 8516312 [details] [diff] [review]
Test for reading list cache

Review of attachment 8516312 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent tests!

::: mobile/android/base/tests/testReadingListCache.js
@@ +1,3 @@
> +// -*- indent-tabs-mode: nil; js-indent-level: 2 -*-
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this file,

Nit: incorrect template; "file," goes on the next line.

https://www.mozilla.org/MPL/headers/

@@ +48,5 @@
> +
> +  let article = yield Reader.getArticleFromCache(ARTICLE_URI);
> +  do_check_eq(article, null);
> +
> +  run_next_test();

I was under the impression that you shouldn't be calling run_next_test from inside add_task. See Bug 972093.

@@ +56,5 @@
> +  let Reader = Services.wm.getMostRecentWindow("navigator:browser").Reader;
> +
> +  Reader.parseDocumentFromURL(ARTICLE.url, function parseCallback(article) {
> +    checkArticle(article);
> +    run_next_test();

It's fine for add_test, of course.
Attachment #8516312 - Flags: review?(rnewman) → review+
Comment on attachment 8515354 [details] [diff] [review]
Cache reading list articles in files, not indexedDB

Review of attachment 8515354 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Some possible improvements inline.

::: mobile/android/chrome/content/Reader.js
@@ +98,5 @@
> +    if (!tab) {
> +      Cu.reportError("Can't add tab to reading list because no tab found for ID: " + tabID);
> +      return;
> +    }
> +    let uri = tab.browser.currentURI;

Do we care about checking this? E.g., non-null, doesn't point to something silly like an about: page?

@@ +173,4 @@
>  
> +      if (!this._requests) {
> +        this.log("Reader has been destroyed, abort");
> +        return;

If we can't run callbacks here, shouldn't we be doing this check before line 170 tries to?

@@ +475,3 @@
>  
> +    let hash = CommonUtils.encodeBase32(this._cryptoHash.finish(false));
> +    let fileName = hash.substring(0, hash.indexOf("="));

.json?
Attachment #8515354 - Flags: review?(rnewman) → review+
Blocks: 1087722
(In reply to Richard Newman [:rnewman] from comment #18)
> Comment on attachment 8516312 [details] [diff] [review]
> Test for reading list cache
> 
> Review of attachment 8516312 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Excellent tests!
> 
> ::: mobile/android/base/tests/testReadingListCache.js
> @@ +1,3 @@
> > +// -*- indent-tabs-mode: nil; js-indent-level: 2 -*-
> > +/* This Source Code Form is subject to the terms of the Mozilla Public
> > + * License, v. 2.0. If a copy of the MPL was not distributed with this file,
> 
> Nit: incorrect template; "file," goes on the next line.

Copy/pasted this from another file... I'll update this from the real template.

> https://www.mozilla.org/MPL/headers/
> 
> @@ +48,5 @@
> > +
> > +  let article = yield Reader.getArticleFromCache(ARTICLE_URI);
> > +  do_check_eq(article, null);
> > +
> > +  run_next_test();
> 
> I was under the impression that you shouldn't be calling run_next_test from
> inside add_task. See Bug 972093.

I assumed I copy/pasted this, but a quick audit of the tree reveals we don't make this mistake elsewhere. Good catch :)
(In reply to Richard Newman [:rnewman] from comment #19)
 
> ::: mobile/android/chrome/content/Reader.js
> @@ +98,5 @@
> > +    if (!tab) {
> > +      Cu.reportError("Can't add tab to reading list because no tab found for ID: " + tabID);
> > +      return;
> > +    }
> > +    let uri = tab.browser.currentURI;
> 
> Do we care about checking this? E.g., non-null, doesn't point to something
> silly like an about: page?

I was just maintaining the current behavior here, but yes, that would probably be reasonable, especially because we already do a similar check in parseDocumentFromTab. I'm going to punt on this for now, and I can deal with it as part of bug 1087722. Doing some refactoring for that bug is really exposing some logic inconsistencies.

> @@ +173,4 @@
> >  
> > +      if (!this._requests) {
> > +        this.log("Reader has been destroyed, abort");
> > +        return;
> 
> If we can't run callbacks here, shouldn't we be doing this check before line
> 170 tries to?

This logic is kinda messy, so I'm going to err on the side of keeping it the way it is. I'm working on changing this over in bug 1087722, and I'm running into some issues, so I think it's fragile.

> @@ +475,3 @@
> >  
> > +    let hash = CommonUtils.encodeBase32(this._cryptoHash.finish(false));
> > +    let fileName = hash.substring(0, hash.indexOf("="));
> 
> .json?

Sounds good.
Depends on: 1093869
https://hg.mozilla.org/mozilla-central/rev/7075ce619635
https://hg.mozilla.org/mozilla-central/rev/3084582bbe17
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Firefox 36
Depends on: 1107588
Depends on: 1152575
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: