Closed Bug 1127165 Opened 7 years ago Closed 7 years ago

Breakdown: Reading List storage on desktop

Categories

(Firefox Graveyard :: Reading List, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adw, Assigned: adw)

References

(Blocks 1 open bug)

Details

Reading List on desktop needs local storage.  We could use SQLite.  We probably don't want to use Places tables because the Places schemas aren't right for this.  But maybe we keep some new tables inside places.sqlite just to avoid the overhead of another DB connection.  Or we could use JSON on disk.

Then we'll need an API to talk to the storage.  Possibly it could live in one of the Places JSMs, but I think maybe it shouldn't.

Android already has storage for Reading List, but of course it's in Java and specific to Android.  But its schema should be useful.  Richard, could you point me to that code, please?

Also:

> rnewman: right now we only store the original URL on Android
> rnewman: we'll probably keep both either (a) in case the destination
>          disappears, and the original still works -- e.g., in the case of
>          being resolved to a login page, or the original being a bit.ly link
> rnewman: or (b) to detect collisions without having to resolve the URL
> rnewman: e.g., if you add the same URL from twitter in two places
Flags: qe-verify-
Flags: firefox-backlog+
(In reply to Richard Newman [:rnewman] from comment #1)
> Here's the provider.
> 
> http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/db/
> ReadingListProvider.java
> 
> It shares storage with browser.db. Here's the CREATE TABLE.
> 
> http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/db/
> BrowserDatabaseHelper.java#314

You may not want to be *too* inspired by this, since we'll probably want to update the Android schema to include new features from the service API that we don't currently have (e.g. is_article, favorite, etc.).

Speaking on which, rnewman, is there an Android bug to fix our "length" field to actually be "word count"? Right now we just store the number of characters:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/reader/content/Readability.js#1520

(We should also update that in Readability.js, since that code could also be used by desktop.)
(In reply to :Margaret Leibovic from comment #2)

> You may not want to be *too* inspired by this, since we'll probably want to
> update the Android schema to include new features from the service API that
> we don't currently have (e.g. is_article, favorite, etc.).

Yeah, I gave thorough caveats over video :D

 
> Speaking on which, rnewman, is there an Android bug to fix our "length"
> field to actually be "word count"?

Not yet! I'll file one later today.

We should also take some time to formalize exactly what we mean by word count, so that various clients can get approximately the same value.
Filed Bug 1127451, marked it as tracking 38.
(In reply to Drew Willcoxon :adw from comment #0)
> Reading List on desktop needs local storage.  We could use SQLite.
> ... Or we could use JSON on disk.
...

(In reply to :Margaret Leibovic from comment #2)
> You may not want to be *too* inspired by this, since we'll probably want to
> update the Android schema to include new features from the service API that
> we don't currently have (e.g. is_article, favorite, etc.).

A possibly naive option might be to use JSON in SQLite.  We could use a scheme where fields we want to have as indexes are real columns, and "the rest" are stored in a single column storing JSON.  This could mean zero schema changes are needed for many new fields and would still allow a relatively simple migration process if we decide an existing fields need to be "promoted" to a real column.
honestly, unless you need to do complex queries, you should not use Sqlite.
It actually depend on the avg number of reading lists items you expect an user to have and the expiration strategy you plan on that.
Reading the whole data (or fragmenting the data keeping large parts on disk and only a lookup json in memory) could be cheaper.
If you plan to do some sort of caching after fetching the data, than it could really make sense to do a one-shot read of a json (indeed doesn't make any sense to have a database if you read all the data and manipulate it in memory most of the time).

Off-hand I'd not store the excerpt in the database or in the lookup json, unless you plan a UI where multiple excerpt are shown at the same time (in such a case opening file handles would be expensive) or you want to provide search through that (and still, you would then need FTS with a good tokenizer to do proper search). It may cause bad and pointless database growth and fragmentation.
Basically, you should split info you need for multiple-entries UIs (memory lookup stuff) from info you need when a specific entry is requested (load on demand). That should help figuring if you need a db, a json, a db with json or a db with external jsons.

That said, if the analysis ends up figuring we can't pay the memory cost or we expect hundreds/thousands of entries, both a separate db or places.sqlite addition are possible. On the plus side of places, we already store urls and they already have a guid and a title, on the minus side, it would make the service dependent on another one.
For active users, we expect hundreds or thousands of entries. For inactive users, zero.

I believe the current mocks include the excerpt in the list view, and it should eventually be searchable.
Wanted to make note that UX is interested in a couple future features just in case they impact the choice of storage today.

A quick search, seen here in safari, probably just the title, url, and description.
A read / unread filter for items.
http://cl.ly/image/3Y3G3A0G1l2p

A grouping by dates, a similar feature seen here in Spartan.  This may also help with bug 1124153 as a way to scale the list.
http://cl.ly/image/1Q2d3W2v1I2L

I don't have a good time frame on these features right now, just giving the heads up that they are under consideration.
Which of the attributes in the data model do we need to store locally?  https://github.com/mozilla-services/readinglist/wiki/API-Design-proposal#data-model

The Android schema doesn't have resolved_url, resolved_title, favorite, is_article, added_by, stored_on, marked_read_by, marked_read_on, read_position.  Some of these are new attributes (comment 2) that the client would clearly want, but what about these three: stored_on, marked_read_by, marked_read_on?
rnewman is working on a patch to update the schema on the Android client, he would be the best person to answer that question.
Flags: needinfo?(rnewman)
Yeah, the Android schema doesn't have those yet because we haven't revved it since starting the syncing work. I'll be getting to that in the next few days; today if I'm very lucky.


In theory you can omit some fields that you don't use, with some caveats:

* Adding them later would mean re-downloading everything from the server.
* Some (e.g., resolved_url) are required for conflict resolution, and the client might get confused by server responses if it's blind to them.
* Even if you don't store them, you're required to send them in some requests. For example, when marking a record as read you must send marked_read_by, and when inserting you must send added_by.

The easiest thing is to support all fields, which is what we'll be doing on Android.


I anticipate that we won't yet be _using_ read_position or favorite, but we'll store and persist them. We won't be displaying the attribution fields (_by, _on), but we'll be storing them for later use.

resolved_url *will* be used, but it'll require some changes to the fetcher, so I'd expect that a week later than the schema change itself.
Flags: needinfo?(rnewman)
Bug 1130461 tracks the Android schema change.
(In reply to Richard Newman [:rnewman] from comment #7)
> I believe the current mocks include the excerpt in the list view, and it
> should eventually be searchable.

I think you should re-evaluate making large text fields searchable (especially if you plan to search them through a LIKE, that sucks for l10n), that means you should not do that unless you plan a separate FTS db (that could even come later).
And then evaluate whether it's worth to store them in the db or not, depending on how much you feel about showing previewa of the text for each entry at the same time in a list view.
(In reply to Marco Bonardo [::mak] (needinfo? me) from comment #14)
> (In reply to Richard Newman [:rnewman] from comment #7)
> > I believe the current mocks include the excerpt in the list view, and it
> > should eventually be searchable.
> 
> I think you should re-evaluate making large text fields searchable

UX only wants the title, url, and description fields to be text searchable.  If it is useful we could  limit the size of the description fields.

The sidebar itself isn't a great interface for search results so its unlikely they would want search matching highlights, however I'm sure they want the results to show the title, url, and descriptions as they normally do.
(In reply to Bryan Clark (Firefox PM) [:clarkbw] from comment #15)

> > I think you should re-evaluate making large text fields searchable
> 
> UX only wants the title, url, and description fields to be text searchable. 
> If it is useful we could  limit the size of the description fields.

FWIW, the description is already limited to 200 words.
Thoughts on the below?

I'm leaning toward just using SQLite, but see below.  I don't think it really matters whether it's in new tables in places.sqlite or a separate DB, so I'd say a separate DB.  According to about:memory, a connection to a new database is only 0.01 MB.  (That number would grow as it's used of course...)

Normally I would think that it's out of the question to use a single JSON file on disk.  Especially if an article's excerpt is stored along with its other data.  But even if not, because the non-excerpt data is not extraordinarily tiny.  We don't control the number of articles that the user puts into it, unless we prune it to a reasonably small size, which I haven't heard anything about.

But, the measurements below indicate that it only costs a couple of megabytes to keep 10,000 (smallish) articles in memory.  That's not bad.  If we want to bet that only a small percentage of users will ever accumulate that many articles over the lifetime of the reading list feature, maybe that's OK.

To lighten the memory footprint, we could use sharded JSON on disk.  N articles to a file, ordered by date (which date?); or a calendar month to a file.  Bryan mentions ordering by date, the mockups I've seen don't specify an ordering at all, and Bryan also mentions searching.  So I'm not sure how good of a fit sharding by date would be.  Say it's a great fit.  What future UX does it make hard to implement?  What does it lock us into?  Same goes for sharding with some other ordering.

* * *

I tried to measure the impact of keeping articles, as simple JS objects, in memory, including their excerpts.  For each N, I created a sandbox in the browser console and evaled some script in it that created a global articles array and filled it with N number of articles.

Each article had properties corresponding to all the attributes in the server data model.  The values of properties in an article were: for URL, title, and GUID properties, new strings of 36 chars (the length of a 32-char hex GUID with hyphens); for excerpt properties, new strings of 200*2 = 400 chars (200 single-char words + 200 spaces, i.e., the smallest excerpt possible (I think?)); for the rest, boolean or number values.

So note that those properties are on the lower boundary of what the typical article might look like, in terms of size.

I used about:memory and named sandboxes to determine how much memory each sandbox used.

Interestingly, it seemed to matter if I created the strings as String objects or string primitives.  Using String objects (new String("some property") instead of "some property") resulted in higher memory usage.  To be clear, I created a different string for each and every string property by concatenating a fixed string to some counters, but wrapping the resulting concatenations in `new String()` resulted in higher memory usage.

An empty sandbox: 0.02 MB.  The MB in this table are relative to that:

# of articles     MB  MB using new String()
-------------  -----  ---------------------
      100,000  18.59  55.97
       10,000   2.02   5.66
        1,000   0.23   0.7
          100   0.07   0.7

One thing I'm not sure about is the memory, GC, and jank impact of creating huge strings of JSON every now and then (when writing and reading the file(s)).  If we do use JSON, I'd really like to (de)serialize it off the main thread.

* * *

Thumbnails are a separate question.  I think we definitely should not keep all thumbnails in memory.  The existing toolkit thumbnail service keeps them on disk, a file per thumbnail, so I guess we could do that.  The mockups show thumbnails as focusing on a specific image or part of the page, which is not how the toolkit thumbnail service works.  So we'd need to write a new thumbnailer unless we don't mind using what the toolkit service provides, simple snapshots of the entire page.
(In reply to Drew Willcoxon :adw from comment #17)

> I'm leaning toward just using SQLite, but see below.  I don't think it
> really matters whether it's in new tables in places.sqlite or a separate DB,
> so I'd say a separate DB.  According to about:memory, a connection to a new
> database is only 0.01 MB.  (That number would grow as it's used of course...) 

I suspect that -- as with most of the features we build that go through this decision process and end up in sqlite -- we pretty much end up wanting to keep flexibility and a somewhat predictable perf envelope, and that pushes us to a DB.

I'd like to hear opinions from gps here (given that he maintains Sqlite.jsm, which is what we'd use).

It's also worth noting that we'll be encapsulating storage anyway, so our decision shouldn't be binding for all eternity. If even the login manager storage layer can be swapped out for JSON (except on Android!) then we can do the same here.


> One thing I'm not sure about is the memory, GC, and jank impact of creating
> huge strings of JSON every now and then (when writing and reading the
> file(s)).  If we do use JSON, I'd really like to (de)serialize it off the
> main thread.

This was something of an issue in the design of FHR; shuffling stuff to workers actually wasn't worth it, because as I recall JSON serialization to get across the boundary was itself a bottleneck! Maybe that's changed, and we no longer require serialization to hand things off to workers, which would be nice. Greg might know more.


> So we'd need to write a new
> thumbnailer unless we don't mind using what the toolkit service provides,
> simple snapshots of the entire page.

I'd be inclined to drop the term "thumbnail", precisely because it implies that it's a miniature version of the page as a whole, and of course that's not very interesting for a reader page.

(The thumbnail of the original page also isn't all that useful, because we're discarding the original layout and styling, so it's less meaningful to a user. They might never have seen the original page!)

So we'll be using a "pull image" or header from the page. Perhaps we should use terminology like "ident image"?
Flags: needinfo?(gps)
(In reply to Drew Willcoxon :adw from comment #17)
> Thumbnails are a separate question.  I think we definitely should not keep
> all thumbnails in memory.  The existing toolkit thumbnail service keeps them
> on disk, a file per thumbnail.

IIRC, these are all stored in a single directory on disk, and an "expiry" process simply enumerates that directory looking for items to delete (not on the main-thread, but still...).  This might make the existing thumbnail storage unsuitable for a large-ish number of thumbnails.  It also implies something will need to ensure the thumbnails on disk and the data in the DB be kept in sync, and if we can solve that simply, I'd wonder why we can't keep the real data on disk as .json and have the DB only stick with stuff necessary for UX (ie, for searching, sorting, etc)
(In reply to Mark Hammond [:markh] from comment #19)
> I'd wonder why we can't
> keep the real data on disk as .json and have the DB only stick with stuff
> necessary for UX (ie, for searching, sorting, etc)

That's what I was suggesting. I didn't say to not use a db for any reason, I said to keep large data (binaries, thumbnails, text blobs) out of the db and think thrice before adding an index on a TEXT field (unless you are building an FTS index, or the text is expected to be less than 50 chars, or the table is expected to have less than 100 rows).
(In reply to Richard Newman [:rnewman] from comment #18)
> I suspect that -- as with most of the features we build that go through this
> decision process and end up in sqlite -- we pretty much end up wanting to
> keep flexibility and a somewhat predictable perf envelope, and that pushes
> us to a DB.
> 
> It's also worth noting that we'll be encapsulating storage anyway, so our
> decision shouldn't be binding for all eternity.

Yes, I agree.

> So we'll be using a "pull image" or header from the page.

Good point, yeah.  I don't know if "pull image" is a widely recognized term like "pull quote" is, but if not the analogy certainly gives me a useful way of thinking about it.

(In reply to Mark Hammond [:markh] from comment #19)
> It also implies something will need to ensure the thumbnails on disk and the
> data in the DB be kept in sync, and if we can solve that simply, I'd wonder
> why we can't keep the real data on disk as .json and have the DB only stick
> with stuff necessary for UX (ie, for searching, sorting, etc)

I like the idea of strategically using a SQLite DB like you're suggesting, but I think that we just cannot have the entire store in a JSON blob, not when the size of the data is unbounded, for memory footprint and serialization (Richard's comment) reasons.  (And "sharded" JSON files have the caveats mentioned above.)

About the pull images -- if they're not in the DB, and they're not in individual files on disk, then where are they?  I'm thinking they're in the DB, actually.  Keeping them in a DB separate from the main DB would avoid the fragmentation problem, in the main DB, that Marco's talking about.  Worth it?

I need to close this bug by the end of the day since the iteration is ending.  I'm thinking the resulting broken-down bug will just be about creating a single SQLite DB and a JSM to provide access to it.
Blocks: 1131362
Breakdown complete: bug 1131362.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
No longer blocks: 1131362
Depends on: 1131362
You over-estimate me. I haven't touched Sqlite.jsm in months, possibly even a year. I just happened to write it because the raw storage APIs were a bazooka pointed at the feet of any poor soul who came near. If you do use SQLite from JavaScript, you should absolutely use Sqlite.jsm. My favorite feature is how it can shrink connection memory usage after the connection has been unused for a period of time.

I'm pretty sure people like Yoric will tell you that new Firefox features should be implemented as much in workers as possible - stay off the main thread. But I've been out of the Firefox development game for a few months. Maybe things have changed with e10s. Best to ask Yoric, Vladan, etc for a performance review.
Flags: needinfo?(gps)
Blocks: 1132074
Mass change of ReadingList bugs, moving to their own component. Filter bugspam on the following quote:

“Reading is to the mind what exercise is to the body.”
― Joseph Addison
Component: General → Reading List
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.