Closed Bug 947745 Opened 11 years ago Closed 10 years ago

Cap maximum size of seer.sqlite

Categories

(Core :: Networking, defect)

27 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox27 + disabled
firefox28 + disabled
firefox29 + fixed
fennec 27+ ---

People

(Reporter: rnewman, Assigned: u408661)

References

Details

(Whiteboard: [MemShrink:P3][no-nag])

Attachments

(3 files, 1 obsolete file)

This is particularly important on Android. Burning 20MB+ of a user's limited memory is not OK. But it's also not OK to use half a gig of disk space on desktop.

The size on Android should reflect (a) device capability, (b) system low storage notifications.

Ideally the DB on Android should be stored in cache space, so that the system can automatically delete the DB entirely.
Whiteboard: [MemShrink]
Blocks: fatfennec
(In reply to Richard Newman [:rnewman] from comment #0)
> This is particularly important on Android. Burning 20MB+ of a user's limited
> memory is not OK. But it's also not OK to use half a gig of disk space on
> desktop.

I agree that we shouldn't ship a 20MB memory usage regression on Android. Firefox/Android OOMs enough as it is. We can't afford more.
(In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if you want a response) from comment #1)

> I agree that we shouldn't ship a 20MB memory usage regression on Android.
> Firefox/Android OOMs enough as it is. We can't afford more.

This is storage, not RAM, but the point still stands -- Firefox already consumes more storage space than most any other app (222MB on my phone!), and we need to make that better, not worse.

Our storage usage is enough to make a mockery of our support of 2.x devices, because very few of them have enough storage to make Firefox's footprint tenable.

(And yes, there's a RAM cost, too -- a sqlite connection and all that comes with it.)
(In reply to Richard Newman [:rnewman] from comment #2)
> (In reply to Brian Smith (:briansmith, was :bsmith; Please NEEDINFO? me if
> you want a response) from comment #1)
> 
> > I agree that we shouldn't ship a 20MB memory usage regression on Android.
> > Firefox/Android OOMs enough as it is. We can't afford more.
> 
> This is storage, not RAM, but the point still stands -- Firefox already
> consumes more storage space than most any other app (222MB on my phone!),
> and we need to make that better, not worse.
> 
> Our storage usage is enough to make a mockery of our support of 2.x devices,
> because very few of them have enough storage to make Firefox's footprint
> tenable.
> 
> (And yes, there's a RAM cost, too -- a sqlite connection and all that comes
> with it.)

Make it an option to make the file bigger for people with 4/8/16/32GB phones?
(In reply to Grant from comment #3)

> Make it an option to make the file bigger for people with 4/8/16/32GB phones?

"The size on Android should reflect (a) device capability…"

Most implementation strategies for this would be configurable, anyway.
value of feature is high - so let's try and get a quick fix and evaluate it to see if its appropriate for 27. can always pref off the feature if its too late or risky.
Assignee: nobody → hurley
Over in the comments on bug 881804, it was mentioned that at least one user has a 400+MB seer.sqlite. I fully admit that my browsing habits probably don't match those of others, but my seer.sqlite (which has existed longer than any other by virtue of the fact that I developed the feature) is only around 12MB on my most-used machine, so if possible, I'd like to get a look at one of these huge seer.sqlite files to see if I can figure out what's going on that would make the file grow so large.


Jim, if you're comfortable, can you make your seer.sqlite file available to me? You can use either my bugmail or replace the domain with "mozilla.com" instead, if that makes you feel more comfortable. Email may not be the best (since the file is so large), so if you're comfortable putting it on a (password-protected) website, that would probably be best.
Flags: needinfo?(jmjeffery)
fwiw my desktop seer.sqlite is 150MB.. I'm not loving the thought of sharing my browsing history for the last few months - but maybe this helps?

sqlite> select count(*) from moz_hosts;
1032
sqlite> select count(*) from moz_pages;
21229
sqlite> select count(*) from moz_redirects;
1182
sqlite> select count(*) from moz_seer_version;
1
sqlite> select count(*) from moz_startup_pages;
137
sqlite> select count(*) from moz_startups;
1
sqlite> select count(*) from moz_subhosts;
11500
sqlite> select count(*) from moz_subresources;
226959

each of those 1/4 million moz_subresources has a big url. hashes for the path perhaps?
(In reply to Patrick McManus [:mcmanus] from comment #7)
> fwiw my desktop seer.sqlite is 150MB.. I'm not loving the thought of sharing
> my browsing history for the last few months - but maybe this helps?

Thanks, Pat. The counts are helpful, and are pretty in line with the sizing of mine (your seer.sqlite is ~10x the size of mine, and your tables have ~10x the number of rows as mine). I was hoping that doing a vacuum might help, but it didn't change the size of my seer.sqlite at all, and given the seemingly linear relationship from your data, I don't think it will help with larger files. I'll work up a patch to prune old data from the file.

Jim, given Pat's data, I don't see any need for your seer.sqlite at this point, so I'm clearing the needinfo. You're still welcome to share, but my guess is that it won't provide any more enlightening data.
Flags: needinfo?(jmjeffery)
so can we get rid of some of those uris? at the leaf level? That's where the bulk of the byte count is and we're not prefetching resources. (though of course we will eventually.. but maybe a hash-promote-to-full-url scheme could be used when we get there).. just thinking out loud.
(In reply to Patrick McManus [:mcmanus] from comment #9)
> so can we get rid of some of those uris? at the leaf level? That's where the
> bulk of the byte count is and we're not prefetching resources. (though of
> course we will eventually.. but maybe a hash-promote-to-full-url scheme
> could be used when we get there).. just thinking out loud.

That sounds like about what I was thinking of.
Fwiw, I also have a 120MB seer.sqlite, and would be happy to provide data about it as needed, but not the whole file...
Mine is 80mb, it's mostly sql being an inefficient storage format, esp if you index strings a lot (sqlite indexes are copies of the columns you are indexing). 

Vacuuming saved ~5mb.

If I dump the db to text format it's 45mb, it compress down to 7mb with zip.

Here are some ballpark figures on indexes I saw in db:
SELECT count(*), sum(length(id) +length(origin)) from moz_subhosts;
982, 330K
SELECT count(*), sum(length(id)) from moz_subhosts;
8K, 34K
SELECT count(*), sum(length(uri)) from moz_startup_pages;
60, 1.5K
SELECT count(*), sum(length(id)+length(uri)) from moz_pages;
13651, 4.4MB
SELECT count(*), sum(length(uri)) from moz_pages;
~same as above
SELECT count(*), sum(length(id)) from moz_subresources;
173K, 932K
SELECT count(*), sum(length(pid)+length(uri)) from moz_redirects;
1.5K, 490K

For anything perf or footprint-sensitive one should not use sqlite. Naively, you could reimplement your database as a bunch of snappy-compressed log files and literally read them every time you look something up and that would probably perform better than a fragmented db.

However, to me this feels like something that should live in cache metadata.
Attached file cleanup_seer.py
Here's a python script that runs my first idea of how to clean up the seer db. If those with large seer.sqlite files would be willing to run this on A COPY of their seer.sqlite and report back with sizes both before and after running the script, I would appreciate it. Just do

cp <profile>/seer.sqlite <home>/seer_copy.sqlite
ls -lh <home>/seer_copy.sqlite
python cleanup_seer.py <home>/seer_copy.sqlite
ls -lh <home>/seer_copy.sqlite
One more optional cleanup algorithm (which has a big effect, even on my tiny seer.sqlite file). Same deal as above if anyone is willing to test.
(In reply to Nicholas Hurley [:hurley] from comment #13)
> Created attachment 8344904 [details]
> cleanup_seer.py
> 
> Here's a python script that runs my first idea of how to clean up the seer
> db. If those with large seer.sqlite files would be willing to run this on A
> COPY of their seer.sqlite and report back with sizes both before and after
> running the script, I would appreciate it. Just do
> 
> cp <profile>/seer.sqlite <home>/seer_copy.sqlite
> ls -lh <home>/seer_copy.sqlite
> python cleanup_seer.py <home>/seer_copy.sqlite
> ls -lh <home>/seer_copy.sqlite



Size of seer.sqlite before 
59M 

Size of seer.sqlite after running cleanup_seer.py
55M
both took ~5 seconds to run on my 153MB input.. the first one reduced it to 139, the second one reduced it to 47. I didn't confirm that either was usable afterwards
(In reply to Nicholas Hurley [:hurley] from comment #14)
> Created attachment 8344918 [details]
> cleanup_seer_aggressive.py
> 
> One more optional cleanup algorithm (which has a big effect, even on my tiny
> seer.sqlite file). Same deal as above if anyone is willing to test.

Size of seer.sqlite before 
59MB 

Size of seer.sqlite after running cleanup_seer_aggressive.py
20MB
• I regularly vacuum my sql databases, but I'd rather delete altogether the ones that don't store "permanent" information (bookmarks, form info, etc).
• I dislike files that grow indefinitely and don't have an option to be cleaned (e.g. in the Ctrl+Shift+Del menu).
• Although appropriate, "seer" sounds extremely ominous and should be rebranded so users aren't spooked w/ security concerns.
Aggressive cleanup dropped my phone's DB from 21MB to 6.5MB.

Switching one month to one week dropped that to 4MB.

One day dropped it to 3.4, so most of the overhead is in the remainder (e.g., 130 Amazon product image URLs):

moz_subresources: 6401
moz_subhosts:     5071
moz_hosts:         644
moz_redirects:     512
Can't we just atomize the urls if we're trying to save space? I don't know much about how you're searching these indexes but unless you need 'between' or 'like' or operators like that then you can just put the urls in a separate table and use joins.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #20)
> Can't we just atomize the urls if we're trying to save space? I don't know
> much about how you're searching these indexes but unless you need 'between'
> or 'like' or operators like that then you can just put the urls in a
> separate table and use joins.

My seer.sqlite database is is 47MB.  39.8% of that is used for the SUBRESOURCE_PID_URI_INDEX index.  So, yes, if the strings were atomized, the size of that index would drop to nearly nothing and the overall size of the database would be almost halved.  More so, perhaps, if you store the URI strings as compressed BLOBs.

If you atomize strings, you can still do range queries, but you would have to do them using a join.  And there would be a performance hit.

Another option is to take advantage of the new WITHOUT ROWID tables introduced in SQLite 3.8.2.  You'd have to land the latest SQLite first, though, obviously.  See http://www.sqlite.org/withoutrowid.html for further information about WITHOUT ROWID tables.  There are advantages and disadvantages here so I'm not sure it is worth the tradeoff, but it is something to consider.

You can download the "sqlite3_analyzer.exe" utility program from http://www.sqlite.org/download.html and then run "sqlite3_analyzer seer.sqlite" to get a nice analysis of how much space each table and index is using.

One final note:  The seer.sqlite database uses lots of AUTOINCREMENT.  AUTOINCREMENT in SQLite is not like AUTOINCREMENT in MySQL.  See http://www.sqlite.org/autoinc.html for details.  There is space and time overhead associated with AUTOINCREMENT.  Are you sure you need it here?
Oops, missed a huge index in my last comment:
SELECT count(*), sum(length(uri)+length(pid))  from moz_subresources order by length(uri) desc;
"174684","26829192"

A lot of those subresources are 5-7K xhr urls. Your timestamps are in microsecond which adds a few percent to the db.
I'm the one that was reporting the huge 437meg + seer.sqlite file.

I grabbed these numbers using the SQLite Manager Addon:

moz_hosts	    752 records 2 indexes
moz_pages	 116032 records 2 indexes
moz_redirects	  25991 records 2 indexes
moz-startup_pages   142 records 1 index
moz_subhosts	   7345 records 2 indexes
moz_subresources 537275 records 2 indexes


I am not a coder nor do I understand scripts.  I don't know how to use the scripts attached.  I tried pasting the code in the 'developer scratchpad tool', but it keeps giving me an error:


/*
Exception: missing keyword 'from' after import specifier set
*/

I'd like to help, but I need some education.  If you don't need anything from me that's fine, just posting the information.

My file was created on NOv 10.2013, and its grown to this size already. Yes, I visit a lot of pages, most are repeat reloads of pages in a forum, along with having tbpl open all the time, and CNN, & nbcnews, various bugs from time to time checking on progress, and looking at 'new bugs filed'.
Removing MemShrink tag because this seems to be about disk space.  If about:memory is showing high memory usage for seer.sqlite, please re-tag.
Whiteboard: [MemShrink]
│     │  ├───2,179,168 B (00.16%) -- seer.sqlite
│     │  │   ├──2,080,816 B (00.15%) ── cache-used
│     │  │   ├─────87,488 B (00.01%) ── stmt-used
│     │  │   └─────10,864 B (00.00%) ── schema-used

Second only to webappsstore.sqlite (Bug 857888 related) and Places.

Is that a concern, njn?
2MB isn't so much on desktop. What's it like on Android and B2G?
Whiteboard: [MemShrink:P3]
Interestingly enough, even though my seer.sqlite file is 437meg at noted above, my 
about:memory readings come pretty close to matching those in comment #25

──2,150,840 B (00.59%) -- seer.sqlite
│   │  │   ├──2,074,088 B (00.57%) ── cache-used
│   │  │   ├─────69,856 B (00.02%) ── stmt-used
│   │  │   └──────6,896 B (00.00%) ── schema-used
(In reply to Nicholas Nethercote [:njn] from comment #26)
> 2MB isn't so much on desktop. What's it like on Android and B2G?

Also 2MB on Android. We have about 8MB of sqlite connections on my device, with webappstore and seer being the two biggest culprits.

I would love to get that down to 200K, matching form history/signons/etc.
While I appreciate (and support) the idea of reducing sqlite connection memory overhead, can we please leave this bug for its initial stated purpose, capping the on-disk size of seer.sqlite? Thanks.
Nick, can you just collect less data, generalize them more or make the prediction a bit less precise while saving a lot of space?
I'm working on a patch to limit the size. Right now it "works" in that it ends up deleting everything the first time the cleanup routine runs, so that's obviously non-optimal (except from a disk space usage standpoint). Forward progress, though!
Given the potential risk of a change like this we will looking for this to land before January 13 so there is enough time to back off if need be.
Attached patch seer_cleanup.patch (obsolete) — Splinter Review
This builds on the outstanding patch from bug 945779, so please review that one first.

It's pretty self-explanatory. We have a limit (bikeshedding on that value is, to a limited extent, OK) on how much space the db can take. When we hit that limit, we delete to 10% below that limit (to give us room to grow for a bit before having to clean up again). If for some reason the smart way of doing that fails, we'll just trash everything and start from scratch.

For android, I'm not doing anything smart yet (like introspecting amount of free space on device or listening to low-space notifications), those can be done in follow-up bugs (since we want to get something in ASAP). I chose a (seemingly) small value for the seer db on android (15MB), but we can cut that even further if necessary.
Attachment #8346174 - Flags: review?(rnewman)
Attachment #8346174 - Flags: review?(honzab.moz)
Blocks: 881804
No longer depends on: 881804
Comment on attachment 8346174 [details] [diff] [review]
seer_cleanup.patch

Review of attachment 8346174 [details] [diff] [review]:
-----------------------------------------------------------------

Overall thoughts:

* This needs some way for us to determine
  (a) how frequently users are hitting these limits, 
  (b) if they're thrashing (time between cleanups), 
  (c) if the cleanup is "failing" (i.e., it's falling back to blowing away the whole DB), 
  (d) whether this cleanup work is crazy expensive (which it looks like it is, given that it involves running a bunch of deletes and then a vacuum).

Telemetry is probably the answer to that.

* There might be a cheap first-pass cleanup you can do: something like "drop all hosts with `loads` less than the median, all subresources with `hits` less than the median". Something to think about, though you probably already did.

* This definitely warrants a test or three, particularly with the foreign key oddity you reported and worked around.

::: mobile/android/app/mobile.js
@@ +102,5 @@
>  pref("network.buffer.cache.count", 24);
>  pref("network.buffer.cache.size",  16384);
>  
> +// predictive actions
> +pref("network.seer.max-db-size", 15728640); // bytes

This should really be on the order of 2MB. That might require alterations to your 10% threshold in order to avoid thrashing.

Does the Seer ship on b2g?

::: netwerk/base/src/Seer.cpp
@@ +100,5 @@
>  const char SEER_MAX_QUEUE_SIZE_PREF[] = "network.seer.max-queue-size";
>  const uint32_t SEER_MAX_QUEUE_SIZE_DEFAULT = 50;
>  
> +const char SEER_MAX_DB_SIZE_PREF[] = "network.seer.max-db-size";
> +const int32_t SEER_MAX_DB_SIZE_DEFAULT = 150 * 1024 * 1024;

..._DEFAULT_BYTES

@@ +2166,5 @@
>    nsresult rv = EnsureInitStorage();
>    RETURN_IF_FAILED(rv);
>  
>    mDB->ExecuteSimpleSQL(NS_LITERAL_CSTRING("DELETE FROM moz_redirects;"));
>    mDB->ExecuteSimpleSQL(NS_LITERAL_CSTRING("DELETE FROM moz_startup_pages;"));

Is there a reason why all of these statements are executing in their own transactions?

@@ +2179,3 @@
>  
>    // Go ahead and ensure this is flushed to disk
>    NewTransaction();

Sidenote: you also need to vacuum here to reclaim the now-blank pages.

@@ +2197,5 @@
>  }
>  
> +// For whatever reason, sqlite doesn't like to do DELETEs on a table that
> +// provides a foreign key for another table when any of the rows affected by the
> +// DELETE are not actually referenced as foreign keys. So, we provide these

Can you explain what "doesn't like" means in this context?

Figuring this out might buy us some simplicity here.

@@ +2236,5 @@
> +    return;
> +  }
> +
> +  int64_t dbFileSize;
> +  nsresult rv = mDBFile->GetFileSize(&dbFileSize);

I did a little poking around in nsLocalFile.

It doesn't look like, from a cursory inspection, we set the file instance to dirty on write, and so we'll never re-stat, and so this file size will never change. That means this won't work correctly.

(That's even more true if we don't vacuum and auto-vacuum doesn't get triggered!)

As such, I think you need a test for this:

* Populate the seer DB
* Read the file size using this mechanism
* Trigger a prune
* Ensure that the reported file size using the same mechanism is now smaller.

Alternatively, if it *does* (i.e., that test passes), then we're stat'ing the file every single time we call MaybeScheduleCleanup, which seems non-optimal.

Is there a reason why you're not using page_size * page_count instead of raw disk bytes?

@@ +2257,5 @@
> +
> +void
> +Seer::CleanupOrigins(PRTime now)
> +{
> +  PRTime cutoff = now - ONE_MONTH;

This seems like something that should be platform-conditional. I doubt mobile platforms will want to keep a month's worth of hosts data around under any circumstances.

@@ +2333,5 @@
> +  if (NS_FAILED(rv)) {
> +    return;
> +  }
> +
> +  int64_t evictionCutoff = static_cast<int64_t>(mMaxDBSize) * 0.90;

Lift that magic constant out. And I might suggest that it be smaller to avoid frequent cleanups -- 0.75? 0.5?

@@ +2344,5 @@
> +  PRTime now = PR_Now();
> +  CleanupOrigins(now);
> +  CleanupStartupPages(now);
> +
> +  VacuumDatabase();

That we have to do two vacuums here kinda sucks. So it goes! :/

@@ +2352,5 @@
> +    return;
> +  }
> +
> +  bool canDelete = true;
> +  while (canDelete && (dbFileSize >= evictionCutoff)) {

Return before this loop if dbFileSize is already below evictionCutoff. We won't delete any subresources, so we might as well skip another DELETE and VACUUM.

@@ +2369,5 @@
> +    int32_t subresourcesToDelete = static_cast<int32_t>(percentNeeded * subresourceCount);
> +    if (!subresourcesToDelete) {
> +      // We're getting pretty close to nothing here, anyway, so we may as well
> +      // just trash it all
> +      subresourcesToDelete = subresourceCount;

Probably worth splitting this into a separate SQL query, rather than just setting the limit. Much quicker to run

  DELETE FROM moz_subresources

than the version with the subquery.


Alternatively, do *all* the work in a subquery:

  DELETE FROM moz_subresources WHERE id IN (
    ... 
    LIMIT ((SELECT COUNT(id) FROM moz_subresources) * 0.90)
  )

@@ +2375,5 @@
> +
> +    nsCOMPtr<mozIStorageStatement> deleteStatement = mStatements.GetCachedStatement(
> +        NS_LITERAL_CSTRING("DELETE FROM moz_subresources WHERE id IN "
> +                           "(SELECT id FROM moz_subresources ORDER BY "
> +                           "last_hit ASC LIMIT :limit);"));

/me wonders why we don't build with SQLITE_ENABLE_UPDATE_DELETE_LIMIT.
Attachment #8346174 - Flags: review?(rnewman) → feedback+
Depends on: 945779
tracking-fennec: ? → 27+
(In reply to Richard Newman [:rnewman] from comment #34)
> * This needs some way for us to determine
>   (a) how frequently users are hitting these limits, 
>   (b) if they're thrashing (time between cleanups), 
>   (c) if the cleanup is "failing" (i.e., it's falling back to blowing away
> the whole DB), 
>   (d) whether this cleanup work is crazy expensive (which it looks like it
> is, given that it involves running a bunch of deletes and then a vacuum).
> 
> Telemetry is probably the answer to that.

Seems sensible.

> * There might be a cheap first-pass cleanup you can do: something like "drop
> all hosts with `loads` less than the median, all subresources with `hits`
> less than the median". Something to think about, though you probably already
> did.

That's effectively what the main cleanup is, anyway. Subresources totally dominates space used (89% in my db). I should probably drop the first vacuum and subsequent space check and just proceed with deleting from subresources (and the related pages) without doing the extra work. The other cleanup can just be to get a little kick in extra space at the end.

> This should really be on the order of 2MB. That might require alterations to
> your 10% threshold in order to avoid thrashing.

Sounds fine.

> Does the Seer ship on b2g?

Indeed it does, I need to fix that pref up, too.

> ..._DEFAULT_BYTES

Check.

> Is there a reason why all of these statements are executing in their own
> transactions?

Unless ExecuteSimpleSQL always runs in a separate transaction, they're not. EnsureInitStorage() starts a transaction, which persists until it's committed, either manually or through a timer-fired event (which is what bug 945779 is all about).
> Sidenote: you also need to vacuum here to reclaim the now-blank pages.

That's bug 948448 (hoping to land all of these in quick succession).

> Can you explain what "doesn't like" means in this context?

Will add to the comment, but for a quick answer - refuses to delete, and throws a "Foreign key constraint failed" error.

> Figuring this out might buy us some simplicity here.

Indeed, it might. The one thing I wonder is if it's not some artifact my my own personal seer.sqlite, which has existed through all the development of this, and did not always enforce foreign keys on al tables, so there may be some funkiness there that doesn't exist in any other seer.sqlite.

> It doesn't look like, from a cursory inspection, we set the file instance to
> dirty on write, and so we'll never re-stat, and so this file size will never
> change. That means this won't work correctly.

We definitely re-stat, as I cleaned up a database locally with no problem. No infinite loops or anything :)

> Is there a reason why you're not using page_size * page_count instead of raw
> disk bytes?

Not yet :) If the two are comparable in size (I can test locally), I'll gladly avoid doing stat()s

> This seems like something that should be platform-conditional. I doubt
> mobile platforms will want to keep a month's worth of hosts data around
> under any circumstances.

Sensible. Let's say a week for mobile?

> Lift that magic constant out. And I might suggest that it be smaller to
> avoid frequent cleanups -- 0.75? 0.5?

Seems reasonable. I'm inclined to make it even more flexible - make it a pref, with a low value (say .5) for fennec/b2g, and a higher value (.75, or even my existing 0.9) for desktop, since desktop will have more space to use to begin with.

> Return before this loop if dbFileSize is already below evictionCutoff. We
> won't delete any subresources, so we might as well skip another DELETE and
> VACUUM.

Makes sense.

> Probably worth splitting this into a separate SQL query, rather than just
> setting the limit. Much quicker to run
> 
>   DELETE FROM moz_subresources
> 
> than the version with the subquery.

Will do.
(In reply to Nicholas Hurley [:hurley] from comment #35)

> > Is there a reason why all of these statements are executing in their own
> > transactions?
> 
> Unless ExecuteSimpleSQL always runs in a separate transaction, they're not.
> EnsureInitStorage() starts a transaction, which persists until it's
> committed, either manually or through a timer-fired event (which is what bug
> 945779 is all about).

Ah, very good. Missed that in switching between files and patch.


> > Figuring this out might buy us some simplicity here.
> 
> Indeed, it might. The one thing I wonder is if it's not some artifact my my
> own personal seer.sqlite, which has existed through all the development of
> this, and did not always enforce foreign keys on al tables, so there may be
> some funkiness there that doesn't exist in any other seer.sqlite.

Perhaps test with a fresh profile (or, indeed, in a test), and just add a "drop and start over" fallback if it only fails on your profile?

Alternatively, I'd bet that sqlite is right here, and we're doing something funky with the order of deletes...


> > Is there a reason why you're not using page_size * page_count instead of raw
> > disk bytes?
> 
> Not yet :) If the two are comparable in size (I can test locally), I'll
> gladly avoid doing stat()s

They should be. All we're doing here is removing stuff from pages, then vacuuming to reduce the free list, which compacts the file on disk by eliminating the free pages. The disk size should be a multiple of the page size.


> Sensible. Let's say a week for mobile?

WFM.


> Seems reasonable. I'm inclined to make it even more flexible - make it a
> pref, with a low value (say .5) for fennec/b2g, and a higher value (.75, or
> even my existing 0.9) for desktop, since desktop will have more space to use
> to begin with.

Sounds good -- will reduce bouncing and increase the likelihood that we'll do a quick DELETE FROM moz_subresources rather than a subquery.
Based on bug 948569, this is no longer an issue on aurora 28 or beta 27, though I don't know the proper flags to set to indicate as such.
Whiteboard: [MemShrink:P3] → [MemShrink:P3][no-nag]
Comment on attachment 8346174 [details] [diff] [review]
seer_cleanup.patch

Review of attachment 8346174 [details] [diff] [review]:
-----------------------------------------------------------------

In general I still wonder why you are collecting so enormous number of data just to predict how many dns lookups and tcp preconnects to deploy.  Do you really need all subresources to be in the database?

Please address also Richard's comments.

::: netwerk/base/src/Seer.cpp
@@ +896,5 @@
>  
>      Telemetry::AccumulateTimeDelta(Telemetry::SEER_PREDICT_WORK_TIME,
>                                     startTime);
>  
> +    gSeer->MaybeScheduleCleanup();

Why doesn't your commit timer do this?

@@ +2198,5 @@
>  
> +// For whatever reason, sqlite doesn't like to do DELETEs on a table that
> +// provides a foreign key for another table when any of the rows affected by the
> +// DELETE are not actually referenced as foreign keys. So, we provide these
> +// functions to disable foreign keys temporarily for just that situation.

Purpose of foreign keys is to make exactly this kind of deleting simple.  So this must work.  Aren't you missing something to allow this in the foreign key declaration?

@@ +2204,5 @@
> +Seer::DisableForeignKeys()
> +{
> +  CommitTransaction();
> +  mDB->ExecuteSimpleSQL(NS_LITERAL_CSTRING("PRAGMA foreign_keys = OFF;"));
> +  BeginTransaction();

And why BeginTransaction after and not before?

@@ +2248,5 @@
> +
> +  mCleanupScheduled = true;
> +
> +  nsRefPtr<SeerCleanupEvent> event = new SeerCleanupEvent();
> +  rv = NS_DispatchToCurrentThread(event);

Use your mIOThread here.

@@ +2260,5 @@
> +{
> +  PRTime cutoff = now - ONE_MONTH;
> +
> +  nsCOMPtr<mozIStorageStatement> deleteOrigins = mStatements.GetCachedStatement(
> +      NS_LITERAL_CSTRING("DELETE FROM moz_hosts WHERE last_load <= :cutoff"));

having index?
Attachment #8346174 - Flags: review?(honzab.moz) → review-
>I chose a (seemingly) small value for the seer db on android (15MB), but we can cut that even further if necessary.

Our Android users generally start filing bugs if the *total* profile storage exceeds 20MB, as they're using the browsers on phones with only a few hundred megs free, that is shared between all apps and (due to security concerns) our profile data. One of the most reoccurring, consistent complaints about Firefox is that it's a pig due to our huge profiles.

What I'm saying is, aim an order of magnitude lower for the low-end.

Is the SQLite memory cache size for this thing hard-limited? I'm seeing 2M on desktop right now, but I don't know if that's intentional or if it can end up more with the right dataset.
(In reply to Gian-Carlo Pascutto (:gcp) from comment #39)

> What I'm saying is, aim an order of magnitude lower for the low-end.

My first review (Comment 34) asked for this to be a 2MB cap.

(At that point, of course, the SQLite connection "weighs" as much as the database itself, which implies we should switch to a better representation, unless it's smart enough to cache everything in RAM.)

Eventually this needs to just drop the whole DB and maybe disable the feature when we're low on storage. If I have 30MB free on my phone, there's no way that burning 2MB of that (and 2MB of connection mem) is worth any incremental speedup.
Status: NEW → ASSIGNED
>(At that point, of course, the SQLite connection "weighs" as much as the database itself, which implies we should switch to a better representation, unless it's smart enough to cache everything in RAM.)

about:memory claims most of that 2M is cache, so yes, at that point the database is mostly in RAM (the connection overhead is surely way less). I notice there's a bunch of SQLite things that end up closely around 2M (except for places) so this is likely the default we compile in somewhere.
Blocks: 948448
(In reply to Honza Bambas (:mayhemer) from comment #38) 
> In general I still wonder why you are collecting so enormous number of data
> just to predict how many dns lookups and tcp preconnects to deploy.  Do you
> really need all subresources to be in the database?

For The Future! :) It's for when we start getting even fancier with resource prefetch, it was easier to build that in right at the beginning than have to update the schema in the future (especially when we knew we were going to be doing prefetch at some point in the not-so-distant future).

> @@ +2260,5 @@
> > +{
> > +  PRTime cutoff = now - ONE_MONTH;
> > +
> > +  nsCOMPtr<mozIStorageStatement> deleteOrigins = mStatements.GetCachedStatement(
> > +      NS_LITERAL_CSTRING("DELETE FROM moz_hosts WHERE last_load <= :cutoff"));
> 
> having index?

Not sure what you mean here, but I think you're asking to have an index on moz_hosts.last_load, is that correct?
(In reply to Nicholas Hurley [:hurley] from comment #42)
> (In reply to Honza Bambas (:mayhemer) from comment #38) 
> > In general I still wonder why you are collecting so enormous number of data
> > just to predict how many dns lookups and tcp preconnects to deploy.  Do you
> > really need all subresources to be in the database?
> 
> For The Future! :) It's for when we start getting even fancier with resource
> prefetch, it was easier to build that in right at the beginning than have to
> update the schema in the future (especially when we knew we were going to be
> doing prefetch at some point in the not-so-distant future).

So, you are collecting a huge number of URLs, right?  Interesting to do something like that for feature we don't have and that is not even designed (AFAIK)

> 
> > @@ +2260,5 @@
> > > +{
> > > +  PRTime cutoff = now - ONE_MONTH;
> > > +
> > > +  nsCOMPtr<mozIStorageStatement> deleteOrigins = mStatements.GetCachedStatement(
> > > +      NS_LITERAL_CSTRING("DELETE FROM moz_hosts WHERE last_load <= :cutoff"));
> > 
> > having index?
> 
> Not sure what you mean here, but I think you're asking to have an index on
> moz_hosts.last_load, is that correct?

Yep.
(In reply to Honza Bambas (:mayhemer) from comment #43)

> > For The Future! :) 

Speaking for Fennec: there is essentially no feature important enough to warrant speculatively using additional storage space, let alone a feature that isn't even in the works.

We are *actively* working to reduce the amount of storage space we use.

I would suggest either eliminating the recording of data that isn't directly delivering significant user benefit in the current release, or doing so conditionally in a way that we can turn off on non-desktop platforms.

By all means keep the schema the same, if you want to avoid a migration later. Just don't do all of the writes.

As a point of reference: the feasible upper bound for this database is ~2MB. I'd like it even lower. We support phones that have ~50MB of storage space _total_.
Sorry, I didn't explain the situation very well. The data (specifically, relationships between specific URLs) IS being used for higher-accuracy predictions when we load a page that we've already loaded once. What we aren't using it for yet is actually prefetching the resources.
Attached patch patch v2Splinter Review
New patch, with review comments addressed. Honza, once I get r+ from you, I'll seek r+ from android and b2g folks for the pref changes on their stuff (and thus the sizing of the seer db on their stuff).
Attachment #8346174 - Attachment is obsolete: true
Attachment #8361274 - Flags: review?(honzab.moz)
(In reply to Nicholas Hurley [:hurley] from comment #45)
> Sorry, I didn't explain the situation very well. The data (specifically,
> relationships between specific URLs) IS being used for higher-accuracy
> predictions when we load a page that we've already loaded once. What we
> aren't using it for yet is actually prefetching the resources.

Ok, so you remember the whole URL just to know what DNS and TCP to prefetch?

Also, what is your plan for the resource prefetch?  you want to immediately start load of all resources the page ever requested?  or just render-blocking stuff such as css and blocking js?  any prioritization, tokenizing to prevent flood?  this is all you will probably be limited by.  Hence collecting everything is IMO just wasting.

Not a discussion for this bug, though I still think that 150mb database will cause more slowness then speed up.  and with what all you store, 2mb on the other hand will not be able to hold much useful information to actually make any sense for seer to work anyway.

Any data on what fits in the 2mb?

...
(In reply to Honza Bambas (:mayhemer) from comment #47)
> Also, what is your plan for the resource prefetch?  you want to immediately
> start load of all resources the page ever requested?  or just
> render-blocking stuff such as css and blocking js?  any prioritization,
> tokenizing to prevent flood?  this is all you will probably be limited by. 
> Hence collecting everything is IMO just wasting.

That's still to be determined, largely. There will, of course, be prioritization and safety built in (my current feeling is that we would never prefetch ANYTHING that wasn't loaded last time the page was loaded). So yes, there are optimizations that can be made to store less data. However, we should move that discussion into another bug (or perhaps a thread on dev-tech-network), and keep this bug limited to the mechanics of limiting space used (regardless of how efficient or not the on-disk storage is). Similarly, if we want to do something different on mobile (forgo the possibility of prefetching entirely and limit ourselves to preconnect and preresolve in order to save space), I'm not against that, but let's have a different bug for that.

> Any data on what fits in the 2mb?

On my desktop profile (standard disclaimer: my browsing habits may not match yours or anyone else's, etc, etc), it's about 2 days worth of information. For my mobile browsing habits, this would probably increase to at least a week's worth of data (my current mobile device isn't rooted, so I can't get at the db file to check).
(In reply to Nicholas Hurley [:hurley] from comment #48)
> (my current mobile device isn't rooted, so I can't get at the db file to check).

https://addons.mozilla.org/en-US/android/addon/copy-profile could probably help you there.
I would tentatively suggest that on mobile the big wins would come from preconnect and preresolve only. If it's a recently visited page, its resources should mostly be in the disk cache anyway. If it's not recent, we'll have dropped those entries from the seer DB!

That implies that the seer DB should be storing only page => resource hostname mappings, which should be significantly more compact than every single resource URL the browser has seen since it last wiped the DB.

(I presume you've measured that the overhead of a ton of DB writes is outweighed by the page load improvements on desktop.)
(In reply to Richard Newman [:rnewman] from comment #50)
> I would tentatively suggest that on mobile the big wins would come from
> preconnect and preresolve only. If it's a recently visited page, its
> resources should mostly be in the disk cache anyway. If it's not recent,
> we'll have dropped those entries from the seer DB!

While true, we can also use the full URIs for prefetching things from cache. Regardless, I've filed bug 960834 for this.

> (I presume you've measured that the overhead of a ton of DB writes is
> outweighed by the page load improvements on desktop.)

Indeed. See bug 881804 comment 103 for the joyous numbers.
Analyzer ran on my m-c working profile:

*** All indices ***************************************************************

Percentage of total database......................  54.0%    

*** Index SUBRESOURCE_PID_URI_INDEX of table MOZ_SUBRESOURCES *****************

Percentage of total database......................  38.3%    

Nick, could we somehow think of removing this index?  Or is it vital?
(In reply to Honza Bambas (:mayhemer) from comment #52)
> Analyzer ran on my m-c working profile:
> 
> *** All indices
> ***************************************************************
> 
> Percentage of total database......................  54.0%    
> 
> *** Index SUBRESOURCE_PID_URI_INDEX of table MOZ_SUBRESOURCES
> *****************
> 
> Percentage of total database......................  38.3%    
> 
> Nick, could we somehow think of removing this index?  Or is it vital?

Right now, the (pid, uri) tuple on moz_subresources is used for updating information about already-known subresources when we are asked to by the image loader, script loader, etc. Removing that index would almost certainly regress at least some of the CPU usage advances made in bug 945779. I would rather have less useful information in the db than have the seer thread be the second-hottest thread during pageload.
Let me also say, I'm not against restructuring the schema of the db (or even switching away from sqlite if it comes to that) to reduce disk usage in general, but I don't want to block this bug (about having a limiting mechanism at all) on work to come up with a more ideal storage system.
Comment on attachment 8361274 [details] [diff] [review]
patch v2

Review of attachment 8361274 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know whether this is OK or not... hard to say.. all this new code reminds me of what we do in the cache right now.  My original idea was to have this much more efficiently residing in the HTTP cache, now we have another actually caching system.

This huge complex machine based on sqlite makes me feel like we have created a beast hard to control.

Maybe not, I don't know

r+ to move forward

::: modules/libpref/src/init/all.js
@@ +1266,5 @@
>  pref("network.seer.preconnect-min-confidence", 90);
>  pref("network.seer.preresolve-min-confidence", 60);
>  pref("network.seer.redirect-likely-confidence", 75);
>  pref("network.seer.max-queue-size", 50);
> +pref("network.seer.max-db-size", 157286400); // bytes

really, 150mb?  it seems to me a bit too much.

@@ +1267,5 @@
>  pref("network.seer.preresolve-min-confidence", 60);
>  pref("network.seer.redirect-likely-confidence", 75);
>  pref("network.seer.max-queue-size", 50);
> +pref("network.seer.max-db-size", 157286400); // bytes
> +pref("network.seer.preserve", 80); // percentage

some comments what this is would be good

::: netwerk/base/src/Seer.cpp
@@ +2397,5 @@
> +  nsresult rv = mIOThread->Dispatch(event, NS_DISPATCH_NORMAL);
> +  if (NS_FAILED(rv)) {
> +    mCleanupScheduled = false;
> +  }
> +  Telemetry::Accumulate(Telemetry::SEER_CLEANUP_SCHEDULED, mCleanupScheduled);

what if the cleanup runs faster and drops the flag? :)  I wouldn't play on probability, I've seen many things happen...

@@ +2400,5 @@
> +  }
> +  Telemetry::Accumulate(Telemetry::SEER_CLEANUP_SCHEDULED, mCleanupScheduled);
> +}
> +
> +#ifndef ANDROID

and b2g? shouldn't be a pref?

@@ +2420,5 @@
> +  mozStorageStatementScoper scopedOrigins(deleteOrigins);
> +
> +  nsresult rv = deleteOrigins->BindInt32ByName(NS_LITERAL_CSTRING("cutoff"),
> +                                               cutoff);
> +  RETURN_IF_FAILED(rv);

oh, how much I don't like this macro...

@@ +2534,5 @@
> +    } else {
> +      nsCOMPtr<mozIStorageStatement> deleteStatement = mStatements.GetCachedStatement(
> +          NS_LITERAL_CSTRING("DELETE FROM moz_subresources WHERE id IN "
> +                            "(SELECT id FROM moz_subresources ORDER BY "
> +                            "last_hit ASC LIMIT :limit);"));

sqlite> explain query plan DELETE FROM moz_subresources WHERE id IN (SELECT id FROM moz_subresources ORDER BY last_hit ASC LIMIT 1);
0|0|0|SEARCH TABLE moz_subresources USING INTEGER PRIMARY KEY (rowid=?) (~25 rows)
0|0|0|EXECUTE LIST SUBQUERY 0
0|0|0|SCAN TABLE moz_subresources (~1000000 rows)  
0|0|0|USE TEMP B-TREE FOR ORDER BY   <- probably not good


However:

sqlite> CREATE INDEX IF NOT EXISTS moz_subresources_hit_index ON moz_subresources (last_hit);
sqlite> explain query plan DELETE FROM moz_subresources WHERE id IN (SELECT id FROM moz_subresources ORDER BY last_hit ASC LIMIT 1);
0|0|0|SEARCH TABLE moz_subresources USING INTEGER PRIMARY KEY (rowid=?) (~25 rows)
0|0|0|EXECUTE LIST SUBQUERY 0
0|0|0|SCAN TABLE moz_subresources USING COVERING INDEX moz_subresources_hit_index (~1000000 rows)


Depends on whether we want another index, because the two last_hit indexes adds some 6% to the overall size of the database...

@@ +2557,5 @@
> +
> +      // Now we clean up pages that no longer reference any subresources
> +      rv = mDB->ExecuteSimpleSQL(
> +          NS_LITERAL_CSTRING("DELETE FROM moz_pages WHERE id NOT IN "
> +                            "(SELECT DISTINCT(pid) FROM moz_subresources);"));

nit: indent
Attachment #8361274 - Flags: review?(honzab.moz) → review+
(In reply to Nicholas Hurley [:hurley] from comment #54)
> Let me also say, I'm not against restructuring the schema of the db (or even
> switching away from sqlite if it comes to that) to reduce disk usage in
> general, but I don't want to block this bug (about having a limiting
> mechanism at all) on work to come up with a more ideal storage system.

Agree.  Just didn't have better place to put ideas to.
(In reply to Nicholas Hurley [:hurley] from comment #53)
> 
> Right now, the (pid, uri) tuple on moz_subresources is used for updating
> information about already-known subresources when we are asked to by the
> image loader, script loader, etc. Removing that index would almost certainly
> regress at least some of the CPU usage advances made in bug 945779. I would
> rather have less useful information in the db than have the seer thread be
> the second-hottest thread during pageload.

Consider changing the schema to the following:

CREATE TABLE moz_subresources (
  id integer NOT NULL,
  pid INTEGER NOT NULL,
  uri TEXT NOT NULL,
  hits INTEGER DEFAULT 0,
  last_hit INTEGER DEFAULT 0,
  PRIMARY KEY(pid, uri),
  FOREIGN KEY(pid) REFERENCES moz_pages(id) ON DELETE CASCADE
) WITHOUT ROWID;
CREATE INDEX subresource_id_index ON moz_subresources (id);

The change converts the PRIMARY KEY from the "id" field to the combination of "pid,uri", thus obviating the need for an index on pid,uri, and saving the corresponding space.  You'll lose autoincrement on the ID column - I don't know if that causes problems for you or not.

Note that in order to achieve the space savings, you must use the WITHOUT ROWID keywords at the end of the CREATE TABLE statement.  That feature is only available in SQLite 3.8.2 and later.
(In reply to Honza Bambas (:mayhemer) from comment #56)
> Agree.  Just didn't have better place to put ideas to.

We now have such a place - bug 961209
(In reply to Honza Bambas (:mayhemer) from comment #55)
> really, 150mb?  it seems to me a bit too much.

It may very well be. At least we have a limit now (and it's preffable!) :) Unless you feel very strongly, let's leave it as is for now, and we can bikeshed a better value on dev-tech-network or in a followup (your choice).

> some comments what this is would be good

Done.

> what if the cleanup runs faster and drops the flag? :)  I wouldn't play on
> probability, I've seen many things happen...

Good point. Fixed.

> > +#ifndef ANDROID
> 
> and b2g? shouldn't be a pref?

ANDROID is defined on b2g, as well (hence the earlier usage in this file of #if defined(ANDROID) && !defined(MOZ_WIDGET_GONK) or something like that). We can make this preffable in a followup.

> Depends on whether we want another index, because the two last_hit indexes
> adds some 6% to the overall size of the database...

This is probably less important, given how rare (we hope) cleanup is. Let's reevaluate once we have telemetry on how often cleanup is happening.

> nit: indent

Fixed.

As with bug 945779, one last try run (to be sure) before landing.
try run: https://tbpl.mozilla.org/?tree=Try&rev=26849f449f14

https://hg.mozilla.org/integration/mozilla-inbound/rev/5bf92bd8d7ed

Sherriffs - on the off chance this gets backed out, please also back out bug 948448.
https://hg.mozilla.org/mozilla-central/rev/5bf92bd8d7ed
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Depends on: 966469
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: