Closed Bug 813765 Opened 12 years ago Closed 12 years ago

Device performance goes to hell around the time we poll for updates

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-basecamp:+, b2g18 fixed)

RESOLVED FIXED
B2G C3 (12dec-1jan)
blocking-basecamp +
Tracking Status
b2g18 --- fixed

People

(Reporter: cjones, Assigned: philikon)

References

Details

I'm not able to provide very precise STR atm.  However, on yesterday's inbound+gaia I noticed that device performance would absolutely tank while polling for updates.  The worst were O(seconds) full-screen pauses.

This is with a clean flash of a "user" build, no other apps installed.

Suggest bb+ to understand the root cause, and we can go from there.
Component: Gaia → Gaia::System
Assignee: nobody → philipp
blocking-basecamp: ? → +
Priority: -- → P3
I think Brian has been looking at reducing main-thread IO in the cache code. Though I wasn't aware that we're doing sync IO through sqlite from the http code.

Is this the appcache code maybe?
This is almost certainly the update-polling code, which I assume has to interact with appcache.

Beyond that, I have no idea what actual database we're mutating.  It's strange to me that unlink() is showing up in the samples ...
I'll check the appcache code what all we do on the main thread and what not.
Is that profile reliable? It also says that nsHttpChannel::OnDataAvailable calls nsWindowMemoryReporter::CollectReports...
storage::Statement::ExecuteStep is a manually-added sample label (and everything above it in the "stack", I think) so it can definitely be trusted.
fabrice reports this not being slow when testing on unagi.  I just updated my decommissioned dogfood phone and can confirm the device stays usable.

I filed this bug based on testing on otoro, which has worse storage and a crappier fs than unagi.  So please test on otoro.
Keywords: qawanted
Peter, the experience here is so bad I think we should consider bumping this up to the highest priority perf item.
This is so horrifically bad that we're going to have trouble convincing anyone to put a phone in stores.
Severity: normal → critical
Priority: P3 → P1
I can confirm this on Otoro. See the same characteristics in the profile.

https://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDeviceSQL.cpp is full of synchronous SQL a la statement->ExecuteStep(). As far as I can tell, this is all executed on the main thread. Also, each statement will trigger an implicit sqlite transaction, hence the many unlink()s. In an ideal world, there'd be much fewer transactions, perhaps just one for the entire app update.

I instrumented the methods of nsOfflineCacheDevice [1] with more labels. Here's two profile runs with that:

http://people.mozilla.com/~bgirard/cleopatra/#report=fd0aab2993060a41ab7223bc27f040d33bf4f05f
http://people.mozilla.com/~bgirard/cleopatra/#report=af0f03b71ccec25d3eb2146a02568ecd766cf1a3

We see that the bulk of the time is spent in nsOfflineCacheDevice::UpdateEntrySize(). If I read the correctly, this is called somewhat frequently as data is received over the HTTP pipe. So one potential remedy would be to throttle this call.

[1] https://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDeviceSQL.cpp
Philipp, thanks for your investigation!

Thinker, do you think your patch from bug 771420 can fix this?

Also CC Michal as he may know how to optimize (or even remove?) this call on the device consumer level.
See Also: → 771420
(In reply to Honza Bambas (:mayhemer) from comment #11)
> Thinker, do you think your patch from bug 771420 can fix this?

Actually, it *seems* like we just need to do bug 809947. IOW, hook up the hosted app update check to the new API introduced in bug 751754. But please correct me if I'm wrong.
If the regression is not caused by actual appcache updates (i.e. download of all the resources listed in a manifest) and is just caused by update checks (i.e. only a check whether manifest has changed on the server) then yes, bug 809947 is all we need here (safe and simple).
Why is this not proxied to the cache thread?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #0)
> I'm not able to provide very precise STR atm.  However, on yesterday's
> inbound+gaia I noticed that device performance would absolutely tank while
> polling for updates.  The worst were O(seconds) full-screen pauses.
> 
> This is with a clean flash of a "user" build, no other apps installed.
> 
> Suggest bb+ to understand the root cause, and we can go from there.

Since all pre-installed applications are packaged, are they still going through appcache?
Not all pre-installed apps are packaged. At least the marketplace is not.

But yes, almost all apps will be packaged and thus won't be affected by this bug.
(In reply to Andreas Gal :gal from comment #14)
> Why is this not proxied to the cache thread?

Since app cache SQL device doesn't have any thread.  I'm going to work on appcache internal code optimization soon after DOM storage is done.
(In reply to Jonas Sicking (:sicking) from comment #16)
> Not all pre-installed apps are packaged. At least the marketplace is not.
> 
> But yes, almost all apps will be packaged and thus won't be affected by this
> bug.

Almost all apps that come with the phone are packaged, yes. But as the user installs more apps, that ratio can be expected to change, I'd say. We're already seeing horrible performance with just 3 (!) hosted apps installed.
A couple of observations:

* We use the webapp manifest's etag to determine whether we should update the app or not. However, for the Marketplace(s), the server doesn't send us an etag, so every time we do an update check, we think the app has been updated. Do we require the server to implement etags? If so, then we should fix the Marketplace. Although I think we should perhaps be lenient and do an If-Modified-Since query with a timestamp if we don't have an etag on file (this will require us to remember the manifest's download timestamp).

* An update check where we don't actually need to download anything seems to work and looks pretty sane, although there are three UPDATE queries that perhaps could be collapsed into one. Optimizing for this case would make sense since it's going to be a pretty common one (most of the time when the phone checks for updates, there will not be any.) Of course, ideally we shouldn't even get here due to the previous point.

12-04 23:12:04.932   105   105 I Gecko   : -*-*- Webapps.jsm : checkForUpdate for https://marketplace.firefox.com/telefonica/manifest.webapp
12-04 23:12:04.932   105   105 I Gecko   : -*-*- Webapps.jsm : Getting manifest at https://marketplace.firefox.com/telefonica/manifest.webapp, etag: null
12-04 23:12:06.473   105   105 I Gecko   : -*-*- Webapps.jsm : Downloaded manifest at https://marketplace.firefox.com/telefonica/manifest.webapp, status is 200, etag is null
12-04 23:12:06.483   105   105 I Gecko   : -*-*- Webapps.jsm : updateHostedApp marketplace
12-04 23:12:06.503   105   105 I Gecko   : -*-*- Webapps.jsm : startOfflineCacheDownload https://marketplace.firefox.com/telefonica/manifest.webapp
12-04 23:12:06.503   105   105 I Gecko   : -*-*- Webapps.jsm : Creating AppcacheObserver for https://marketplace.firefox.com/telefonica/ - updating
12-04 23:12:17.804   105   105 I Gecko   : -*-*- Webapps.jsm : Offline cache state change for https://marketplace.firefox.com/telefonica/ : 2
12-04 23:12:17.894   105   105 I GeckoStorage: Statement::ExecuteStep(): INSERT INTO moz_cache (ClientID, Key, MetaData, Generation, DataSize, FetchCount, LastFetched, LastModified, ExpirationTime) VALUES(?,?,?,?,?,?,?,?,?);
12-04 23:12:17.944   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 0)
12-04 23:12:17.944   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 0)
12-04 23:12:17.944   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:17.974   105   105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(5032)
12-04 23:12:17.974   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 5032)
12-04 23:12:17.974   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 5032)
12-04 23:12:17.974   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:17.994   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET MetaData = ?, DataSize = ?, FetchCount = ?, LastFetched = ?, LastModified = ?, ExpirationTime = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:18.084   105   105 E GeckoConsole: Offline cache doesn't need to update, URL=https://marketplace.firefox.com/telefonica/appcache/manifest.appcache

* When we download a file because it has changed, we do the same thing as observed in the previous log: we continuosly UPDATE the data size, presumably as we receive data. Over the course of one file download, it's not uncommon to fire more than 5 UPDATE queries. Judging from my profiles in comment 10, this is what's killing us. We should look into collapsing these queries, or doing them off the main thread.

12-03 22:29:09.313   105   105 I GeckoStorage: Statement::ExecuteStep(): INSERT INTO moz_cache (ClientID, Key, MetaData, Generation, DataSize, FetchCount, LastFetched, LastModified, ExpirationTime) VALUES(?,?,?,?,?,?,?,?,?);
12-03 22:29:09.373   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 0)
12-03 22:29:09.373   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 0)
12-03 22:29:09.373   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.403   105   105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(16384)
12-03 22:29:09.403   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.403   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.403   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.433   105   105 I GeckoStorage: Statement::ExecuteStep(): SELECT Sum(DataSize) from moz_cache;
12-03 22:29:09.433   105   105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(16384)
12-03 22:29:09.433   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.433   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 32768)
12-03 22:29:09.433   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.463   105   105 I GeckoStorage: Statement::ExecuteStep(): SELECT Sum(DataSize) from moz_cache;
12-03 22:29:09.463   105   105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(14729)
12-03 22:29:09.463   105   105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 14729)
12-03 22:29:09.463   105   105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 47497)
12-03 22:29:09.463   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.493   105   105 I Gecko   : -*-*- Webapps.jsm : Offline cache state change for https://marketplace-dev.allizom.org/telefonica/ : 8
12-03 22:29:09.493   105   105 I Gecko   : -*-*- Webapps.jsm : Offlinecache setStatus to updating for https://marketplace-dev.allizom.org/telefonica/
12-03 22:29:09.503   105   105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET MetaData = ?, DataSize = ?, FetchCount = ?, LastFetched = ?, LastModified = ?, ExpirationTime = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.543   105   105 I Gecko   : -*-*- Webapps.jsm : Offline cache state change for https://marketplace-dev.allizom.org/telefonica/ : 6
12-03 22:29:09.543   105   105 I Gecko   : -*-*- Webapps.jsm : Offlinecache setStatus to updating for https://marketplace-dev.allizom.org/telefonica/
Adding Rob, Potch, and Wil - Since there's some observations and discussions on marketplace, app updates, and appcache.
Thanks Jason.  Bug 818311 filed for adding an etag to the marketplace.
(In reply to Philipp von Weitershausen [:philikon] from comment #19)
> * We use the webapp manifest's etag to determine whether we should update
> the app or not. However, for the Marketplace(s), the server doesn't send us
> an etag, so every time we do an update check, we think the app has been
> updated. Do we require the server to implement etags? If so, then we should
> fix the Marketplace. Although I think we should perhaps be lenient and do an
> If-Modified-Since query with a timestamp if we don't have an etag on file
> (this will require us to remember the manifest's download timestamp).

Bug 818311 was filed to add etags to Marketplace, which is fine. This is the first I've heard of adding etags to hosted app manifests. We do have etags on our packaged app's mini-manifests.

If etags affect the performance of updates it seems like a bad idea to assume all hosted apps will make this extra effort.

If we want them to, we should document so. There's no reference of etags here:
https://developer.mozilla.org/en-US/docs/Apps/Manifest#Serving_manifests
(In reply to Philipp von Weitershausen [:philikon] from comment #19)
> A couple of observations:
> 
> * We use the webapp manifest's etag to determine whether we should update
> the app or not. However, for the Marketplace(s), the server doesn't send us
> an etag, so every time we do an update check, we think the app has been
> updated. Do we require the server to implement etags? If so, then we should
> fix the Marketplace. Although I think we should perhaps be lenient and do an
> If-Modified-Since query with a timestamp if we don't have an etag on file
> (this will require us to remember the manifest's download timestamp).
> 
> * An update check where we don't actually need to download anything seems to
> work and looks pretty sane, although there are three UPDATE queries that
> perhaps could be collapsed into one. Optimizing for this case would make
> sense since it's going to be a pretty common one (most of the time when the
> phone checks for updates, there will not be any.) Of course, ideally we
> shouldn't even get here due to the previous point.

Bug 771420 is supposed to a solution of this issue by avoiding any update if nothing is changed.

> 
> * When we download a file because it has changed, we do the same thing as
> observed in the previous log: we continuosly UPDATE the data size,
> presumably as we receive data. Over the course of one file download, it's
> not uncommon to fire more than 5 UPDATE queries. Judging from my profiles in
> comment 10, this is what's killing us. We should look into collapsing these
> queries, or doing them off the main thread.

Again! Bug 771420.  In my first implementation for that bug, it does not do any update until last moment, all files had been downloaded.  But, it is a big change, and it is a bit out of scope of bug 771420, so it was dropped.  @honzab, how do you think?  Should we open a new bug for that, or resolve this issue in bug 771420.

Bug 771420, now, seems to get higher priority.  I will shift back to bug 771420.
Depends on: 771420
I think we have few simpler thinks to do here first:
- make the server providing web app manifests work with ETags (that is something every web server provides automatically, it just may not be on by default, and is a normal and transparent way to optimize things), bug 818311
- use the checkForUpdate API to check for updates, that API is not doing any UPDATEs as well, bug 809947 (with r? patch now)

I personally would rather make nsDiskCacheDeviceSQL work fully asynchronously on a background thread (change its API) instead of what the current patch in bug 771420 does.  That is concept/work I would like to coordinate on with Thinker after I finish my work on DOM storage (couple of weeks).
Depends on: 818311, 809947
Important note to every one here: *please be careful when using the word "update", there are a) web apps that update and there are b) offline cache manifests (and content) that update.*  It is not always clear about which of the two updates is being spoken of.  Thanks.
(In reply to Honza Bambas (:mayhemer) from comment #24)
> I think we have few simpler thinks to do here first:
> - make the server providing web app manifests work with ETags (that is
> something every web server provides automatically, it just may not be on by
> default, and is a normal and transparent way to optimize things), bug 818311

But it's easy for any web author to forget to implement etags and therefore ruin the performance of the device for every FxOS user. Regularly. That's not ok.

> - use the checkForUpdate API to check for updates, that API is not doing any
> UPDATEs as well, bug 809947 (with r? patch now)

Great!

> I personally would rather make nsDiskCacheDeviceSQL work fully
> asynchronously on a background thread (change its API) instead of what the
> current patch in bug 771420 does.  That is concept/work I would like to
> coordinate on with Thinker after I finish my work on DOM storage (couple of
> weeks).

I agree that this is a good long-term proper fix. But we need remedies that work for v1 (now in mozilla-beta). Can we do something smaller than Thinker's patch that would be acceptable to land as a short-term remedy?
If SQLite update queries on the main thread are the culprit here then bug 809947 should definitely help.

Most servers support etagging by default:
http://httpd.apache.org/docs/2.2/mod/core.html#fileetag (see the default values)
http://technet.microsoft.com/en-us/library/ee619764%28v=ws.10%29.aspx

And there are also other ways to prevent net requests at all or provide info to the server to get just 304.

For app cache manifests we plan to make requests more unconditional, so it means to communicate with the server even more.

Let's see how this improves after those two bugs get fixed.
(In reply to Honza Bambas (:mayhemer) from comment #27)
> If SQLite update queries on the main thread are the culprit here then bug
> 809947 should definitely help.

I can confirm that with bug 809947, update checks are much much better. They feel almost instant with a good wifi connection and there seems to be no mainthread disk I/O anymore. So the typical case of the app not changing much or at all is now covered. But the fact that web authors (who might as well push updates daily) can arbitrarily ruin our device's performance like this makes me very uneasy.

> Most servers support etagging by default:
> http://httpd.apache.org/docs/2.2/mod/core.html#fileetag (see the default
> values)
> http://technet.microsoft.com/en-us/library/ee619764%28v=ws.10%29.aspx
> 
> And there are also other ways to prevent net requests at all or provide info
> to the server to get just 304.

Correct, e.g. the If-Modified-Since which I suggested earlier.

> For app cache manifests we plan to make requests more unconditional, so it
> means to communicate with the server even more.

Hmm ok. I'm not going to question the reasoning behind this without having any background information, but consider me being suspicious ;). That said, network I/O is typically off thread and not nearly as bad as disk I/O on these devices.
Target Milestone: --- → B2G C3 (12dec-1jan)
Looks like this can be resolved but we need a new bug for "the fact that web authors (who might as well push updates daily) can arbitrarily ruin our device's performance like this makes me very uneasy."

philikon - If you agree, can you resolve this bug and file the follow-up bug for the app specific update issue with any details that you have from your investigation?
Sure ok. Filed bug 822833, resolving this one.
Status: NEW → RESOLVED
Closed: 12 years ago
Depends on: 822833
Resolution: --- → FIXED
Component: General → Gaia
Component: Gaia → General
You need to log in before you can comment on or make changes to this bug.