Closed
Bug 813765
Opened 12 years ago
Closed 12 years ago
Device performance goes to hell around the time we poll for updates
Categories
(Firefox OS Graveyard :: General, defect, P1)
Tracking
(blocking-basecamp:+, b2g18 fixed)
Tracking | Status | |
---|---|---|
b2g18 | --- | fixed |
People
(Reporter: cjones, Assigned: philikon)
References
Details
I'm not able to provide very precise STR atm. However, on yesterday's inbound+gaia I noticed that device performance would absolutely tank while polling for updates. The worst were O(seconds) full-screen pauses.
This is with a clean flash of a "user" build, no other apps installed.
Suggest bb+ to understand the root cause, and we can go from there.
Updated•12 years ago
|
Component: Gaia → Gaia::System
Updated•12 years ago
|
Assignee: nobody → philipp
blocking-basecamp: ? → +
Priority: -- → P3
Reporter | ||
Comment 1•12 years ago
|
||
We're doing synchronous main-thread disk IO, in the master b2g process, on behalf of sqlite.
http://people.mozilla.com/~bgirard/cleopatra/#report=5e149f1329c89bc64684b6d4ca79c61f6d34cd5b&filter=%255B%257B%2522type%2522%253A%2522RangeSampleFilter%2522%252C%2522start%2522%253A607%252C%2522end%2522%253A1889%257D%255D&selection=%255B%2522Startup%253A%253AXRE_Main%2522%252C%2522Input%253A%253AnsInputStreamPump%253A%253AOnInputStreamReady%2522%252C%2522Input%253A%253AnsInputStreamPump%253A%253AOnStateTransfer%2522%252C%2522network%253A%253AnsHttpChannel%253A%253AOnDataAvailable%2522%252C%2522storage%253A%253AStatement%253A%253AExecuteStep%2522%255D
Fucking awesome!
I think Brian has been looking at reducing main-thread IO in the cache code. Though I wasn't aware that we're doing sync IO through sqlite from the http code.
Is this the appcache code maybe?
Reporter | ||
Comment 3•12 years ago
|
||
This is almost certainly the update-polling code, which I assume has to interact with appcache.
Beyond that, I have no idea what actual database we're mutating. It's strange to me that unlink() is showing up in the samples ...
Comment 4•12 years ago
|
||
I'll check the appcache code what all we do on the main thread and what not.
Is that profile reliable? It also says that nsHttpChannel::OnDataAvailable calls nsWindowMemoryReporter::CollectReports...
Reporter | ||
Comment 6•12 years ago
|
||
storage::Statement::ExecuteStep is a manually-added sample label (and everything above it in the "stack", I think) so it can definitely be trusted.
Reporter | ||
Comment 7•12 years ago
|
||
fabrice reports this not being slow when testing on unagi. I just updated my decommissioned dogfood phone and can confirm the device stays usable.
I filed this bug based on testing on otoro, which has worse storage and a crappier fs than unagi. So please test on otoro.
Keywords: qawanted
Reporter | ||
Comment 8•12 years ago
|
||
Peter, the experience here is so bad I think we should consider bumping this up to the highest priority perf item.
Reporter | ||
Comment 9•12 years ago
|
||
This is so horrifically bad that we're going to have trouble convincing anyone to put a phone in stores.
Severity: normal → critical
Priority: P3 → P1
Component: Gaia::System → General
Assignee | ||
Comment 10•12 years ago
|
||
I can confirm this on Otoro. See the same characteristics in the profile.
https://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDeviceSQL.cpp is full of synchronous SQL a la statement->ExecuteStep(). As far as I can tell, this is all executed on the main thread. Also, each statement will trigger an implicit sqlite transaction, hence the many unlink()s. In an ideal world, there'd be much fewer transactions, perhaps just one for the entire app update.
I instrumented the methods of nsOfflineCacheDevice [1] with more labels. Here's two profile runs with that:
http://people.mozilla.com/~bgirard/cleopatra/#report=fd0aab2993060a41ab7223bc27f040d33bf4f05f
http://people.mozilla.com/~bgirard/cleopatra/#report=af0f03b71ccec25d3eb2146a02568ecd766cf1a3
We see that the bulk of the time is spent in nsOfflineCacheDevice::UpdateEntrySize(). If I read the correctly, this is called somewhat frequently as data is received over the HTTP pipe. So one potential remedy would be to throttle this call.
[1] https://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDeviceSQL.cpp
Comment 11•12 years ago
|
||
Philipp, thanks for your investigation!
Thinker, do you think your patch from bug 771420 can fix this?
Also CC Michal as he may know how to optimize (or even remove?) this call on the device consumer level.
See Also: → 771420
Assignee | ||
Comment 12•12 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #11)
> Thinker, do you think your patch from bug 771420 can fix this?
Actually, it *seems* like we just need to do bug 809947. IOW, hook up the hosted app update check to the new API introduced in bug 751754. But please correct me if I'm wrong.
Comment 13•12 years ago
|
||
If the regression is not caused by actual appcache updates (i.e. download of all the resources listed in a manifest) and is just caused by update checks (i.e. only a check whether manifest has changed on the server) then yes, bug 809947 is all we need here (safe and simple).
Comment 14•12 years ago
|
||
Why is this not proxied to the cache thread?
Comment 15•12 years ago
|
||
(In reply to Chris Jones [:cjones] [:warhammer] from comment #0)
> I'm not able to provide very precise STR atm. However, on yesterday's
> inbound+gaia I noticed that device performance would absolutely tank while
> polling for updates. The worst were O(seconds) full-screen pauses.
>
> This is with a clean flash of a "user" build, no other apps installed.
>
> Suggest bb+ to understand the root cause, and we can go from there.
Since all pre-installed applications are packaged, are they still going through appcache?
Not all pre-installed apps are packaged. At least the marketplace is not.
But yes, almost all apps will be packaged and thus won't be affected by this bug.
Comment 17•12 years ago
|
||
(In reply to Andreas Gal :gal from comment #14)
> Why is this not proxied to the cache thread?
Since app cache SQL device doesn't have any thread. I'm going to work on appcache internal code optimization soon after DOM storage is done.
Assignee | ||
Comment 18•12 years ago
|
||
(In reply to Jonas Sicking (:sicking) from comment #16)
> Not all pre-installed apps are packaged. At least the marketplace is not.
>
> But yes, almost all apps will be packaged and thus won't be affected by this
> bug.
Almost all apps that come with the phone are packaged, yes. But as the user installs more apps, that ratio can be expected to change, I'd say. We're already seeing horrible performance with just 3 (!) hosted apps installed.
Assignee | ||
Comment 19•12 years ago
|
||
A couple of observations:
* We use the webapp manifest's etag to determine whether we should update the app or not. However, for the Marketplace(s), the server doesn't send us an etag, so every time we do an update check, we think the app has been updated. Do we require the server to implement etags? If so, then we should fix the Marketplace. Although I think we should perhaps be lenient and do an If-Modified-Since query with a timestamp if we don't have an etag on file (this will require us to remember the manifest's download timestamp).
* An update check where we don't actually need to download anything seems to work and looks pretty sane, although there are three UPDATE queries that perhaps could be collapsed into one. Optimizing for this case would make sense since it's going to be a pretty common one (most of the time when the phone checks for updates, there will not be any.) Of course, ideally we shouldn't even get here due to the previous point.
12-04 23:12:04.932 105 105 I Gecko : -*-*- Webapps.jsm : checkForUpdate for https://marketplace.firefox.com/telefonica/manifest.webapp
12-04 23:12:04.932 105 105 I Gecko : -*-*- Webapps.jsm : Getting manifest at https://marketplace.firefox.com/telefonica/manifest.webapp, etag: null
12-04 23:12:06.473 105 105 I Gecko : -*-*- Webapps.jsm : Downloaded manifest at https://marketplace.firefox.com/telefonica/manifest.webapp, status is 200, etag is null
12-04 23:12:06.483 105 105 I Gecko : -*-*- Webapps.jsm : updateHostedApp marketplace
12-04 23:12:06.503 105 105 I Gecko : -*-*- Webapps.jsm : startOfflineCacheDownload https://marketplace.firefox.com/telefonica/manifest.webapp
12-04 23:12:06.503 105 105 I Gecko : -*-*- Webapps.jsm : Creating AppcacheObserver for https://marketplace.firefox.com/telefonica/ - updating
12-04 23:12:17.804 105 105 I Gecko : -*-*- Webapps.jsm : Offline cache state change for https://marketplace.firefox.com/telefonica/ : 2
12-04 23:12:17.894 105 105 I GeckoStorage: Statement::ExecuteStep(): INSERT INTO moz_cache (ClientID, Key, MetaData, Generation, DataSize, FetchCount, LastFetched, LastModified, ExpirationTime) VALUES(?,?,?,?,?,?,?,?,?);
12-04 23:12:17.944 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 0)
12-04 23:12:17.944 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 0)
12-04 23:12:17.944 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:17.974 105 105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(5032)
12-04 23:12:17.974 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 5032)
12-04 23:12:17.974 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace.firefox.com/telefonica/appcache/manifest.appcache%231022+f|0000001354662726|8:https://marketplace.firefox.com/telefonica/appcache/manifest.appcache, 5032)
12-04 23:12:17.974 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:17.994 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET MetaData = ?, DataSize = ?, FetchCount = ?, LastFetched = ?, LastModified = ?, ExpirationTime = ? WHERE ClientID = ? AND Key = ?;
12-04 23:12:18.084 105 105 E GeckoConsole: Offline cache doesn't need to update, URL=https://marketplace.firefox.com/telefonica/appcache/manifest.appcache
* When we download a file because it has changed, we do the same thing as observed in the previous log: we continuosly UPDATE the data size, presumably as we receive data. Over the course of one file download, it's not uncommon to fire more than 5 UPDATE queries. Judging from my profiles in comment 10, this is what's killing us. We should look into collapsing these queries, or doing them off the main thread.
12-03 22:29:09.313 105 105 I GeckoStorage: Statement::ExecuteStep(): INSERT INTO moz_cache (ClientID, Key, MetaData, Generation, DataSize, FetchCount, LastFetched, LastModified, ExpirationTime) VALUES(?,?,?,?,?,?,?,?,?);
12-03 22:29:09.373 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 0)
12-03 22:29:09.373 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 0)
12-03 22:29:09.373 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.403 105 105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(16384)
12-03 22:29:09.403 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.403 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.403 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.433 105 105 I GeckoStorage: Statement::ExecuteStep(): SELECT Sum(DataSize) from moz_cache;
12-03 22:29:09.433 105 105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(16384)
12-03 22:29:09.433 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 16384)
12-03 22:29:09.433 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 32768)
12-03 22:29:09.433 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.463 105 105 I GeckoStorage: Statement::ExecuteStep(): SELECT Sum(DataSize) from moz_cache;
12-03 22:29:09.463 105 105 I GeckoCache: nsCacheEntryDescriptor::RequestDataSizeChange(14729)
12-03 22:29:09.463 105 105 I GeckoCache: nsOfflineCacheDevice::OnDataSizeChange(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 14729)
12-03 22:29:09.463 105 105 I GeckoCache: nsOfflineCacheDevice::UpdateEntrySize(https%3A//marketplace-dev.allizom.org/telefonica/appcache/manifest.appcache%231025+f|0000001354573748|0:https://marketplace-dev-cdn.allizom.org/media/img/mkt/grain.png?d99a08c, 47497)
12-03 22:29:09.463 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET DataSize = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.493 105 105 I Gecko : -*-*- Webapps.jsm : Offline cache state change for https://marketplace-dev.allizom.org/telefonica/ : 8
12-03 22:29:09.493 105 105 I Gecko : -*-*- Webapps.jsm : Offlinecache setStatus to updating for https://marketplace-dev.allizom.org/telefonica/
12-03 22:29:09.503 105 105 I GeckoStorage: Statement::ExecuteStep(): UPDATE moz_cache SET MetaData = ?, DataSize = ?, FetchCount = ?, LastFetched = ?, LastModified = ?, ExpirationTime = ? WHERE ClientID = ? AND Key = ?;
12-03 22:29:09.543 105 105 I Gecko : -*-*- Webapps.jsm : Offline cache state change for https://marketplace-dev.allizom.org/telefonica/ : 6
12-03 22:29:09.543 105 105 I Gecko : -*-*- Webapps.jsm : Offlinecache setStatus to updating for https://marketplace-dev.allizom.org/telefonica/
Comment 20•12 years ago
|
||
Adding Rob, Potch, and Wil - Since there's some observations and discussions on marketplace, app updates, and appcache.
Comment 21•12 years ago
|
||
Thanks Jason. Bug 818311 filed for adding an etag to the marketplace.
Comment 22•12 years ago
|
||
(In reply to Philipp von Weitershausen [:philikon] from comment #19)
> * We use the webapp manifest's etag to determine whether we should update
> the app or not. However, for the Marketplace(s), the server doesn't send us
> an etag, so every time we do an update check, we think the app has been
> updated. Do we require the server to implement etags? If so, then we should
> fix the Marketplace. Although I think we should perhaps be lenient and do an
> If-Modified-Since query with a timestamp if we don't have an etag on file
> (this will require us to remember the manifest's download timestamp).
Bug 818311 was filed to add etags to Marketplace, which is fine. This is the first I've heard of adding etags to hosted app manifests. We do have etags on our packaged app's mini-manifests.
If etags affect the performance of updates it seems like a bad idea to assume all hosted apps will make this extra effort.
If we want them to, we should document so. There's no reference of etags here:
https://developer.mozilla.org/en-US/docs/Apps/Manifest#Serving_manifests
Comment 23•12 years ago
|
||
(In reply to Philipp von Weitershausen [:philikon] from comment #19)
> A couple of observations:
>
> * We use the webapp manifest's etag to determine whether we should update
> the app or not. However, for the Marketplace(s), the server doesn't send us
> an etag, so every time we do an update check, we think the app has been
> updated. Do we require the server to implement etags? If so, then we should
> fix the Marketplace. Although I think we should perhaps be lenient and do an
> If-Modified-Since query with a timestamp if we don't have an etag on file
> (this will require us to remember the manifest's download timestamp).
>
> * An update check where we don't actually need to download anything seems to
> work and looks pretty sane, although there are three UPDATE queries that
> perhaps could be collapsed into one. Optimizing for this case would make
> sense since it's going to be a pretty common one (most of the time when the
> phone checks for updates, there will not be any.) Of course, ideally we
> shouldn't even get here due to the previous point.
Bug 771420 is supposed to a solution of this issue by avoiding any update if nothing is changed.
>
> * When we download a file because it has changed, we do the same thing as
> observed in the previous log: we continuosly UPDATE the data size,
> presumably as we receive data. Over the course of one file download, it's
> not uncommon to fire more than 5 UPDATE queries. Judging from my profiles in
> comment 10, this is what's killing us. We should look into collapsing these
> queries, or doing them off the main thread.
Again! Bug 771420. In my first implementation for that bug, it does not do any update until last moment, all files had been downloaded. But, it is a big change, and it is a bit out of scope of bug 771420, so it was dropped. @honzab, how do you think? Should we open a new bug for that, or resolve this issue in bug 771420.
Bug 771420, now, seems to get higher priority. I will shift back to bug 771420.
Comment 24•12 years ago
|
||
I think we have few simpler thinks to do here first:
- make the server providing web app manifests work with ETags (that is something every web server provides automatically, it just may not be on by default, and is a normal and transparent way to optimize things), bug 818311
- use the checkForUpdate API to check for updates, that API is not doing any UPDATEs as well, bug 809947 (with r? patch now)
I personally would rather make nsDiskCacheDeviceSQL work fully asynchronously on a background thread (change its API) instead of what the current patch in bug 771420 does. That is concept/work I would like to coordinate on with Thinker after I finish my work on DOM storage (couple of weeks).
Updated•12 years ago
|
Comment 25•12 years ago
|
||
Important note to every one here: *please be careful when using the word "update", there are a) web apps that update and there are b) offline cache manifests (and content) that update.* It is not always clear about which of the two updates is being spoken of. Thanks.
Assignee | ||
Comment 26•12 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #24)
> I think we have few simpler thinks to do here first:
> - make the server providing web app manifests work with ETags (that is
> something every web server provides automatically, it just may not be on by
> default, and is a normal and transparent way to optimize things), bug 818311
But it's easy for any web author to forget to implement etags and therefore ruin the performance of the device for every FxOS user. Regularly. That's not ok.
> - use the checkForUpdate API to check for updates, that API is not doing any
> UPDATEs as well, bug 809947 (with r? patch now)
Great!
> I personally would rather make nsDiskCacheDeviceSQL work fully
> asynchronously on a background thread (change its API) instead of what the
> current patch in bug 771420 does. That is concept/work I would like to
> coordinate on with Thinker after I finish my work on DOM storage (couple of
> weeks).
I agree that this is a good long-term proper fix. But we need remedies that work for v1 (now in mozilla-beta). Can we do something smaller than Thinker's patch that would be acceptable to land as a short-term remedy?
Comment 27•12 years ago
|
||
If SQLite update queries on the main thread are the culprit here then bug 809947 should definitely help.
Most servers support etagging by default:
http://httpd.apache.org/docs/2.2/mod/core.html#fileetag (see the default values)
http://technet.microsoft.com/en-us/library/ee619764%28v=ws.10%29.aspx
And there are also other ways to prevent net requests at all or provide info to the server to get just 304.
For app cache manifests we plan to make requests more unconditional, so it means to communicate with the server even more.
Let's see how this improves after those two bugs get fixed.
Assignee | ||
Comment 28•12 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #27)
> If SQLite update queries on the main thread are the culprit here then bug
> 809947 should definitely help.
I can confirm that with bug 809947, update checks are much much better. They feel almost instant with a good wifi connection and there seems to be no mainthread disk I/O anymore. So the typical case of the app not changing much or at all is now covered. But the fact that web authors (who might as well push updates daily) can arbitrarily ruin our device's performance like this makes me very uneasy.
> Most servers support etagging by default:
> http://httpd.apache.org/docs/2.2/mod/core.html#fileetag (see the default
> values)
> http://technet.microsoft.com/en-us/library/ee619764%28v=ws.10%29.aspx
>
> And there are also other ways to prevent net requests at all or provide info
> to the server to get just 304.
Correct, e.g. the If-Modified-Since which I suggested earlier.
> For app cache manifests we plan to make requests more unconditional, so it
> means to communicate with the server even more.
Hmm ok. I'm not going to question the reasoning behind this without having any background information, but consider me being suspicious ;). That said, network I/O is typically off thread and not nearly as bad as disk I/O on these devices.
Updated•12 years ago
|
Target Milestone: --- → B2G C3 (12dec-1jan)
Comment 29•12 years ago
|
||
Looks like this can be resolved but we need a new bug for "the fact that web authors (who might as well push updates daily) can arbitrarily ruin our device's performance like this makes me very uneasy."
philikon - If you agree, can you resolve this bug and file the follow-up bug for the app specific update issue with any details that you have from your investigation?
Assignee | ||
Comment 30•12 years ago
|
||
Sure ok. Filed bug 822833, resolving this one.
Updated•12 years ago
|
Component: General → Gaia
Updated•12 years ago
|
status-b2g18:
--- → fixed
Component: Gaia → General
You need to log in
before you can comment on or make changes to this bug.
Description
•