Closed Bug 729330 Opened 13 years ago Closed 10 years ago

Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread SQL query >100ms on nightly

Categories

(Toolkit :: Add-ons Manager, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: taras.mozilla, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:P1] c=startup_addons u= p=)

Attachments

(2 files)

And as far as we can tell XPIProvider.jsm is caller.
P1 until we can prove it isn't. Might be happening on startup.
Whiteboard: [Snappy] → [Snappy:P1]
(In reply to Taras Glek (:taras) from comment #1) > P1 until we can prove it isn't. Might be happening on startup. It'll be happening on startup whenever a new extension is installed or an extension is uninstalled, but not normally. This is likely a symptom of using transactions around all the DB operations (RELEASE SAVEPOINT) commits the transaction, and us doing sync writes to the DB, which is a giant difficulty to fix, particularly for the startup case.
This might be difficult to fix, but this is the most common slow sql operation(1.5-2x more common than the next one). The distribution(in milliseconds) of slow sql reports here is: 100.0,174.0,237.0,373.0,878.0,2239.0 This means half of the [>100ms] users are spending >237ms on this is and some small number of users are spending 2239ms. We should do this in a more efficient manner.
(In reply to Taras Glek (:taras) from comment #3) > We should do this in a more efficient manner. I don't disagree, but it requires rewriting quite a lot of code and only has an impact when users make changes to their add-ons.
This seems really bad if you don't run firefox often. I just ran into about 9seconds of cumulative jank on the main thread(soon after startup) to do with updating firefox + doing addon update checks. To reproduce have some outdated addons, a firefox update ready, then reboot the computer. Look at about:telemetry slow sql and MOZ_SQLITE_OTHER_SYNC_MAIN_THREAD_MS
Summary: RELEASE SAVEPOINT 'default' is #1 main thread query >100ms on nightly → Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread query >100ms on nightly
Blair: as a first step, could you provide some guidance for adding telemetry probes to measure the impact of addons manager main-thread queries? Are there easy things we could measure to get more insight into the scope of the problem?
Summary: Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread query >100ms on nightly → Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread SQL query >100ms on nightly
Measuring that might be harder than I first thought :\ SAVEPOINT works by only writing data to disk when the *outer-most* SAVEPOINT is released. So AFAIK we can't usefully measure nested transactions - and there are cases where there will be many of those. We'll have to be smarter about when we start transactions to be able to get any granular data - trying to make sure we have as many outer transaction as possible, and reduce the amount of nested transactions. For example: Currently, if the DB schema has changed since the last startup, then rebuilding all that data is wrapped in one big outer transaction. (Although, creating the schema is a separate outer transaction.) As for doing the actual measurements: * Add a name parameter to XPIDatabase.commitTransation() * Measure cost of executing RELEASE SAVEPOINT * Add that value to a telemetry probe, based on the name parameter (Or could add the name to beginTransaction, but we'd need to convert transactionCount to a stack of names.) Another thing that would be useful is to just make sure we're not doing stupid things like adding data, then updating it right after. Or adding it, then removing it right after. A debug log of all executed SQL would be useful in figuring that out (and generally be useful when debugging code that affects the DB).
Blocks: StorageJank
I see this query taking time during xpcshell tests, too: "slowSQLStartup":{"mainThread":{"RELEASE SAVEPOINT 'default'":[2,328],"UPDATE addon SET userDisabled=:userDisabled, appDisabled=:appDisabled, softDisabled=:softDisabled, pendingUninstall=:pendingUninstall, applyBackgroundUpdates=:applyBackgroundUpdates WHERE internal_id=:internal_id":[1,187],"UPDATE addon SET active=:active WHERE internal_id=:internal_id":[1,182]},"otherThreads":{}} So in addition to slowing down the user experience, we're hurting our own dev experience too. :(
To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were spent executing sql. This happens on every xpcshell invocation, it's reasonable to assume that there about a second of addon manager overhead here.
(In reply to Taras Glek (:taras) from comment #9) > To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were > spent executing sql. This happens on every xpcshell invocation, it's > reasonable to assume that there about a second of addon manager overhead > here. That's really odd, I didn't think the add-ons manager was initialised during normal xpcshell runs, you have to explicitly call it to start it up. Is it possible these numbers are only coming from the add-ons manager tests themselves?
(In reply to Dave Townsend (:Mossop) from comment #10) > (In reply to Taras Glek (:taras) from comment #9) > > To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were > > spent executing sql. This happens on every xpcshell invocation, it's > > reasonable to assume that there about a second of addon manager overhead > > here. > > That's really odd, I didn't think the add-ons manager was initialised during > normal xpcshell runs, you have to explicitly call it to start it up. Is it > possible these numbers are only coming from the add-ons manager tests > themselves? Oh, I guess the telemetry xpcshell tests also start up the add-ons manager and that's probably where you're seeing this data come from right?
(In reply to Dave Townsend (:Mossop) from comment #11) > Oh, I guess the telemetry xpcshell tests also start up the add-ons manager > and that's probably where you're seeing this data come from right? Yes, they do. It's good to know this cost isn't paid for every xpcshell invocation.
Depends on: asyncXPIProvider
I'm trying to understand whether this slow SQL query, observed during startup, is actually causing relevant issues or not. I've seen that some telemetry data has been provided on this bug, but if I understand correctly, it applies only to the subset of users that actually experienced the slow query. This means users who just updated the browser, or had add-ons that needed actions related to installation or updates. Do we already have data about how many times this slow SQL happens, compared to the global number of startups? I'd expect this to be a really small percentage. And in fact it might also only happen in case the systems is already overloaded, as suggested in the comments here: https://blog.mozilla.org/vdjeric/2012/05/14/more-telemetry-from-super-slow-startups/ Moving this query to a separate thread requires some work, and unless we want to do it for the sake of removing all the main thread SQL in the codebase, it's probably a bit safer to leave all the startup queries synchronous (to prevent hangs if an uncaught exception occurs in asynchronous code). Just for reference, in the "test_startup.js" xpcshell test on my computer, this SQL query takes from 8ms to 11ms to commit the transaction.
(In reply to Paolo Amadini [:paolo] from comment #13) > I'm trying to understand whether this slow SQL query, observed during > startup, > is actually causing relevant issues or not. > > I've seen that some telemetry data has been provided on this bug, but if I > understand correctly, it applies only to the subset of users that actually > experienced the slow query. This means users who just updated the browser, > or had add-ons that needed actions related to installation or updates. > > Do we already have data about how many times this slow SQL happens, compared > to > the global number of startups? I'd expect this to be a really small > percentage. As I mentioned, this will only happen after the user restarts after changing their extensions in some way (installing, uninstalling, enabling, disabling), or during some Firefox upgrades. In the worst case if we change the DB schema we completely rebuild the database from scratch. So yes I'd agree that it is likely a very small percentage of the startups.
Just because this doesn't happen all the time, does not make it low priority. Last I checked this was the most frequent case of main thread slow-sql. As is common with IO, it tends to micro-benchmark better than it is likely to actually behave(especially if it's in disk cache). In this case, these expensive IO operations happen during startup when a lot of other IO is going on, so they are almost guaranteed to exhibit worse performance than xpcshell. Unfortunately the slowsql dashboard is down at the moment, I'll post an updated prioritized(by frequency) list of queries as soon as I can get the data.
(In reply to Taras Glek (:taras) from comment #15) > Unfortunately the slowsql dashboard is down at the moment, I'll post an > updated prioritized(by frequency) list of queries as soon as I can get the > data. Is this frequency data a precentage compared to the global number of startups (as opposed to the total number of times the code path is actually hit)? Do you think this piece of data is relevant to inform a decision here, or that we should take action even if this affects very small number of startups, at the cost of producing startup hangs in some (probably equally rare) cases?
(In reply to Paolo Amadini [:paolo] from comment #16) > (In reply to Taras Glek (:taras) from comment #15) > > Unfortunately the slowsql dashboard is down at the moment, I'll post an > > updated prioritized(by frequency) list of queries as soon as I can get the > > data. > > Is this frequency data a precentage compared to the global number of startups > (as opposed to the total number of times the code path is actually hit)? It compares various causes of slowsql main thread jank. We should remove all of it. The only decision to make is what jank to focus on first :) > > Do you think this piece of data is relevant to inform a decision here, or > that we > should take action even if this affects very small number of startups, at > the cost > of producing startup hangs in some (probably equally rare) cases? We should not introduce hangs.
Here is some data. By frequency addon manager features twice in top 5 janks. By duration it features twice in top dozen. If you combine the 2 addon manager janks(release savepoint + untracked addons.sqlite), it's our #2 problem behind dom storage(which we are fixing). However if this is hard to fix, i would be happy to work on other problems with similar cumulative impact(ie download manager + form history). Which ever bugfixing path reduces most jank fastest is good by me :)
Forgot to mention, above data is from the release channel. Addon jank is more prominent on devel channel due to frequent upgrades.
I've just posted a patch in bug 699839 comment 4, to move all the SQL queries to the asynchronous thread. Using an approach based on the Task.jsm module makes it possible to solve the problem without big rewrites, despite the many special cases and conditionals handled by the code. There may be other risk consideration to do, so we'll discuss this possible approach there, in bug 699839.
This patch adds an identifier to the SQLITE statements to create, commit and roll back savepoints, to try and pin down which operations are causing the slow SQL operations.
Attachment #717921 - Flags: review?(bmcbride)
Comment on attachment 717921 [details] [diff] [review] Add an identifier to the SQLITE savepoint statements Review of attachment 717921 [details] [diff] [review]: ----------------------------------------------------------------- I don't know how useful this will be, given what I described in comment 7. But I guess a little info is better than none. ::: toolkit/mozapps/extensions/XPIProviderUtils.js @@ +421,5 @@ > * until the outermost transaction is committed. Transactions can be started > * even when the database is not yet open in which case they will be started > * when the database is first opened. > */ > + beginTransaction: function XPIDB_beginTransaction(txName) { Convention for this code is to name arguments as aWhatever. Also, should null-check that argument, and throw/log an error - I can imagine it will be easy to accidentally forget to pass in a name otherwise. @@ +426,3 @@ > if (this.initialized) > + this.getStatement("createSavepoint_" + txName, > + "SAVEPOINT /* " + txName + " */ 'default'") Any reason why you're adding the new name as a comment here, and keeping the actual savepoint name as 'default' instead of changing that? @@ +433,5 @@ > /** > * Commits the most recent transaction. The data may still be rolled back if > * an outer transaction is rolled back. > */ > + commitTransaction: function XPIDB_commitTransaction(txName) { commitTransation and rollbackTransation shouldn't need to take an argument - it makes the calling code needlessly complex and error-prone. Should instead convert transactionCount to an array, and add the transaction name to it in beginTransation, so it can be popped here and in rollbackTransation.
Attachment #717921 - Flags: review?(bmcbride) → review-
Whiteboard: [Snappy:P1] → [Snappy:P1] c=startup_addons u= p=
Bug 853388 is the alternate solution to this.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
WONTFIX doesn't seem accurate, let's use this to track that we got rid of all the main-thread impact.
Status: RESOLVED → REOPENED
Depends on: 853388
No longer depends on: asyncXPIProvider
Resolution: WONTFIX → ---
Slow SQL telemetry for Nightly shows that the rate of slow SQL operations on addons.sqlite and extensions.sqlite is dropping off; we'll see what happens over the next week or so as more of the pre-Aug 12 nightlies get upgraded. The startup time metrics (SIMPLE_MEASURES_XPI_STARTUP_END and SIMPLE_MEASURES_SESSIONRESTORED) haven't changed much; we'll need to watch and see if those go down as we get more sessions that don't include the database rebuild.
I am assuming this has been fixed by not using anymore Sqlite in XPIProvider. Regardless I don't see anymore RELEASE SAVEPOINT anywhere in the codebase.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: