Last Comment Bug 729330 - Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread SQL query >100ms on nightly
: Addon-manager jank: RELEASE SAVEPOINT 'default' is #1 main thread SQL query >...
Status: RESOLVED FIXED
[Snappy:P1] c=startup_addons u= p=
:
Product: Toolkit
Classification: Components
Component: Add-ons Manager (show other bugs)
: unspecified
: x86 Windows 7
: -- normal with 3 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
:
Mentors:
Depends on: 853388
Blocks: StorageJank
  Show dependency treegraph
 
Reported: 2012-02-21 14:52 PST by (dormant account)
Modified: 2015-04-09 02:39 PDT (History)
28 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Screenshot of sql dash ordered by cumulative time (43.17 KB, image/png)
2012-08-30 10:19 PDT, (dormant account)
no flags Details
Add an identifier to the SQLITE savepoint statements (14.51 KB, patch)
2013-02-25 09:13 PST, :Irving Reid (No longer working on Firefox)
blair: review-
Details | Diff | Splinter Review

Description (dormant account) 2012-02-21 14:52:59 PST
And as far as we can tell XPIProvider.jsm is caller.
Comment 1 (dormant account) 2012-02-23 12:05:34 PST
P1 until we can prove it isn't. Might be happening on startup.
Comment 2 Dave Townsend [:mossop] 2012-02-23 12:26:26 PST
(In reply to Taras Glek (:taras) from comment #1)
> P1 until we can prove it isn't. Might be happening on startup.

It'll be happening on startup whenever a new extension is installed or an extension is uninstalled, but not normally.

This is likely a symptom of using transactions around all the DB operations (RELEASE SAVEPOINT) commits the transaction, and us doing sync writes to the DB, which is a giant difficulty to fix, particularly for the startup case.
Comment 3 (dormant account) 2012-02-23 14:24:08 PST
This might be difficult to fix, but this is the most common slow sql operation(1.5-2x more common than the next one).

The distribution(in milliseconds) of slow sql reports here is:
100.0,174.0,237.0,373.0,878.0,2239.0

This means half of the [>100ms] users are spending >237ms on this is and some small number of users are spending 2239ms.

We should do this in a more efficient manner.
Comment 4 Dave Townsend [:mossop] 2012-02-23 14:31:54 PST
(In reply to Taras Glek (:taras) from comment #3)
> We should do this in a more efficient manner.

I don't disagree, but it requires rewriting quite a lot of code and only has an impact when users make changes to their add-ons.
Comment 5 (dormant account) 2012-02-28 16:50:41 PST
This seems really bad if you don't run firefox often. I just ran into about 9seconds of cumulative jank on the main thread(soon after startup) to do with updating firefox + doing addon update checks.

To reproduce have some outdated addons, a firefox update ready, then reboot the computer. Look at about:telemetry slow sql and MOZ_SQLITE_OTHER_SYNC_MAIN_THREAD_MS
Comment 6 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-03-23 14:30:24 PDT
Blair: as a first step, could you provide some guidance for adding telemetry probes to measure the impact of addons manager main-thread queries? Are there easy things we could measure to get more insight into the scope of the problem?
Comment 7 Blair McBride [:Unfocused] (UNAVAILABLE) 2012-05-22 21:06:37 PDT
Measuring that might be harder than I first thought :\

SAVEPOINT works by only writing data to disk when the *outer-most* SAVEPOINT is released. So AFAIK we can't usefully measure nested transactions - and there are cases where there will be many of those. We'll have to be smarter about when we start transactions to be able to get any granular data - trying to make sure we have as many outer transaction as possible, and reduce the amount of nested transactions.

For example: Currently, if the DB schema has changed since the last startup, then rebuilding all that data is wrapped in one big outer transaction. (Although, creating the schema is a separate outer transaction.)

As for doing the actual measurements:
* Add a name parameter to XPIDatabase.commitTransation()
* Measure cost of executing RELEASE SAVEPOINT
* Add that value to a telemetry probe, based on the name parameter
(Or could add the name to beginTransaction, but we'd need to convert transactionCount to a stack of names.)

Another thing that would be useful is to just make sure we're not doing stupid things like adding data, then updating it right after. Or adding it, then removing it right after. A debug log of all executed SQL would be useful in figuring that out (and generally be useful when debugging code that affects the DB).
Comment 8 Nathan Froyd [:froydnj] 2012-05-29 12:35:49 PDT
I see this query taking time during xpcshell tests, too:

"slowSQLStartup":{"mainThread":{"RELEASE SAVEPOINT 'default'":[2,328],"UPDATE addon SET userDisabled=:userDisabled, appDisabled=:appDisabled, softDisabled=:softDisabled, pendingUninstall=:pendingUninstall, applyBackgroundUpdates=:applyBackgroundUpdates WHERE internal_id=:internal_id":[1,187],"UPDATE addon SET active=:active WHERE internal_id=:internal_id":[1,182]},"otherThreads":{}}

So in addition to slowing down the user experience, we're hurting our own dev experience too. :(
Comment 9 (dormant account) 2012-05-29 15:11:03 PDT
To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were spent executing sql. This happens on every xpcshell invocation, it's reasonable to assume that there about a second of addon manager overhead here.
Comment 10 Dave Townsend [:mossop] 2012-05-30 10:16:09 PDT
(In reply to Taras Glek (:taras) from comment #9)
> To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were
> spent executing sql. This happens on every xpcshell invocation, it's
> reasonable to assume that there about a second of addon manager overhead
> here.

That's really odd, I didn't think the add-ons manager was initialised during normal xpcshell runs, you have to explicitly call it to start it up. Is it possible these numbers are only coming from the add-ons manager tests themselves?
Comment 11 Dave Townsend [:mossop] 2012-05-30 10:29:20 PDT
(In reply to Dave Townsend (:Mossop) from comment #10)
> (In reply to Taras Glek (:taras) from comment #9)
> > To clarify, above mean that 328 + 187 + 182 milliseconds of xpcshell were
> > spent executing sql. This happens on every xpcshell invocation, it's
> > reasonable to assume that there about a second of addon manager overhead
> > here.
> 
> That's really odd, I didn't think the add-ons manager was initialised during
> normal xpcshell runs, you have to explicitly call it to start it up. Is it
> possible these numbers are only coming from the add-ons manager tests
> themselves?

Oh, I guess the telemetry xpcshell tests also start up the add-ons manager and that's probably where you're seeing this data come from right?
Comment 12 Nathan Froyd [:froydnj] 2012-05-30 11:35:57 PDT
(In reply to Dave Townsend (:Mossop) from comment #11)
> Oh, I guess the telemetry xpcshell tests also start up the add-ons manager
> and that's probably where you're seeing this data come from right?

Yes, they do.  It's good to know this cost isn't paid for every xpcshell invocation.
Comment 13 :Paolo Amadini 2012-08-28 10:03:04 PDT
I'm trying to understand whether this slow SQL query, observed during startup,
is actually causing relevant issues or not.

I've seen that some telemetry data has been provided on this bug, but if I
understand correctly, it applies only to the subset of users that actually
experienced the slow query. This means users who just updated the browser,
or had add-ons that needed actions related to installation or updates.

Do we already have data about how many times this slow SQL happens, compared to
the global number of startups? I'd expect this to be a really small percentage.
And in fact it might also only happen in case the systems is already overloaded,
as suggested in the comments here:

https://blog.mozilla.org/vdjeric/2012/05/14/more-telemetry-from-super-slow-startups/

Moving this query to a separate thread requires some work, and unless we want to
do it for the sake of removing all the main thread SQL in the codebase, it's
probably a bit safer to leave all the startup queries synchronous (to prevent
hangs if an uncaught exception occurs in asynchronous code).

Just for reference, in the "test_startup.js" xpcshell test on my computer, this
SQL query takes from 8ms to 11ms to commit the transaction.
Comment 14 Dave Townsend [:mossop] 2012-08-28 10:12:03 PDT
(In reply to Paolo Amadini [:paolo] from comment #13)
> I'm trying to understand whether this slow SQL query, observed during
> startup,
> is actually causing relevant issues or not.
> 
> I've seen that some telemetry data has been provided on this bug, but if I
> understand correctly, it applies only to the subset of users that actually
> experienced the slow query. This means users who just updated the browser,
> or had add-ons that needed actions related to installation or updates.
> 
> Do we already have data about how many times this slow SQL happens, compared
> to
> the global number of startups? I'd expect this to be a really small
> percentage.

As I mentioned, this will only happen after the user restarts after changing their extensions in some way (installing, uninstalling, enabling, disabling), or during some Firefox upgrades. In the worst case if we change the DB schema we completely rebuild the database from scratch. So yes I'd agree that it is likely a very small percentage of the startups.
Comment 15 (dormant account) 2012-08-28 10:28:14 PDT
Just because this doesn't happen all the time, does not make it low priority. Last I checked this was the most frequent case of main thread slow-sql. As is common with IO, it tends to micro-benchmark better than it is likely to actually behave(especially if it's in disk cache).
In this case, these expensive IO operations happen during startup when a lot of other IO is going on, so they are almost guaranteed to exhibit worse performance than xpcshell.
Unfortunately the slowsql dashboard is down at the moment, I'll post an updated prioritized(by frequency) list of queries as soon as I can get the data.
Comment 16 :Paolo Amadini 2012-08-28 10:37:33 PDT
(In reply to Taras Glek (:taras) from comment #15)
> Unfortunately the slowsql dashboard is down at the moment, I'll post an
> updated prioritized(by frequency) list of queries as soon as I can get the
> data.

Is this frequency data a precentage compared to the global number of startups
(as opposed to the total number of times the code path is actually hit)?

Do you think this piece of data is relevant to inform a decision here, or that we
should take action even if this affects very small number of startups, at the cost
of producing startup hangs in some (probably equally rare) cases?
Comment 17 (dormant account) 2012-08-28 10:42:18 PDT
(In reply to Paolo Amadini [:paolo] from comment #16)
> (In reply to Taras Glek (:taras) from comment #15)
> > Unfortunately the slowsql dashboard is down at the moment, I'll post an
> > updated prioritized(by frequency) list of queries as soon as I can get the
> > data.
> 
> Is this frequency data a precentage compared to the global number of startups
> (as opposed to the total number of times the code path is actually hit)?

It compares various causes of slowsql main thread jank. We should remove all of it. The only decision to make is what jank to focus on first :)

> 
> Do you think this piece of data is relevant to inform a decision here, or
> that we
> should take action even if this affects very small number of startups, at
> the cost
> of producing startup hangs in some (probably equally rare) cases?

We should not introduce hangs.
Comment 18 (dormant account) 2012-08-30 10:19:02 PDT
Created attachment 656934 [details]
Screenshot of sql dash ordered by cumulative time

Here is some data. By frequency addon manager features twice in top 5 janks. By duration it features twice in top dozen.
If you combine the 2 addon manager janks(release savepoint + untracked addons.sqlite), it's our #2 problem behind dom storage(which we are fixing).

However if this is hard to fix, i would be happy to work on other problems with similar cumulative impact(ie download manager + form history). Which ever bugfixing path reduces most jank fastest is good by me :)
Comment 19 (dormant account) 2012-08-30 10:27:36 PDT
Forgot to mention, above data is from the release channel. Addon jank is more prominent on devel channel due to frequent upgrades.
Comment 20 :Paolo Amadini 2012-08-30 11:23:28 PDT
I've just posted a patch in bug 699839 comment 4, to move all the SQL queries to
the asynchronous thread. Using an approach based on the Task.jsm module makes it
possible to solve the problem without big rewrites, despite the many special cases
and conditionals handled by the code. There may be other risk consideration to do,
so we'll discuss this possible approach there, in bug 699839.
Comment 21 :Irving Reid (No longer working on Firefox) 2013-02-25 09:13:15 PST
Created attachment 717921 [details] [diff] [review]
Add an identifier to the SQLITE savepoint statements

This patch adds an identifier to the SQLITE statements to create, commit and roll back savepoints, to try and pin down which operations are causing the slow SQL operations.
Comment 22 Blair McBride [:Unfocused] (UNAVAILABLE) 2013-02-27 03:23:10 PST
Comment on attachment 717921 [details] [diff] [review]
Add an identifier to the SQLITE savepoint statements

Review of attachment 717921 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know how useful this will be, given what I described in comment 7. But I guess a little info is better than none.

::: toolkit/mozapps/extensions/XPIProviderUtils.js
@@ +421,5 @@
>     * until the outermost transaction is committed. Transactions can be started
>     * even when the database is not yet open in which case they will be started
>     * when the database is first opened.
>     */
> +  beginTransaction: function XPIDB_beginTransaction(txName) {

Convention for this code is to name arguments as aWhatever.

Also, should null-check that argument, and throw/log an error - I can imagine it will be easy to accidentally forget to pass in a name otherwise.

@@ +426,3 @@
>      if (this.initialized)
> +      this.getStatement("createSavepoint_" + txName,
> +                        "SAVEPOINT /* " + txName + " */ 'default'")

Any reason why you're adding the new name as a comment here, and keeping the actual savepoint name as 'default' instead of changing that?

@@ +433,5 @@
>    /**
>     * Commits the most recent transaction. The data may still be rolled back if
>     * an outer transaction is rolled back.
>     */
> +  commitTransaction: function XPIDB_commitTransaction(txName) {

commitTransation and rollbackTransation shouldn't need to take an argument - it makes the calling code needlessly complex and error-prone. Should instead convert transactionCount to an array, and add the transaction name to it in beginTransation, so it can be popped here and in rollbackTransation.
Comment 23 Blair McBride [:Unfocused] (UNAVAILABLE) 2013-08-12 02:39:06 PDT
Bug 853388 is the alternate solution to this.
Comment 24 :Gavin Sharp [email: gavin@gavinsharp.com] 2013-08-12 11:14:50 PDT
WONTFIX doesn't seem accurate, let's use this to track that we got rid of all the main-thread impact.
Comment 25 :Irving Reid (No longer working on Firefox) 2013-08-21 09:09:21 PDT
Slow SQL telemetry for Nightly shows that the rate of slow SQL operations on addons.sqlite and extensions.sqlite is dropping off; we'll see what happens over the next week or so as more of the pre-Aug 12 nightlies get upgraded.

The startup time metrics (SIMPLE_MEASURES_XPI_STARTUP_END and SIMPLE_MEASURES_SESSIONRESTORED) haven't changed much; we'll need to watch and see if those go down as we get more sessions that don't include the database rebuild.
Comment 26 Marco Bonardo [::mak] 2015-04-09 02:39:23 PDT
I am assuming this has been fixed by not using anymore Sqlite in XPIProvider.
Regardless I don't see anymore RELEASE SAVEPOINT anywhere in the codebase.

Note You need to log in before you can comment on or make changes to this bug.