Closed Bug 699839 (asyncXPIProvider) Opened 13 years ago Closed 11 years ago

Investigate avoiding synchronous Storage APIs in XPIProvider.jsm

Categories

(Toolkit :: Add-ons Manager, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: mak, Assigned: Paolo)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

this module is still doing synchronous Storage work, if possible it should be avoided, or workarounded so that there is no risk of thread contention.
Alias: asyncXPIProvider
Cc'ing Blair for his input.

Do we know how much IO happens, where, and whether it's causing real-world problems?
Currently all writes to the database are synchronous. A number of reads are also synchronous, but those generally only(!) happen during startup.

We read and write to the database either during startup when changes have been made to the application or the set of extensions installed, or during runtime when the add-ons manager is opened or when add-ons are installed. I don't have any metrics for how much IO that actually is.

The largest problem here is startup. During startup the add-ons manager is called and we must wait until the add-ons manager has finished taking care of any changes and determined which extensions are meant to be active. Currently we do this by ensuring all of that code uses synchronous reads and writes to the DB.

The runtime code might be easier to switch to pure async. My main concern with doing that is test coverage. A lot of the DB code there is used by startup too which gives us a lot of test coverage. Making two copies of essentially the same function, one async and one sync for startup means that we lose some of that.

We could consider making the startup code full async as well. It wouldn't actually save us much time I think, the main thread would just have to spin an event loop till the add-ons manager finished its thing and said to proceed with startup, but it would allow us keep the code sharing described above to make changing the runtime code safer. The risk here is that if something goes wrong in the add-ons manager code and it fails to send the signal to continue startup, the event loop keeps on spinning and the user sees Firefox never start and a stuck process.

One final option for startup might be a hybrid approach. We could do the database accesses that we need to figure out the active add-ons synchronously (I /think/ these would only be reads), then set up the writes to go asynchronously after we've continued startup. If those writes fail for some reason then we end up in an inconsistent state, but probably no worse than we already do.

A probably very simple way to improve startup straight away would be to use the fact that any DB access during startup is wrapped in a single transaction. Currently we commit the transaction synchronously, but we could probably equally do it asynchronously (except it'd probably break all the tests that assume it completes immediately!).
Comment 2 is a great explanation, I've read the code in the light of it and it
makes perfect sense. Solving this problem is key to avoiding the slow SQL on
startup reported in bug 729330.

The idea of completing the database writes on a separate thread, while startup
goes on, would probably solve the issue in most cases, as long as there are no
more database reads required during subsequent startup phases.

The fact is that SQLite handles transactions per thread, thus I don't think we
can commit a main thread transaction on a separate thread. To commit on a
separate thread, we would need the writes to be executed on that thread as well.

As suggested, we have two options. We may keep the reads on the main thread,
group the writes, and enqueue them at the end of the startup process as
asynchronous Storage statements. Or we can just do all the reads and writes
as asynchronous statements on the same Storage connection (thus sharing the
same asynchronous Storage thread and transaction).

The first approach still requires the database schema creation or migration to
be completed on the main thread when required, before starting the reads. With
the second approach we might do both the schema handling and the writes in the
same transaction on the worker thread.

I'd prefer the second approach, using asynchronous statements everywhere reduces
duplication and possible contention, and it aligns with global Snappy goals. The
risk of ending up in inconsistent state exists with both approaches, and the risk
of hanging on startup while waiting for the provider to complete its operation
can be reduced by carefully written exception handling.

There is a bit of work here, but I can take this bug. Comments?
Blocks: 729330
Attached patch Prrof of concept. (obsolete) — Splinter Review
This is a draft of a solution that moves all the queries and transactions of
XPIProvider to the asynchronous thread. This basically will get rid of all
of the add-on manager's SQL-related jank documented in bug 729330, forever,
without risk of thread contention between sync and async queries.

This took just one day to complete, and as of now passes 258 tests and fails 2.
Exception propagation works the same as before, there is just one error code
that changes when using an asynchronous statement instead of a synchronous one.

The main factors that should be evaluated here are, firstly, if people who
routinely work on the Add-ons Manager code are comfortable with using the
Task.jsm approach (which is based on <http://taskjs.org>), and secondly what
is the impact of using nested event loops in the specific cases.

Unfortunately, XPIProvider adheres to an interface called by AddonsManager as
well as by add-ons themselves. In many places (especially property getters),
we cannot be asynchronous without spinning an inner event loop, whatever
approach we choose (unless we change the consumer interface).

This patch is just a proof of concept (it has duplicate helper functions,
complex one-liners to be removed, no comments, and so on). For clarity, I've
not re-indented pieces of code that are now wrapped in a task function.
Attachment #656959 - Flags: feedback?(dtownsend+bugmail)
You'll want to change your uses of createStatement to createAsyncStatement where you are moving to only calling executeAsync.  createStatement acquires the database connection lock on the main thread.

Since Sync also rather nefariously spins nested event loops, you may want to make sure that the interaction between Sync and the addon-manager cannot cause nested event loop wedging.  (Also, there will be extremely high karma costs to so many nested event loop spinnings!)
Comment on attachment 656959 [details] [diff] [review]
Prrof of concept.

Review of attachment 656959 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not ok with introducing new event loops at runtime
Attachment #656959 - Flags: feedback?(dtownsend+bugmail) → feedback-
(In reply to Dave Townsend (:Mossop) from comment #6)
> I'm not ok with introducing new event loops at runtime

I was in fact looking for feedback from you on the specific cases, to understand
what methods were actually called at runtime, and if we actually needed to wait
for the outcome of the runtime call or not. In some cases, we could just enqueue
the database operations without waiting for the result.

Only in those cases where we actually need to wait for a result, we need to
change the outer interface, adding a callback. I'm also looking for feedback
on how these interface changes could look like, if needed.

Here is a breakdown of the cases I found in the attached patch:

* XPIProvider methods that don't already have a callback: addonChanged,
  updateAddonAppDisabledStates, updateAllAddonDisabledStates, uninstallAddon,
  cancelUninstallAddon. Are they supposed to return only when completed, or
  throw synchronously? If not, we can return early and complete asynchronously.
  Of course, changing these means changing other providers also.

* UpdateChecker methods: onUpdateCheckComplete. I suppose this is already
  expected to do its job asynchronously?

* AddonInstall methods: startInstall. What about this one?

* Addon setters: applyBackgroundUpdates, syncGUID, userDisabled, softDisabled.
  Since we have a memory cache, it seems we could just update the in-memory
  representation of the object synchronously and do a best-effort approach to
  write to the database.

* Addon getters: There are a lot of these that worried me, but now I realize
  they're all populated when we read the add-on asynchronously from the
  database. We can then remove the getter code completely.
(In reply to Paolo Amadini [:paolo] from comment #7)
> (In reply to Dave Townsend (:Mossop) from comment #6)
> > I'm not ok with introducing new event loops at runtime
> 
> I was in fact looking for feedback from you on the specific cases, to
> understand
> what methods were actually called at runtime, and if we actually needed to
> wait
> for the outcome of the runtime call or not. In some cases, we could just
> enqueue
> the database operations without waiting for the result.
> 
> Only in those cases where we actually need to wait for a result, we need to
> change the outer interface, adding a callback. I'm also looking for feedback
> on how these interface changes could look like, if needed.
> 
> Here is a breakdown of the cases I found in the attached patch:
> 
> * XPIProvider methods that don't already have a callback: addonChanged,
>   updateAddonAppDisabledStates, updateAllAddonDisabledStates, uninstallAddon,
>   cancelUninstallAddon. Are they supposed to return only when completed, or
>   throw synchronously? If not, we can return early and complete
> asynchronously.
>   Of course, changing these means changing other providers also.

uninstallAddon and cancelUninstallAddon can be made asynchronous, though I'm sure there are some tests that make bad assumptions there. Likewise updateAllAddonDisabledStates and it looks like updateAddonAppDisabledStates is fine too.

addonChanged is a complexity here, it should be possible to make it async but the code is complex and fragile. We want to remove it soon, Blair is working on doing that elsewhere but I don't know how far off that is.

> * UpdateChecker methods: onUpdateCheckComplete. I suppose this is already
>   expected to do its job asynchronously?

Yes

> * AddonInstall methods: startInstall. What about this one?

Operates asynchronously currently IIRC

> * Addon setters: applyBackgroundUpdates, syncGUID, userDisabled,
> softDisabled.
>   Since we have a memory cache, it seems we could just update the in-memory
>   representation of the object synchronously and do a best-effort approach to
>   write to the database.

Yep

> * Addon getters: There are a lot of these that worried me, but now I realize
>   they're all populated when we read the add-on asynchronously from the
>   database. We can then remove the getter code completely.

Right, these shouldn't be a problem
(In reply to Dave Townsend (:Mossop) from comment #8)
> addonChanged is a complexity here, it should be possible to make it async
> but the code is complex and fragile. We want to remove it soon, Blair is
> working on doing that elsewhere but I don't know how far off that is.

Fine! I'll handle all the other cases and leave addonChanged alone for now, trying
to have a patch that passes all tests.
This is another proof of concept, this time spinning an inner event loop only
on startup and shutdown. The patch looks ugly, it should be carefully
refactored once we determine that we'll apply this approach.

The only open point is about Sync. Currently, the syncGUID setter is expected
to throw when a duplicate GUID is assigned. Can we change that in Sync so that
we call a function that returns the outcome asynchronously? (Andrew, in
comment 5 you mentioned interaction with Sync, do you have any insight about
the case discussed here?)

This passes all the 260 xpcshell tests, don't know about other harnesses.
Some test cases are affected by the changes, as expected sometimes we must
wait for the setters to complete their processing before checking state.

Also, in some tests we shutdown the Add-on Manager while handling an Add-on
Manager event. This causes a timeout while waiting for pending events to
complete on shutdown. We simply ignore that case in tests.
Attachment #656959 - Attachment is obsolete: true
Attachment #658134 - Flags: feedback?(dtownsend+bugmail)
(In reply to Paolo Amadini [:paolo] from comment #10)
> (Andrew, in
> comment 5 you mentioned interaction with Sync, do you have any insight about
> the case discussed here?)

No insight on the syncGUID thing, sorry.
(In reply to Paolo Amadini [:paolo] from comment #10)
> The only open point is about Sync. Currently, the syncGUID setter is expected
> to throw when a duplicate GUID is assigned. Can we change that in Sync so
> that
> we call a function that returns the outcome asynchronously? (Andrew, in
> comment 5 you mentioned interaction with Sync, do you have any insight about
> the case discussed here?)

gps should answer this.
(In reply to Dave Townsend (:Mossop) from comment #12)
> (In reply to Paolo Amadini [:paolo] from comment #10)
> > The only open point is about Sync. Currently, the syncGUID setter is expected
> > to throw when a duplicate GUID is assigned. Can we change that in Sync so
> > that
> > we call a function that returns the outcome asynchronously? (Andrew, in
> > comment 5 you mentioned interaction with Sync, do you have any insight about
> > the case discussed here?)
> 
> gps should answer this.

Sync can handle it. Either ping me when you need a Sync patch or ask me for review of the Sync code.
Comment on attachment 658134 [details] [diff] [review]
Proving we can avoid nested event loops at runtime

Review of attachment 658134 [details] [diff] [review]:
-----------------------------------------------------------------

(Ugh, please punch me in the face if I ever take this long again.)

Spent some time with this, and I'm warming to the idea. Looks like it could be cleaned up a bit once the major issues are sorted, to help keep it readable and maintainable. And the patch itself would really benefit from being split up into parts, so its more manageable (at the very least, split off things like changes to the Addon getters, which could even go in a separate bug).

I do have some concerns:
* Still spinning the event loop. There's a history of other areas of code doing this in the past, and causing various subtle issues. Would like to have someone that better knows the background and implications to comment on this (bsmedberg maybe?).
* Manually counting numPendingStatements seems error-prone, and hard to verify its done right. And in general, the fact that waitPending() is now needed in some tests makes me think those tests are no longer representative of real-world usage. I think that needs to be completely removed, and those tests changed to rely only on existing events (eg, onDisabled, when setting userDisabled=true). That's going to be needed anyway, for the likes of bug 771441.

::: toolkit/mozapps/extensions/XPIProvider.jsm
@@ +4980,5 @@
>    startInstall: function AI_startInstall() {
> +    Task.spawn(function XPI_task_startInstall() {
> +
> +    // Prevent re-entrance
> +    yield XPIDatabase.task_asyncWaitPending();

Is this really needed? Can it be worked around?

::: toolkit/mozapps/extensions/XPIProviderUtils.js
@@ +577,5 @@
>     */
> +  task_openConnection: function XPIDB_openConnection(aRebuildOnError, aForceOpen, aAlways) {
> +    if (this.connectionAttempted && !aAlways) {
> +      // The database is already open.
> +      throw new Task.Result({});

This seems racey, if we're needing to (re-)create the schema.

@@ +949,4 @@
>  
>      // Any errors in here should rollback the transaction
>      try {
> +      yield task_executeStatement(this.getStatement("schema1",

Is there any specific need to change these to be memorized? Only in extremely rare cases will they ever be used more than once, so memorizing them is mostly just a waste of memory.
Assignee: nobody → paolo.mozmail
Status: NEW → ASSIGNED
Comment on attachment 658134 [details] [diff] [review]
Proving we can avoid nested event loops at runtime

Review of attachment 658134 [details] [diff] [review]:
-----------------------------------------------------------------

I basically agree with everything Unfocused said. I think spinning the event loop in startup and shutdown should be relatively safe (particularly if we can ensure the shutdown one doesn't trigger any events that outside code might respond to) but we should find ways around the other event loops. The tests should certainly be changed to just wait for the appropriate events to happen, we have simple patterns for doing that or I wouldn't be averse to spinning a loops in the tests to wait if that makes things simpler.

I flip-flop a little on whether the "task_" prefix is a good idea or not, mostly I'm leaning towards being ok with it but it's Blair's call on that at the end of the day anyway.
(In reply to Dave Townsend (:Mossop) from comment #15)
> I flip-flop a little on whether the "task_" prefix is a good idea or not,
> mostly I'm leaning towards being ok with it

Yea. I figure we're better off being obvious about which functions run as tasks and which don't - a "task_" prefix does that the best (albeit at some cost to prettiness and code autocomplete).
Attachment #658134 - Flags: feedback?(dtownsend+bugmail) → feedback+
Paulo, Taras has asked me to look into bug 729330. Is there anything I can do to help move this patch along, or are there other aspects of the problem I could work on?
Flags: needinfo?(paolo.mozmail)
(In reply to Irving Reid (:irving) from comment #17)
> Paulo, Taras has asked me to look into bug 729330. Is there anything I can
> do to help move this patch along, or are there other aspects of the problem
> I could work on?

Well, both :-) Thanks for looking into this!

For context, during the last Snappy work week we discussed about ways to get rid
of background main-thread SQLite usage while browsing, for Downloads and Add-ons
in particular. I'm currently working on the Downloads part, for which we decided
to avoid the downloads.sqlite back-end entirely, now that download history is
integrated into Places. We'll store in-progress download metadata in a JSON file.

For Add-ons, we also discussed whether it made sense to go for a faster back-end
than SQLite, given the very long startup times and jank observed on some systems
when the list of add-ons needed to be updated (that's basically bug 729330).
This seems like a good idea in general because it reduces random-access I/O on
startup, though the amount of work required, compared to just moving SQLite
usage to a background thread, should be better evaluated with Blair and Dave.

What has become clear, while working on Add-ons, Downloads, and other parts of
the code base, is that we're ending up doing either off-main-thread pure file
I/O (OS.File), or off-main-thread SQLite access, that block startup or shutdown.
Sometimes we're special-casing startup and shutdown, by duplicating code to run
on the main thread, but it looks like this approach isn't scalable as the number
of components using OS.File or async storage grow.

So, regardless of the approach here, one of the "other aspects of the problem"
becomes putting in place a module to handle (in a very simple way) asynchronous
startup and shutdown operations, with a safety watchdog that is able to bypass
the operations if a hang occurs for any reason. Startup and shutdown operations
should be very fast in any case, but we still can't rely on an exit(0) approach
that could prevent recent changes from being persisted. I've found bug 518683
about shutdown, though the same watchdog could handle startup as well. Writing
a module for this should be a relatively quick and self-contained task.

You can also feel free to take over the patch and/or continue discussion about
which approach is better for the Add-ons Manager.
Flags: needinfo?(paolo.mozmail)
We're investingating a re-write of the XPIProvider storage back-end in bug 853388.
Seems like this is WONTFIX given that new direction.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: