Closed Bug 836493 Opened 11 years ago Closed 11 years ago

fix race condition with ServiceMainThreadInitializer in mozStorageService.cpp by enforcing consumers to first get a service handle on main-thread.

Categories

(Toolkit :: Storage, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: froydnj, Assigned: froydnj)

References

Details

Attachments

(3 files, 3 obsolete files)

As described in bug 798366 comment 19, it's possible for a non-main thread to get a handle on the storage service and then start creating connections before the must-run-on-main-thread bits have actually run.  We need to fix this race condition before we fix bug 798366.
Here's an ugly patch which I think fixes the race condition.  Raw PR_ATOMIC_*
calls and event loop spinning are not my favorite things, but I don't think
there's a better way to ensure that we only initialize things once.  I would
be happy to hear if you have any better ideas.
Attachment #708575 - Flags: review?(mak77)
Comment on attachment 708575 [details] [diff] [review]
fix race condition in initializing mozilla::storage::Service

During the work week I sit down with Paolo and tried to evaluate the patch and alternatives. We think this approach is a bit too much "hackish" by relying on empty loops and pending events spinnning, we have come up with an alternative idea that wold be using a Monitor, I'll try to convert it to a patch shortly to evaluate it.
Attachment #708575 - Flags: review?(mak77)
Attached patch wip (obsolete) — Splinter Review
We are still investigating this, the idea is to always initialize on main-thread, using a monitor to wait for the main-thread initialization to be completed. This version is based on the assumption GetSingleton is protected by the Components Manager monitor.
Comment on attachment 710015 [details] [diff] [review]
wip

IIUC, it looks like the component manager's monitor isn't held over getSingleton/CreateInstance:

http://mxr.mozilla.org/mozilla-central/source/xpcom/components/nsComponentManager.cpp#1232

Also, can we please minimize using NS_ADDREF and friends on gService?  A simple:

nsRefPtr<Service> service = new Service(...);
if (NS_SUCCEEDED(service->initialize()))
  service.forget(&gService);

is IMHO much clearer.
Well, I didn't touch the addrefing from the original code, while I agree we should not be using direct addrefing, that looks like scope-creep for this bug. May do that when we finalize a solution though.

I'm not sure whether getSingleton needs more protection, since it is intended to also initialize threadsafe services it looks crazy if it's not protected at all. Btw, the only problematic part in the patch is creating the static Monitor, from that point on we are protected by its lock. Ideas are welcome.
(In reply to Marco Bonardo [:mak] from comment #5)
> I'm not sure whether getSingleton needs more protection, since it is
> intended to also initialize threadsafe services it looks crazy if it's not
> protected at all. Btw, the only problematic part in the patch is creating
> the static Monitor, from that point on we are protected by its lock. Ideas
> are welcome.

It looks like GetService and friends have some crazy code to prevent multiple threads from getting to CreateInstance simultaneously:

http://mxr.mozilla.org/mozilla-central/source/xpcom/components/nsComponentManager.cpp#1195

At least, that's what I *think* it does...

If that's the case, then we shouldn't need any locking/monitors/etc. whatsoever.  The simple fix of making the event synchronously dispatched should work just fine.
We think that dispatch_sync causes some tricky spinning loops that in the end don't guarantee anything, a monitor is much safer in this case.

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#410
(In reply to Marco Bonardo [:mak] from comment #7)
> We think that dispatch_sync causes some tricky spinning loops that in the
> end don't guarantee anything, a monitor is much safer in this case.

"that in the end don't guarantee anything"?  How do they not guarantee anything?  Please justify.

You own the code, not me, but I'd rather go with a one-line change that uses a common construct from elsewhere in the code base vs. a rather more complicated fix that uses an unnecessary locking scheme.
First, spinning has its own set of issues associated (see the various hard to debug crashes and bugs caused by spinning). But mostly, in this case, you may have other threads or runnables trying to use the service, and by spinning you may end up causing them to find gService defined, even if we have not finished initialization yet. It's hard to figure what may be around when you start spinning, so it's also hard to preemptively define it "safe".
The unnecessary locking schema is actually a pretty common thread synchronization schema in most other software and languages, fwiw.
Adding more people to the discussion, since they have some good experience with both Storage and threads.
(In reply to Marco Bonardo [:mak] from comment #9)
> First, spinning has its own set of issues associated (see the various hard
> to debug crashes and bugs caused by spinning). But mostly, in this case, you
> may have other threads or runnables trying to use the service, and by
> spinning you may end up causing them to find gService defined, even if we
> have not finished initialization yet.

Two points:

1) comment 6 suggests that we don't have to worry about other threads contending with us because the component manager takes care of that.

2) Finding gService defined wouldn't be a problem if the code used a proper initialization scheme like comment 4 suggests. ;)

> The unnecessary locking schema is actually a pretty common thread synchronization schema 
> in most other software and languages, fwiw.

Yes, I know.  I'm saying that you don't need it in this instance, because there are easier ways of accomplishing the same thing.
(In reply to Nathan Froyd (:froydnj) from comment #11)
> 1) comment 6 suggests that we don't have to worry about other threads
> contending with us because the component manager takes care of that.

when you spin I don't think you can rely on the proper order of scheduling, nor on the fact the monitor in the components manager works as expected. I'm scared of dispatch_sync more than I'm of doing "more than needed" to ensure safety.

> 2) Finding gService defined wouldn't be a problem if the code used a proper
> initialization scheme like comment 4 suggests. ;)

That means it may find it undefined and try to define it, though, or cause a failure by returning a null handle. Why should we risk?
Can we just require any users of storage to get a reference to the service on the main thread before they spin up any background threads?  We enforce through hard asserts/aborts if we see ourselves getting initialized off the main thread.

It seems like there is always going to be the foot-gun of event-loop spinning, or the potential for dead-lock where the main thread ends up joining on the async thread that is blocking on something happening on the main thread.

While the not-my-problem approach I propose is somewhat of a cop-out, since JS consumers must exist on the main thread to be able to access mozStorage, we are really only talking about a bounded/finite number of C++ consumers.
(In reply to Andrew Sutherland (:asuth) from comment #13)
> Can we just require any users of storage to get a reference to the service
> on the main thread before they spin up any background threads?

We could, though, if we can fix it, it's less issues we are going to hit with add-ons totally ignoring our request. Even if we abort, for the user is still a Firefox crash on startup that will block his browser forever.

Plus, I think we would just move the issue to indexedDB code, I think it takes the service handle off main-thread? (to be verified)

> It seems like there is always going to be the foot-gun of event-loop
> spinning, or the potential for dead-lock where the main thread ends up
> joining on the async thread that is blocking on something happening on the
> main thread.

Are you thinking of shutdown on init? Otherwise I'm not sure why the threads should join at this point, not even sure if it's possible, since the monitor blocks the other thread.
So after talking with Ben, he said indexedDB has a similar functionality, it has a handle on main-thread and then moves to the other thread, so it should be easy to make it manage something like comment 13.
Though, I still think a runtime abort is excessive and may be problematic for the reasons expressed on comment 14 (part 1).

Throwing would be fine though.
Ben suggested to make Init() fail if it's invoked off main-thread, some consumers may have to change but at least we won't crash.

Considered we mostly care about few in-tree cpp consumers, if this solution is fine with them, I'm fine with it.
(In reply to Marco Bonardo [:mak] from comment #15)
> Though, I still think a runtime abort is excessive and may be problematic
> for the reasons expressed on comment 14 (part 1).

Right, punitive user crashes that they can't do anything about are horrible and we shouldn't do that.  My suggestion is primarily just to make sure that C++ users who don't read the docs get a handy hint in the form of their tests aborting with an error message about the rules rather than debug spew or silence.  So whatever the only-in-debug-builds/only-in-test-builds case is.  Binary add-ons that don't read the docs are more troublesome, but ideally those add-on developers will be running debug builds themselves, or something like that.
(In reply to Marco Bonardo [:mak] from comment #15)
> Throwing would be fine though.
> Ben suggested to make Init() fail if it's invoked off main-thread, some
> consumers may have to change but at least we won't crash.

If no Storage objects that depend on the service are createable without getting
a reference to the service first, I agree it should be not too much to ask of
consumers to initialize Storage on the main thread before spawning their own
worker threads.
Ok, I think we reached an agreement.
Nathan, are you still interested in fixing this, according to the decided route?
Summary: fix race condition with ServiceMainThreadInitializer in mozStorageService.cpp → fix race condition with ServiceMainThreadInitializer in mozStorageService.cpp by enforcing consumers to first get a service handle on main-thread.
(In reply to Marco Bonardo [:mak] from comment #18)
> Ok, I think we reached an agreement.
> Nathan, are you still interested in fixing this, according to the decided
> route?

Sure, I can do that.  The comment thread so far has been talking about changing Init() to fail on a non-main thread.  Is there an advantage to doing that vs. simply making getSingleton--when the service hasn't already been initialized, of course--fail on a non-main thread?  The two options seem semantically equivalent, but the latter is slightly more efficient.
I think the result is the same, unless I missed anything.
Assignee: nobody → nfroyd
Status: NEW → ASSIGNED
This trivial patch just does the work to ensure we're getting the first
reference on the main thread.  For simplicity, we'll eliminate
ServiceMainThreadInitializer in the next patch.
Attachment #708575 - Attachment is obsolete: true
Attachment #710015 - Attachment is obsolete: true
Attachment #711870 - Flags: review?(mak77)
Running basic tests indicates this is not sufficient; xpcom/tests/TestSettingsAPI hangs with several messages about being unable to open the DB, presumably because IndexedDB is attempting to open a DB off the main thread.  And being able to do that requires accessing the storage service, which means that the first reference is being gotten off the main thread, which causes problems.

In fact, IndexedDB explicitly asserts that grabbing the storage service happens off the main thread:

http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/OpenDatabaseHelper.cpp#1736

I think Ben suggested starting the storage service from manifest files; maybe that's the only viable way out here. :(
(In reply to Nathan Froyd (:froydnj) from comment #23)
> I think Ben suggested starting the storage service from manifest files;
> maybe that's the only viable way out here. :(

Ben clearly said it would not be an issue to change indexedDB to first get a handle on main-thread. That means indexedDB needs changes, so clearly this bug alone can't be enough, we need a bug to make indexedDB get the mainthread handle first.
Flags: needinfo?(bent.mozilla)
Just having IDBFactory grab a reference to mozIStorageService in its constructor is sufficient to at least make the test failure go away.  I don't know whether that's what Ben had in mind, or whether he intended the grab to be put someplace else.
Comment on attachment 711870 [details] [diff] [review]
part 1: ensure that the initial reference to mozStorageService is obtained on the main thread

Review of attachment 711870 [details] [diff] [review]:
-----------------------------------------------------------------

There is some risk add-on may be using this off mainthread, and after this change they will cause a startup crash if they don't error check do_GetService. They should commonly do error checking, or at least I hope :/

I suggest cross-posting to mozilla.dev.extensions and Planet Mozilla about this behavior change.
While these notifications are hardly enough, we should at least try to do all the possible to ensure knowledge about the change.

::: storage/src/mozStorageService.cpp
@@ +321,5 @@
>    }
>  
> +  // The first reference to the storage service must be obtained on the
> +  // main thread.
> +  NS_ENSURE_TRUE(NS_IsMainThread(), nullptr);

you should add a @note to mozIStorageService.idl about this fact, and the same note should be added to MDN page https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/mozIStorageService
Attachment #711870 - Flags: review?(mak77) → review+
(In reply to Marco Bonardo [:mak] from comment #26)
> I suggest cross-posting to mozilla.dev.extensions and Planet Mozilla about
> this behavior change.
> While these notifications are hardly enough, we should at least try to do
> all the possible to ensure knowledge about the change.

Do you think it's worth delaying landing this until after next week's uplift in an attempt to give people the maximum amount of time to fix things?  Obviously there's the IndexedDB issue to address...maybe other changes required will naturally push the fix out a week, but in case they don't?

> ::: storage/src/mozStorageService.cpp
> @@ +321,5 @@
> >    }
> >  
> > +  // The first reference to the storage service must be obtained on the
> > +  // main thread.
> > +  NS_ENSURE_TRUE(NS_IsMainThread(), nullptr);
> 
> you should add a @note to mozIStorageService.idl about this fact, and the
> same note should be added to MDN page
> https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/
> mozIStorageService

Will do.
Comment on attachment 711871 [details] [diff] [review]
part 2: eliminate now-unnecessary ServiceMainThreadInitializer

Review of attachment 711871 [details] [diff] [review]:
-----------------------------------------------------------------

you should probably also change this test, make it so it checks we fail if trying to init the service on a background thread

http://mxr.mozilla.org/mozilla-central/source/storage/test/test_service_init_background_thread.cpp

::: storage/src/mozStorageService.cpp
@@ +514,5 @@
>    }
>  
>    // Set the default value for the toolkit.storage.synchronous pref.  It will be
>    // updated with the user preference on the main thread.
>    sSynchronousPref = PREF_TS_SYNCHRONOUS_DEFAULT;

The comment above and this line can probably go, since we just read the pref below

@@ +516,5 @@
>    // Set the default value for the toolkit.storage.synchronous pref.  It will be
>    // updated with the user preference on the main thread.
>    sSynchronousPref = PREF_TS_SYNCHRONOUS_DEFAULT;
>  
> +  // Register for xpcom-shutdown so we can cleanup after ourselves.  The

please add a MOZ_ASSERT(NS_IsMainThread()); at the beginning of initialize(), just in case...
Attachment #711871 - Flags: review?(mak77) → review+
(In reply to Nathan Froyd (:froydnj) from comment #27)
> Do you think it's worth delaying landing this until after next week's uplift
> in an attempt to give people the maximum amount of time to fix things? 
> Obviously there's the IndexedDB issue to address...maybe other changes
> required will naturally push the fix out a week, but in case they don't?

could be wise, yes.
Apart indexedDB, I just noticed nsOfflineCacheDevice is defined as threadsafe isupports, and looks like it supports being initialized off mainthread:
http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/nsDiskCacheDeviceSQL.cpp#1119
Honza, is comment 30 actually true? would it be hard to have a mainthread initializer part in that component?
Flags: needinfo?(honzab.moz)
Here's a patch to do the necessary IndexedDB bits.
Attachment #713533 - Flags: review?(bent.mozilla)
Comment on attachment 713533 [details] [diff] [review]
part 0a - make sure IndexedDB attempts to initialize the storage service on the main thread

Review of attachment 713533 [details] [diff] [review]:
-----------------------------------------------------------------

This would work but I think you can do it more cleanly in IndexedDatabaseManager::GetOrCreate(). It's a singleton already so you won't have to do the static stuff again.

Maybe somewhere around here: http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/IndexedDatabaseManager.cpp#416
Attachment #713533 - Flags: review?(bent.mozilla)
Flags: needinfo?(bent.mozilla)
Ah, OK, that does seem like a better way to do it.
Attachment #713533 - Attachment is obsolete: true
Attachment #713951 - Flags: review?(bent.mozilla)
Comment on attachment 713951 [details] [diff] [review]
part 0a - make sure IndexedDB attempts to initialize the storage service on the main thread

Review of attachment 713951 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #713951 - Flags: review?(bent.mozilla) → review+
(In reply to Marco Bonardo [:mak] (Away 14-17 Feb) from comment #31)
> Honza, is comment 30 actually true? would it be hard to have a mainthread
> initializer part in that component?

I strongly believe that nsOfflineCacheDevice::Init() is only called on the main thread, ever.  This initiation happens only through nsIApplicationCacheService service instantiation, and all consumers seems to be only main-thread.

I think the thread reference in nsOfflineCacheDevice is just some left over.
Flags: needinfo?(honzab.moz)
A debug-only all-unit-tests Try run looks pretty green, so barring any objections, I'm going to commit the patches attached to this bug in the middle of next week after the uplift.

https://tbpl.mozilla.org/?tree=Try&rev=e6dfcac7995c
(In reply to Honza Bambas (:mayhemer) from comment #36)
> (In reply to Marco Bonardo [:mak] (Away 14-17 Feb) from comment #31)
> > Honza, is comment 30 actually true? would it be hard to have a mainthread
> > initializer part in that component?
> 
> I strongly believe that nsOfflineCacheDevice::Init() is only called on the
> main thread, ever.  This initiation happens only through
> nsIApplicationCacheService service instantiation, and all consumers seems to
> be only main-thread.
> 
> I think the thread reference in nsOfflineCacheDevice is just some left over.

It may be worth to add a MOZ_ASSERT to check for main-thread there then, or even an actual failure. We can't risk that someone adds a non-main-thread caller in future.
Depends on: 842611
(In reply to Marco Bonardo [:mak] (Away Feb 22) from comment #38)
> (In reply to Honza Bambas (:mayhemer) from comment #36)
> > (In reply to Marco Bonardo [:mak] (Away 14-17 Feb) from comment #31)
> > > Honza, is comment 30 actually true? would it be hard to have a mainthread
> > > initializer part in that component?
> > 
> > I strongly believe that nsOfflineCacheDevice::Init() is only called on the
> > main thread, ever.  This initiation happens only through
> > nsIApplicationCacheService service instantiation, and all consumers seems to
> > be only main-thread.
> > 
> > I think the thread reference in nsOfflineCacheDevice is just some left over.
> 
> It may be worth to add a MOZ_ASSERT to check for main-thread there then, or
> even an actual failure. We can't risk that someone adds a non-main-thread
> caller in future.

So, according bug 842611 and two try runs, nsOfflineCacheDevice::Init may be called on off-main thread:
https://tbpl.mozilla.org/?tree=Try&rev=5e93e389ebb6
https://tbpl.mozilla.org/?tree=Try&rev=02e6ee31092a

I'll try to put that assert just before opening the DB to try ones more.
(In reply to Honza Bambas (:mayhemer) from comment #39)
> So, according bug 842611 and two try runs, nsOfflineCacheDevice::Init may be
> called on off-main thread:
> https://tbpl.mozilla.org/?tree=Try&rev=5e93e389ebb6
> https://tbpl.mozilla.org/?tree=Try&rev=02e6ee31092a
> 
> I'll try to put that assert just before opening the DB to try ones more.

yes, that was my fear, this fix may have to wait for nsOfflineCacheDevide to be changed, provided that's feasible.
I landed part 0 to unblock the main-thread-only prefs work.

https://hg.mozilla.org/integration/mozilla-inbound/rev/cf3772a6f4e9
Whiteboard: [leave open]
Looks like bug 842611 just landed, that means this change is now unblocked.

Nathan, may we proceed here?
Flags: needinfo?(nfroyd)
(In reply to Marco Bonardo [:mak] from comment #43)
> Looks like bug 842611 just landed, that means this change is now unblocked.
> 
> Nathan, may we proceed here?

Yup, was going to rebase these patches (I think part 2 will need some changes for some recent patches by Ben) and push them to try today, just to make sure we haven't introduced any other storage service refs that need fixing up.
Flags: needinfo?(nfroyd)
Small fixup in part 2 to account for the new page size pref.  Pushed:

remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/5727d59ae208
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/76913343f8a6

Pushed to try (mochitests, xpcshell) to make sure it didn't break things and everything was green:

https://tbpl.mozilla.org/?tree=Try&rev=3713c426ca54
I assume the leave open can go away now, it was set when khuey pushed the first part.
Whiteboard: [leave open]
https://hg.mozilla.org/mozilla-central/rev/5727d59ae208
https://hg.mozilla.org/mozilla-central/rev/76913343f8a6
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Blocks: 832677
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: