crash in nsIEventTarget::Dispatch

RESOLVED FIXED in Firefox 46

Status

()

defect
--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: tracy, Assigned: asuth)

Tracking

({crash, topcrash-win})

Trunk
mozilla46
Unspecified
Windows NT
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 fixed)

Details

(crash signature)

Attachments

(3 attachments, 2 obsolete attachments)

This bug was filed from the Socorro interface and is 
report bp-d0964002-f04e-4156-99bf-d7e0f2160104.
=============================================================

Reporting this Windows only startup crash against Nightly (46) because it has inched in to the top 10 there. It is present on builds from 42betas through 45 in relatively lower volume.

I looked through a sample of a dozen reports on Nightly and the common add-ons in all of them are loop and pocket.
I think I might see the problem...
I think we just need some more liberal logic in the error checking in AsyncClose.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fc1504fbf5a4
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Attachment #8705155 - Flags: review?(bugmail)
Okay, so the problem as I understand it in this case is that:
- Although this is a startup crash, XPCOM is shutting down.
- AsyncClose is being invoked as part of an orderly shutdown by the cookie database.  Based on a quick persusal of the cookie code, this is likely not a redundant AsyncClose call because all AsyncClose calls are followed-up by calls to CleanupDefaultDBConnection() that nulls out the connection.
- The database is opened synchronously so we expect mDBConn to be non-null.
- getAsyncExecutionTarget() is probably returning null because there wasn't a thread but it's trying to create a new one via NS_NewThread and that failed.  (The other possibility is that mAsyncExecutionThreadShuttingDown is true, but that would only happen if someone called Close() or got past the crash-point in this code.  The cookie code seems to depend on destructor-invoked Close() only, and I think we've ruled out multiple calls to AsyncClose.  Also, multiple calls to AsyncClose would result in mDBConn being nulled out, resulting in the early return happening.)
- Therefore the issue is that the (mDBConn && !asyncThread) is true, and this leads to a crash because you can't dispatch to a null dispatch target.

So this fix will definitely fix the problem because there's no way for us to get to crash-town if we don't have the dispatch target.

But the question is whether there's fallout if we're now letting !mDBConn fast-path us out.  Because as the comment block above the conditional explains, we can end up in the situation (!mDBConn && asyncThread), and it intentionally wants us to not exit early.  So this fix would break that.  (And I think that reason is that the only place we call shutdownAsyncThread is as part of the AsyncCloseConnection runnable.)

I'm not going to do the boolean matrix, but I think the summary is that we really want out of AsyncClose is to dispatch AsyncCloseConnection if there is a live async execution thread.  If there is not then if there is an mDBConn then we can just depend on the destructor to failsafe close things down.

So my counter-proposal is we change this block:

  // It's possible to get here with a null mDBConn but a non-null async
  // execution target if OpenAsyncDatabase failed somehow, so don't exit early
  // in that case.
  nsIEventTarget *asyncThread = getAsyncExecutionTarget();

  if (!mDBConn && !asyncThread)
    return NS_ERROR_NOT_INITIALIZED;

to look something more like:

  {
    MutexAutoLock lockedScope(sharedAsyncExecutionMutex);
    // (Check the alive flag rather than mAsyncExecutionThread because the flag tracks whether
    // a shutdown has been dispatched and so better reflects the imminent state.)
    if (!mAsyncExecutionThreadIsAlive) {
      // If there is no async execution thread, then there is no need to do any special async
      // shutdown.  Do a normal close.  This will handle !mDBConn.
      return Close();
    }
  }

Does that seem reasonable?
Component: Networking: Cookies → Storage
Flags: needinfo?(bkelly)
Product: Core → Toolkit
Comment on attachment 8705155 [details] [diff] [review]
Fail storage connection close if async thread is null. r=asuth

r=asuth with a revised fix that's basically what I propose, otherwise let's do another round please.
Attachment #8705155 - Flags: review?(bugmail) → review+
Uh, and the reason we'd stop using getAsyncExecutionTarget() is because it can make the async execution thread come to life if we're not in shutdown.  And it's silly to start up the thread just to shut it down again.  (I assume we used that initially because we didn't have the other cool life-cycle flags in place.)
Uh, and of course, don't actually call Close() with the mutex held.  Do the Close() outside the locked scope by saving off the variable first.  Assuming callers are obeying our "only touch the connection on one thread at a time" rule and all that, the variable should not be able to change state on the way out.  (Plus, Close is smart enough to not explode the world if called concurrently.)
(And the variable you'd save off is asyncThread from mAsyncExecutionThread.  Sorry for all the updates; I'm only now eating my life/brain-giving hot pocket foodstuff.)
Something like this?

I'm a little worried here because this is a more complex patch and I don't have a clear test for this particular condition.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=78fabc171b41
Attachment #8705155 - Attachment is obsolete: true
Flags: needinfo?(bkelly)
Attachment #8705272 - Flags: review?(bugmail)
Comment on attachment 8705272 [details] [diff] [review]
Perform storage close synchronously if async thread is shutdown. r=asuth

Review of attachment 8705272 [details] [diff] [review]:
-----------------------------------------------------------------

I agree on the complexity, but the simple patch was too simple and regressed thread shutdown.  Test proposals below.

But yes, I suppose the one wrinkle with this patch (with the action item addressed) is that we won't invoke the callback and we won't return an error in the case of (mDBConn && !asyncThread).  We probably could handle this better.

If we switch back to using getAsyncExecutionTarget(), then in the non-shutdown case we get the callback properly invoked at the cost of some wasteful thread startup/shutdown (while theoretically reducing patch complexity).  In the shutdown case, we don't invoke the callback.

If we keep the logic (with the AI addressed) and add a dispatch of the callback listener to the current thread, then we get the callback invoked in all cases.

I suppose the most correct thing to do is the latter thing.  I suppose we could also just complete the shutdown and not invoke the callback, but return an error so we make it the caller's problem.  (The cookie service doesn't check the return code, so we would have passed the buck but not improved things.)


I think we (and by we I mean you :) can write a C++ test to cover the current crasher (mDBConn && !asyncThread) situation like so:
- Create synchronous database connection.
- Call nsThreadManager::Shutdown() (I think only C++ code can do this?)
- Call conn->AsyncClose()
- (Things would crash without the fix at this point)
- Ensure that the connection actually got shutdown due to Close being shunted to.

In terms of the (!mDBConn && asyncThread) case, I'm not sure if we have coverage already, but that test would look like:
- Create an async connection so the db open happens on an async thread.  But make that fail somehow.  Maybe tell it gibberish like to open a text file or a nonexistent filename but using flags that require it to already exist (if we can do that?).
- Get a reference to the async thread.  (I think we expose this directly now, but if not, I think I had some C++ helper hacks in place.)
- Call AsyncClose, make sure we don't throw an error.  Make sure the callback happens and the thread shuts down.  (If the thread is shutdown dispatch will fail, but also possible I guess is just seeing if calling shutdown on it a second time results in NS_ERROR_UNEXPECTED).

I think the (mDBConn && asyncThread) scenario is covered already, and (!mDBConn && !asyncThread) is already sufficiently fine since we will return an error.


I'm clearing the review flag because we probably do want roundtripped consensus before this is landed.  (Also, I'm very sensitive to the scope creep here, but I'm also not (yet) a platform engineer ;)

::: storage/mozStorageConnection.cpp
@@ +1224,1 @@
>      return NS_ERROR_NOT_INITIALIZED;

This "if (!mDBConn)" check and early return has to go.  If there is an asyncThread we absolutely want to dispatch AsyncCloseConnection to it so that any runnables posted to it get a chance to complete before we try and shut it down.  (There could be an async database open on that thread that hasn't run yet, etc.)  I checked the AsyncCloseRunnable's path and consistent with that comment about "It's possible to get here...", the runnable is smart enough to not explode if (!mDBConn) when it gets invoked.
Attachment #8705272 - Flags: review?(bugmail)
Andrew, do you want to steal this bug?  I think you understand the issues better than I do.  I thought it was a simple change, but clearly its trickier.
Flags: needinfo?(bugmail)
Stealing per IRC discussion.  :mak, I see you have a review queue of 12 right now so I'm going to have :bkelly review to optimize for time since this is a topcrasher.  If you would like to review instead, please speak up.  But since I do agree with :bkelly's complexity concerns and in general the async interactions have been bad news for us, I'm going to run with making sure we have test coverage for the permutations on AsyncClose which should hopefully make everyone feel better about any changes.
Assignee: bkelly → bugmail
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #9)
> - Call nsThreadManager::Shutdown() (I think only C++ code can do this?)

I think I ran into linkage problems with trying to do this, presumably due to MOZILLA_INTERNAL_API and reaching into XPCOM internals generally being a massively warranty-voiding approach.  The less-horrible alternative is to try and replace the nsIThreadManager service, but I'm not really up on linkage, and NS_NewThread might be using internal API too in storage's case, defeating that hack.  I think I'm giving up on this since the crash path is pretty clear by inspection and making a complicated/brittle C++ test for a control-flow path we're completely eliminating is not really helping anyone.

So I'm switching to adding just JS tests with branch coverage.
One design question for cleaning this all up is whether it's appropriate to call the callback listener in all cases or only if a close is actually happening.  An argument can be made that asyncClose should have the following semantics:
a) It's safely idempotent.  Attempting to close an unopened connection should not return errors.  (The errors, if checked, might be handy for misuse cases in the normal case... but in edge cases like this bug, blowing up the cleanup/shutdown code might not be a net win.)
b) Likewise, the callback should be invoked in all cases because if someone is calling AsyncClose, it's their intent to close the database and they're clearly assuming it's open.  Since the whole point of an async close callback is to know when the close has safely occurred, always invoking the callback is consistent with that.

Consumer-wise, we've largely got:
- nsCookieService in C++.  Does not pay attention to the return value of AsyncClose.  Does use a listener.  The listener is for database rebuilding and notifying the observer service, apparently only for test purposes.
- Places in C++.  Does not pay attention to the return value of AsyncClose.  Does use the listener to clear a shutdown blocker.  I'm far from a places expert, but it does seem like the shutdown blocker is added well in advance of the database being opened.  But then places does synchronously go to a lot of work to open the database... so it probably does end up opened.
- Sqlite.jsm.  Technically, it does not pay attention to the return value of AsyncClose... but since XPConnect does and XPConnect will explode if we return any error, uh, bad things will happen with errors.  It does use the close listener for cleaning up a shutdown blocker, but it definitely only adds the blocker after the open has completed.

Anywho, ni'ing :mak for input on whether it's appropriate to always invoke the callback or only if a close was completed.  (Noting that it's possible for the close to happen synchronously on the main thread in this bug's case.  My plan for that is just to dispatch it to the main thread so the event loop ordering is effectively the same.)  If I don't hear anything by when I finish the test tomorrow, I'll assume that all callbacks all the time is the way to go.
Flags: needinfo?(mak77)
Two existing storage tests uses task.jsm add_task style tests and they included some duplicated code.  Since that's cleaner and the future and gDBConn is historically messy for asyncClose tests, I moved the helper code into the place for helpers.  This lets the fix patch have cleaner tests without being super distracting.  The tests pass before and after this migration.
Attachment #8706238 - Flags: review?(bkelly)
Because of the potential for undesirable main-thread jank I went back to us always using getAsyncExecutionTarget().  So the fix is basically just that we notice when the thread didn't exist or get created and we do the dispatch and switchover to Close().  As such, the changes to AsyncClose are quite minimal.

There are however some non-trivial comments because there can be a lot going on in the system and I never ever ever want to have to reason this stuff out again.  And hopefully this will help save other people too.

Note that because we will create the async execution thread and, as noted in a prior comment, there is no way for our xpcshell to reasonably cause thread creation to fail, our xpcshell tests cannot trigger that fallback Close() path.  *HOWEVER*, the change back to getAsyncExecutionTarget() is somewhat recent, so I did run the tests with AsyncClose using a "don't create the async thread if it's not already created" heuristic and it worked.  And I also commented out the sync dispatch to ensure it caused the test to hang, etc.  Between that and the minimal control-flow changes, I'm pretty confident about all this.

Test-wise, note that I pulled out the existing asyncClose tests from test_connection_executeAsync.js to put them in our new test_connection_asyncClose.js since they provide a lot of the test coverage.  I tried to augment their comments too, and you probably do want to make sure they seem sane.

I'm clearing the NI on :mak per my previous comments and especially since I took out the potential for inducing avoidable main-thread I/O.
Attachment #8705272 - Attachment is obsolete: true
Flags: needinfo?(mak77)
Attachment #8706240 - Flags: review?(bkelly)
I got over-zealous when migrating the asyncClose cases from test_connection_executeAsync.js resulting in us not actually calling asyncClose there at all, which led to some oranges.  I've added back asyncClose and modernized the test a little bit.  More cleanup is of course possible but not particularly necessary in my mind.

I'm planning to fold this patch into part 2 when landing to avoid making non-push-batch bisection happier.  Posting it separately since you might already have looked at the patch and who trusts interdiff?
Attachment #8706578 - Flags: review?(bkelly)
Comment on attachment 8706238 [details] [diff] [review]
part 1: move some promisey test helpers into head_storage.js

Review of attachment 8706238 [details] [diff] [review]:
-----------------------------------------------------------------

Trusting nothing changed internal to these functions and it was just a move.
Attachment #8706238 - Flags: review?(bkelly) → review+
Comment on attachment 8706240 [details] [diff] [review]
part 2: perform close synchronously if unable to create an async thread, tests

Review of attachment 8706240 [details] [diff] [review]:
-----------------------------------------------------------------

Thorough comments and tests.  Looks good to me, but I'm not an expert in the corner cases here.

::: storage/mozStorageConnection.cpp
@@ +1267,5 @@
> +    if (completeEvent) {
> +      // Closing the database is more important than returning an error code
> +      // about a failure to dispatch, especially because all existing native
> +      // callers ignore our return value.
> +      (void)NS_DispatchToMainThread(completeEvent);

nit: I think gecko style these days is Unused << NS_DispatchToMainThread().
nit: Use completeEvent.forget() here.
Attachment #8706240 - Flags: review?(bkelly) → review+
Attachment #8706578 - Flags: review?(bkelly) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6a85731d2aa518779451ebf8583815151565c56
Bug 1237601 - Centralize storage xpcshell promise helpers (support patch). r=bkelly

https://hg.mozilla.org/integration/mozilla-inbound/rev/42ee91882a503bfb9e216f19111dc935bd400caf
Bug 1237601 - Perform storage close synchronously if async thread cannot be started. r=bkelly
https://hg.mozilla.org/mozilla-central/rev/d6a85731d2aa
https://hg.mozilla.org/mozilla-central/rev/42ee91882a50
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
thanks for taking care of this, I just had a chance to look at the patches, they look good.

Just as a nit side note, we should not be using anymore Promise.defer() in new code, it's deprecated, we should rather use new Promise().
You need to log in before you can comment on or make changes to this bug.