Closed Bug 1320533 Opened 3 years ago Closed 3 years ago

Crash in shutdownhang | mozilla::CondVar::Wait | mozilla::DataStorage::WaitForReady | mozilla::DataStorage::GetAll

Categories

(Core :: XPCOM, defect, critical)

51 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- unaffected
firefox51 --- fixed
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: philipp, Assigned: ehsan)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-1c6e150b-c3e5-4409-afd3-adbf42161127.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	NtWaitForSingleObject 	
2 	kernel32.dll 	WaitForSingleObjectEx 	
3 	kernel32.dll 	WaitForSingleObject 	
4 	nss3.dll 	PR_WaitCondVar 	nsprpub/pr/src/threads/combined/prucv.c:525
5 	xul.dll 	mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:79
6 	xul.dll 	mozilla::DataStorage::WaitForReady() 	security/manager/ssl/DataStorage.cpp:428
7 	xul.dll 	mozilla::DataStorage::GetAll(nsTArray<mozilla::dom::DataStorageItem>*) 	security/manager/ssl/DataStorage.cpp:499
8 	xul.dll 	mozilla::dom::ContentParent::RecvReadDataStorageArray(nsString const&, nsTArray<mozilla::dom::DataStorageItem>*) 	dom/ipc/ContentParent.cpp:2411
9 	xul.dll 	mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&, IPC::Message*&) 	obj-firefox/ipc/ipdl/PContentParent.cpp:8297
10 	xul.dll 	mozilla::ipc::MessageChannel::DispatchSyncMessage(IPC::Message const&, IPC::Message*&) 	ipc/glue/MessageChannel.cpp:1633
11 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message&&) 	ipc/glue/MessageChannel.cpp:1596
12 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp:1567

this shutdown hang signature is regressing in numbers since firefox 51 builds on windows. it's mid-volume - in 51.0b this is accounting for 0.14% of the crash volume in the browser process...
Hi :kanru,
Can you help shed some light here?
Flags: needinfo?(kchen)
Flags: needinfo?(ehsan)
Hmm, the only theory I can come up with for this is that since we're during shutdown, the xpcom-shutdown event has been dispatched, and our worker thread has been shut down here: <http://searchfox.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#886>, so we're waiting for a read that never finishes.  If this theory is correct, we need to check mShuttingDown and bail out early from GetAll().

David, what do you think?
Flags: needinfo?(ehsan) → needinfo?(dkeeler)
I think that would only happen if thread shutdown cancels runnables that were dispatched to the thread but haven't been run at that point. My understanding is rather that thread shutdown processes all such runnables. However, if we failed to dispatch the Reader runnable in the first place, it looks like it gets leaked ( https://dxr.mozilla.org/mozilla-central/rev/741a720c98cdb92c229376be0badbf036f653bff/xpcom/threads/nsThread.cpp#718 ), and since DataStorage can only proceed after the Reader runnable's destructor is run, maybe that's what's happening?
Flags: needinfo?(dkeeler)
Milan,
Can you help find someone who can work on this like determine which component is the right one at least?
Flags: needinfo?(milan)
We can start with XPCOM, unless Nathan thinks otherwise.
Component: General → XPCOM
Flags: needinfo?(milan) → needinfo?(nfroyd)
Hm, so https://crash-stats.mozilla.com/report/index/1b0a0931-a7b7-46e4-ba4e-39f002161130 has a much nicer stack on the main thread.  We are in xpcom-shutdown, as shown by the ContentParent::Observe frame.

So Ehsan's theory sounds correct.  But David points out in comment 4 that we have this Reader runnable:

https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#167

that is supposed to indicate that we are ready by setting DataStorage::mReady:

https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#187

and that Reader runnable gets dispatched from here in AsyncReadData:

https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#395

AsyncReadData is only ever called from here in Init:

https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#89

Init() must be called on the main thread, so if the Init() call happened before we received xpcom-shutdown, we have a couple of cases to consider:

1) We couldn't create mWorkerThread (we create a separate thread for *every* DataStorage instance?!).  mReady would always be false, and we'd hang as in the stack above, waiting for an event that never got dispatched.  (This seems pretty unlikely, especially in the example above: there's a *lot* of virtual memory free.)

2) mWorkerThread was created, and therefore we called AsyncReadData.  Perhaps we couldn't Dispatch the Reader runnable in AsyncReadData...unlikely.  Even if we did, though, we'd run Reader's destructor and set mReady to true.  So we can't hang in this scenario.

3) OK, so we dispatched Reader.  What's the state of our worker thread?  If we haven't called mWorkerThread->Shutdown(), then mWorkerThread is still running, and mWorkerThread has a Reader runnable.  We guaranteed that Reader runnable will run at some point, so we can't hang in this scenario.  (It's *possible* that mWorkerThread might not be scheduled for long enough that our shutdownhang killer would get invoked...?)

4) mWorkerThread has been shutdown.  Shutting down an xpcom thread drains its event queue, so the Reader runnable would definitely have run, thereby setting mReady to true.  So we can't hang in this scenario.

OK, the above says we can't be here.  But what about this alternative scenario:

1) Start xpcom-shutdown.
2) DataStorage::Observe runs, which clears sDataStorages.
3) ContentParent::Observe runs, as in the above stack.
4) We eventually get to ContentParent::RecvReadDataStorageArray:

https://hg.mozilla.org/releases/mozilla-aurora/annotate/55b599ca3530/dom/ipc/ContentParent.cpp#l2424

which tries DataStorage::Get on a given filename:

https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/DataStorage.cpp#59

We can't find the DataStorage for the given filename--because we've cleared sDataStorages already--so we return a new DataStorage.  (Note that DataStorage::Get will happily reallocate sDataStorages if it's been deallocated, and will happily hand out new DataStorages even after DataStorage::Observe has seen xpcom-shutdown.  That seems bad.)  ContentParent::RecvReadDataStorageArray obviously thinks that whatever DataStorage object it receives has already been Init()'d...but that's not the case with this one.  And so we call GetAll() on this newly-allocated, non-Init()'d DataStorage object, but there's nothing to set DataStorage::mReady to true, so we wind up waiting forever.

Sprinkling around a couple MOZ_DIAGNOSTIC_ASSERTs would help confirm this scenario (e.g. MOZ_DIAGNOSTIC_ASSERT(mInitCalled) in DataStorage::WaitForReader).

Without knowing more intricacies of how this thing works, I think that's a pretty likely scenario (or #3 above, where mWorkerThread just can't get scheduled).  That means the problem is probably somewhere in DataStorage (more robust shutdown mechanisms?  more robust ::Get()?) or possibly in ContentParent.  Does that sound plausible?
Flags: needinfo?(nfroyd)
David, WDYT about comment 7?
Flags: needinfo?(dkeeler)
Nathan, I think your theory is the most likely, it sort of fills in a few holes that I had in my theory.

Here are a few extra thoughts:

* DataStorage::Get() can't really return null post shutdown, because the same code path also handles profile-before-change, but there is also GetIfExists(), which should be used by callers which want to deal with this possibility.
* The hypothesis you brought up can easily be tested with a MOZ_DIAGNOSTIC_ASSERT(!mShutdownPending) inside ContentParent::RecvReadDataStorageArray(), but even trying that is silly...  Let's walk through what happens:

1. We're in ContentParent::Observer for xpcom-shutdown.  We're assuming DataStorage::Observe has been called before.
2. We call ShutdownProcess() <http://searchfox.org/mozilla-central/rev/4f5f9f3222f35fa4635ea96a0c963c130854ef19/dom/ipc/ContentParent.cpp#2318>
3. It sets mShutdownPending to true <http://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#1212>.  If that branch isn't taken because SendShutdown fails, our IPC channel is already broken and we'd be crashing soon anyways.
4. Then we get to processing our events <http://searchfox.org/mozilla-central/rev/4f5f9f3222f35fa4635ea96a0c963c130854ef19/dom/ipc/ContentParent.cpp#2324>
5. Which takes us to ContentParent::RecvReadDataStorageArray as per <https://crash-stats.mozilla.com/report/index/1b0a0931-a7b7-46e4-ba4e-39f002161130>
6. So at this point, mShutdownPending *has* to be true.

But even ignoring the hang in this bug, the whole purpose of RecvReadDataStorageArray is to setup the DataStorage object in the child process for it to be used later.  If we're shutting down, then that's all wasted work which we should just avoid for performance reasons, even if we're not hanging here.

Therefore, I'm going to write a patch to fix that issue.  When we land that we shall see if that addresses all of the hangs with this signature.
Assignee: nobody → ehsan
Comment on attachment 8820775 [details] [diff] [review]
Avoid setting up DataStorage in the child process when we're shutting down, hopefully fixing a shutdown hang

Review of attachment 8820775 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM.  Ideally it would be nice to signal the shutdown back to the child to avoid firing more observers there, but no big deal.
Attachment #8820775 - Flags: review?(bkelly) → review+
Blocks: 1215723
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d7ea845356af
Avoid setting up DataStorage in the child process when we're shutting down, hopefully fixing a shutdown hang; r=bkelly
(In reply to Nathan Froyd [:froydnj] from comment #7)
> Sprinkling around a couple MOZ_DIAGNOSTIC_ASSERTs would help confirm this
> scenario (e.g. MOZ_DIAGNOSTIC_ASSERT(mInitCalled) in
> DataStorage::WaitForReader).

I filed bug 1325119 for this, in the interest of not polluting this bug where we may want to backport patches...
See Also: → 1325120
(In reply to Nathan Froyd [:froydnj] from comment #7)
... 
> we create a separate thread for *every* DataStorage instance?!

Unfortunately, yes. (DataStorage used to be a singleton, basically, but then its scope grew.)

...

> OK, the above says we can't be here.  But what about this alternative
> scenario:
> 
> 1) Start xpcom-shutdown.
> 2) DataStorage::Observe runs, which clears sDataStorages.
> 3) ContentParent::Observe runs, as in the above stack.
> 4) We eventually get to ContentParent::RecvReadDataStorageArray:
> 
> https://hg.mozilla.org/releases/mozilla-aurora/annotate/55b599ca3530/dom/ipc/
> ContentParent.cpp#l2424
> 
> which tries DataStorage::Get on a given filename:
> 
> https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/
> DataStorage.cpp#59
> 
> We can't find the DataStorage for the given filename--because we've cleared
> sDataStorages already--so we return a new DataStorage.  (Note that
> DataStorage::Get will happily reallocate sDataStorages if it's been
> deallocated, and will happily hand out new DataStorages even after
> DataStorage::Observe has seen xpcom-shutdown.  That seems bad.)

This seems plausible (as well as something we'd want to fix anyway).
Flags: needinfo?(dkeeler)
(In reply to David Keeler [:keeler] (use needinfo?) from comment #14)
> (In reply to Nathan Froyd [:froydnj] from comment #7)
> ... 
> > we create a separate thread for *every* DataStorage instance?!
> 
> Unfortunately, yes. (DataStorage used to be a singleton, basically, but then
> its scope grew.)

That is bug 1325178 now.
https://hg.mozilla.org/mozilla-central/rev/d7ea845356af
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Flags: needinfo?(kchen)
Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(ehsan)
Comment on attachment 8820775 [details] [diff] [review]
Avoid setting up DataStorage in the child process when we're shutting down, hopefully fixing a shutdown hang

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1215723.
[User impact if declined]: We may hang in the parent process is short lived Firefox instances (for example when we're quickly restarting immediately after startup) and eventually get killed by the shutdown hang monitor.
[Is this code covered by automated tests?]: No, it's not very easy to add an automated test for this.
[Has the fix been verified in Nightly?]:  I looked at crash-stats, and it seems like this hang happens on Aurora and Beta a lot more than it does on Nightly, so it's hard to say whether this fix has been effective before backporting.  Note that we have come up with a possible theory behind this hang, and there may be more fixed needed after this patch if the crashes don't stop.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: The fix itself is super simple.
[Why is the change risky/not risky?]: The patch just makes us return early in response to an IPC call if we're shutting down.  The operation wouldn't finish if we attempted it anyways.
[String changes made/needed]: None.
Flags: needinfo?(ehsan)
Attachment #8820775 - Flags: approval-mozilla-beta?
Attachment #8820775 - Flags: approval-mozilla-aurora?
Crash Signature: [@ shutdownhang | mozilla::CondVar::Wait | mozilla::DataStorage::WaitForReady | mozilla::DataStorage::GetAll] → [@ shutdownhang | mozilla::CondVar::Wait | mozilla::DataStorage::WaitForReady | mozilla::DataStorage::GetAll] [@ shutdownhang | mozilla::CondVar::Wait | mozilla::DataStorage::WaitForReady | mozilla::dom::ContentParent::RecvReadDataStorageArray]
Comment on attachment 8820775 [details] [diff] [review]
Avoid setting up DataStorage in the child process when we're shutting down, hopefully fixing a shutdown hang

hopefully avoid a shutdown hang in aurora52
Attachment #8820775 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8820775 [details] [diff] [review]
Avoid setting up DataStorage in the child process when we're shutting down, hopefully fixing a shutdown hang

Fix for a top crash, Beta51+
Attachment #8820775 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.