Closed Bug 1248750 Opened 8 years ago Closed 8 years ago

spike in nightly crashes at mozilla::ipc::MessageListener::IntentionalCrash

Categories

(Core :: IPC, defect, P1)

Unspecified
Windows NT
defect

Tracking

()

VERIFIED FIXED
mozilla47
Tracking Status
e10s + ---
firefox46 --- unaffected
firefox47 + verified

People

(Reporter: dbaron, Assigned: billm)

References

Details

(Keywords: crash, topcrash, Whiteboard: btpp-active)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-d6de845d-03be-4cd0-b796-b28f42160216.
=============================================================

Following https://hg.mozilla.org/mozilla-central/rev/cbdcef6d6e01 there's been a spike in crashes at [@ mozilla::ipc::MessageListener::IntentionalCrash].

The stack signatures vary a bit, but they all seem to involve recursion where, in the middle of sending an IPC message, we recur into sending the same IPC message again.


e.g., in:
bp-d6de845d-03be-4cd0-b796-b28f42160216

0 	xul.dll 	mozilla::ipc::MessageListener::IntentionalCrash() 	ipc/glue/MessageLink.h
1 	xul.dll 	mozilla::ipc::MessageChannel::CancelTransaction(int) 	ipc/glue/MessageChannel.cpp
2 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
3 	xul.dll 	mozilla::dom::PBrowserChild::SendGetWidgetNativeData(unsigned __int64*) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp
4 	xul.dll 	mozilla::widget::PuppetWidget::GetNativeData(unsigned int) 	widget/PuppetWidget.cpp
...
37 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
38 	xul.dll 	mozilla::dom::PBrowserChild::SendGetWidgetNativeData(unsigned __int64*) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp
39 	xul.dll 	mozilla::widget::PuppetWidget::GetNativeData(unsigned int) 	widget/PuppetWidget.cpp

(which seems to be the most common variation)


or in bp-aed03712-923e-4419-b508-5b6622160216 :

0 	xul.dll 	mozilla::ipc::MessageListener::IntentionalCrash() 	ipc/glue/MessageLink.h
1 	xul.dll 	mozilla::ipc::MessageChannel::CancelTransaction(int) 	ipc/glue/MessageChannel.cpp
2 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
3 	xul.dll 	mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const&, nsresult*, unsigned int*) 	obj-firefox/ipc/ipdl/PContentChild.cpp
4 	xul.dll 	mozilla::plugins::PluginModuleContentParent::LoadModule(unsigned int, nsPluginTag*) 	dom/plugins/ipc/PluginModuleParent.cpp
...
37 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
38 	xul.dll 	mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const&, nsresult*, unsigned int*) 	obj-firefox/ipc/ipdl/PContentChild.cpp
39 	xul.dll 	mozilla::plugins::PluginModuleContentParent::LoadModule(unsigned int, nsPluginTag*) 	dom/plugins/ipc/PluginModuleParent.cpp


or in bp-a231a285-1e93-4520-9a74-9f3282160211 :

0 	xul.dll 	mozilla::ipc::MessageListener::IntentionalCrash() 	ipc/glue/MessageLink.h
1 	xul.dll 	mozilla::ipc::MessageChannel::CancelTransaction(int) 	ipc/glue/MessageChannel.cpp
2 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
3 	xul.dll 	mozilla::dom::PContentChild::SendClipboardHasType(nsTArray<nsCString> const&, int const&, bool*) 	obj-firefox/ipc/ipdl/PContentChild.cpp
4 	xul.dll 	nsClipboardProxy::HasDataMatchingFlavors(char const**, unsigned int, int, bool*) 	widget/nsClipboardProxy.cpp
...
42 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
43 	xul.dll 	mozilla::dom::PContentChild::SendClipboardHasType(nsTArray<nsCString> const&, int const&, bool*) 	obj-firefox/ipc/ipdl/PContentChild.cpp
44 	xul.dll 	nsClipboardProxy::HasDataMatchingFlavors(char const**, unsigned int, int, bool*) 	widget/nsClipboardProxy.cpp




A full query of such crashes on Firefox nightly on Windows since January 15:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-01-15&signature=mozilla%3A%3Aipc%3A%3AMessageListener%3A%3AIntentionalCrash#aggregations
shows that they started appearing on January 29, and have been showing up at about 30 per day since February 2.  (I'm not quite sure what the delay was in there...)
Flags: needinfo?(wmccloskey)
I suspect bug 1191145 is what caused this signature to spike. It definitely made it more likely that we would get into this situation. I'm still trying to figure out if we should add the script blockers back in a more correct way or if I can find a more complete solution that avoids these intentional crashes entirely.
Flags: needinfo?(wmccloskey)
Assignee: nobody → wmccloskey
Priority: -- → P1
Whiteboard: dom-triaged btpp-fixnow
Are things going alright here, Bill? Do you need a hand?
Flags: needinfo?(wmccloskey)
I've got a work in progress patch that I'm debugging.
Flags: needinfo?(wmccloskey)
Comment on attachment 8724990 [details] [diff] [review]
patch

Submitted before I was ready...

This patch removes the intentional crash in the following situation: there is a race between the child sending a normal priority sync message C1 and the parent sending a high priority sync message P1; in this case, P1 is dispatched in the child first; if P1's transaction is canceled, then we have difficulty deciding what to do about C1, so we crash.

One option would be to cancel C1 along with P1. The problem is that we send a lot of normal prio sync messages in Gecko, and the code isn't written to handle a failure of this sort. I guess you'd want it to retry somehow, but that's not what happens now.

The other option is to allow C1 to proceed. That also causes some tricky issues because the child is now unblocked even through it's awaiting a sync message reply. The stack looks like this:
  Send C1
  ProcessPendingRequests
  Handle P1
  ...some Gecko code...
  CancelCurrentTransaction
When the cancellation completes, the parent is unblocked immediately and the child is free to return to the Gecko code that triggered the cancellation. One concrete thing that can go wrong is that the child sends another sync message C2. It will block waiting for a reply to C2, but it will get the C1 reply first and crash.

It turns out this isn't really that hard to handle. We just need to allow the child to receive the C1 reply and store it somewhere. When we get back to the stack frame for C1, we can deal with the reply at that time.

To store the reply, I expanded AutoEnterTransaction to be a more full-featured way of storing IPC state. Each AutoEnterTransaction object is stored in an explicit stack that's accessible from the channel. Rather than using a bunch of channel-wide state fields, we can consult the stack to figure out what messages we're sending or handling. It's also a convenient place to store replies. I was able to get rid of a lot of ad hoc state fields this way.
Attachment #8724990 - Attachment is patch: true
Attachment #8724990 - Flags: review?(dvander)
Whiteboard: dom-triaged btpp-fixnow → btpp-active
Comment on attachment 8724990 [details] [diff] [review]
patch

Review of attachment 8724990 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, I really like the new AutoEnterTransaction. Just a few questions.

::: ipc/glue/MessageChannel.cpp
@@ +328,5 @@
> +
> +    void Cancel() {
> +        AutoEnterTransaction *cur = mChan->mTransactionStack;
> +        MOZ_RELEASE_ASSERT(cur == this);
> +        while (cur && cur->mPriority != IPC::Message::PRIORITY_NORMAL) {

Does this mean we'll cancel messages from a different transaction?

@@ +396,5 @@
> +            MOZ_RELEASE_ASSERT(IsComplete());
> +        } else {
> +            MOZ_RELEASE_ASSERT(mNext);
> +            mNext->HandleReply(aMessage);
> +        }

It's a little weird to use recursion here... it's almost certainly safe but it'd be easier to follow if flattened.

@@ +821,5 @@
>      // Deferring in the parent only sort of breaks message ordering. When the
>      // child's message comes in, we can pretend the child hasn't quite
>      // finished sending it yet. Since the message is sync, we know that the
>      // child hasn't moved on yet.
> +    return mSide == ParentSide && aMsg.transaction_id() != CurrentHighPriorityTransaction();

Why does this specifically need the high priority transaction?

@@ +1094,3 @@
>  
> +    bool nest = stackTop && stackTop->Priority() == IPC::Message::PRIORITY_HIGH;
> +    int32_t transaction = nest ? stackTop->TransactionID() : seqno;

Why do we need to take the most recent high-priority transaction #, rather than use the sequence number as before?

@@ +2305,5 @@
>  
> +bool
> +MessageChannel::IsInTransaction() const
> +{
> +    MonitorAutoLock lock(*mMonitor);

Taking the lock here is a little weird, but since it's only ever called on the main thread from outside IPC I guess it's safe.
(In reply to David Anderson [:dvander] from comment #7)
> ::: ipc/glue/MessageChannel.cpp
> @@ +328,5 @@
> > +
> > +    void Cancel() {
> > +        AutoEnterTransaction *cur = mChan->mTransactionStack;
> > +        MOZ_RELEASE_ASSERT(cur == this);
> > +        while (cur && cur->mPriority != IPC::Message::PRIORITY_NORMAL) {
> 
> Does this mean we'll cancel messages from a different transaction?

Yes. If two high priority messages race with each other, then we dispatch in the child first. If that transaction is cancelled, then we cancel everything, including the other high prio message. On the other side, this message is removed from the queue here:
https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#2175

I considered doing the same thing in this case as we do when a normal priority message races with a high priority message. Maybe it would work. It just seemed like more effort without much reward.

> @@ +396,5 @@
> > +            MOZ_RELEASE_ASSERT(IsComplete());
> > +        } else {
> > +            MOZ_RELEASE_ASSERT(mNext);
> > +            mNext->HandleReply(aMessage);
> > +        }
> 
> It's a little weird to use recursion here... it's almost certainly safe but
> it'd be easier to follow if flattened.

Yeah, I can change that.

> @@ +821,5 @@
> >      // Deferring in the parent only sort of breaks message ordering. When the
> >      // child's message comes in, we can pretend the child hasn't quite
> >      // finished sending it yet. Since the message is sync, we know that the
> >      // child hasn't moved on yet.
> > +    return mSide == ParentSide && aMsg.transaction_id() != CurrentHighPriorityTransaction();
> 
> Why does this specifically need the high priority transaction?

I guess this is kind of confusing. The idea is that CurrentHighPriorityTransaction() returns the current transaction and asserts that that transaction is for a high priority message. So it's like a more restrictive version of mCurrentTransaction.

I did this because I was worried about all the places where we used mCurrentTransaction and how they would be affected by a normal priority sync message still on the stack after the current transaction is canceled. To work around that, I found that most users of the current transaction only run if we're known to be in a high priority transaction, in which case there's already something above the normal priority sync message on the stack. So I added this extra-asserty thing to ensure that.

> @@ +1094,3 @@
> >  
> > +    bool nest = stackTop && stackTop->Priority() == IPC::Message::PRIORITY_HIGH;
> > +    int32_t transaction = nest ? stackTop->TransactionID() : seqno;
> 
> Why do we need to take the most recent high-priority transaction #, rather
> than use the sequence number as before?

Previously we used mCurrentTransaction as the transaction ID. That was set by AutoEnterTransaction, which would set mCurrentTransaction to the sequence number of the message if there wasn't already an existing transaction.

This doesn't work now. Imagine that there's already a normal priority sync send on the stack. It's sitting there because a high priority message that raced with it got canceled. Now we want to send another sync message. The new sync message should have a new transaction, so we can't use the transaction of the old normal priority sync message.

I thought about different ways to do this and decided to do it very directly: if the top thing on the stack is for a high priority message, then our call should nest inside that and we use the same transaction ID. Otherwise we need a new transaction ID (so we use the seqno of the message we're sending).

(You might sense a theme here that any uses of mCurrentTransaction are problematic after this patch. It's not clear what the current transaction is when you have one of these stale sync messages on the stack that's been sent but that we're not waiting on.)

> @@ +2305,5 @@
> >  
> > +bool
> > +MessageChannel::IsInTransaction() const
> > +{
> > +    MonitorAutoLock lock(*mMonitor);
> 
> Taking the lock here is a little weird, but since it's only ever called on
> the main thread from outside IPC I guess it's safe.

Yeah, we didn't take the lock before and it seemed racy. In practice I don't think it matters, but I just wanted to avoid a bug report from TSan.
Comment on attachment 8724990 [details] [diff] [review]
patch

Sorry for the late review, thanks for explaining. It is probably worth including some of those comments in the code (especially the one about transaction IDs).
Attachment #8724990 - Flags: review?(dvander) → review+
https://hg.mozilla.org/mozilla-central/rev/ab56b147647f
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Depends on: 1254471
Verified on nightly based on:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-02-01&signature=mozilla%3A%3Aipc%3A%3AMessageListener%3A%3AIntentionalCrash#aggregations
which shows that the last crashes were in the build from March 6th.  (The nightlies from March 7th had problems building and ended up with very few users, but there were also no crashes in the nightlies from March 8th.)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.