Closed
Bug 1248750
Opened 8 years ago
Closed 8 years ago
spike in nightly crashes at mozilla::ipc::MessageListener::IntentionalCrash
Categories
(Core :: IPC, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla47
Tracking | Status | |
---|---|---|
e10s | + | --- |
firefox46 | --- | unaffected |
firefox47 | + | verified |
People
(Reporter: dbaron, Assigned: billm)
References
Details
(Keywords: crash, topcrash, Whiteboard: btpp-active)
Crash Data
Attachments
(1 file)
45.70 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-d6de845d-03be-4cd0-b796-b28f42160216. ============================================================= Following https://hg.mozilla.org/mozilla-central/rev/cbdcef6d6e01 there's been a spike in crashes at [@ mozilla::ipc::MessageListener::IntentionalCrash]. The stack signatures vary a bit, but they all seem to involve recursion where, in the middle of sending an IPC message, we recur into sending the same IPC message again. e.g., in: bp-d6de845d-03be-4cd0-b796-b28f42160216 0 xul.dll mozilla::ipc::MessageListener::IntentionalCrash() ipc/glue/MessageLink.h 1 xul.dll mozilla::ipc::MessageChannel::CancelTransaction(int) ipc/glue/MessageChannel.cpp 2 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 3 xul.dll mozilla::dom::PBrowserChild::SendGetWidgetNativeData(unsigned __int64*) obj-firefox/ipc/ipdl/PBrowserChild.cpp 4 xul.dll mozilla::widget::PuppetWidget::GetNativeData(unsigned int) widget/PuppetWidget.cpp ... 37 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 38 xul.dll mozilla::dom::PBrowserChild::SendGetWidgetNativeData(unsigned __int64*) obj-firefox/ipc/ipdl/PBrowserChild.cpp 39 xul.dll mozilla::widget::PuppetWidget::GetNativeData(unsigned int) widget/PuppetWidget.cpp (which seems to be the most common variation) or in bp-aed03712-923e-4419-b508-5b6622160216 : 0 xul.dll mozilla::ipc::MessageListener::IntentionalCrash() ipc/glue/MessageLink.h 1 xul.dll mozilla::ipc::MessageChannel::CancelTransaction(int) ipc/glue/MessageChannel.cpp 2 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 3 xul.dll mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const&, nsresult*, unsigned int*) obj-firefox/ipc/ipdl/PContentChild.cpp 4 xul.dll mozilla::plugins::PluginModuleContentParent::LoadModule(unsigned int, nsPluginTag*) dom/plugins/ipc/PluginModuleParent.cpp ... 37 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 38 xul.dll mozilla::dom::PContentChild::SendLoadPlugin(unsigned int const&, nsresult*, unsigned int*) obj-firefox/ipc/ipdl/PContentChild.cpp 39 xul.dll mozilla::plugins::PluginModuleContentParent::LoadModule(unsigned int, nsPluginTag*) dom/plugins/ipc/PluginModuleParent.cpp or in bp-a231a285-1e93-4520-9a74-9f3282160211 : 0 xul.dll mozilla::ipc::MessageListener::IntentionalCrash() ipc/glue/MessageLink.h 1 xul.dll mozilla::ipc::MessageChannel::CancelTransaction(int) ipc/glue/MessageChannel.cpp 2 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 3 xul.dll mozilla::dom::PContentChild::SendClipboardHasType(nsTArray<nsCString> const&, int const&, bool*) obj-firefox/ipc/ipdl/PContentChild.cpp 4 xul.dll nsClipboardProxy::HasDataMatchingFlavors(char const**, unsigned int, int, bool*) widget/nsClipboardProxy.cpp ... 42 xul.dll mozilla::ipc::MessageChannel::Send(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp 43 xul.dll mozilla::dom::PContentChild::SendClipboardHasType(nsTArray<nsCString> const&, int const&, bool*) obj-firefox/ipc/ipdl/PContentChild.cpp 44 xul.dll nsClipboardProxy::HasDataMatchingFlavors(char const**, unsigned int, int, bool*) widget/nsClipboardProxy.cpp A full query of such crashes on Firefox nightly on Windows since January 15: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-01-15&signature=mozilla%3A%3Aipc%3A%3AMessageListener%3A%3AIntentionalCrash#aggregations shows that they started appearing on January 29, and have been showing up at about 30 per day since February 2. (I'm not quite sure what the delay was in there...)
Flags: needinfo?(wmccloskey)
Updated•8 years ago
|
tracking-e10s:
--- → ?
Assignee | ||
Comment 1•8 years ago
|
||
I suspect bug 1191145 is what caused this signature to spike. It definitely made it more likely that we would get into this situation. I'm still trying to figure out if we should add the script blockers back in a more correct way or if I can find a more complete solution that avoids these intentional crashes entirely.
Flags: needinfo?(wmccloskey)
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → wmccloskey
Reporter | ||
Updated•8 years ago
|
Updated•8 years ago
|
Priority: -- → P1
Updated•8 years ago
|
Whiteboard: dom-triaged btpp-fixnow
Comment 2•8 years ago
|
||
Are things going alright here, Bill? Do you need a hand?
Flags: needinfo?(wmccloskey)
Tracked as this is a top-crash.
Assignee | ||
Comment 4•8 years ago
|
||
I've got a work in progress patch that I'm debugging.
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 5•8 years ago
|
||
Assignee | ||
Comment 6•8 years ago
|
||
Comment on attachment 8724990 [details] [diff] [review] patch Submitted before I was ready... This patch removes the intentional crash in the following situation: there is a race between the child sending a normal priority sync message C1 and the parent sending a high priority sync message P1; in this case, P1 is dispatched in the child first; if P1's transaction is canceled, then we have difficulty deciding what to do about C1, so we crash. One option would be to cancel C1 along with P1. The problem is that we send a lot of normal prio sync messages in Gecko, and the code isn't written to handle a failure of this sort. I guess you'd want it to retry somehow, but that's not what happens now. The other option is to allow C1 to proceed. That also causes some tricky issues because the child is now unblocked even through it's awaiting a sync message reply. The stack looks like this: Send C1 ProcessPendingRequests Handle P1 ...some Gecko code... CancelCurrentTransaction When the cancellation completes, the parent is unblocked immediately and the child is free to return to the Gecko code that triggered the cancellation. One concrete thing that can go wrong is that the child sends another sync message C2. It will block waiting for a reply to C2, but it will get the C1 reply first and crash. It turns out this isn't really that hard to handle. We just need to allow the child to receive the C1 reply and store it somewhere. When we get back to the stack frame for C1, we can deal with the reply at that time. To store the reply, I expanded AutoEnterTransaction to be a more full-featured way of storing IPC state. Each AutoEnterTransaction object is stored in an explicit stack that's accessible from the channel. Rather than using a bunch of channel-wide state fields, we can consult the stack to figure out what messages we're sending or handling. It's also a convenient place to store replies. I was able to get rid of a lot of ad hoc state fields this way.
Attachment #8724990 -
Attachment is patch: true
Attachment #8724990 -
Flags: review?(dvander)
Updated•8 years ago
|
Whiteboard: dom-triaged btpp-fixnow → btpp-active
Comment on attachment 8724990 [details] [diff] [review] patch Review of attachment 8724990 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, I really like the new AutoEnterTransaction. Just a few questions. ::: ipc/glue/MessageChannel.cpp @@ +328,5 @@ > + > + void Cancel() { > + AutoEnterTransaction *cur = mChan->mTransactionStack; > + MOZ_RELEASE_ASSERT(cur == this); > + while (cur && cur->mPriority != IPC::Message::PRIORITY_NORMAL) { Does this mean we'll cancel messages from a different transaction? @@ +396,5 @@ > + MOZ_RELEASE_ASSERT(IsComplete()); > + } else { > + MOZ_RELEASE_ASSERT(mNext); > + mNext->HandleReply(aMessage); > + } It's a little weird to use recursion here... it's almost certainly safe but it'd be easier to follow if flattened. @@ +821,5 @@ > // Deferring in the parent only sort of breaks message ordering. When the > // child's message comes in, we can pretend the child hasn't quite > // finished sending it yet. Since the message is sync, we know that the > // child hasn't moved on yet. > + return mSide == ParentSide && aMsg.transaction_id() != CurrentHighPriorityTransaction(); Why does this specifically need the high priority transaction? @@ +1094,3 @@ > > + bool nest = stackTop && stackTop->Priority() == IPC::Message::PRIORITY_HIGH; > + int32_t transaction = nest ? stackTop->TransactionID() : seqno; Why do we need to take the most recent high-priority transaction #, rather than use the sequence number as before? @@ +2305,5 @@ > > +bool > +MessageChannel::IsInTransaction() const > +{ > + MonitorAutoLock lock(*mMonitor); Taking the lock here is a little weird, but since it's only ever called on the main thread from outside IPC I guess it's safe.
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #7) > ::: ipc/glue/MessageChannel.cpp > @@ +328,5 @@ > > + > > + void Cancel() { > > + AutoEnterTransaction *cur = mChan->mTransactionStack; > > + MOZ_RELEASE_ASSERT(cur == this); > > + while (cur && cur->mPriority != IPC::Message::PRIORITY_NORMAL) { > > Does this mean we'll cancel messages from a different transaction? Yes. If two high priority messages race with each other, then we dispatch in the child first. If that transaction is cancelled, then we cancel everything, including the other high prio message. On the other side, this message is removed from the queue here: https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#2175 I considered doing the same thing in this case as we do when a normal priority message races with a high priority message. Maybe it would work. It just seemed like more effort without much reward. > @@ +396,5 @@ > > + MOZ_RELEASE_ASSERT(IsComplete()); > > + } else { > > + MOZ_RELEASE_ASSERT(mNext); > > + mNext->HandleReply(aMessage); > > + } > > It's a little weird to use recursion here... it's almost certainly safe but > it'd be easier to follow if flattened. Yeah, I can change that. > @@ +821,5 @@ > > // Deferring in the parent only sort of breaks message ordering. When the > > // child's message comes in, we can pretend the child hasn't quite > > // finished sending it yet. Since the message is sync, we know that the > > // child hasn't moved on yet. > > + return mSide == ParentSide && aMsg.transaction_id() != CurrentHighPriorityTransaction(); > > Why does this specifically need the high priority transaction? I guess this is kind of confusing. The idea is that CurrentHighPriorityTransaction() returns the current transaction and asserts that that transaction is for a high priority message. So it's like a more restrictive version of mCurrentTransaction. I did this because I was worried about all the places where we used mCurrentTransaction and how they would be affected by a normal priority sync message still on the stack after the current transaction is canceled. To work around that, I found that most users of the current transaction only run if we're known to be in a high priority transaction, in which case there's already something above the normal priority sync message on the stack. So I added this extra-asserty thing to ensure that. > @@ +1094,3 @@ > > > > + bool nest = stackTop && stackTop->Priority() == IPC::Message::PRIORITY_HIGH; > > + int32_t transaction = nest ? stackTop->TransactionID() : seqno; > > Why do we need to take the most recent high-priority transaction #, rather > than use the sequence number as before? Previously we used mCurrentTransaction as the transaction ID. That was set by AutoEnterTransaction, which would set mCurrentTransaction to the sequence number of the message if there wasn't already an existing transaction. This doesn't work now. Imagine that there's already a normal priority sync send on the stack. It's sitting there because a high priority message that raced with it got canceled. Now we want to send another sync message. The new sync message should have a new transaction, so we can't use the transaction of the old normal priority sync message. I thought about different ways to do this and decided to do it very directly: if the top thing on the stack is for a high priority message, then our call should nest inside that and we use the same transaction ID. Otherwise we need a new transaction ID (so we use the seqno of the message we're sending). (You might sense a theme here that any uses of mCurrentTransaction are problematic after this patch. It's not clear what the current transaction is when you have one of these stale sync messages on the stack that's been sent but that we're not waiting on.) > @@ +2305,5 @@ > > > > +bool > > +MessageChannel::IsInTransaction() const > > +{ > > + MonitorAutoLock lock(*mMonitor); > > Taking the lock here is a little weird, but since it's only ever called on > the main thread from outside IPC I guess it's safe. Yeah, we didn't take the lock before and it seemed racy. In practice I don't think it matters, but I just wanted to avoid a bug report from TSan.
Comment on attachment 8724990 [details] [diff] [review] patch Sorry for the late review, thanks for explaining. It is probably worth including some of those comments in the code (especially the one about transaction IDs).
Attachment #8724990 -
Flags: review?(dvander) → review+
Comment 11•8 years ago
|
||
Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/806c404edf06
Comment 13•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ab56b147647f
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Reporter | ||
Comment 14•8 years ago
|
||
Verified on nightly based on: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&platform=Windows&date=%3E%3D2016-02-01&signature=mozilla%3A%3Aipc%3A%3AMessageListener%3A%3AIntentionalCrash#aggregations which shows that the last crashes were in the build from March 6th. (The nightlies from March 7th had problems building and ended up with very few users, but there were also no crashes in the nightlies from March 8th.)
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•