Closed Bug 1331193 Opened 3 years ago Closed 3 years ago

[e10s] Crash in mozilla::dom::TabChild::RecvAsyncMessage

Categories

(Core :: DOM: Core & HTML, defect, critical)

51 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- wontfix
firefox51 blocking fixed
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: philipp, Assigned: bkelly)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-e141f5ec-7648-4006-8f23-da7042170114.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::dom::TabChild::RecvAsyncMessage(nsString const&, nsTArray<mozilla::jsipc::CpowEntry>&&, IPC::Principal const&, mozilla::dom::ClonedMessageData const&) 	dom/ipc/TabChild.cpp:2365
1 	xul.dll 	mozilla::dom::PBrowserChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PBrowserChild.cpp:2894
2 	xul.dll 	mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/ipc/ipdl/PContentChild.cpp:6521
3 	xul.dll 	mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) 	ipc/glue/MessageChannel.cpp:1662
4 	xul.dll 	mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message&&) 	ipc/glue/MessageChannel.cpp:1600
5 	xul.dll 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp:1567
6 	xul.dll 	mozilla::detail::RunnableMethodImpl<bool ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:768
7 	xul.dll 	mozilla::ipc::MessageChannel::DequeueTask::Run() 	obj-firefox/dist/include/mozilla/ipc/MessageChannel.h:559
8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067

this crash in the content process has been around for a while, but spiked up in volume in beta-nightly prerelease versions on 2017-01-13.
[Tracking Requested - why for this release]:
this is a late regression in 51.0b14 and causing nearly 5% of content crashes there at the moment.
#5 topcrash in Aurora 20170113004016. The crash address is usually (always?) 0x3e.

billm, any ideas?
Flags: needinfo?(wmccloskey)
Track 51+ as the crash starts to grow in 51.0b14.
The volume in beta14 is so high that I am afraid that it is going to reach unacceptable level in release. Updating the tracking value.
The only obvious DOM-looking change I see common to both pushlog ranges in #c1 is bug 1329989. Any chance it's related, Ben?
Flags: needinfo?(bkelly)
Blassey noted during the channel meeting today that this is strongly correlated to the addons cohort on 51 as well, so it could be an addon update causing this.
I don't think its related, but I can take a stab at fixing it.  I don't think this code correctly handles the mMessageManager getting cycle collected.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(bkelly)
Olli, this tries to avoid nullptr de-refs in TabChild when mMessageManager has been cleared.  I *think* this can happen if the mTabChildGlobal has been CC'd but the TabChild actor is still being held alive by IPC.

I also considered calling mMessageManager->Disconnect() in the TabChildGlobal CC unlink, but not sure if that is needed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=842fa114f2c99df72747294e86929db2d3be3663
Attachment #8827566 - Flags: review?(bugs)
adding rhelmer to look into comment 7....

who could look into a crash stat showing some correlation with add-ons?  https://crash-stats.mozilla.com/report/index/e141f5ec-7648-4006-8f23-da7042170114#tab-correlation last line is (63.06% in signature vs 19.22% overall) e10s_cohort = addons-set51alladdons-test [63.06% vs 42.67% if process_type = content].   

the "extensions" correlated are all system add-ons (pocket, automatic update service, webcompat, e10sroll-out, and default theme) and present with or without the experiment.
Flags: needinfo?(rhelmer)
So IPC keeps a reference to TabChild and DeallocPBrowserChild releases it, meaning that CC can't really unlink before DeallocPBrowserChild.
Comment on attachment 8827566 [details] [diff] [review]
Make TabChild handle a nullptr mMessageManager. r=smaug

Add MOZ_DIAGNOSTIC_ASSERT as discussed on IRC.

This still puzzles me.
Attachment #8827566 - Flags: review?(bugs) → review+
Added diagnostic assertions.
Attachment #8827566 - Attachment is obsolete: true
Attachment #8827604 - Flags: review+
I can't think of why this would happen. I agree with Olli that the CC issue shouldn't be possible. We could take Ben's patch and it would probably make the crash go away, but it's a little scary that we don't know what's happening.

The only two possibilities I can think of are:
1. We've received a message after TabChild::ActorDestroy has run, or
2. We've received a message before the message manager has been initialized.

#1 seems pretty unlikely. If that happened, lots of things would be horribly broken. We could add a release assertion for it pretty easily though. We'd just need a state variable to track if ActorDestroy ran yet or not.

#2 seems a little more likely to me. We know we have an mTabChildGlobal, so InitTabChildGlobal must have run. If this line fails, it is possible that we could get into this situation:
http://searchfox.org/mozilla-central/rev/790b2cb423ea1ecb5746722d51633caec9bab95a/dom/ipc/TabChild.cpp#2509
We could add a release assertion there and see if it fires.
I'll add a MOZ_DIAGNOSTIC_ASSERT after InitTabChildGlobal().  Thanks.
(In reply to Ben Kelly [:bkelly] from comment #15)
> I'll add a MOZ_DIAGNOSTIC_ASSERT after InitTabChildGlobal().  Thanks.

After talking on IRC I think the plan is to keep the original patch, but also add a nightly P2 patch that tries to fix the init failure path.  We will only save mTabChildGlobal if we know the initialization has succeeded.  This will avoid inconsistent state.
As discussed on IRC this moves the assignment to mTabChildGlobal to take place after the conditional failure checks.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=33c30680013d5c7dcb751dcb69a6d7fc76612aae
Attachment #8827621 - Flags: review?(wmccloskey)
Attachment #8827621 - Flags: review?(wmccloskey) → review+
Flags: needinfo?(wmccloskey)
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d75f6cff73c8
Make TabChild handle a nullptr mMessageManager. r=smaug
https://hg.mozilla.org/integration/mozilla-inbound/rev/013e646a2f7b
P2 Don't leave TabChild in inconsistent state if init fails. r=billm
Comment on attachment 8827604 [details] [diff] [review]
Make TabChild handle a nullptr mMessageManager. r=smaug

This is for this patch only.  Do not uplift P2 as its a higher risk.

Approval Request Comment
[Feature/Bug causing the regression]: Unclear what the cause is, but we are getting a high volume of nullptr de-ref crashes on beta.
[User impact if declined]: Crashes
[Is this code covered by automated tests?]: Our automated test suite does not trigger this crash.
[Has the fix been verified in Nightly?]: We don't have STR so we can't manually verify.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Minimal risk
[Why is the change risky/not risky?]: This patch tries to mitigate the problem by checking for nullptr here.
[String changes made/needed]: None
Attachment #8827604 - Flags: approval-mozilla-beta?
Attachment #8827604 - Flags: approval-mozilla-aurora?
Comment on attachment 8827604 [details] [diff] [review]
Make TabChild handle a nullptr mMessageManager. r=smaug

Top crash, blocking issue, let's try this for 51 RC2. 
We have to land the changes for 51 on m-r as well as m-b. (beta->release merge was yesterday)
Attachment #8827604 - Flags: approval-mozilla-release+
Attachment #8827604 - Flags: approval-mozilla-beta?
Attachment #8827604 - Flags: approval-mozilla-beta+
Attachment #8827604 - Flags: approval-mozilla-aurora?
Attachment #8827604 - Flags: approval-mozilla-aurora+
Needs rebasing for Aurora.
Flags: needinfo?(rhelmer) → needinfo?(bkelly)
https://hg.mozilla.org/mozilla-central/rev/d75f6cff73c8
https://hg.mozilla.org/mozilla-central/rev/013e646a2f7b
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Pushed to beta on closed tree:

https://hg.mozilla.org/releases/mozilla-beta/rev/df64cba2a9edf6b8046f9e7e63dc959bff63c419

Ryan, can you merge this to release?  I think this is what you suggested.
Flags: needinfo?(bkelly) → needinfo?(ryanvm)
Wes, if you're around and ryan hasn't already done it, I'd like this on m-r as well so we can get it into the RC2 build.
Flags: needinfo?(wkocher)
I'll get it.
Flags: needinfo?(wkocher)
Flags: needinfo?(ryanvm)
Crash Signature: [@ mozilla::dom::TabChild::RecvAsyncMessage] → [@ mozilla::dom::TabChild::RecvAsyncMessage] [@ <name omitted> | mozilla::dom::TabChild::RecvAsyncMessage ]
Looks good so far.  I don't see any RecvAsyncMessage() crashes in the RC2 build.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.