Closed Bug 1331193 Opened 3 years ago Closed 3 years ago
[e10s] Crash in mozilla::dom::Tab
Child::Recv Async Message
3.35 KB, patch
|Details | Diff | Splinter Review|
1.40 KB, patch
|Details | Diff | Splinter Review|
This bug was filed from the Socorro interface and is report bp-e141f5ec-7648-4006-8f23-da7042170114. ============================================================= Crashing Thread (0) Frame Module Signature Source 0 xul.dll mozilla::dom::TabChild::RecvAsyncMessage(nsString const&, nsTArray<mozilla::jsipc::CpowEntry>&&, IPC::Principal const&, mozilla::dom::ClonedMessageData const&) dom/ipc/TabChild.cpp:2365 1 xul.dll mozilla::dom::PBrowserChild::OnMessageReceived(IPC::Message const&) obj-firefox/ipc/ipdl/PBrowserChild.cpp:2894 2 xul.dll mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&) obj-firefox/ipc/ipdl/PContentChild.cpp:6521 3 xul.dll mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) ipc/glue/MessageChannel.cpp:1662 4 xul.dll mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message&&) ipc/glue/MessageChannel.cpp:1600 5 xul.dll mozilla::ipc::MessageChannel::OnMaybeDequeueOne() ipc/glue/MessageChannel.cpp:1567 6 xul.dll mozilla::detail::RunnableMethodImpl<bool ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() obj-firefox/dist/include/nsThreadUtils.h:768 7 xul.dll mozilla::ipc::MessageChannel::DequeueTask::Run() obj-firefox/dist/include/mozilla/ipc/MessageChannel.h:559 8 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1067 this crash in the content process has been around for a while, but spiked up in volume in beta-nightly prerelease versions on 2017-01-13.
this would be the pushlog for 52.0a2 20170113004016 - 1 day: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=1abb4b193740653f1161d86f38dbb79287e7d69b&tochange=1e322cd9c7410917d4c5f1fa5b2cd78eb4a79ec6 and from 51.0b13 to 14: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_51_0b13_RELEASE&tochange=FIREFOX_51_0b14_RELEASE
[Tracking Requested - why for this release]: this is a late regression in 51.0b14 and causing nearly 5% of content crashes there at the moment.
#5 topcrash in Aurora 20170113004016. The crash address is usually (always?) 0x3e. billm, any ideas?
The volume in beta14 is so high that I am afraid that it is going to reach unacceptable level in release. Updating the tracking value.
The only obvious DOM-looking change I see common to both pushlog ranges in #c1 is bug 1329989. Any chance it's related, Ben?
Blassey noted during the channel meeting today that this is strongly correlated to the addons cohort on 51 as well, so it could be an addon update causing this.
I don't think its related, but I can take a stab at fixing it. I don't think this code correctly handles the mMessageManager getting cycle collected.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Olli, this tries to avoid nullptr de-refs in TabChild when mMessageManager has been cleared. I *think* this can happen if the mTabChildGlobal has been CC'd but the TabChild actor is still being held alive by IPC. I also considered calling mMessageManager->Disconnect() in the TabChildGlobal CC unlink, but not sure if that is needed. https://treeherder.mozilla.org/#/jobs?repo=try&revision=842fa114f2c99df72747294e86929db2d3be3663
Attachment #8827566 - Flags: review?(bugs)
adding rhelmer to look into comment 7.... who could look into a crash stat showing some correlation with add-ons? https://crash-stats.mozilla.com/report/index/e141f5ec-7648-4006-8f23-da7042170114#tab-correlation last line is (63.06% in signature vs 19.22% overall) e10s_cohort = addons-set51alladdons-test [63.06% vs 42.67% if process_type = content]. the "extensions" correlated are all system add-ons (pocket, automatic update service, webcompat, e10sroll-out, and default theme) and present with or without the experiment.
So IPC keeps a reference to TabChild and DeallocPBrowserChild releases it, meaning that CC can't really unlink before DeallocPBrowserChild.
Comment on attachment 8827566 [details] [diff] [review] Make TabChild handle a nullptr mMessageManager. r=smaug Add MOZ_DIAGNOSTIC_ASSERT as discussed on IRC. This still puzzles me.
Attachment #8827566 - Flags: review?(bugs) → review+
Added diagnostic assertions.
I can't think of why this would happen. I agree with Olli that the CC issue shouldn't be possible. We could take Ben's patch and it would probably make the crash go away, but it's a little scary that we don't know what's happening. The only two possibilities I can think of are: 1. We've received a message after TabChild::ActorDestroy has run, or 2. We've received a message before the message manager has been initialized. #1 seems pretty unlikely. If that happened, lots of things would be horribly broken. We could add a release assertion for it pretty easily though. We'd just need a state variable to track if ActorDestroy ran yet or not. #2 seems a little more likely to me. We know we have an mTabChildGlobal, so InitTabChildGlobal must have run. If this line fails, it is possible that we could get into this situation: http://searchfox.org/mozilla-central/rev/790b2cb423ea1ecb5746722d51633caec9bab95a/dom/ipc/TabChild.cpp#2509 We could add a release assertion there and see if it fires.
I'll add a MOZ_DIAGNOSTIC_ASSERT after InitTabChildGlobal(). Thanks.
(In reply to Ben Kelly [:bkelly] from comment #15) > I'll add a MOZ_DIAGNOSTIC_ASSERT after InitTabChildGlobal(). Thanks. After talking on IRC I think the plan is to keep the original patch, but also add a nightly P2 patch that tries to fix the init failure path. We will only save mTabChildGlobal if we know the initialization has succeeded. This will avoid inconsistent state.
As discussed on IRC this moves the assignment to mTabChildGlobal to take place after the conditional failure checks. https://treeherder.mozilla.org/#/jobs?repo=try&revision=33c30680013d5c7dcb751dcb69a6d7fc76612aae
Attachment #8827621 - Flags: review?(wmccloskey)
Attachment #8827621 - Flags: review?(wmccloskey) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/d75f6cff73c8 Make TabChild handle a nullptr mMessageManager. r=smaug https://hg.mozilla.org/integration/mozilla-inbound/rev/013e646a2f7b P2 Don't leave TabChild in inconsistent state if init fails. r=billm
Comment on attachment 8827604 [details] [diff] [review] Make TabChild handle a nullptr mMessageManager. r=smaug This is for this patch only. Do not uplift P2 as its a higher risk. Approval Request Comment [Feature/Bug causing the regression]: Unclear what the cause is, but we are getting a high volume of nullptr de-ref crashes on beta. [User impact if declined]: Crashes [Is this code covered by automated tests?]: Our automated test suite does not trigger this crash. [Has the fix been verified in Nightly?]: We don't have STR so we can't manually verify. [Needs manual test from QE? If yes, steps to reproduce]: No [List of other uplifts needed for the feature/fix]: None [Is the change risky?]: Minimal risk [Why is the change risky/not risky?]: This patch tries to mitigate the problem by checking for nullptr here. [String changes made/needed]: None
Comment on attachment 8827604 [details] [diff] [review] Make TabChild handle a nullptr mMessageManager. r=smaug Top crash, blocking issue, let's try this for 51 RC2. We have to land the changes for 51 on m-r as well as m-b. (beta->release merge was yesterday)
Needs rebasing for Aurora.
Flags: needinfo?(rhelmer) → needinfo?(bkelly)
Pushed to beta on closed tree: https://hg.mozilla.org/releases/mozilla-beta/rev/df64cba2a9edf6b8046f9e7e63dc959bff63c419 Ryan, can you merge this to release? I think this is what you suggested.
Flags: needinfo?(bkelly) → needinfo?(ryanvm)
Wes, if you're around and ryan hasn't already done it, I'd like this on m-r as well so we can get it into the RC2 build.
I'll get it.
Crash Signature: [@ mozilla::dom::TabChild::RecvAsyncMessage] → [@ mozilla::dom::TabChild::RecvAsyncMessage] [@ <name omitted> | mozilla::dom::TabChild::RecvAsyncMessage ]
Looks good so far. I don't see any RecvAsyncMessage() crashes in the RC2 build.
You need to log in before you can comment on or make changes to this bug.