Closed
Bug 1389759
Opened 7 years ago
Closed 6 years ago
Crash in mozilla::ipc::MessageChannel::CommonThreadOpenInit
Categories
(Core :: Graphics: Layers, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla58
People
(Reporter: philipp, Assigned: aosmond)
Details
(Keywords: crash, regression, Whiteboard: [gfx-noted])
Crash Data
Attachments
(2 files)
12.37 KB,
patch
|
dvander
:
review+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
2.00 KB,
patch
|
rhunt
:
review+
lizzard
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-74ecfb98-8232-4d19-ade5-0034f0170811. ============================================================= Crashing Thread (30), Name: Compositor Frame Module Signature Source 0 xul.dll mozilla::ipc::MessageChannel::CommonThreadOpenInit(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) ipc/glue/MessageChannel.cpp:840 1 xul.dll mozilla::ipc::MessageChannel::OnOpenAsSlave(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) ipc/glue/MessageChannel.cpp:826 2 xul.dll mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void ( mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel*, mozilla::ipc::Side), 0, 0, mozilla::ipc::MessageChannel*, mozilla::ipc::Side>::Run() xpcom/threads/nsThreadUtils.h:1172 3 xul.dll MessageLoop::RunTask(already_AddRefed<nsIRunnable>) ipc/chromium/src/base/message_loop.cc:452 4 xul.dll MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) ipc/chromium/src/base/message_loop.cc:460 5 xul.dll MessageLoop::DoWork() ipc/chromium/src/base/message_loop.cc:535 6 xul.dll base::MessagePumpForUI::DoRunLoop() ipc/chromium/src/base/message_pump_win.cc:210 7 xul.dll base::MessagePumpWin::Run(base::MessagePump::Delegate*) ipc/chromium/src/base/message_pump_win.h:80 8 xul.dll MessageLoop::RunHandler() ipc/chromium/src/base/message_loop.cc:319 9 xul.dll MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:299 10 xul.dll base::Thread::ThreadMain() ipc/chromium/src/base/thread.cc:181 11 xul.dll `anonymous namespace'::ThreadFunc ipc/chromium/src/base/platform_thread_win.cc:28 12 kernel32.dll BaseThreadInitThunk 13 ntdll.dll RtlUserThreadStart crash reports with this signature are getting more frequent since firefox 56. Correlations for Firefox Nightly (not usre how valid, they are from a handful of affected installations last week): (100.0% in signature vs 00.44% overall) "D3D11 Layers-" in app_notes = true (100.0% in signature vs 00.51% overall) "D3D11 Layers?" in app_notes = true (100.0% in signature vs 00.61% overall) GFX_ERROR "[D3D11] failed to get compositor device." = true (100.0% in signature vs 00.62% overall) GFX_ERROR "[D3D11] Failed to init compositor with reason: " = true (100.0% in signature vs 00.96% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE (100.0% in signature vs 01.01% overall) GFX_ERROR "[D3D11] " = true (100.0% in signature vs 04.73% overall) Module "winsta.dll" = true (100.0% in signature vs 11.79% overall) Module "wininet.dll" = true
Comment 1•7 years ago
|
||
I looked at a few of these, and the crashes were all in the compositor thread, so maybe it is graphics related. Plus the correlations at the end of comment 0 are very graphics-y.
Component: IPC → Graphics: Layers
Comment 2•7 years ago
|
||
This looks like a the MessageChannel is a nullptr object.. any ideas nical?
Flags: needinfo?(nical.bugzilla)
Whiteboard: [gfx-noted]
Updated•7 years ago
|
Priority: -- → P3
Comment 3•7 years ago
|
||
Sorry, this one totally fell under the radar. I don't know this code very well, but I would be surpried that the 'this' pointer of the MessageChannel be null because it is the receiver of a runnable method in which we check that the pointer is not null before calling the function: https://hg.mozilla.org/releases/mozilla-beta/annotate/da6760885a24/xpcom/threads/nsThreadUtils.h#l1172 Also the offset of mWorkerLoop should be closer to 0x38 than 0x198 (I think). I think that the crash happens when trying to fetch or access the current message loop which is a typical shutdown scenario, so I looked at the stack of the main thread in a bunch of reports from the same stack and they are all in > [...] > mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown(mozilla::gfx::GPUProcessHost*) > mozilla::gfx::GPUProcessHost::OnChannelClosed() Which can't be a coincidence. So while I don't know this code or the GPU process code, my (wild) guess is that we are shutting down the GPU process while trying to open a channel/protocol pair to talk to it at the same time. I guess we might want to do something like first making sure nothing can initiate any sort of communication with the GPU process and its threads before going ahead and destroying the threads.
Flags: needinfo?(nical.bugzilla) → needinfo?(dvander)
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(aosmond)
comment #4 had some good leads - indeed, |this| is not null, and the offset is 0x48 on x64. In fact |this| is 0x150 (making the crashing address at 0x198). So, probably the MessageChannel was embedded in a null pointer. Probably what happened is we got here [1]. We returned null to the caller at [2], which does not perform a null check (and probably shouldn't, we want it to be infallible). parent->GetIPCChannel() returns &mChannel, so if parent is null, we'll get a bogus address. Does that seem likely, Andrew? [1] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerParent.cpp#35 [2] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerChild.cpp#42
Flags: needinfo?(dvander)
Assignee | ||
Comment 5•7 years ago
|
||
I can get on board with that analysis -- nice catch. It is puzzling how we get there though, because sInstance should only be initialized if we previously called CompositorManagerParent::CreateSameProcess which in turn is only called by CompositorManagerChild::InitSameProcess. This is also the only call path that CompositorManagerParent::sInstance is populated. This shouldn't exist because mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown being called implies that the GPU process died, and we are switching to a combined UI/GPU process where we *do* create it, supposedly for the first time.
Flags: needinfo?(aosmond)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → aosmond
Assignee | ||
Comment 6•7 years ago
|
||
Some I have rejigged the CompositorManagerChild/Parent setup/teardown a bit, but I'm not sure exactly what is causing the problem. 1) Stop using the GPU process ID as my identifier. Using the process token should be much more reliable (0 when it is the same process). 2) Ensure that when we call CompositorManagerChild::OnGPUProcessLost, that we only care if the process token matches. If there was a misordering of events, we could teardown a perfectly valid instance. 3) If CompositorManagerChild::InitSameProcess fails, let's not keep around the objects. Instead we only save the parent and child to their respective static instance pointers if it completes successfully. try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b27d33b79ec7688759a2146190e14c44d1b45e7e
Assignee | ||
Updated•7 years ago
|
Attachment #8911221 -
Flags: review?(dvander)
Assignee | ||
Comment 7•7 years ago
|
||
Oh and the real try is: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f975d752dead6ad90ded95a9b37fbfd11afb08d
Comment on attachment 8911221 [details] [diff] [review] Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1 Review of attachment 8911221 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/ipc/GPUProcessManager.cpp @@ +520,5 @@ > GPUProcessManager::OnProcessUnexpectedShutdown(GPUProcessHost* aHost) > { > MOZ_ASSERT(mProcess && mProcess == aHost); > > + CompositorManagerChild::OnGPUProcessLost(aHost->GetProcessToken()); I got really confused wondering why this was needed for CompositorManagerChild, but not ImageBridgeChild. It looks like we never actually forget the old ImageBridgeChild singleton in the parent process, so it is never recreated. (Probably not that serious, it would only affect video decoding in the UI after a GPU process crash.) ::: gfx/layers/ipc/CompositorManagerChild.cpp @@ +37,5 @@ > { > MOZ_ASSERT(NS_IsMainThread()); > if (NS_WARN_IF(IsInitialized(base::GetCurrentProcId()))) { > MOZ_ASSERT_UNREACHABLE("Already initialized same process"); > return false; I would prefer to see this return false go away, and the function return void. We don't use the return value and don't handle it, so it'd be better to hard assert. Or gfxDevCrash.
Attachment #8911221 -
Flags: review?(dvander) → review+
Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242 Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail.
Comment 10•7 years ago
|
||
Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e3e936deef86 Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail. r=dvander
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Pulsebot from comment #9) > Pushed by aosmond@gmail.com: > https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242 > Ensure we tear down CompositorManagerChild correctly if init or the GPU > process fail. I backed this push out because although I had updated the patch due to the r+ feedback, I failed to land the patch containing those changes.
Comment 12•7 years ago
|
||
Backout by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/78681a30c037 Backed out changeset 6ed0903b5242 because an old version of the patch was landed.
Comment 13•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/e3e936deef86
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Comment 14•7 years ago
|
||
What do we want to do wrt 57?
Assignee | ||
Comment 15•7 years ago
|
||
(In reply to Sylvestre Ledru [:sylvestre] from comment #14) > What do we want to do wrt 57? The volume is relatively low. I'd like to soak it for a few days or a week before considering uplifting to confirm it is fixed.
Assignee | ||
Comment 16•7 years ago
|
||
Comment on attachment 8911221 [details] [diff] [review] Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1 Approval Request Comment [Feature/Bug causing the regression]: Bug 1389021 [User impact if declined]: May experience very low volume crash in the content process. [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: Yes, it has been landed for a few days with no issues. The crash rate on nightly is very, very low so while I believe there is a good chance it will fix it on beta, I cannot confirm at this time. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: N/A [Is the change risky?]: No. [Why is the change risky/not risky?]: It is relatively small and confined. It only tweaks how we handle certain errors, making them more resilient to introducing another error. [String changes made/needed]: N/A
Attachment #8911221 -
Flags: approval-mozilla-beta?
Comment 17•7 years ago
|
||
Comment on attachment 8911221 [details] [diff] [review] Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1 Fix a crash which has a decent volume on release. Taking it. Should be in 57b5
Attachment #8911221 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 18•7 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/b7b894f5fad8
Comment 19•7 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #16) > [Is this code covered by automated tests?]: Yes. > [Has the fix been verified in Nightly?]: Yes, it has been landed for a few > days with no issues. The crash rate on nightly is very, very low so while I > believe there is a good chance it will fix it on beta, I cannot confirm at > this time. > [Needs manual test from QE? If yes, steps to reproduce]: No. Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
Reporter | ||
Comment 20•7 years ago
|
||
there are still crashes with this signature in 57.0b5: https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&build_id=%3E%3D20171002181526&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACommonThreadOpenInit&date=%3E%3D2017-10-01#reports are they still related and/or should we file a new bug for this?
Flags: needinfo?(aosmond)
Assignee | ||
Comment 21•7 years ago
|
||
Still happening on nightly too it seems.
Status: RESOLVED → REOPENED
Flags: needinfo?(aosmond)
Resolution: FIXED → ---
Updated•7 years ago
|
Comment 22•6 years ago
|
||
Too late for 58, but this still seems to happen on 59.
Reporter | ||
Comment 23•6 years ago
|
||
i've just chatted with an affected user (bp-1e4e8e24-f782-4c54-9165-8c5461180130), where it turned out to be an issue with malicious software being present on the system, that could be resolved with a scan of malwarebytes.
Comment 24•6 years ago
|
||
Philipp, do you have the name of the malware? Thanks
Reporter | ||
Comment 25•6 years ago
|
||
unfortunately not (and i have no way to reach out to them again, since it was irc). will remember to ask the next time someone is coming along...!
Assignee | ||
Comment 26•6 years ago
|
||
Funny, I just found a way to make this seemingly impossible crash happen at my desk, although I had to force it down a particular race condition path with code. Confirmation bias perhaps, but it matched my long held suspicions with how we are using / updating the process token stored in GPUProcessManager. Maybe the malicious software interferes with the timing?
Assignee | ||
Comment 27•6 years ago
|
||
So here is the failing chain of events bringing us here: 1) Bring up a GPU process. Sets GPUProcessManager::mProcessToken = X. CompositorManagerChild already got torn down. 2) Enter GPUProcessManager::EnsureCompositorManagerChild(). 3) Enter GPUProcessManager::EnsureGPUReady(). 3) Enter GPUChild::EnsureGPUReady. It fails due to the sync message failing. Note that this failure does not tear down the GPU process yet. 4) Exit GPUProcessManager::EnsureGPUReady(), return false. 5) Since there is no GPU ready, we call CompositorManagerChild::InitSameProcess() with process token X. This succeeds. A probably short time later: 6) The GPU process is properly lost. GPUChild::ActorDestroy is called, and GPUProcessManager::DestroyProcess() follows. Sets GPUProcessManager::mProcessToken = 0. A probably short time later: 7) Enter GPUProcessManager::EnsureCompositorManagerChild(). GPU process is not ready, we gave up. 8) We call CompositorManagerChild::InitSameProcess() with process token 0. 9) We already initialized, and so CompositorManagerParent::CreateSameProcess() fails. 10) Crash follows as per the trace. We intend the the process token 0 be used when we abandon using the GPU process.
Assignee | ||
Comment 28•6 years ago
|
||
Worked out to be surprisingly simple. try (just the fix): https://treeherder.mozilla.org/#/jobs?repo=try&revision=51775baef0f7ca63be34c036bb06c6a69c70cb27 try (fix + forcing GPU process to fail in a way that triggers the race condition): https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca2b35f0e17e5247f91327c2634972adf3175462
Assignee | ||
Comment 29•6 years ago
|
||
Comment on attachment 8946668 [details] [diff] [review] 0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1 There are a handful of seemingly unrelated failures in the try run with my hack to make the GPU process fail, but the regular try seems fine.
Attachment #8946668 -
Flags: review?(rhunt)
Updated•6 years ago
|
Attachment #8946668 -
Flags: review?(rhunt) → review+
Comment 30•6 years ago
|
||
Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ecf18629444a Fix GPUProcessManager::EnsureGPUReady how did not guarantee the GPU process state is consistent. r=rhunt
Comment 31•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ecf18629444a
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Comment 32•6 years ago
|
||
Should we uplift to beta?
Assignee | ||
Comment 33•6 years ago
|
||
Comment on attachment 8946668 [details] [diff] [review] 0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1 Approval Request Comment [Feature/Bug causing the regression]: Bug 1389021 [User impact if declined]: May continue to crash UI process on startup for some users. [Is this code covered by automated tests?]: The method modified is tested but no tests are known to trigger the race condition required to crash. [Has the fix been verified in Nightly?]: Nightly has a very low crash rate compared to beta and release, so it is difficult to assess, other than it does not appear to have made things worse. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: No. [Why is the change risky/not risky?]: We will always crash if we hit the code path modified by the patch. It is unlikely the patch can make things worse compared to crashing the UI process. [String changes made/needed]: None.
Attachment #8946668 -
Flags: approval-mozilla-beta?
Comment 34•6 years ago
|
||
Comment on attachment 8946668 [details] [diff] [review] 0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1 Let's uplift this for 59 beta 7 and see if the crash volume goes down.
Attachment #8946668 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 35•6 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/7b20f617fd15
You need to log in
before you can comment on or make changes to this bug.
Description
•