Crash in mozilla::ipc::MessageChannel::CommonThreadOpenInit

RESOLVED FIXED in Firefox 57

Status

()

Core
Graphics: Layers
P3
critical
RESOLVED FIXED
2 months ago
14 days ago

People

(Reporter: philipp, Assigned: aosmond, NeedInfo)

Tracking

({crash, regression})

56 Branch
mozilla58
All
Windows
crash, regression
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr52 wontfix, firefox55 wontfix, firefox56 wontfix, firefox57 fixed, firefox58 fixed)

Details

(Whiteboard: [gfx-noted], crash signature)

Attachments

(1 attachment)

(Reporter)

Description

2 months ago
This bug was filed from the Socorro interface and is 
report bp-74ecfb98-8232-4d19-ade5-0034f0170811.
=============================================================

Crashing Thread (30), Name: Compositor
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::ipc::MessageChannel::CommonThreadOpenInit(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:840
1 	xul.dll 	mozilla::ipc::MessageChannel::OnOpenAsSlave(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:826
2 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void ( mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel*, mozilla::ipc::Side), 0, 0, mozilla::ipc::MessageChannel*, mozilla::ipc::Side>::Run() 	xpcom/threads/nsThreadUtils.h:1172
3 	xul.dll 	MessageLoop::RunTask(already_AddRefed<nsIRunnable>) 	ipc/chromium/src/base/message_loop.cc:452
4 	xul.dll 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) 	ipc/chromium/src/base/message_loop.cc:460
5 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc:535
6 	xul.dll 	base::MessagePumpForUI::DoRunLoop() 	ipc/chromium/src/base/message_pump_win.cc:210
7 	xul.dll 	base::MessagePumpWin::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_win.h:80
8 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
9 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
10 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc:181
11 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:28
12 	kernel32.dll 	BaseThreadInitThunk 	
13 	ntdll.dll 	RtlUserThreadStart

crash reports with this signature are getting more frequent since firefox 56.

Correlations for Firefox Nightly (not usre how valid, they are from a handful of affected installations last week):
(100.0% in signature vs 00.44% overall) "D3D11 Layers-" in app_notes = true
(100.0% in signature vs 00.51% overall) "D3D11 Layers?" in app_notes = true
(100.0% in signature vs 00.61% overall) GFX_ERROR "[D3D11] failed to get compositor device." = true
(100.0% in signature vs 00.62% overall) GFX_ERROR "[D3D11] Failed to init compositor with reason: " = true
(100.0% in signature vs 00.96% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE
(100.0% in signature vs 01.01% overall) GFX_ERROR "[D3D11] " = true
(100.0% in signature vs 04.73% overall) Module "winsta.dll" = true
(100.0% in signature vs 11.79% overall) Module "wininet.dll" = true
I looked at a few of these, and the crashes were all in the compositor thread, so maybe it is graphics related. Plus the correlations at the end of comment 0 are very graphics-y.
Component: IPC → Graphics: Layers
This looks like a the MessageChannel is a nullptr object.. any ideas nical?
Flags: needinfo?(nical.bugzilla)
Whiteboard: [gfx-noted]
Priority: -- → P3
Sorry, this one totally fell under the radar.

I don't know this code very well, but I would be surpried that the 'this' pointer of the MessageChannel be null because it is the receiver of a runnable method in which we check that the pointer is not null before calling the function: https://hg.mozilla.org/releases/mozilla-beta/annotate/da6760885a24/xpcom/threads/nsThreadUtils.h#l1172
Also the offset of mWorkerLoop should be closer to 0x38 than 0x198 (I think).

I think that the crash happens when trying to fetch or access the current message loop which is a typical shutdown scenario, so I looked at the stack of the main thread in a bunch of reports from the same stack and they are all in

> [...]
> mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown(mozilla::gfx::GPUProcessHost*)
> mozilla::gfx::GPUProcessHost::OnChannelClosed()

Which can't be a coincidence.

So while I don't know this code or the GPU process code, my (wild) guess is that we are shutting down the GPU process while trying to open a channel/protocol pair to talk to it at the same time. I guess we might want to do something like first making sure nothing can initiate any sort of communication with the GPU process and its threads before going ahead and destroying the threads.
Flags: needinfo?(nical.bugzilla) → needinfo?(dvander)
Flags: needinfo?(aosmond)
comment #4 had some good leads - indeed, |this| is not null, and the offset is 0x48 on x64. In fact |this| is 0x150 (making the crashing address at 0x198). So, probably the MessageChannel was embedded in a null pointer.

Probably what happened is we got here [1]. We returned null to the caller at [2], which does not perform a null check (and probably shouldn't, we want it to be infallible). parent->GetIPCChannel() returns &mChannel, so if parent is null, we'll get a bogus address.

Does that seem likely, Andrew?

[1] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerParent.cpp#35
[2] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerChild.cpp#42
Flags: needinfo?(dvander)
I can get on board with that analysis -- nice catch. It is puzzling how we get there though, because sInstance should only be initialized if we previously called CompositorManagerParent::CreateSameProcess which in turn is only called by CompositorManagerChild::InitSameProcess. This is also the only call path that CompositorManagerParent::sInstance is populated. This shouldn't exist because mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown being called implies that the GPU process died, and we are switching to a combined UI/GPU process where we *do* create it, supposedly for the first time.
Flags: needinfo?(aosmond)
Assignee: nobody → aosmond
Created attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Some I have rejigged the CompositorManagerChild/Parent setup/teardown a bit, but I'm not sure exactly what is causing the problem.

1) Stop using the GPU process ID as my identifier. Using the process token should be much more reliable (0 when it is the same process).

2) Ensure that when we call CompositorManagerChild::OnGPUProcessLost, that we only care if the process token matches. If there was a misordering of events, we could teardown a perfectly valid instance.

3) If CompositorManagerChild::InitSameProcess fails, let's not keep around the objects. Instead we only save the parent and child to their respective static instance pointers if it completes successfully.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b27d33b79ec7688759a2146190e14c44d1b45e7e
Attachment #8911221 - Flags: review?(dvander)
Oh and the real try is: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f975d752dead6ad90ded95a9b37fbfd11afb08d
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Review of attachment 8911221 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUProcessManager.cpp
@@ +520,5 @@
>  GPUProcessManager::OnProcessUnexpectedShutdown(GPUProcessHost* aHost)
>  {
>    MOZ_ASSERT(mProcess && mProcess == aHost);
>  
> +  CompositorManagerChild::OnGPUProcessLost(aHost->GetProcessToken());

I got really confused wondering why this was needed for CompositorManagerChild, but not ImageBridgeChild. It looks like we never actually forget the old ImageBridgeChild singleton in the parent process, so it is never recreated. (Probably not that serious, it would only affect video decoding in the UI after a GPU process crash.)

::: gfx/layers/ipc/CompositorManagerChild.cpp
@@ +37,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
>    if (NS_WARN_IF(IsInitialized(base::GetCurrentProcId()))) {
>      MOZ_ASSERT_UNREACHABLE("Already initialized same process");
>      return false;

I would prefer to see this return false go away, and the function return void. We don't use the return value and don't handle it, so it'd be better to hard assert. Or gfxDevCrash.
Attachment #8911221 - Flags: review?(dvander) → review+

Comment 9

22 days ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail.

Comment 10

22 days ago
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e3e936deef86
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail. r=dvander
(In reply to Pulsebot from comment #9)
> Pushed by aosmond@gmail.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
> Ensure we tear down CompositorManagerChild correctly if init or the GPU
> process fail.

I backed this push out because although I had updated the patch due to the r+ feedback, I failed to land the patch containing those changes.

Comment 12

22 days ago
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78681a30c037
Backed out changeset 6ed0903b5242 because an old version of the patch was landed.

Comment 13

22 days ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/e3e936deef86
Status: NEW → RESOLVED
Last Resolved: 22 days ago
status-firefox58: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
What do we want to do wrt 57?
status-firefox56: affected → wontfix
(In reply to Sylvestre Ledru [:sylvestre] from comment #14)
> What do we want to do wrt 57?

The volume is relatively low. I'd like to soak it for a few days or a week before considering uplifting to confirm it is fixed.
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1389021
[User impact if declined]: May experience very low volume crash in the content process.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes, it has been landed for a few days with no issues. The crash rate on nightly is very, very low so while I believe there is a good chance it will fix it on beta, I cannot confirm at this time.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It is relatively small and confined. It only tweaks how we handle certain errors, making them more resilient to introducing another error.
[String changes made/needed]: N/A
Attachment #8911221 - Flags: approval-mozilla-beta?
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Fix a crash which has a decent volume on release. Taking it.
Should be in 57b5
Attachment #8911221 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 18

17 days ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/b7b894f5fad8
status-firefox57: affected → fixed
(In reply to Andrew Osmond [:aosmond] from comment #16)
> [Is this code covered by automated tests?]: Yes.
> [Has the fix been verified in Nightly?]: Yes, it has been landed for a few
> days with no issues. The crash rate on nightly is very, very low so while I
> believe there is a good chance it will fix it on beta, I cannot confirm at
> this time.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
(Reporter)

Comment 20

14 days ago
there are still crashes with this signature in 57.0b5:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&build_id=%3E%3D20171002181526&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACommonThreadOpenInit&date=%3E%3D2017-10-01#reports

are they still related and/or should we file a new bug for this?
Flags: needinfo?(aosmond)
You need to log in before you can comment on or make changes to this bug.