Closed Bug 1389759 Opened 7 years ago Closed 6 years ago

Crash in mozilla::ipc::MessageChannel::CommonThreadOpenInit

Categories

(Core :: Graphics: Layers, defect, P3)

56 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- fixed
firefox60 --- fixed

People

(Reporter: philipp, Assigned: aosmond)

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-74ecfb98-8232-4d19-ade5-0034f0170811.
=============================================================

Crashing Thread (30), Name: Compositor
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::ipc::MessageChannel::CommonThreadOpenInit(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:840
1 	xul.dll 	mozilla::ipc::MessageChannel::OnOpenAsSlave(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:826
2 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void ( mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel*, mozilla::ipc::Side), 0, 0, mozilla::ipc::MessageChannel*, mozilla::ipc::Side>::Run() 	xpcom/threads/nsThreadUtils.h:1172
3 	xul.dll 	MessageLoop::RunTask(already_AddRefed<nsIRunnable>) 	ipc/chromium/src/base/message_loop.cc:452
4 	xul.dll 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) 	ipc/chromium/src/base/message_loop.cc:460
5 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc:535
6 	xul.dll 	base::MessagePumpForUI::DoRunLoop() 	ipc/chromium/src/base/message_pump_win.cc:210
7 	xul.dll 	base::MessagePumpWin::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_win.h:80
8 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
9 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
10 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc:181
11 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:28
12 	kernel32.dll 	BaseThreadInitThunk 	
13 	ntdll.dll 	RtlUserThreadStart

crash reports with this signature are getting more frequent since firefox 56.

Correlations for Firefox Nightly (not usre how valid, they are from a handful of affected installations last week):
(100.0% in signature vs 00.44% overall) "D3D11 Layers-" in app_notes = true
(100.0% in signature vs 00.51% overall) "D3D11 Layers?" in app_notes = true
(100.0% in signature vs 00.61% overall) GFX_ERROR "[D3D11] failed to get compositor device." = true
(100.0% in signature vs 00.62% overall) GFX_ERROR "[D3D11] Failed to init compositor with reason: " = true
(100.0% in signature vs 00.96% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE
(100.0% in signature vs 01.01% overall) GFX_ERROR "[D3D11] " = true
(100.0% in signature vs 04.73% overall) Module "winsta.dll" = true
(100.0% in signature vs 11.79% overall) Module "wininet.dll" = true
I looked at a few of these, and the crashes were all in the compositor thread, so maybe it is graphics related. Plus the correlations at the end of comment 0 are very graphics-y.
Component: IPC → Graphics: Layers
This looks like a the MessageChannel is a nullptr object.. any ideas nical?
Flags: needinfo?(nical.bugzilla)
Whiteboard: [gfx-noted]
Sorry, this one totally fell under the radar.

I don't know this code very well, but I would be surpried that the 'this' pointer of the MessageChannel be null because it is the receiver of a runnable method in which we check that the pointer is not null before calling the function: https://hg.mozilla.org/releases/mozilla-beta/annotate/da6760885a24/xpcom/threads/nsThreadUtils.h#l1172
Also the offset of mWorkerLoop should be closer to 0x38 than 0x198 (I think).

I think that the crash happens when trying to fetch or access the current message loop which is a typical shutdown scenario, so I looked at the stack of the main thread in a bunch of reports from the same stack and they are all in

> [...]
> mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown(mozilla::gfx::GPUProcessHost*)
> mozilla::gfx::GPUProcessHost::OnChannelClosed()

Which can't be a coincidence.

So while I don't know this code or the GPU process code, my (wild) guess is that we are shutting down the GPU process while trying to open a channel/protocol pair to talk to it at the same time. I guess we might want to do something like first making sure nothing can initiate any sort of communication with the GPU process and its threads before going ahead and destroying the threads.
Flags: needinfo?(nical.bugzilla) → needinfo?(dvander)
Flags: needinfo?(aosmond)
comment #4 had some good leads - indeed, |this| is not null, and the offset is 0x48 on x64. In fact |this| is 0x150 (making the crashing address at 0x198). So, probably the MessageChannel was embedded in a null pointer.

Probably what happened is we got here [1]. We returned null to the caller at [2], which does not perform a null check (and probably shouldn't, we want it to be infallible). parent->GetIPCChannel() returns &mChannel, so if parent is null, we'll get a bogus address.

Does that seem likely, Andrew?

[1] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerParent.cpp#35
[2] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerChild.cpp#42
Flags: needinfo?(dvander)
I can get on board with that analysis -- nice catch. It is puzzling how we get there though, because sInstance should only be initialized if we previously called CompositorManagerParent::CreateSameProcess which in turn is only called by CompositorManagerChild::InitSameProcess. This is also the only call path that CompositorManagerParent::sInstance is populated. This shouldn't exist because mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown being called implies that the GPU process died, and we are switching to a combined UI/GPU process where we *do* create it, supposedly for the first time.
Flags: needinfo?(aosmond)
Assignee: nobody → aosmond
Some I have rejigged the CompositorManagerChild/Parent setup/teardown a bit, but I'm not sure exactly what is causing the problem.

1) Stop using the GPU process ID as my identifier. Using the process token should be much more reliable (0 when it is the same process).

2) Ensure that when we call CompositorManagerChild::OnGPUProcessLost, that we only care if the process token matches. If there was a misordering of events, we could teardown a perfectly valid instance.

3) If CompositorManagerChild::InitSameProcess fails, let's not keep around the objects. Instead we only save the parent and child to their respective static instance pointers if it completes successfully.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b27d33b79ec7688759a2146190e14c44d1b45e7e
Attachment #8911221 - Flags: review?(dvander)
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Review of attachment 8911221 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUProcessManager.cpp
@@ +520,5 @@
>  GPUProcessManager::OnProcessUnexpectedShutdown(GPUProcessHost* aHost)
>  {
>    MOZ_ASSERT(mProcess && mProcess == aHost);
>  
> +  CompositorManagerChild::OnGPUProcessLost(aHost->GetProcessToken());

I got really confused wondering why this was needed for CompositorManagerChild, but not ImageBridgeChild. It looks like we never actually forget the old ImageBridgeChild singleton in the parent process, so it is never recreated. (Probably not that serious, it would only affect video decoding in the UI after a GPU process crash.)

::: gfx/layers/ipc/CompositorManagerChild.cpp
@@ +37,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
>    if (NS_WARN_IF(IsInitialized(base::GetCurrentProcId()))) {
>      MOZ_ASSERT_UNREACHABLE("Already initialized same process");
>      return false;

I would prefer to see this return false go away, and the function return void. We don't use the return value and don't handle it, so it'd be better to hard assert. Or gfxDevCrash.
Attachment #8911221 - Flags: review?(dvander) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail.
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e3e936deef86
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail. r=dvander
(In reply to Pulsebot from comment #9)
> Pushed by aosmond@gmail.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
> Ensure we tear down CompositorManagerChild correctly if init or the GPU
> process fail.

I backed this push out because although I had updated the patch due to the r+ feedback, I failed to land the patch containing those changes.
Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78681a30c037
Backed out changeset 6ed0903b5242 because an old version of the patch was landed.
https://hg.mozilla.org/mozilla-central/rev/e3e936deef86
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
What do we want to do wrt 57?
(In reply to Sylvestre Ledru [:sylvestre] from comment #14)
> What do we want to do wrt 57?

The volume is relatively low. I'd like to soak it for a few days or a week before considering uplifting to confirm it is fixed.
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1389021
[User impact if declined]: May experience very low volume crash in the content process.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes, it has been landed for a few days with no issues. The crash rate on nightly is very, very low so while I believe there is a good chance it will fix it on beta, I cannot confirm at this time.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It is relatively small and confined. It only tweaks how we handle certain errors, making them more resilient to introducing another error.
[String changes made/needed]: N/A
Attachment #8911221 - Flags: approval-mozilla-beta?
Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Fix a crash which has a decent volume on release. Taking it.
Should be in 57b5
Attachment #8911221 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Andrew Osmond [:aosmond] from comment #16)
> [Is this code covered by automated tests?]: Yes.
> [Has the fix been verified in Nightly?]: Yes, it has been landed for a few
> days with no issues. The crash rate on nightly is very, very low so while I
> believe there is a good chance it will fix it on beta, I cannot confirm at
> this time.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that this fix has automated coverage.
Flags: qe-verify-
there are still crashes with this signature in 57.0b5:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&build_id=%3E%3D20171002181526&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACommonThreadOpenInit&date=%3E%3D2017-10-01#reports

are they still related and/or should we file a new bug for this?
Flags: needinfo?(aosmond)
Still happening on nightly too it seems.
Status: RESOLVED → REOPENED
Flags: needinfo?(aosmond)
Resolution: FIXED → ---
Too late for 58, but this still seems to happen on 59.
i've just chatted with an affected user (bp-1e4e8e24-f782-4c54-9165-8c5461180130), where it turned out to be an issue with malicious software being present on the system, that could be resolved with a scan of malwarebytes.
Philipp, do you have the name of the malware?
Thanks
unfortunately not (and i have no way to reach out to them again, since it was irc). will remember to ask the next time someone is coming along...!
Funny, I just found a way to make this seemingly impossible crash happen at my desk, although I had to force it down a particular race condition path with code. Confirmation bias perhaps, but it matched my long held suspicions with how we are using / updating the process token stored in GPUProcessManager. Maybe the malicious software interferes with the timing?
So here is the failing chain of events bringing us here:

1) Bring up a GPU process. Sets GPUProcessManager::mProcessToken = X. CompositorManagerChild already got torn down.
2) Enter GPUProcessManager::EnsureCompositorManagerChild().
3) Enter GPUProcessManager::EnsureGPUReady().
3) Enter GPUChild::EnsureGPUReady. It fails due to the sync message failing. Note that this failure does not tear down the GPU process yet.
4) Exit GPUProcessManager::EnsureGPUReady(), return false.
5) Since there is no GPU ready, we call CompositorManagerChild::InitSameProcess() with process token X. This succeeds.

A probably short time later:

6) The GPU process is properly lost. GPUChild::ActorDestroy is called, and GPUProcessManager::DestroyProcess() follows. Sets GPUProcessManager::mProcessToken = 0.

A probably short time later:

7) Enter GPUProcessManager::EnsureCompositorManagerChild(). GPU process is not ready, we gave up.
8) We call CompositorManagerChild::InitSameProcess() with process token 0.
9) We already initialized, and so CompositorManagerParent::CreateSameProcess() fails.
10) Crash follows as per the trace.

We intend the the process token 0 be used when we abandon using the GPU process.
Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

There are a handful of seemingly unrelated failures in the try run with my hack to make the GPU process fail, but the regular try seems fine.
Attachment #8946668 - Flags: review?(rhunt)
Attachment #8946668 - Flags: review?(rhunt) → review+
Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ecf18629444a
Fix GPUProcessManager::EnsureGPUReady how did not guarantee the GPU process state is consistent. r=rhunt
https://hg.mozilla.org/mozilla-central/rev/ecf18629444a
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Should we uplift to beta?
Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1389021
[User impact if declined]: May continue to crash UI process on startup for some users.
[Is this code covered by automated tests?]: The method modified is tested but no tests are known to trigger the race condition required to crash.
[Has the fix been verified in Nightly?]: Nightly has a very low crash rate compared to beta and release, so it is difficult to assess, other than it does not appear to have made things worse.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: We will always crash if we hit the code path modified by the patch. It is unlikely the patch can make things worse compared to crashing the UI process.
[String changes made/needed]: None.
Attachment #8946668 - Flags: approval-mozilla-beta?
Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

Let's uplift this for 59 beta 7 and see if the crash volume goes down.
Attachment #8946668 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: