1389759 - Crash in mozilla::ipc::MessageChannel::CommonThreadOpenInit

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is 
report bp-74ecfb98-8232-4d19-ade5-0034f0170811.
=============================================================

Crashing Thread (30), Name: Compositor
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::ipc::MessageChannel::CommonThreadOpenInit(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:840
1 	xul.dll 	mozilla::ipc::MessageChannel::OnOpenAsSlave(mozilla::ipc::MessageChannel*, mozilla::ipc::Side) 	ipc/glue/MessageChannel.cpp:826
2 	xul.dll 	mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void ( mozilla::ipc::MessageChannel::*)(mozilla::ipc::MessageChannel*, mozilla::ipc::Side), 0, 0, mozilla::ipc::MessageChannel*, mozilla::ipc::Side>::Run() 	xpcom/threads/nsThreadUtils.h:1172
3 	xul.dll 	MessageLoop::RunTask(already_AddRefed<nsIRunnable>) 	ipc/chromium/src/base/message_loop.cc:452
4 	xul.dll 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask&&) 	ipc/chromium/src/base/message_loop.cc:460
5 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc:535
6 	xul.dll 	base::MessagePumpForUI::DoRunLoop() 	ipc/chromium/src/base/message_pump_win.cc:210
7 	xul.dll 	base::MessagePumpWin::Run(base::MessagePump::Delegate*) 	ipc/chromium/src/base/message_pump_win.h:80
8 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:319
9 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:299
10 	xul.dll 	base::Thread::ThreadMain() 	ipc/chromium/src/base/thread.cc:181
11 	xul.dll 	`anonymous namespace'::ThreadFunc 	ipc/chromium/src/base/platform_thread_win.cc:28
12 	kernel32.dll 	BaseThreadInitThunk 	
13 	ntdll.dll 	RtlUserThreadStart

crash reports with this signature are getting more frequent since firefox 56.

Correlations for Firefox Nightly (not usre how valid, they are from a handful of affected installations last week):
(100.0% in signature vs 00.44% overall) "D3D11 Layers-" in app_notes = true
(100.0% in signature vs 00.51% overall) "D3D11 Layers?" in app_notes = true
(100.0% in signature vs 00.61% overall) GFX_ERROR "[D3D11] failed to get compositor device." = true
(100.0% in signature vs 00.62% overall) GFX_ERROR "[D3D11] Failed to init compositor with reason: " = true
(100.0% in signature vs 00.96% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE
(100.0% in signature vs 01.01% overall) GFX_ERROR "[D3D11] " = true
(100.0% in signature vs 04.73% overall) Module "winsta.dll" = true
(100.0% in signature vs 11.79% overall) Module "wininet.dll" = true

Andrew McCreight [:mccr8]

Comment 1

•

7 years ago

I looked at a few of these, and the crashes were all in the compositor thread, so maybe it is graphics related. Plus the correlations at the end of comment 0 are very graphics-y.

Component: IPC → Graphics: Layers

Bas Schouten (:bas.schouten)

Comment 2

•

7 years ago

This looks like a the MessageChannel is a nullptr object.. any ideas nical?

Flags: needinfo?(nical.bugzilla)

Whiteboard: [gfx-noted]

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

7 years ago

Priority: -- → P3

Nicolas Silva [:nical]

Comment 3

•

7 years ago

Sorry, this one totally fell under the radar.

I don't know this code very well, but I would be surpried that the 'this' pointer of the MessageChannel be null because it is the receiver of a runnable method in which we check that the pointer is not null before calling the function: https://hg.mozilla.org/releases/mozilla-beta/annotate/da6760885a24/xpcom/threads/nsThreadUtils.h#l1172
Also the offset of mWorkerLoop should be closer to 0x38 than 0x198 (I think).

I think that the crash happens when trying to fetch or access the current message loop which is a typical shutdown scenario, so I looked at the stack of the main thread in a bunch of reports from the same stack and they are all in

> [...]
> mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown(mozilla::gfx::GPUProcessHost*)
> mozilla::gfx::GPUProcessHost::OnChannelClosed()

Which can't be a coincidence.

So while I don't know this code or the GPU process code, my (wild) guess is that we are shutting down the GPU process while trying to open a channel/protocol pair to talk to it at the same time. I guess we might want to do something like first making sure nothing can initiate any sort of communication with the GPU process and its threads before going ahead and destroying the threads.

Flags: needinfo?(nical.bugzilla) → needinfo?(dvander)

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Flags: needinfo?(aosmond)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 4

•

7 years ago

comment #4 had some good leads - indeed, |this| is not null, and the offset is 0x48 on x64. In fact |this| is 0x150 (making the crashing address at 0x198). So, probably the MessageChannel was embedded in a null pointer.

Probably what happened is we got here [1]. We returned null to the caller at [2], which does not perform a null check (and probably shouldn't, we want it to be infallible). parent->GetIPCChannel() returns &mChannel, so if parent is null, we'll get a bogus address.

Does that seem likely, Andrew?

[1] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerParent.cpp#35
[2] http://searchfox.org/mozilla-central/rev/d08b24e613cac8c9c5a4131452459241010701e0/gfx/layers/ipc/CompositorManagerChild.cpp#42

Flags: needinfo?(dvander)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 5

•

7 years ago

I can get on board with that analysis -- nice catch. It is puzzling how we get there though, because sInstance should only be initialized if we previously called CompositorManagerParent::CreateSameProcess which in turn is only called by CompositorManagerChild::InitSameProcess. This is also the only call path that CompositorManagerParent::sInstance is populated. This shouldn't exist because mozilla::gfx::GPUProcessManager::OnProcessUnexpectedShutdown being called implies that the GPU process died, and we are switching to a combined UI/GPU process where we *do* create it, supposedly for the first time.

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Assignee: nobody → aosmond

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 6

•

7 years ago

Attached patch Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1 — Details — Splinter Review

Some I have rejigged the CompositorManagerChild/Parent setup/teardown a bit, but I'm not sure exactly what is causing the problem.

1) Stop using the GPU process ID as my identifier. Using the process token should be much more reliable (0 when it is the same process).

2) Ensure that when we call CompositorManagerChild::OnGPUProcessLost, that we only care if the process token matches. If there was a misordering of events, we could teardown a perfectly valid instance.

3) If CompositorManagerChild::InitSameProcess fails, let's not keep around the objects. Instead we only save the parent and child to their respective static instance pointers if it completes successfully.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b27d33b79ec7688759a2146190e14c44d1b45e7e

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

7 years ago

Attachment #8911221 - Flags: review?(dvander)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 7

•

7 years ago

Oh and the real try is: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f975d752dead6ad90ded95a9b37fbfd11afb08d

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 8

•

7 years ago

Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Review of attachment 8911221 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/ipc/GPUProcessManager.cpp
@@ +520,5 @@
>  GPUProcessManager::OnProcessUnexpectedShutdown(GPUProcessHost* aHost)
>  {
>    MOZ_ASSERT(mProcess && mProcess == aHost);
>  
> +  CompositorManagerChild::OnGPUProcessLost(aHost->GetProcessToken());

I got really confused wondering why this was needed for CompositorManagerChild, but not ImageBridgeChild. It looks like we never actually forget the old ImageBridgeChild singleton in the parent process, so it is never recreated. (Probably not that serious, it would only affect video decoding in the UI after a GPU process crash.)

::: gfx/layers/ipc/CompositorManagerChild.cpp
@@ +37,5 @@
>  {
>    MOZ_ASSERT(NS_IsMainThread());
>    if (NS_WARN_IF(IsInitialized(base::GetCurrentProcId()))) {
>      MOZ_ASSERT_UNREACHABLE("Already initialized same process");
>      return false;

I would prefer to see this return false go away, and the function return void. We don't use the return value and don't handle it, so it'd be better to hard assert. Or gfxDevCrash.

Attachment #8911221 - Flags: review?(dvander) → review+

Pulsebot

Comment 9

•

7 years ago

Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail.

Pulsebot

Comment 10

•

7 years ago

Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e3e936deef86
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail. r=dvander

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 11

•

7 years ago

(In reply to Pulsebot from comment #9)
> Pushed by aosmond@gmail.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/6ed0903b5242
> Ensure we tear down CompositorManagerChild correctly if init or the GPU
> process fail.

I backed this push out because although I had updated the patch due to the r+ feedback, I failed to land the patch containing those changes.

Pulsebot

Comment 12

•

7 years ago

Backout by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78681a30c037
Backed out changeset 6ed0903b5242 because an old version of the patch was landed.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e3e936deef86

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox58: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla58

Sylvestre Ledru [:Sylvestre]

Comment 14

•

7 years ago

What do we want to do wrt 57?

status-firefox56: affected → wontfix

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 15

•

7 years ago

(In reply to Sylvestre Ledru [:sylvestre] from comment #14)
> What do we want to do wrt 57?

The volume is relatively low. I'd like to soak it for a few days or a week before considering uplifting to confirm it is fixed.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 16

•

7 years ago

Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1389021
[User impact if declined]: May experience very low volume crash in the content process.
[Is this code covered by automated tests?]: Yes.
[Has the fix been verified in Nightly?]: Yes, it has been landed for a few days with no issues. The crash rate on nightly is very, very low so while I believe there is a good chance it will fix it on beta, I cannot confirm at this time.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: N/A
[Is the change risky?]: No.
[Why is the change risky/not risky?]: It is relatively small and confined. It only tweaks how we handle certain errors, making them more resilient to introducing another error.
[String changes made/needed]: N/A

Attachment #8911221 - Flags: approval-mozilla-beta?

Sylvestre Ledru [:Sylvestre]

Comment 17

•

7 years ago

Comment on attachment 8911221 [details] [diff] [review]
Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1

Fix a crash which has a decent volume on release. Taking it.
Should be in 57b5

Attachment #8911221 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 18

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/b7b894f5fad8

status-firefox57: affected → fixed

Andrei Vaida [:avaida]

Comment 19

•

7 years ago

(In reply to Andrew Osmond [:aosmond] from comment #16)
> [Is this code covered by automated tests?]: Yes.
> [Has the fix been verified in Nightly?]: Yes, it has been landed for a few
> days with no issues. The crash rate on nightly is very, very low so while I
> believe there is a good chance it will fix it on beta, I cannot confirm at
> this time.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Andrew's assessment on manual testing needs and the fact that this fix has automated coverage.

Flags: qe-verify-

[:philipp]

Reporter

Comment 20

•

7 years ago

there are still crashes with this signature in 57.0b5:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=beta&build_id=%3E%3D20171002181526&signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACommonThreadOpenInit&date=%3E%3D2017-10-01#reports

are they still related and/or should we file a new bug for this?

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 21

•

7 years ago

Still happening on nightly too it seems.

Status: RESOLVED → REOPENED

Flags: needinfo?(aosmond)

Resolution: FIXED → ---

Julien Cristau [:jcristau] (back April 22)

Updated

•

7 years ago

status-firefox57: fixed → wontfix

status-firefox58: fixed → affected

status-firefox59: --- → ?

Julien Cristau [:jcristau] (back April 22)

Comment 22

•

6 years ago

Too late for 58, but this still seems to happen on 59.

status-firefox58: affected → wontfix

status-firefox59: ? → affected

[:philipp]

Reporter

Comment 23

•

6 years ago

i've just chatted with an affected user (bp-1e4e8e24-f782-4c54-9165-8c5461180130), where it turned out to be an issue with malicious software being present on the system, that could be resolved with a scan of malwarebytes.

Sylvestre Ledru [:Sylvestre]

Comment 24

•

6 years ago

Philipp, do you have the name of the malware?
Thanks

[:philipp]

Reporter

Comment 25

•

6 years ago

unfortunately not (and i have no way to reach out to them again, since it was irc). will remember to ask the next time someone is coming along...!

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 26

•

6 years ago

Funny, I just found a way to make this seemingly impossible crash happen at my desk, although I had to force it down a particular race condition path with code. Confirmation bias perhaps, but it matched my long held suspicions with how we are using / updating the process token stored in GPUProcessManager. Maybe the malicious software interferes with the timing?

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 27

•

6 years ago

So here is the failing chain of events bringing us here:

1) Bring up a GPU process. Sets GPUProcessManager::mProcessToken = X. CompositorManagerChild already got torn down.
2) Enter GPUProcessManager::EnsureCompositorManagerChild().
3) Enter GPUProcessManager::EnsureGPUReady().
3) Enter GPUChild::EnsureGPUReady. It fails due to the sync message failing. Note that this failure does not tear down the GPU process yet.
4) Exit GPUProcessManager::EnsureGPUReady(), return false.
5) Since there is no GPU ready, we call CompositorManagerChild::InitSameProcess() with process token X. This succeeds.

A probably short time later:

6) The GPU process is properly lost. GPUChild::ActorDestroy is called, and GPUProcessManager::DestroyProcess() follows. Sets GPUProcessManager::mProcessToken = 0.

A probably short time later:

7) Enter GPUProcessManager::EnsureCompositorManagerChild(). GPU process is not ready, we gave up.
8) We call CompositorManagerChild::InitSameProcess() with process token 0.
9) We already initialized, and so CompositorManagerParent::CreateSameProcess() fails.
10) Crash follows as per the trace.

We intend the the process token 0 be used when we abandon using the GPU process.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 28

•

6 years ago

Attached patch 0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1 — Details — Splinter Review

Worked out to be surprisingly simple.

try (just the fix): https://treeherder.mozilla.org/#/jobs?repo=try&revision=51775baef0f7ca63be34c036bb06c6a69c70cb27

try (fix + forcing GPU process to fail in a way that triggers the race condition): https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca2b35f0e17e5247f91327c2634972adf3175462

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 29

•

6 years ago

Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

There are a handful of seemingly unrelated failures in the try run with my hack to make the GPU process fail, but the regular try seems fine.

Attachment #8946668 - Flags: review?(rhunt)

Ryan Hunt [:rhunt] (on leave until early May)

Updated

•

6 years ago

Attachment #8946668 - Flags: review?(rhunt) → review+

Pulsebot

Comment 30

•

6 years ago

Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ecf18629444a
Fix GPUProcessManager::EnsureGPUReady how did not guarantee the GPU process state is consistent. r=rhunt

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 31

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ecf18629444a

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

status-firefox60: --- → fixed

Resolution: --- → FIXED

Marco Castelluccio [:marco]

Comment 32

•

6 years ago

Should we uplift to beta?

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 33

•

6 years ago

Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

Approval Request Comment
[Feature/Bug causing the regression]: Bug 1389021
[User impact if declined]: May continue to crash UI process on startup for some users.
[Is this code covered by automated tests?]: The method modified is tested but no tests are known to trigger the race condition required to crash.
[Has the fix been verified in Nightly?]: Nightly has a very low crash rate compared to beta and release, so it is difficult to assess, other than it does not appear to have made things worse.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: We will always crash if we hit the code path modified by the patch. It is unlikely the patch can make things worse compared to crashing the UI process.
[String changes made/needed]: None.

Attachment #8946668 - Flags: approval-mozilla-beta?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 34

•

6 years ago

Comment on attachment 8946668 [details] [diff] [review]
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1

Let's uplift this for 59 beta 7 and see if the crash volume goes down.

Attachment #8946668 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 35

•

6 years ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/7b20f617fd15

status-firefox59: affected → fixed

Ensure we tear down CompositorManagerChild correctly if init or the GPU process fail., v1 7 years ago Andrew Osmond [:aosmond] (he/him) 12.37 KB, patch	dvander : review+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review
0001-Bug-1389759-Fix-GPUProcessManager-EnsureGPUReady-how.patch, v1 6 years ago Andrew Osmond [:aosmond] (he/him) 2.00 KB, patch	rhunt : review+ lizzard : approval-mozilla-beta+	Details \| Diff \| Splinter Review