Closed Bug 1125848 Opened 9 years ago Closed 9 years ago

crash in mozilla::layers::PCompositorParent::Lookup(int)

Categories

(Core :: Graphics: Layers, defect)

All
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox37 + fixed
firefox38 + fixed
firefox39 --- fixed
firefox40 --- fixed

People

(Reporter: wsmwk, Assigned: nical)

References

()

Details

(Keywords: crash, Whiteboard: [gfx-noted][mozmill])

Crash Data

Attachments

(4 files)

In general I am crashing like crazy since updating from 2014-12-29 nightly.
Related to bug 1120331 and/or my Bug 1125847 ?

I did help | about  using 2015-01-25 nightly on my old PC.
report bp-4284d694-252a-44cc-afca-7a4f42150126.
=============================================================

 0 	xul.dll	mozilla::layers::PCompositorParent::Lookup(int)	obj-firefox/ipc/ipdl/PCompositorParent.cpp
1 	xul.dll	mozilla::layers::PCompositorParent::OnMessageReceived(IPC::Message const&)	obj-firefox/ipc/ipdl/PCompositorParent.cpp
2 	xul.dll	mozilla::ipc::MessageChannel::OnMaybeDequeueOne()	ipc/glue/MessageChannel.cpp
3 	xul.dll	RunnableMethod<mozilla::ipc::MessageChannel, void ( mozilla::ipc::MessageChannel::*)(void), Tuple0>::Run()	ipc/chromium/src/base/task.h
4 	xul.dll	MessageLoop::DoWork()	ipc/chromium/src/base/message_loop.cc
5 	xul.dll	base::MessagePumpForUI::DoRunLoop()	ipc/chromium/src/base/message_pump_win.cc
6 	xul.dll	base::MessagePumpWin::Run(base::MessagePump::Delegate*)	ipc/chromium/src/base/message_pump_win.h
7 	xul.dll	MessageLoop::RunHandler()	ipc/chromium/src/base/message_loop.cc
8 	xul.dll	MessageLoop::Run()	ipc/chromium/src/base/message_loop.cc
9 	xul.dll	base::Thread::ThreadMain()	ipc/chromium/src/base/thread.cc
10 	xul.dll	`anonymous namespace'::ThreadFunc(void*)	ipc/chromium/src/base/platform_thread_win.cc
Component: IPC → Graphics: Layers
And earlier - doing  tools | addons | check for updates - I crashed @ IDMap<mozilla::ipc::IProtocol>::Remove(int) bp-7ee7712d-bacd-4766-b5f0-e04fb2150126

0 	xul.dll	IDMap<mozilla::ipc::IProtocol>::Remove(int)	ipc/chromium/src/base/id_map.h
1 	xul.dll	mozilla::layers::PTextureParent::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)	obj-firefox/ipc/ipdl/PTextureParent.cpp
2 	xul.dll	mozilla::layers::PTextureParent::Send__delete__(mozilla::layers::PTextureParent*)	obj-firefox/ipc/ipdl/PTextureParent.cpp
3 	xul.dll	mozilla::layers::PTextureParent::OnMessageReceived(IPC::Message const&)	obj-firefox/ipc/ipdl/PTextureParent.cpp
4 	xul.dll	mozilla::layers::PCompositorParent::OnMessageReceived(IPC::Message const&)	obj-firefox/ipc/ipdl/PCompositorParent.cpp
Lots of tabs open.
I tried to do about:sessionstore - not e10s
bp-546aa438-94c8-4812-99e0-1ee6c2150126
timeless also crashed. bp-cdc06738-707b-46a3-b621-224802150128
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150204030234 CSet: 0c2f7434c325 just crashed here too.

Report ID 	Date Submitted
bp-531e0f96-207b-4d6e-a22d-537ec2150204
	2/4/2015	9:46 AM

Let me know if there's anything worth collecting from this system.
Were any of the crashes with e10s enabled or all without?
Flags: needinfo?(vseerror)
Whiteboard: [gfx-noted]
Crashed again today on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150209030204 CSet: 3436787a82d0

Report ID 	Date Submitted
bp-94e2c433-a263-484d-bf7a-545f02150209
	2/9/2015	9:47 AM

e10s is disabled due to https://bugzilla.mozilla.org/show_bug.cgi?id=1087007 or a more virulent variant of it.
[Tracking Requested - why for this release]:
This is the #5 crash on Dev Edition 37 at this point with >2% of all crashes.


This signature is 100% correlated to D3D10Warp.dll, is that helpful?
Also see https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Alayers%3A%3APCompositorParent%3A%3ALookup%28int%29 for more reports and data.
Given the D3D10Warp.dll and Win7 correlations, this is likely to be the same issue as bug 1120331 which was recently fixed. Let's give it a couple days to confirm this goes away from crash-stats.
This may not be the same as bug 1120331 given that our tests with Nightly are still crashing in that stack trace on Feb 14th and 15th, with the fix landed on mozilla-central on Feb 13th.

Here one example from yesterday: bp-b0474132-b1c9-4073-9dde-302342150216
OS: Windows NT → Windows 7
Whiteboard: [gfx-noted] → [gfx-noted][mozmill]
(In reply to David Major [:dmajor] (UTC+13) from comment #10)
> Given the D3D10Warp.dll and Win7 correlations, this is likely to be the same
> issue as bug 1120331 which was recently fixed. Let's give it a couple days
> to confirm this goes away from crash-stats.

(In reply to Henrik Skupin (:whimboo) from comment #11)
> This may not be the same as bug 1120331 given that our tests with Nightly
> are still crashing in that stack trace on Feb 14th and 15th, with the fix
> landed on mozilla-central on Feb 13th.

Ah, that's too bad. Nical, any ideas for this one?
Flags: needinfo?(nical.bugzilla)
Adding Sotaro also
Flags: needinfo?(sotaro.ikeda.g)
(In reply to David Major [:dmajor] (UTC+13) from comment #12)
> Ah, that's too bad. Nical, any ideas for this one?

nical is PTO until end of February.
Flags: needinfo?(sotaro.ikeda.g)
From Bug 1133426 - Comment 14, there is a case that CompositorChild and CompositorParent are destroyed without correct shutdown.
(In reply to Sotaro Ikeda [:sotaro] from comment #15)
> From Bug 1133426 - Comment 14, there is a case that CompositorChild and
> CompositorParent are destroyed without correct shutdown.

Bug 1133426 Comment 14
Assignee: nobody → nical.bugzilla
ABORT: mismatched CxxStackFrame ctor/dtors: file c:\builds\moz2_slave\m-cen-w32-ntly-000000000000000\build\src\ipc\glue\MessageChannel.cpp, line 1825)
That assertion almost always indicates a UAF condition.
This is currently the #2 topcrash for 37 with 3.53% of crashes, but it's not showing up at all in the topcrash list for 38 or 39. Almost all these crashes are for buildid 2015022318.  

So something must have fixed this in 38.  If we could figure out what it is, it would be good to uplift to 37.
We want to be careful here; Feb 23 is around the time the trains changed, and when 38 would have gone to e10s off.  If it is a race condition, with use after free, it may very well be just the timing change.
This will shield us against messages being sent after PCompositor::SendWillStop, and crash debug builds when it happens (Although we won't see this assertion on beta, but it'll help this kind of issue not sneaking into nightly in the future).

I am not 100% certain that this is the cause of the problem but it's the only thing that came to mind so far and if proven wrong we'll at least have a big suspect out of the equation.
Flags: needinfo?(nical.bugzilla)
Attachment #8573253 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8573253 [details] [diff] [review]
Prevent messages to be sent after the shutdown of PCompositorChild has started

Looks good.
Attachment #8573253 - Flags: review?(sotaro.ikeda.g) → review+
Comment on attachment 8573253 [details] [diff] [review]
Prevent messages to be sent after the shutdown of PCompositorChild has started

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: High crash volume on beta
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: very low, only adds sanity checks and early returns in situations where we would crash otherwise.
[String/UUID change made/needed]:
Attachment #8573253 - Flags: approval-mozilla-beta?
Attachment #8573253 - Flags: approval-mozilla-aurora?
Comment on attachment 8573253 [details] [diff] [review]
Prevent messages to be sent after the shutdown of PCompositorChild has started

Confirmed with Milan on irc that we should push this up to Beta quickly to test. Let's test the results of this crash fix in Beta 3. Beta+ Aurora+
Attachment #8573253 - Flags: approval-mozilla-beta?
Attachment #8573253 - Flags: approval-mozilla-beta+
Attachment #8573253 - Flags: approval-mozilla-aurora?
Attachment #8573253 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/1c85afc8c189
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #29)
> Backed out from beta for crashes/asserts on at least OSX.
> https://hg.mozilla.org/releases/mozilla-beta/rev/e4f2ee87f064
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=288208&repo=mozilla-
> beta

That's great news! Now we know that something is trying to send PCompositorChild messages too late while the actor is being destroyed, which would cause the symptoms of this bug, and that it happens on beta but not on trunk, just like this bug.
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #30)
> That's great news! Now we know that something is trying to send
> PCompositorChild messages too late while the actor is being destroyed, which
> would cause the symptoms of this bug, and that it happens on beta but not on
> trunk, just like this bug.

After a closer look, the assertion we hit means that a CompositorChild is being destroyed without going through the usual SendWillStop/SendStop destruction sequence (rather than a message being sent at the wrong time).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Re-landed without the assertion to see if it can show us something more interesting.
https://hg.mozilla.org/releases/mozilla-beta/rev/a082a6e1cd99
Target Milestone: mozilla39 → ---
Basically, every time something clears the LayerManager of a widget without making sure the PCompositor pair is properly destroyed, we risk that a subsequent call to GetLayerManager will replace and destroy the current CompositorParent, regardless of whether there are messages in flight.

This also removes the SetLayersAcceleration madness which could cause the layer manager to be destroyed without properly destroying the PCompositor pair.
Attachment #8574015 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8574015 [details] [diff] [review]
Reduce the likelyhood of a CompositorParent to be destroyed without the proper shutdown sequence

Review of attachment 8574015 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8574015 - Flags: review?(sotaro.ikeda.g) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/0556bae4c05e
Whiteboard: [gfx-noted][mozmill] → [gfx-noted][mozmill] [leave-open]
This should simplify things slightly. Rather than assigning null to its mLayerManager, the widget should call DestroyLayerManager which will ensure that the CompositorChild/Parent are dealt with properly. Also make sure that in the unlikely scenario where we still call CreateCompositor without having deinitialized the PCompositor pair, we take care of that first (but assert in debug builds).
Attachment #8574720 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8574015 [details] [diff] [review]
Reduce the likelyhood of a CompositorParent to be destroyed without the proper shutdown sequence

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: crashes
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: low risk. Simple patch.
[String/UUID change made/needed]:
Attachment #8574015 - Flags: approval-mozilla-beta?
Attachment #8574015 - Flags: approval-mozilla-aurora?
Comment on attachment 8574720 [details] [diff] [review]
move creation/destruction logic into CompositorChild.cpp, make sure the CompositorChild is always deinitialized if the LayerManager is cleared

Review of attachment 8574720 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8574720 - Flags: review?(sotaro.ikeda.g) → review+
Comment on attachment 8574720 [details] [diff] [review]
move creation/destruction logic into CompositorChild.cpp, make sure the CompositorChild is always deinitialized if the LayerManager is cleared

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: crashes
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: low, simple patch.
[String/UUID change made/needed]:
Attachment #8574720 - Flags: approval-mozilla-release?
Attachment #8574720 - Flags: approval-mozilla-beta?
Landed the 3rd patch. The version that was reviewed was based on top of beta, so I had to adapt a few things before landing it on inbound.

https://hg.mozilla.org/integration/mozilla-inbound/rev/622f0877bc3f
sorry had to back this out for e10s bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=7405550&repo=mozilla-inbound
Flags: needinfo?(nical.bugzilla)
I made a silly mistake with refcounting, fixed and relanded: https://hg.mozilla.org/integration/mozilla-inbound/rev/79eab0a3960e
Flags: needinfo?(nical.bugzilla)
Attachment #8574720 - Flags: approval-mozilla-release? → approval-mozilla-aurora?
Comment on attachment 8574015 [details] [diff] [review]
Reduce the likelyhood of a CompositorParent to be destroyed without the proper shutdown sequence

I reviewed this bug with Milan. There is definitely more risk here than either of us would like to see but the payoff in terms of fixing the crash is pretty big. We're going to take this fix in Beta 5 in order to try and verify it. If there is notable fallout (not expected but definitely a possibility), we'll back this out in Beta 6. Beta+ Aurora+
Attachment #8574015 - Flags: approval-mozilla-beta?
Attachment #8574015 - Flags: approval-mozilla-beta+
Attachment #8574015 - Flags: approval-mozilla-aurora?
Attachment #8574015 - Flags: approval-mozilla-aurora+
Attachment #8574720 - Flags: approval-mozilla-beta?
Attachment #8574720 - Flags: approval-mozilla-beta+
Attachment #8574720 - Flags: approval-mozilla-aurora?
Attachment #8574720 - Flags: approval-mozilla-aurora+
landed on beta, haven't had time to prepare the aurora version and it's late so I'll do it tomorrow.

https://hg.mozilla.org/releases/mozilla-beta/rev/e15e6597d699
https://hg.mozilla.org/releases/mozilla-beta/rev/81009105d11d
Should this bug be closed now, or is there more landing on m-c still?
Flags: needinfo?(nical.bugzilla)
Target Milestone: --- → mozilla39
Nical - This fix has missed Beta 5 because of the backout. In the hope that this can land cleanly for Beta 6 (probably our last chance to take this patch), can you please try to fix up the patch tomorrow and execute a try run before uplifting to m-b again?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #47)
> Should this bug be closed now, or is there more landing on m-c still?

Let's keep it open until we can confirm that the bug is fixed (I can't reproduce it).
Flags: needinfo?(nical.bugzilla)
The landing of the patch caused a new top-crasher for nightly builds. See bug 1142939.
I think that patch 1 and 2 are safe to keep (patch 2 should be relanded on beta without patch 3). With patch 3 there something subtle about the destruction of CompositorChild that I don't quite understand yet and that makes it way trickier than I thought it would be. Additionally the destruction of CompositorChild has changed in non-trivial ways between m-c and beta. So I'll give up on patch 3 which is too risky (tried to salvage bits of it but try push came out as colorful as a christmas tree).
As soon as inbound reopens I'll back it out from central, and as soon as I get a green try from patch 2 on beta without patch 3 I'll uplift patch 2.

Side note: All of this showed that currently if ComposiorChild's destruction "works" it's just luck. For instanmce if you assert in the destructor that the object has gone through the safe destruction hand-shake, a lot of tests start burning randomly. We should take the time to fix CompositorChild's shutdown (again) when things settle a bit. This includes removing this extraordinarily footgun-friendly sCompositor global that makes you think it's a singleton but in fact not really.
(In reply to Henrik Skupin (:whimboo) from comment #51)
> The landing of the patch caused a new top-crasher for nightly builds. See
> bug 1142939.

To be more explicit given that multiple patches are attached to this bug... the regressor is https://hg.mozilla.org/mozilla-central/rev/79eab0a3960e, which hasn't been landed on aurora and got backed out from beta 5! So no new top crasher for the next beta release.
(In reply to Henrik Skupin (:whimboo) from comment #53)
> To be more explicit given that multiple patches are attached to this bug...
> the regressor is https://hg.mozilla.org/mozilla-central/rev/79eab0a3960e,
> which hasn't been landed on aurora and got backed out from beta 5! So no new
> top crasher for the next beta release.

I rechecked the the change. It might hit the problem of Bug 1133426 Comment 18. 

ClientLayerManager::~ClientLayerManager() calls DidComposite(). In some cases, it trigger LayerManager re-creation. The re-creation triggers new CompositorChild and CompositorParent creation. If previous CompositorChild and CompositorParent pointer is held only by member. They lacks reference count, and they are destructed without correct shutdown.

Therefore, Bug 1133426 fix hold pointers of CompositorChild and CompositorParent to temporary pointers. the above change remove it.
This patch might fix the problem that happened before back-out.
A treeherder build (winXP opt build) works, so I think the backout fixes it
Nical - Are you still working on this? Should we take a fix in Beta 7 on Thursday?
Flags: needinfo?(nical.bugzilla)
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #58)
> Nical - Are you still working on this? Should we take a fix in Beta 7 on
> Thursday?

re-landed part 2 on beta and aurora
https://hg.mozilla.org/releases/mozilla-beta/rev/45897d27ef82
https://hg.mozilla.org/releases/mozilla-aurora/rev/8c310e0dbcc3

There is too much risk with part 3 which still breaks with sotaro's fix. Beyond part 2, things won't make it in time for beta. Still looking into it, though.
Flags: needinfo?(nical.bugzilla)
Fx37 is as fixed as it's going to be.
Hey :nical, getting a bit confused - which of the patches landed (maybe we can tag them with checkin)?  Has anything landed that hasn't been uplifted?  Are there any more patches coming?
Flags: needinfo?(nical.bugzilla)
Attachment #8574720 - Flags: checkin-
Flags: needinfo?(nical.bugzilla)
Attachment #8574015 - Flags: checkin+
Attachment #8573253 - Flags: checkin+
(In reply to Milan Sreckovic [:milan] from comment #61)
> Hey :nical, getting a bit confused - which of the patches landed (maybe we
> can tag them with checkin)?  Has anything landed that hasn't been uplifted? 
> Are there any more patches coming?

Good idea, I got confused a few times myself about what had been landed/uplidted here.

The first two patches (marked checkin+) have landed in central + aurora + beta (and by now it's in release I guess), while the third patch did not land anywhere (got backed out a bunch of times, for different reasons depending on where it landed, even with the fourth patch that Sotaro made).

Patch should come, well, because we need them to, since the problem isn't entirely fixed, but I don't understand 100% what's going on at the moment and how much we need to improve the destruction of CompositorChild to get into a place where things work well here.
Landed a modified version of the 3rd patch + Sotaro's fix (after green try push), let's see how this goes in the wild.

https://hg.mozilla.org/integration/mozilla-inbound/rev/295db120bf12
crash-stats shows no crashes with this signature on gecko 38+.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Whiteboard: [gfx-noted][mozmill] [leave-open] → [gfx-noted][mozmill]
Target Milestone: mozilla39 → mozilla40
Depends on: 1156833
Depends on: 1180688
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: