Closed Bug 1120331 Opened 5 years ago Closed 5 years ago

crash in mozilla::ipc::MessageChannel::Send(IPC::Message*)

Categories

(Core :: DOM: Content Processes, defect, critical)

All
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
e10s - ---
firefox36 --- unaffected
firefox37 + fixed
firefox38 + fixed

People

(Reporter: mihaelav, Assigned: nical)

References

()

Details

(4 keywords, Whiteboard: [mozmill])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-bfffaee3-98f7-4f40-8dfc-0878b2150112.
=============================================================

We had this crash when running our endurance testrun.
Report: http://mozmill-daily.blargon7.com/#/endurance/report/5bb4668cfa6af53a0ba505835d4a8f3e
Hm I thought I filed this signature yesterday but I can't find it.

Since it's a read crash and not a breakpoint, I don't think it's bug 1114034. Since these are all Win7SP1 and there is Layers code on the stack, I think it's more likely bug 1120252.
Well I'm wrong since there are no more crashes for bug 1120252 but this signature is still present on the latest nightly. The Win7SP1 correlation is still awfully suspicious though.
Ran into the same issue when using the following builds:
* Crash: https://crash-stats.mozilla.com/report/index/331d4c28-2ae9-4d2e-9e02-c9c322150113
* Build: https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2015-01-10-03-02-01-mozilla-central/

Once installed, I opened fx right after and it crashed automatically within a few seconds of launching. Was using the following system:

- Win 7 Pro SP1 x64 (VM which is completely updated)
Crash Reason 	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address 	0x69696956

First 10 frames:

0 	xul.dll 	mozilla::ipc::MessageChannel::Send(IPC::Message*) 	ipc/glue/MessageChannel.cpp
1 	xul.dll 	mozilla::layers::PLayerTransactionChild::SendClearCachedResources() 	obj-firefox/ipc/ipdl/PLayerTransactionChild.cpp
2 	xul.dll 	mozilla::layers::ClientLayerManager::ClearCachedResources(mozilla::layers::Layer*) 	gfx/layers/client/ClientLayerManager.cpp
3 	xul.dll 	mozilla::layers::ClientLayerManager::~ClientLayerManager() 	gfx/layers/client/ClientLayerManager.cpp
4 	xul.dll 	mozilla::layers::ClientLayerManager::`scalar deleting destructor'(unsigned int) 	
5 	xul.dll 	mozilla::layers::LayerManager::Release() 	gfx/layers/Layers.h
6 	xul.dll 	mozilla::layers::CompositorChild::Destroy() 	gfx/layers/ipc/CompositorChild.cpp
7 	xul.dll 	nsBaseWidget::DestroyCompositor() 	widget/nsBaseWidget.cpp
8 	xul.dll 	nsWindow::OnPaint(HDC__*, unsigned int) 	widget/windows/nsWindowGfx.cpp
9 	xul.dll 	nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*) 	widget/windows/nsWindow.cpp
10 	xul.dll 	nsWindow::WindowProcInternal(HWND__*, unsigned int, unsigned int, long) 	widget/windows/nsWindow.cpp

Jim, I have to add that our Mozmill tests are disabling e10s, so this might be unrelated to multi processes.
OS: Windows NT → Windows 7
Hardware: x86 → All
Could this be more fallout of bug 1107718? In comment 3 I decided against it but now I am second-guessing myself.
Flags: needinfo?(bugmail.mozilla)
I think it's quite possible. Bug 1120252 was just a band-aid for one of the many code paths that might be affected by bug 1107718.
Flags: needinfo?(bugmail.mozilla)
What's 0x696969.. address signify?  0x5a5a5a5a is UAF, but I don't know what this one is...
It's trying to do pointer math on some 5a5a values.

7040dff1 837c0ef400      cmp     dword ptr [esi+ecx-0Ch],0 ds:002b:69696956=????????

Where ecx=5a5a5a5a and esi=0f0f0f08 (which was obtained a few lines before as 0x5a5a5a5a*0x14)
On Win64 the multiplier is 0x18 so given the delta, it's an array of some structure that contains one pointer. I am guessing it's |mThat.mCxxStackFrames|.
If you figure out that this is e10s specific, please reflag for triage.
Am I interpreting this right: if mCxxStackFrames is full of 0x5a, then the CxxStackFrame's |mThat| is dead, which in turn means the MessageChanel is dead? What does that mean?
Flags: needinfo?(bent.mozilla)
This sounds like something related to bug 1065536...

Adding mccr8 and nical.

Calling ActorDestroy manually (as is done in CompositorChild::Shutdown()) doesn't properly destroy/dealloc sub-protocol actors. It seems very possible to me that a subprotocol actor (PLayerTransactionChild) is still alive at the time the CompositorChild is deleted, and it doesn't expect its mChannel member (which is actually owned by the top-level actor, PCompositorChild) to suddenly be deleted like this.
Flags: needinfo?(bent.mozilla)
Yeah, if we have no test case, backing out part 3 there seems like it would be worth a shot.
Would bug 1065536 explain why this spiked with the 1/9 build? The timing seems off.
Dunno. But the current behavior seems clearly broken to me
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #16)
> Dunno. But the current behavior seems clearly broken to me

I see what you mean about it being broken, so I backed the patch out, but I think it can't be causing this crash because we're not going to run the code I added in non-DEBUG builds, AFAICT, due to the pre-xpcom-shutdown QuickExit().
Are we in the process of crashing and then somehow we call nsWindow::OnPaint()?  The presence of CallWindowProcCrashProtected in there looks suspicious to me.
The crash stack in bug 1120485 comment 0 also had CallWindowProcCrashProtected in it; the two bugs are probably dupes of each other.
¡Hola!

Just crashed like this:

Report ID 	Date Submitted
bp-16faaf93-4136-48db-b734-fdba12150115	1/15/2015	9:27 AM

The symptoms on this profile FWIW, e10s is completely busted here since Monday's Nightly and https://bugzilla.mozilla.org/show_bug.cgi?id=1120457 I believe all tabs became blank with a black/gray stopped spinner in the center upon disabling e10s this and https://bugzilla.mozilla.org/show_bug.cgi?id=1121101 started random crashes.

Let me know if there's anything you need me to collect on this end.

¡Gracias!
Alex
Flags: needinfo?(mihaela.velimiroviciu)
Flags: needinfo?(mihaela.velimiroviciu)
I get this pretty consistently these days and it usually happens when I open fx right after it's been installed on a new system (VM). If I wait anywhere between 10-15 seconds before opening the newly installed fx, the crashes are cut down by at least half during startup.
[Tracking Requested - why for this release]: #1 crash on aurora and nightly in 1-day data
Can't let this one sit. Any ideas? Can I dig anything out of crash dumps that would help you? Can Kamil collect any data from his repro? Does the win7 correlation give any clues?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(bent.mozilla)
This has failed a lot in our mozmill testruns over the weekend, on Windows different versions, with nightly and aurora. There are different tests when it crashes so nothing I could do to reproduce it fast, can't investigate now. I submitted all reports.

Should I add a .dmp file, will that help you?
(In reply to Andreea Matei [:AndreeaMatei] from comment #24)
> This has failed a lot in our mozmill testruns over the weekend, on Windows
> different versions, with nightly and aurora. There are different tests when
> it crashes so nothing I could do to reproduce it fast, can't investigate
> now. I submitted all reports.
> 
> Should I add a .dmp file, will that help you?

yes please do!
Attached file dmp file
This is the most recent one from today, happened on win 7 64 bit with nightly.
Crash report: https://crash-stats.mozilla.com/report/index/84ac97ce-cfcf-40de-911b-c30552150120
As best I can tell mCompositorChild is dead when we get here: http://mxr.mozilla.org/mozilla-central/source/widget/nsBaseWidget.cpp#182

Given that mCompositorChild is a nsRefPtr I think we still have a reference counting problem somewhere. Andrew, can we try backing out other patches that might have eaten an extra ref?
Flags: needinfo?(bent.mozilla)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #27)
> Given that mCompositorChild is a nsRefPtr I think we still have a reference
> counting problem somewhere. Andrew, can we try backing out other patches
> that might have eaten an extra ref?

I backed out part 2 of bug 1065536.  (Part 3 was already gone.)
Ben pointed out that my patch removed a release (not an addref), and is probably dead code anyways, so I wouldn't expect the backout to do anything.  But at least this will clear my good name. ;)
Bug 1107718 still looks suspicious to me, as it landed on the right day to cause this, and also is on the crash stack.  Can we back that out?
Blocks: 1107718
I agree that bug 1107718 looks quite likely...

But we still needed to back out part 3 of bug 1065536, that was definitely not correct.
Andreea, is it possible to collect a full-heap dmp file from the mozmill machines?
Flags: needinfo?(andreea.matei)
> Bug 1107718 still looks suspicious to me, as it landed on the right day to
> cause this, and also is on the crash stack.  Can we back that out?

Bas, what do you think? That change has come up as a suspect in several bugs this week.
Flags: needinfo?(bas)
Bug 1107718 would have fixed a lot of crashes, but it may have consolidated a long tail of various crashes and replaced it by some, less frequent ones. The total crash frequency has gone down, I've communicated this over a couple of channels but I guess not widely enough. It's important to look at the -total- crashes. As bug 1107718 likely brought those down. i.e. this will occur occasionally in a situation where before bug 1107718 we would simply -always- crash, just always in a random signature that made less sense.
Flags: needinfo?(bas)
The observation of multiple QA folks on this bug is that FF is being especially crashy lately, thanks to this signature. I don't see how that agrees with the statement that total crashes have been reduced.
David, we could try, could you give us some instructions on how to get a full-heap dmp file?
Flags: needinfo?(andreea.matei)
It depends how you're collecting these .dmp files currently. Are you using a debugger? Or the crash reporter?

If you're using the crash reporter, set the environment variable MOZ_CRASHREPORTER_FULLDUMP=1 before starting the browser. When a crash happens, you won't see the crash reporter. You'll need to manually find the most recent .dmp file in your profile\minidumps directory.

Note that the file will be very large, and will contain all Firefox memory, possibly including passwords and personal information. If the automation machines contain any sensitive data then please send the file privately.
From what I've discussed with Henrik, there's no sensitive data on the machines, so here's the file:
https://www.dropbox.com/s/sv9ehbcpuzurx7f/625949f3-bb34-4799-9538-4faf0469b594.dmp?dl=0

It has 257MB and I see only memory blocks in it.
(In reply to Andreea Matei [:AndreeaMatei] from comment #38)

Can you post the build id also?
This was with Aurora, BuildID=20150121004011
I can somewhat reliably crash by typing about:whatever, eg about:crashes.
bp-9ff36e82-e249-4edb-ad41-25d872150126 bp-c7511d65-993c-44df-bbc5-399162150126
Similar crashes with different signature Bug 1125847, Bug 1125848 
 
If I instead get to crashes ID by clicking help | troubleshooting then no crash - at least not so far.


My first crash today was GetFileAttributesW bp-92379ec6-82f1-493f-948a-1f3512150126, but I don't see that it is related.
The never ending black on white spinners were back this morning so I disabled e10s and upon restart Nightly crashed when loading my work's ServiceNow instance.

Report ID 	Date Submitted
bp-2820f799-9572-48d9-8d71-ad9302150127	1/27/2015	9:33 AM
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150128030203 CSet: b2b10231606b is crash happy with this signature today =(

Report ID 	Date Submitted
bp-04f531b3-e823-4eaf-afb1-a612f2150128	1/28/2015	11:54 AM
bp-c887b286-7cf3-4845-a2e8-082a32150128	1/28/2015	11:46 AM
bp-a89ee360-3390-404a-979a-f9b052150128	1/28/2015	11:17 AM
This has 100% correlation with D3D10Warp.dll. Relevant? Or simply a consequence of the fact that it's Win7-only?
¡Hola David!

Might it be related to https://bugzilla.mozilla.org/show_bug.cgi?id=1121101 ?

When I don't get this crash I get that other one.

Please https://www.youtube.com/watch?v=XP4clbHc4Xg ! =D
Flags: needinfo?(dmajor)
Yes it's very likely. By chance, those crashes had valid memory at the affected addresses, so the process survived just a little bit longer.
Flags: needinfo?(dmajor)
I've confirmed from Andreea's dump file that the MessageChannel is dead, but I don't know IPC well enough to interpret what that means...
I think comment 27 is still right: we have a reference counting problem. When the CompositorChild is destroyed its MessageChannel dies with it. Any sub-protocol actors (ShadowLayerForwarder holds one it seems) that are still alive will then be using a dangling MessageChannel pointer. If the CompositorChild was closed properly then I think we'd take this branch and avoid the crash: https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/ShadowLayers.cpp#880
¡Hola David, Ben!

FWIW all the crashes I checked in about:crashes point to http://hg.mozilla.org/mozilla-central/annotate/29b05d283b00/ipc/glue/MessageChannel.cpp#l530

Is that relevant or I'm stating something obvious?
Flags: needinfo?(bent.mozilla)
Keywords: topcrashtopcrash-win
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #49)
> I think comment 27 is still right: we have a reference counting problem.
> When the CompositorChild is destroyed its MessageChannel dies with it. Any
> sub-protocol actors (ShadowLayerForwarder holds one it seems) that are still
> alive will then be using a dangling MessageChannel pointer. If the
> CompositorChild was closed properly then I think we'd take this branch and
> avoid the crash:
> https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/ShadowLayers.
> cpp#880


I am getting a bit lost in the ipc stuff, in which situation can we destroy the CompositorChild without giving it a chance to force-destroy all of its managees (in which case we have the IPCOpen safe guard)?

Oh, or could this mean that we create another CompositorChild, attach it to the same channel, and then the first CompositorChild, when destroying itself would close the channel and leave something broken for the second one?
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #52)
> 

(answering my own questions)

> I am getting a bit lost in the ipc stuff, in which situation can we destroy
> the CompositorChild without giving it a chance to force-destroy all of its
> managees (in which case we have the IPCOpen safe guard)?

Looks like we need to make sure in nsBaseWidget::CreateCompositor to explitly close the previous channel if we already had a pcompositor pair when creating a new one. This should make sure DestroySubtree is propagated correctly.  

> 
> Oh, or could this mean that we create another CompositorChild, attach it to
> the same channel, and then the first CompositorChild, when destroying itself
> would close the channel and leave something broken for the second one?

Looked closer and I don't see any place where we attach a CompositorChild to a channel that we didn't just create.
It's odd, that we crash sending an ipdl message in a stack where we just sent a bunch of others like SendWillStop in nsBaseWidget::DestroyCompositor.

Also, before nulling the reference to the ClientLayerManager that will trigger the destructor which will call ClearCachedResource and crash, we call Destroy on the ClientLayerManager, which also calls ClearCachedResource. So we just sent the message without crashing (we should not send it twice but I don't see how this could cause the crash).
Based on my previous comment. This patch will at least make us not send the message twice, but I am worried that it might just move the crash to a different stack, so we should either keep investigating without landing it or land it on on a branch where we can quickly observe the effects and be on the lookout for another top-crash appearing if this one vanishes with the patch.

I still don't understand what's causing the issue.
Attachment #8559166 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8559166 [details] [diff] [review]
Don't send the ClearCachedResources message twice

Review of attachment 8559166 [details] [diff] [review]:
-----------------------------------------------------------------

This patch is OK. But I have no idea about whether this patch could fix the problem.
Current gfx layer code seems not have a problem.
Attachment #8559166 - Flags: review?(sotaro.ikeda.g) → review+
Could someone please land this on Nightly?

This crash is rather frequent for me.

Report ID 	Date Submitted
bp-adfb94b5-89e1-4722-958c-360cb2150210
	2/10/2015	7:37 AM
bp-7c50d5a3-2885-4e04-b5f6-c50662150129
	1/29/2015	7:47 AM
> or land it on on a branch where we can quickly observe the effects and be on
> the lookout for another top-crash appearing if this one vanishes with the
> patch.

The joy of being the #1 crash is that we'll certainly see it if the signature moves :)
Flags: needinfo?(bent.mozilla)
:nical, should we land this?
Flags: needinfo?(bas) → needinfo?(nical.bugzilla)
Blocks: 1126268
to add some perspective, this is 30% of crashes for 37.0a2, which in just over a week will propagate to beta.
(In reply to Milan Sreckovic [:milan] from comment #59)
> :nical, should we land this?

yes
Flags: needinfo?(nical.bugzilla)
https://hg.mozilla.org/mozilla-central/rev/7d4f76e0411b
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
(In reply to Jim Mathies [:jimm] from comment #65)
> Looks like this has helped a great deal, I count just one report thus far
> for the 14th build - 

OIK, this query agrees with you:

https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=nightly&build_id=>20150214000000&signature=%24mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ASend&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform

This one shows a different signature though that contains mozilla::ipc::MessageChannel::Send, is that related or a different issue?

https://crash-stats.mozilla.com/search/?product=Firefox&release_channel=nightly&build_id=>20150214000000&signature=~mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ASend&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform
Hrm, seems like the plain > sign breaks the links in Bugzilla. :(
We had lots of those crashes with our Mozmill test on the Windows 7 machines with Nightly builds last week. But over the weekend no single crash in this signature for Nightly! Only Aurora builds are still crashing. So at least from our perspective this patch fixed our crashes.
Assignee: nsilva → nical.bugzilla
That's good news! Let's get this on Aurora.
Flags: needinfo?(nical.bugzilla)
Actually this may have caused bug 1133426...
Flags: needinfo?(nical.bugzilla)
It would be great if we can get this backported to mozilla-aurora soon. This is the most occurring crash in our Mozmill test suite for Aurora builds those days. Thanks.
(In reply to Henrik Skupin (:whimboo) from comment #71)
> It would be great if we can get this backported to mozilla-aurora soon. This
> is the most occurring crash in our Mozmill test suite for Aurora builds
> those days. Thanks.

I'm pretty sure we didn't actually fix this. The signature merely moved to bug 1133426.
This signature is still the #1 top crash in 37.0a2 and is usually a startup crash.
Do we need to uplift this fix to 37 in addition to fixing bug 1133426?
Flags: needinfo?(dmajor)
It is better to look if uplift of Bug 1133426 fixed the problem. It was just uplifted to beta.
Thanks Sotaro. With bug 1133426 now on Beta 37, I'm marking this bug as fixed for 37.
Flags: needinfo?(dmajor)
Flags: qe-verify+
Here is the current volume by Beta version:
* 37.0b1: 10899 crashes
* 37.0b2: 1 crash
* 37.0b3: 3 crashes
* 37.0b4: 10 crashes
* 37.0b5: 4 crashes

Based on these volumes I'm not sure this is 100% fixed but it certainly seems to be lessened for those on later Beta versions. However, based on the extreme volume on 37.0b1 is this possibly preventing people from upgrading?

Henrik, as an aside, can you report back if this has been resolved in your Mozmill testruns?
Flags: needinfo?(hskupin)
We haven't seen any crashes for this signature since the fix landed for Firefox Beta. That's all what I can say.
Flags: needinfo?(hskupin)
If you expand the date range back to the start of 37b, it gets even more dramatic:
20150223185154 	260135
20150302192546 	2
20150305191659 	5
20150309191715 	10
20150312193711 	6

Given that, I'm completely happy with the fix from my perspective. From a quick spot check those other crashes are just noise, different address patterns from the ones seen here.
Thanks Henrik and David. I'm dropping the QA verification request here based on your data.
Flags: qe-verify+ → qe-verify-
You need to log in before you can comment on or make changes to this bug.