crash in mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) | mozilla::ipc::MessageChannel::~MessageChannel() | mozilla::layers::CompositorParent::Release()

RESOLVED INCOMPLETE

Status

()

defect
P3
critical
RESOLVED INCOMPLETE
5 years ago
9 months ago

People

(Reporter: mihaelav, Unassigned)

Tracking

({crash})

unspecified
mozilla48
x86
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox38 affected, firefox48 affected, firefox49 affected, firefox50 affected, firefox51 affected, firefox52 ?)

Details

(Whiteboard: [mozmill][gfx-noted], crash signature, )

Attachments

(3 attachments, 1 obsolete attachment)

This bug was filed from the Socorro interface and is 
report bp-62b61d2f-ab57-4c6b-93a7-ab40c2150112.
=============================================================

This crashed while running our mozmill functional suite.
Report: http://mozmill-daily.blargon7.com/#/functional/report/c0d2f578c4e131660e84b4e0e211031b
If this happens again please let us know. It would be good to have an idea of how frequently this is encountered.

Also, do you know if the mozmill suite loads pages with plugins? Stuff in the crash stack makes me think it might be related to plugin crashing.
Flags: needinfo?(mihaela.velimiroviciu)
Whiteboard: gfx-noted
Weird stuff happening here. Looks like we are asking a LayerManager to a widget that used to have one, then lost it somehow, and since it doesn't have a LayerManager anymore, it thinks it doesn't have a compositor protocol, except it does and things go bad when it reassigns the previous compositor without properly shutting it down.
Flags: needinfo?(mihaela.velimiroviciu)
The crash appeared in a test which doesn't use plugins, but Flash plugin was used in one of the previous tests
Flags: needinfo?(mihaela.velimiroviciu)
Flags: needinfo?(mihaela.velimiroviciu)
All crashes so far on crashstats were happening under Windows 7. There are 45 crashes in the last 7 days, but none before. So this might be a new regression. The first buildid for this crash is 20150109030224.


Crash Reason 	EXCEPTION_BREAKPOINT
Crash Address 	0x74011e28

Here the list of the first 10 frames:

0 	mozalloc.dll 	mozalloc_abort(char const* const) 	memory/mozalloc/mozalloc_abort.cpp
1 	xul.dll 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
2 	xul.dll 	mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) 	ipc/glue/MessageChannel.cpp
3 	xul.dll 	mozilla::ipc::MessageChannel::~MessageChannel() 	ipc/glue/MessageChannel.cpp
4 	xul.dll 	mozilla::layers::CompositorParent::Release() 	gfx/layers/ipc/CompositorParent.h
5 	mozalloc.dll 	mozalloc.dll@0xfff 	
6 	xul.dll 	nsBaseWidget::CreateCompositor(int, int) 	widget/nsBaseWidget.cpp
7 	xul.dll 	nsBaseWidget::CreateCompositor() 	widget/nsBaseWidget.cpp
8 	xul.dll 	nsWindow::GetLayerManager(mozilla::layers::PLayerTransactionChild*, mozilla::layers::LayersBackend, nsIWidget::LayerManagerPersistence, bool*) 	widget/windows/nsWindow.cpp
9 	xul.dll 	nsIWidget::GetLayerManager(bool*) 	widget/nsIWidget.h
10 	xul.dll 	PresShell::Paint(nsView*, nsRegion const&, unsigned int) 	layout/base/nsPresShell.cpp
OS: Windows NT → Windows 7
Maybe related to one of the changes landed between Jan 8th and 9th?

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=70de2960aa87&tochange=b3f84cf78dc2

There is at least bug 1109873, which seems to have changed code for layers.
Whiteboard: gfx-noted → [mozmill}[gfx-noted]
Whiteboard: [mozmill}[gfx-noted] → [mozmill][gfx-noted]
Crashed again in /testSecurity/testSSLStatusAfterRestart.js|testSSLStatusAfterRestart.js
(first time was in /testAwesomeBar/testEscapeAutocomplete.js|testEscapeAutocomplete.js), win 7 x86.
Report: http://mozmill-daily.blargon7.com/#/remote/report/569158a0e5d1513e2dbac88f3360f4bc
(In reply to Henrik Skupin (:whimboo) from comment #4)
> All crashes so far on crashstats were happening under Windows 7. There are
> 45 crashes in the last 7 days, but none before. So this might be a new
> regression. The first buildid for this crash is 20150109030224.

This sounds very similar to bug 1120252 and probably has the same root cause. It will be interesting to see if the patch that I landed on that bug fixes this or if it's just a band-aid that prevents some of the symptoms.

(In reply to Henrik Skupin (:whimboo) from comment #5)
> Maybe related to one of the changes landed between Jan 8th and 9th?
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=70de2960aa87&tochange=b3f84cf78dc2
> 
> There is at least bug 1109873, which seems to have changed code for layers.

I would consider it unlikely that that bug causes this though.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> (In reply to Henrik Skupin (:whimboo) from comment #5)
> > Maybe related to one of the changes landed between Jan 8th and 9th?
> > 
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=70de2960aa87&tochange=b3f84cf78dc2

In fact given this range and considering it's windows only, I'm going to pin it on bug 1107718.
Blocks: 1107718
Flags: needinfo?(bas)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> It will be interesting to see if the patch that I landed on that bug
> fixes this or if it's just a band-aid that prevents some of the symptoms.

There are more crashes on the latest nightly which includes my fix. See e.g. https://crash-stats.mozilla.com/report/index/fa07958a-3091-49e6-8f58-1fb2c2150113
Forgot Bas is on PTO. CC'ing milan for FYI since this seems like a potentially bad regression.
We did expect bug 1107718 to consolidate some crashes, in that a long tail of random signature crashes may all end up together now.  This could be it.  Did the overall number of crashes increase, do we have ways of tracking that?
Or, it could be a regression from bug 1107718 or something else, I just wanted to raise the chance that we're just seeing things better now.
That might be the case. CC'ing kairo in case he can help us analyze the crash data more in this respect.

I'm still thinking that the landing did regress something, because the crash observed in bug 1120252 (which I believe to also have the same root cause) was a very high-volume topcrash on windows. I would be surprised if that was the result of long-tail crashes all ending up in the same bucket. The stack there also seemed to point to some sort of corruption in the CompositorParent data structures.

nical, does your analysis in comment 2 sound like it might conceivably be expected behaviour as a result of bug 1107718?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> nical, does your analysis in comment 2 sound like it might conceivably be
> expected behaviour as a result of bug 1107718?

It is definitely possible: in the patch in that bug, there is a code path where we null out nsWindow's LayerManager, which could cause what I said in comment 2.
(In reply to Nicolas Silva [:nical] from comment #14)
> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> > nical, does your analysis in comment 2 sound like it might conceivably be
> > expected behaviour as a result of bug 1107718?
> 
> It is definitely possible: in the patch in that bug, there is a code path
> where we null out nsWindow's LayerManager, which could cause what I said in
> comment 2.

This all sounds like a very likely scenario, and similarly I do believe this crash will have come in that place of another amount of (equal or larger) other crashes.
Flags: needinfo?(bas)
Another from Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150205030205 CSet: 34a66aaaca81

Report ID 	Date Submitted
bp-00d2345c-57a1-4751-97fc-746972150205
	2/5/2015	11:56 AM
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) | mozilla::ipc::MessageChannel::~MessageChannel() | mozilla::layers::CompositorParent::Release()] → [@ mozalloc_abort(char const* const) | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) | mozilla::ipc::MessageChannel::~MessageChannel() | mozilla::layers::CompositorParent::Release()] [@ mozallo…
This signature is fairly high on the Firefox 47.0a2 Top Crash list, e.g.
bp-91de2290-03db-46a2-a97d-d03892160318
It's a MOZ_RELEASE_ASSERT now:
http://hg.mozilla.org/releases/mozilla-aurora/annotate/bf6d2a4b1031/ipc/glue/MessageChannel.cpp#l522
Crash Signature: mozilla::layers::CompositorParent::Release] → mozilla::layers::CompositorParent::Release] [@ mozilla::ipc::MessageChannel::~MessageChannel | mozilla::layers::PCompositorParent::~PCompositorParent ]
Mason, can you add GetLastError or whatever is appropriate here, so that we know what the error was in play when we hit this assert?  (Even if the error code is device specific, it may shine some light on this.)
Flags: needinfo?(mchang)
Flags: needinfo?(mchang)
Attachment #8738016 - Flags: review?(milan)
Attachment #8738016 - Flags: review?(milan) → review+
https://hg.mozilla.org/mozilla-central/rev/9ae0a0f6a9ee
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Whoops, leave open since the patch was just logging to see what's going on. The patch didn't actually fix the crash.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Hello Mason, this is a top-crash on Fx47 (Beta). It jumped up 18 spots and is #10 right now. Have we been able to use the additional logging to root cause this? The recent up spike is concerning and I am wondering if there a patch in the works that can we uplift to Beta47. Thanks!
Flags: needinfo?(mchang)
I misspoke, this jumped up 57(!!) spots and is at #12 atm.
(In reply to Ritu Kothari (:ritu) from comment #25)
> Hello Mason, this is a top-crash on Fx47 (Beta). It jumped up 18 spots and
> is #10 right now. Have we been able to use the additional logging to root
> cause this? The recent up spike is concerning and I am wondering if there a
> patch in the works that can we uplift to Beta47. Thanks!

I haven't seen this crash on 48 as this was the only place where we added logging. All the reports are only on 47. I wonder if this was fixed in 48?
Flags: needinfo?(mchang)
(In reply to Mason Chang [:mchang] from comment #27)
> 
> I haven't seen this crash on 48 as this was the only place where we added
> logging. All the reports are only on 47. I wonder if this was fixed in 48?

If that is the case, would you guys have a clue what fix from 48 can be cherry-picked and uplifted to 47?
Flags: needinfo?(milan)
Flags: needinfo?(mchang)
Nothing fixed it, I'd say, here's a 48 crash :)

https://crash-stats.mozilla.com/report/index/cdfc19a0-d9eb-4614-b4e9-9d1842160429

Error 6 is ERROR_INVALID_HANDLE.  We could reset the mHandle to nullptr, just in case we're trying to close the channel multiple times? Should be safe enough, and we can see if it does anything to the crash rate.
Flags: needinfo?(milan)
Assignee: nobody → mchang
Crash Signature: mozilla::layers::CompositorParent::Release] [@ mozilla::ipc::MessageChannel::~MessageChannel | mozilla::layers::PCompositorParent::~PCompositorParent ] → mozilla::layers::CompositorParent::Release] [@ mozilla::ipc::MessageChannel::~MessageChannel | mozilla::layers::PCompositorParent::~PCompositorParent ] [@ mozilla::ipc::MessageChannel::~MessageChannel | mozilla::layers::PCompositorBridgeParent::~PCompo…
Flags: needinfo?(mchang)
Posted patch closeHandle.patch (obsolete) — Splinter Review
Just a check to see if we still have an mEvent. If we don't, don't try to close again. It's odd, I'm not sure if we actually share the handle with anyone else though since I don't see it referenced anywhere. Might be closed on the other side first?
Attachment #8747861 - Flags: review?(milan)
Attachment #8747861 - Attachment is obsolete: true
Attachment #8747861 - Flags: review?(milan)
Attachment #8747883 - Flags: review?(milan)
Attachment #8747883 - Flags: review?(milan) → review+
This still seems to be happening[1], and I'm not really sure why we would get an invalid handle. I wonder if this is happening because another function that's using mEvent is already failing[2]

[1] https://crash-stats.mozilla.com/report/index/1eb8a3b6-dcae-4a16-83f1-27d0f2160509
[2] "It is usually not necessary to call CloseHandle if a function that uses a handle fails with ERROR_INVALID_HANDLE, because this error usually indicates that the handle is already invalidated." - https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211(v=vs.85).aspx
Should we just ignore an error if we get this error then?
See Also: → 1237458
This originally landed in 47, bug 1237458, which changed the asserts from debug asserts to release asserts. Do you know if it's ok to ignore this error Bill? The assert in question is at [1], which is failing with an invalid handle.

[1] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#517
Flags: needinfo?(wmccloskey)
Log some uses of mEvent in MessageChannel and ensure that they succeed. Then we can see if some other function is setting the handle to an invalid state.
Attachment #8750476 - Flags: review?(milan)
Comment on attachment 8750476 [details] [diff] [review]
Log uses of mEvent

Review of attachment 8750476 [details] [diff] [review]:
-----------------------------------------------------------------

It's useful to have different messages in each of the gfxDevCrash, so that we can tell them apart once this is in beta/release where there is no crash, and we can't tell from the stack which one of them triggered.
Attachment #8750476 - Flags: review?(milan) → review+
(In reply to Milan Sreckovic [:milan] from comment #38)
> Comment on attachment 8750476 [details] [diff] [review]
> Log uses of mEvent
> 
> Review of attachment 8750476 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It's useful to have different messages in each of the gfxDevCrash, so that
> we can tell them apart once this is in beta/release where there is no crash,
> and we can't tell from the stack which one of them triggered.

Updated the patch to do that. Try looks ok I think - https://treeherder.mozilla.org/#/jobs?repo=try&revision=8b4ca9b63e8b&selectedJob=20653975 . Lots of orange on inbound as well, so I think this is ok.
Flags: needinfo?(wmccloskey)
No crashes with the extra information added in comment 41.
Interestingly, 96.08% of the reports with signature 'mozilla::ipc::MessageChannel::~MessageChannel | mozilla::layers::PCompositorBridgeParent::~PCompositorBridgeParent' have a Spanish locale
(either es-MX or es-ES).
(In reply to Marco Castelluccio [:marco] from comment #43)
> Interestingly, 96.08% of the reports with signature
> 'mozilla::ipc::MessageChannel::~MessageChannel |
> mozilla::layers::PCompositorBridgeParent::~PCompositorBridgeParent' have a
> Spanish locale
> (either es-MX or es-ES).

New report about this issue on SUMO: https://support.mozilla.org/en-US/questions/1144682

And I guess the reporter is probably Spanish.
(In reply to Loic from comment #44)
> (In reply to Marco Castelluccio [:marco] from comment #43)
> > Interestingly, 96.08% of the reports with signature
> > 'mozilla::ipc::MessageChannel::~MessageChannel |
> > mozilla::layers::PCompositorBridgeParent::~PCompositorBridgeParent' have a
> > Spanish locale
> > (either es-MX or es-ES).
> 
> New report about this issue on SUMO:
> https://support.mozilla.org/en-US/questions/1144682
> 
> And I guess the reporter is probably Spanish.

The locale of the Firefox build he's using is 'en-US', but his name looks Spanish. Perhaps there's some Spanish website that is causing this to happen.

Looking at the URLs, the 'gob.mx' sites are pretty frequent.
(In reply to Marco Castelluccio [:marco] from comment #45)
> The locale of the Firefox build he's using is 'en-US', but his name looks
> Spanish. Perhaps there's some Spanish website that is causing this to happen.

He replied he is using es-MX locale.
(In reply to Loic from comment #46)
> (In reply to Marco Castelluccio [:marco] from comment #45)
> > The locale of the Firefox build he's using is 'en-US', but his name looks
> > Spanish. Perhaps there's some Spanish website that is causing this to happen.
> 
> He replied he is using es-MX locale.

Interesting, his crash reports are with an en-US Firefox build (see 'useragent_locale' in the Metadata tab).
Assignee: mchang → nobody
The leave-open keyword is there and there is no activity for 6 months.
:davidb, maybe it's time to close this bug?
Flags: needinfo?(dbolter)
Yeah I think so.
Status: REOPENED → RESOLVED
Closed: 3 years ago9 months ago
Flags: needinfo?(dbolter)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.