Closed Bug 1133426 Opened 5 years ago Closed 5 years ago

Crash in mozilla::layers::CompositorChild::Destroy()

Categories

(Core :: Graphics: Layers, defect, critical)

38 Branch
All
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla38
Tracking Status
firefox37 + verified
firefox38 + verified

People

(Reporter: whimboo, Assigned: sotaro, NeedInfo)

References

()

Details

(Keywords: crash, topcrash-win, Whiteboard: [mozmill][tbird crash])

Crash Data

Attachments

(2 files, 2 obsolete files)

Running one of our endurance tests with Mozmill we see lots of crashes on Windows 7 machines with the signature like bp-e739b432-19f7-4797-8bcc-077552150216.

Crash Reason 	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address 	0x0

First 10 frames of the stack:

0 	xul.dll 	mozilla::layers::CompositorChild::Destroy() 	gfx/layers/ipc/CompositorChild.cpp
1 	xul.dll 	nsBaseWidget::DestroyCompositor() 	widget/nsBaseWidget.cpp
2 	xul.dll 	nsWindow::OnPaint(HDC__*, unsigned int) 	widget/windows/nsWindowGfx.cpp
3 	xul.dll 	nsWindow::ProcessMessage(unsigned int, unsigned int&, long&, long*) 	widget/windows/nsWindow.cpp
4 	xul.dll 	nsWindow::WindowProcInternal(HWND__*, unsigned int, unsigned int, long) 	widget/windows/nsWindow.cpp
5 	xul.dll 	CallWindowProcCrashProtected 	xpcom/base/nsCrashOnException.cpp
6 	xul.dll 	nsWindow::WindowProc(HWND__*, unsigned int, unsigned int, long) 	widget/windows/nsWindow.cpp
7 	user32.dll 	InternalCallWinProc 	
8 	user32.dll 	GetRealWindowOwner 	
9 	user32.dll 	DispatchClientMessage 	
10 	user32.dll 	__fnDWORD

The endurance test we are running is http://hg.mozilla.org/qa/mozmill-tests/file/default/firefox/tests/endurance/testBookmarks_OpenAndClose/test1.js

This is only happening for Firefox 38 so far, no single crash with Firefox 37.

Crash details as reported by crash-stats

Windows 7 	99.52 %	207
Windows 8.1 	0.48 %	1
This started in build 20150214030238 and has been the #1 nightly crash since then.

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f5c5ec1a24b&tochange=a7c177546ca0
The regression range includes:
	7d4f76e0411b	Nicolas Silva — Bug 1120331 - "crash in mozilla::ipc::MessageChannel::Send(IPC::Message*)". r=sotaro

And again this one is all Win7 and all D3D10Warp.dll. Seems like that crash just moved signatures. Nical do you agree?
Flags: needinfo?(nical.bugzilla)
nical is PTO until end of February. I am going take a bug.
Assignee: nobody → sotaro.ikeda.g
(In reply to David Major [:dmajor] (UTC+13) from comment #1)
> This started in build 20150214030238 and has been the #1 nightly crash since
> then.
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=2f5c5ec1a24b&tochange=a7c177546ca0
> The regression range includes:
> 	7d4f76e0411b	Nicolas Silva — Bug 1120331 - "crash in
> mozilla::ipc::MessageChannel::Send(IPC::Message*)". r=sotaro
> 
> And again this one is all Win7 and all D3D10Warp.dll. Seems like that crash
> just moved signatures. Nical do you agree?

It seems just moved the signature. But the change seems to make clear that it is not a problem of ClientLayerManager.
This bug just bite me on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150218030228 CSet: 9696d1c4b3ba sans e10s.

Report ID 	Date Submitted
bp-eaeaefd8-63cb-42f5-92de-ed7ba2150218
	2/18/2015	11:39 AM
me bp-7ad742b2-3c5c-43f9-b304-f54492150218 thunderbird 20150216030224
I had just mouse clicked on a different folder

FWIW, bp-c434f094-8506-4eed-bc18-a668b2150217 comment " always crashing when opening down the url line (most visited links) or try to copy an url with mouse, not ctrl + c"
Keywords: topcrash-win
Whiteboard: [mozmill] → [mozmill][tbird crash]
re comment 3 - for the 3 user email addresses with the most crashes, each with a long history of mozilla::ipc::MessageChannel::Send crashes, each user's rate of crashing for mozilla::layers::CompositorChild::Destroy is the SAME as for mozilla::ipc::MessageChannel::Send(IPC::Message*) for a similar time period
Alex, since you're seeing this (and the other IPC crash) frequently -- is there any common pattern? Is it more frequent on some websites? Or after specific actions that you do?
Flags: needinfo?(alex_mayorga)
Sadly I haven't fount any rhythm or rime as to what causes the crashes.

It seems to have something to do with mousing into the "Awesome bar".

Here are my crashes from the past week or so FWIW:

Report ID 	Date Submitted
bp-7784571b-3787-4578-b302-1462f2150219
	2/19/2015	10:54 AM
bp-889f82eb-4905-4163-83c4-c7f8d2150219
	2/19/2015	10:39 AM
bp-fbf1e11a-a312-4875-bf34-cc38e2150219
	2/19/2015	10:38 AM
bp-7102fcb2-68fd-499a-a14d-fde2c2150219
	2/19/2015	10:38 AM
bp-05734df1-cb76-45f0-ad8e-021e32150219
	2/19/2015	10:18 AM
bp-eaeaefd8-63cb-42f5-92de-ed7ba2150218
	2/18/2015	11:39 AM
bp-2a63e10c-8353-4488-ac76-9b10f2150213
	2/13/2015	4:33 PM
bp-be5088b3-ca30-4d91-8401-2d26f2150213
	2/13/2015	4:31 PM
bp-5c3c2e9c-36c2-48ed-a986-f5f562150213
	2/13/2015	4:30 PM
bp-f364e215-bf49-4826-8aa6-1be1e2150213
	2/13/2015	11:19 AM
bp-1f003cbd-078a-42fc-a1c6-f38a02150212
	2/12/2015	2:49 PM
bp-0efe8407-73b4-47c2-aea7-aa5162150212
	2/12/2015	12:01 PM

This Nightly is updated almost daily and the system runs on VMware (VMCI sockets STREAM : 0 : 1 : c:\Program Files\VMware\VMware Tools\VSock SDK\bin\win32\vsocklib.dll) and it is accessed through http://www.citrix.com/go/receiver.html

Kindly let me know if there's anything I could provide to help solve this bug.
Flags: needinfo?(alex_mayorga) → needinfo?(dmajor)
I locally modified the code as to often call nsBaseWidget::DestroyCompositor() from nsWindow::OnPaint(). By this change, I could reproduce the almost same crash.
(In reply to alex_mayorga from comment #8)
> Kindly let me know if there's anything I could provide to help solve this bug.

Sotaro knows this code area better than me... Can Alex do anything to help?
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(dmajor)
(In reply to alex_mayorga from comment #8)
> Sadly I haven't fount any rhythm or rime as to what causes the crashes.
> 
> It seems to have something to do with mousing into the "Awesome bar".

nsWindow::OnPaint() is called when drawing browser's chrome part. Therefore, "Awesome bar" seems to affect to the crash. nsBaseWidget::DestroyCompositor() is called when gfxWindowsPlatform::DidRenderingDeviceReset() detects rendering device's reset. It might say that D3D Device on the machine might be unstable. Then, it might be that graphic driver might be unstable.
Flags: needinfo?(sotaro.ikeda.g)
gfxWindowsPlatform::DidRenderingDeviceReset() is the following.

https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxWindowsPlatform.cpp#1143
call stack of a crash that I re-produced by locally modifying the code. This bug's crash and Bug 1120331 seems to be caused by same problem.
nsBaseWidget::DestroyCompositor() causes ClientLayerManager's destrutor. And it causes nsWindow::GetLayerManager() call. It request new LayerManager creation, new CompositorChild's and new CompositorParent creation. And also current CompositorChild's destruction and current CompositorParent's destruction without correct IPC shutdown.
Comment on attachment 8566917 [details] [diff] [review]
patch - Remove DidComposite() from ClientLayerManager's destrutor

matt.woodrow, can we remove DidComposite() from ClientLayerManager's destrutor? It trigger LayerManager creation and causes the problem like Comment 14.
Attachment #8566917 - Flags: review?(matt.woodrow)
I don't think this is ok.

This happens when the layer manager has promised the refresh driver that it's doing a composite, but we're destroying the layer manager before it completes.

We lie and send the DidComposite message anyway, so that the refresh driver doesn't hang waiting for the response.

I think we should call mTransactionIdAllocator->RevokeTransactionId(mLatestTransactionId); instead, but this might still result in LayerManager creation.

Why are we creating a new layer manager if we've just destroyed the last one? What triggered the destruction?

And why does that crash?
(In reply to Matt Woodrow (:mattwoodrow) from comment #17)
> I don't think this is ok.
> 
> This happens when the layer manager has promised the refresh driver that
> it's doing a composite, but we're destroying the layer manager before it
> completes.
> 
> We lie and send the DidComposite message anyway, so that the refresh driver
> doesn't hang waiting for the response.
> 
> I think we should call
> mTransactionIdAllocator->RevokeTransactionId(mLatestTransactionId); instead,
> but this might still result in LayerManager creation.
> 
> Why are we creating a new layer manager if we've just destroyed the last
> one? What triggered the destruction?

RenderingDeviceReset detect in nsWindow::OnPaint() triggers compositor's destruction. The destruction is done by nsBaseWidget::DestroyCompositor(). Comment 13, Comment 14 might help to understand it.

https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsWindowGfx.cpp#194
 
> And why does that crash?

And DidComposite() from ClientLayerManager causes current using CompositorChild and CompositorParent destruction without correct shutdown. It creates dangling pointers.
Attachment #8566917 - Flags: review?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #17)
> I don't think this is ok.
> 
> I think we should call
> mTransactionIdAllocator->RevokeTransactionId(mLatestTransactionId); instead,
> but this might still result in LayerManager creation.

Yes, it resulted to LayerManager creation and caused the crash.
This seems better than previous one.
Attachment #8566917 - Attachment is obsolete: true
(In reply to Sotaro Ikeda [:sotaro] from comment #20)
> Created attachment 8566949 [details] [diff] [review]
> patch - Care about new CompositorChild and CompositorParent creation.
> 
> This seems better than previous one.

The patch at least seems to fix the problem.
I can perfectly reproduce this crash with our Mozmill tests. Just found the time to check that locally. 

Here the steps:
1. Clone http://hg.mozilla.org/qa/mozmill-tests
2. Install Mozmill 2.0.10: https://pypi.python.org/pypi/mozmill
3. Run the tests via `mozmill -b %path/to/firefox% -m mozmill-tests/firefox/tests/functional/testPrivateBrowsing/manifest.ini`

On Windows 7 the tests always when executing the test testPrivateDownloadPanel.js.
whimboo, can you check if attachment 8566949 [details] [diff] [review] fix the problem?
Flags: needinfo?(hskupin)
I run this build a couple of times with the tests and yes, the crash is gone! Yay! Thanks!
Flags: needinfo?(hskupin)
Good! Thanks for the confirmation.
Comment on attachment 8566949 [details] [diff] [review]
patch - Care about new CompositorChild and CompositorParent creation.

Jeff, can you look into the patch?
Attachment #8566949 - Flags: review?(jmuizelaar)
Sotaro, how about this instead?
Attachment #8566949 - Flags: review?(jmuizelaar)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)
> Created attachment 8567129 [details] [diff] [review]
> How about this instead
> 
> Sotaro, how about this instead?

Yhea, it is simpler than mine.
Jeff, as we talked, can you check the correctness of what the patch is doing?
Flags: needinfo?(jmuizelaar)
Comment on attachment 8567129 [details] [diff] [review]
How about this instead

Nical, can do you a post commit review of this?
Flags: needinfo?(jmuizelaar)
Attachment #8567129 - Flags: review?(nical.bugzilla)
Attachment #8566949 - Attachment is obsolete: true
[Tracking Requested - why for this release]: Firefox 37 is indeed affected given that this is just a signature change from bug 1120331. So it would be great to get this fixed for Firefox 37 before we have the first beta. If not I will have to turn off all Mozmill tests on Windows 7 for beta, given that this would fail horribly on us while testing all locales.
https://hg.mozilla.org/mozilla-central/rev/b28fd46a1174
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Between this and bug 1120331, what's the status of this issue on Beta 37?
Given how huge this was on Aurora 37, I'm really worried about shipping betas with this crash.
Flags: needinfo?(sotaro.ikeda.g)
Comment on attachment 8567129 [details] [diff] [review]
How about this instead

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 1107718
User impact if declined: crash could happen.
Testing completed: locally tested and comment 25.
Risk to taking this patch (and alternatives if risky): relatively low risk.
String or UUID changes made by this patch: none.
Flags: needinfo?(sotaro.ikeda.g)
Attachment #8567129 - Flags: approval-mozilla-b2g37?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #36)
> Between this and bug 1120331, what's the status of this issue on Beta 37?
> Given how huge this was on Aurora 37, I'm really worried about shipping
> betas with this crash.

It could be uplifted to Aurora 37. I just put "Approval Request".
(In reply to Sotaro Ikeda [:sotaro] from comment #38)
> It could be uplifted to Aurora 37. I just put "Approval Request".

You asked for it to be uplifted to B2G only. What you you want for 37 right now is approval-beta?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #39)
> (In reply to Sotaro Ikeda [:sotaro] from comment #38)
> > It could be uplifted to Aurora 37. I just put "Approval Request".
> 
> You asked for it to be uplifted to B2G only. What you you want for 37 right
> now is approval-beta?

Sorry, it was my mistake. Uplift for b2g is not necessary.
Comment on attachment 8567129 [details] [diff] [review]
How about this instead

Approval Request Comment
[Feature/regressing bug #]: Bug 1107718
[User impact if declined]: crash could happen.
[Describe test coverage new/current, TreeHerder]: unchanged
[Risks and why]: locally tested and comment 25.
[String/UUID change made/needed]: none
Attachment #8567129 - Flags: approval-mozilla-b2g37? → approval-mozilla-beta?
Do we need the patch from bug 1120331 in addition to the fix that has been nommed for uplift here?
Flags: needinfo?(sotaro.ikeda.g)
Comment on attachment 8567129 [details] [diff] [review]
How about this instead

Thrilled to see the source of this crash identified and the confirmation of the fix in comment 25. Let's get this onto Beta and confirm with crash data from Beta 2. Beta+
Attachment #8567129 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #42)
> Do we need the patch from bug 1120331 in addition to the fix that has been
> nommed for uplift here?

We do not need the patch from bug 1120331. It actually does not fix the crash.
Flags: needinfo?(sotaro.ikeda.g)
Attachment #8567129 - Flags: review?(nical.bugzilla) → review+
Flags: needinfo?(nical.bugzilla)
I have re-enabled our Mozmill tests for Firefox beta. None of our testruns on Windows 7 showed a crash anymore. The same applies for the other branches. Marking as verified fixed. Thanks!
Status: RESOLVED → VERIFIED
I'm sorry to say but something has been re-introduced this crash with yesterdays Nightly builds. I will file a new bug for it.
100% reproducible per above (ok, 4 for 4):
about:config; add New boolean media.peerconnection.video.vp9_enabled = true
https://mozilla.github.io/webrtc-landing/pc_test.html
Start
NI to make sure Sotaro/Matt see this
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(matt.woodrow)
It is a regression of Bug 1125848. Possible cause of change was just backed out. Bug 1125848 Comment 55.
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Sotaro Ikeda [:sotaro] from comment #51)
> It is a regression of Bug 1125848. Possible cause of change was just backed
> out. Bug 1125848 Comment 55.

jesup, can you confirm if the backout fix the problem?
Flags: needinfo?(rjesup)
Flags: needinfo?(rjesup)
You need to log in before you can comment on or make changes to this bug.