Closed Bug 637367 Opened 13 years ago Closed 13 years ago

Firefox 4.0b12 crash [@ DrawPlugin ]

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- Macaw+
status2.0 --- .1-fixed

People

(Reporter: marcia, Assigned: smichaud)

Details

(Keywords: crash, topcrash, Whiteboard: [approved-patches-landed])

Crash Data

Attachments

(3 files, 3 obsolete files)

Seen while reviewing crash stats.  http://tinyurl.com/4atn9ct links to the crashes which are all Mac. So far fairly low volume with 98 crashes since we shipped on Friday and ranks as #2 top Mac crash.


Frame 	Module 	Signature [Expand] 	Source
0 	XUL 	DrawPlugin 	
1 	XUL 	mozilla::layers::ImageLayerOGL::RenderLayer 	gfx/layers/opengl/ImageLayerOGL.cpp:473
2 	XUL 	mozilla::layers::ContainerLayerOGL::RenderLayer 	gfx/layers/opengl/ContainerLayerOGL.cpp:297
3 	XUL 	mozilla::layers::ContainerLayerOGL::RenderLayer 	gfx/layers/opengl/ContainerLayerOGL.cpp:297
4 	XUL 	mozilla::layers::LayerManagerOGL::Render 	gfx/layers/opengl/LayerManagerOGL.cpp:611
5 	XUL 	mozilla::layers::LayerManagerOGL::EndTransaction 	gfx/layers/opengl/LayerManagerOGL.cpp:427
6 	XUL 	mozilla::layers::LayerManagerOGL::EndEmptyTransaction 	gfx/layers/opengl/LayerManagerOGL.cpp:401
7 	XUL 	PresShell::Paint 	layout/base/nsPresShell.cpp:6173
8 	XUL 	nsViewManager::RenderViews 	view/src/nsViewManager.cpp:458
9 	XUL 	nsViewManager::Refresh 	view/src/nsViewManager.cpp:424
10 	XUL 	nsViewManager::DispatchEvent 	view/src/nsViewManager.cpp:925
11 	XUL 	HandleEvent 	view/src/nsView.cpp:161
12 	XUL 	nsChildView::DispatchEvent 	widget/src/cocoa/nsChildView.mm:1807
13 	XUL 	nsChildView::DispatchWindowEvent 	widget/src/cocoa/nsChildView.mm:1817
14 	XUL 	-[ChildView drawRect:inContext:] 	widget/src/cocoa/nsChildView.mm:2874
15 	XUL 	-[ChildView drawRect:] 	widget/src/cocoa/nsChildView.mm:2780
16 	AppKit 	AppKit@0x100cf8 	
17 	AppKit 	AppKit@0xfbf6a 	
18 	AppKit 	AppKit@0x73c5ff 	
19 	AppKit 	AppKit@0xfb81f 	
20 	Foundation 	Foundation@0x16bd9 	
21 	AppKit 	AppKit@0x801aff 	
22 	AppKit 	AppKit@0xfe4ce 	
23 	libSystem.B.dylib 	libSystem.B.dylib@0xa2d8 	
24 	libSystem.B.dylib 	libSystem.B.dylib@0xa2d8 	
25 	AppKit 	AppKit@0x753197 	
26 	AppKit 	AppKit@0x2379a 	
27 	CoreFoundation 	CoreFoundation@0xbc24 	
28 	CoreFoundation 	CoreFoundation@0x104c7 	
29 	CoreFoundation 	CoreFoundation@0xfcd6 	
30 	CoreFoundation 	CoreFoundation@0xfb2e 	
31 	CoreFoundation 	CoreFoundation@0x24814 	
32 	CoreFoundation 	CoreFoundation@0x24688 	
33 	Foundation 	Foundation@0x14dbb 	
34 	AppKit 	AppKit@0xfee59 	
35 	libSystem.B.dylib 	libSystem.B.dylib@0xa2d8 	
36 	AppKit 	AppKit@0x753197 	
37 	AppKit 	AppKit@0x2379a 	
38 	CoreFoundation 	CoreFoundation@0xbc24 	
39 	CoreFoundation 	CoreFoundation@0x13e1a7 	
40 	CoreFoundation 	CoreFoundation@0xfcd6
It is #1 top crasher on Mac OS X in 4.0b12.
blocking2.0: --- → ?
Keywords: topcrash
Almost assuredly blocking; bsmedberg/roc/josh - can you see if this is obviously caused by a change in b12?
Discussed in triage today - bsmedberg speculatively assigned this to matt, I'm just making it so.
Assignee: nobody → matt.woodrow+bugzilla
blocking2.0: ? → final+
Whiteboard: [hardblocker]
I haven't been able to reproduce the crash, so taking a guess here.

The crash address in the linked report is 0x50, in a debug build (x64) mInstanceOwner is 0x60 bytes into nsObjectFrame. Assuming this 16 byte difference is due to debugging code, then it's likely that aObjectFrame was null and we are crashing inside UpdateImageLayer with |this| == null.
Attachment #516009 - Flags: review?(roc)
http://hg.mozilla.org/mozilla-central/rev/7f63b0b1f7c0
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
It still crashes in 4.0b13pre/20110302:
bp-e9de1716-c989-4679-abd3-f289a2110302
bp-c3eaad11-ea15-40f8-ae7b-b29162110302
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
mostly seems to affect 10.6.3 and beyond with highest numbers on 10.6.6.

os verions mar1-3
   1 Mac OS X 10.6.3 10D573
   2 Mac OS X 10.6.4 10F569
   4 Mac OS X 10.6.5 10H574
 147 Mac OS X 10.6.6 10J567
   1 Mac OS X 10.7.0 11A390
That means: our users keep up to date, and this only affects OOPP.
not to much in comments.  a couple that indicate clicking on links at google analytics
there is a tiny residual crash on 3.[0-6].x with this signature, and the new volume regression seems to have gone undetected in 4.0.x builds until beta 12 was released.  here are counts of all the reports with version an build id info since the beginning of the year.

   1 3.0.15 2009101600
   1 3.0.5  2008120121
   1 3.0.7  2009021906
   1 3.6    20100115132715
   7 3.6.13 20101203074432
 298 4.0b12 20110222205430
   2 4.0b13pre 20110302030357
(In reply to comment #10)
> That means: our users keep up to date, and this only affects OOPP.

actually about 1/2 our mac users stay up-to-date if crash counts across mac users are an indication of the distribution of versions in use.  crash counts for mar1-2

18071 Mac OS X 10.6
 9445 Mac OS X 10.5
 7143 Mac OS X 10.4
   67 Mac OS X 10.7
    1 Mac OS X 10.9
I tried to repro this with my Google Analytics account but no luck so far. I emailed someone from Mozilla who crashed and asked if they could give me STR.
What is the volume of this crash now?
About 100 crashes per day, it's the #1/#2 topcrash *on mac*, ~70 overall. 

It seems to correlate pretty highly with having multiple cores, but that may be an accident.

I'm currently wondering if something within the rendering stack is being re-entered by plugin code, which modifies the DOM at an unsafe time during painting. If so, there are two possibilities:

1) We are resolving RPC races incorrectly. This has caused topcrashes in the past, but we haven't made any changes to the race resolution or IPDL protocols involved with mac drawing in any of the recent changes.

2) The plugin is just making unsafe calls. There's nothing we can do about this except implement async painting on mac as we do on other platforms.
> What is the volume of this crash now?
It was happening only in Beta 12, after the patch landing, it happens in nightlies:
20110302: 2
20110301: 0 (patch)
20110228: 0
20110227: 0
20110226: 0
Drivers decided to move this to .x based on previous comments. Please renominate if it spikes up to many many more crashes.
blocking2.0: final+ → .x+
Whiteboard: [hardblocker] → [hardblocker][approved-patches-landed]
Whiteboard: [hardblocker][approved-patches-landed] → [approved-patches-landed]
IMO, this should probably hardblock the final release of 4.0 and not be deferred to a .x release. I encounter it in the first RC1 candidate build around once every-other-day on my Mac (dual core). It is a rather annoying, invasive issue that probably shouldn't be inflicted on uses in a general release.
Nick, we haven't been able to reproduce this. What are you doing at the time of the crash? Are you navigating to or away from a page with plugins? Any particular page?
A couple of crash stacks show these crashes taking place at the
folllowing line:

http://hg.mozilla.org/releases/mozilla-2.0/annotate/5f8f494a4c29/layout/generic/nsObjectFrame.cpp#l5870

So we should try null-checking mInstance at the top of
nsPluginInstanceOwner::DoCocoaEventDrawRect().  Or maybe we should
null-check both mInstance and mObjectFrame, as
nsPluginInstanceOwner::Paint() does.

https://crash-stats.mozilla.com/report/index/f447eada-9345-4931-b3e5-32e642110311
https://crash-stats.mozilla.com/report/index/001e6ef3-7f69-4ea4-b83f-2c12b2110311
Steven, can you put up your proposed patch today? I'd like to get something landed on trunk to verify that it fixes the crashes so we can potentially include it in a RC2 if that happens.
Assignee: matt.woodrow+bugzilla → smichaud
No problem.  It should take me another hour or so.
Attached patch Potential fixSplinter Review
Attachment #518830 - Flags: review?(benjamin)
Attachment #518830 - Flags: review?(benjamin) → review+
Comment on attachment 518830 [details] [diff] [review]
Potential fix

I'd like to get this trivial null-check on trunk now so that we can check crash-stats over the weekend for possible inclusion in RC2, if that happens.
Attachment #518830 - Flags: approval2.0?
Attachment #518830 - Flags: approval2.0? → approval2.0+
Comment on attachment 518830 [details] [diff] [review]
Potential fix

https://crash-stats.mozilla.com/report/index/b4f1ed60-15ab-46e3-a8cf-874f62110313

Oops, my patch doesn't seem to have fixed these crashes.

As this stack shows, it seems that
nsPluginInstanceOwner::DoCocoaEventDrawRect() is being called on a
deleted nsPluginInstanceOwner instance.

I'm not sure what we can do about this (without being able to
reproduce the crashes), though I'll take a closer look tomorrow ... if
no-one else beats me to it.
Attached patch Another potential fix (obsolete) — Splinter Review
> As this stack shows, it seems that
> nsPluginInstanceOwner::DoCocoaEventDrawRect() is being called on a
> deleted nsPluginInstanceOwner instance.

Looking through the code, I don't see how an nsObjectFrame object
could have a non-zero mInstanceOwner that's also invalid (since
mInstanceOwner holds a strong reference).  So now I think it's the
nsObjectFrame object itself that's been deleted.

Here's a patch that should, I think, prevent
nsObjectFrame::UpdateImageLayer() from being called (from DrawPlugin
and MacIOSurfaceImageOGL::Update()) on a deleted nsObjectFrame object.
Attachment #519289 - Flags: review?(benjamin)
Attachment #519289 - Flags: review?(benjamin) → review?(roc)
Comment on attachment 519289 [details] [diff] [review]
Another potential fix

I've thought of a better way to do this.

New patch coming up.
Attachment #519289 - Attachment is obsolete: true
Attachment #519289 - Flags: review?(roc)
Attachment #519429 - Flags: review?(roc)
Why do you need the DestroyCallback?

A simpler way to do this would probably be to let the DrawPlugin callback data be a *strong* reference to the nsPluginInstanceOwner. Then if DrawPlugin gets called after the nsObjectFrame has been destroyed, its mObjectFrame will be null and we can bail out.
> Why do you need the DestroyCallback?

I was afraid there might be several calls to the UpdateCallback from
the MacIOSurfaceImage created in
nsPluginInstanceOwner::SetCurrentImage(), or none at all.  In that
case I wouldn't be able to delete the reference passed by
SetUpdateCallback() in DrawPlugin().  This is true regardless of what
kind of strong reference I pass in SetUpdateCallback() (whether it's
an nsWeakFrame or an nsPluginInstanceOwner).

Now that I look again, I see that SetCurrentImage() is called from
nsObjectFrame::UpdateImageLayer() (which creates a new
MacIOSurfaceImage).  Which I guess means there will be at most one
call to the UpdateCallback from any MacIOSurfaceImage.  But in
principle it might still be possible that the UpdateCallback will
never be called -- in which case the strong reference passed in
SetUpdateCallback() would be leaked.

Is it guaranteed that there's exactly one call to the UpdateCallback
from every MacIOSurfaceImage on which SetUpdateCallback() is called?
> A simpler way to do this would probably be to let the DrawPlugin
> callback data be a *strong* reference to the nsPluginInstanceOwner.

I could do this, but I need to know where I can release this strong
reference.  So I need to know your response to my previous comment.
Oh right, so you do need the destroy callback. Sorry! The other part of what I said still holds, use a strong pointer to the nsPluginInstanceOwner instead of an nsWeakFrame.

Instead of adding a destroy callback, the best approach might be to add a LayerUserDataSet to Image so we can have user-data on Images. Then the strong reference can be held by a LayerUserData attached to the Image and will automatically be removed when the Image dies. But I'll understand if you don't want to make that change.
(and I don't have any other need for Image user-data right now.)
> Instead of adding a destroy callback, the best approach might be to
> add a LayerUserDataSet to Image so we can have user-data on
> Images. Then the strong reference can be held by a LayerUserData
> attached to the Image and will automatically be removed when the
> Image dies. But I'll understand if you don't want to make that
> change.

I'm completely unfamiliar with that code, so it should probably be
someone else who does this.

> The other part of what I said still holds, use a strong pointer to
> the nsPluginInstanceOwner instead of an nsWeakFrame

Will do.  New patch coming up.
Attachment #519429 - Attachment is obsolete: true
Attachment #519518 - Flags: review?(roc)
Attachment #519429 - Flags: review?(roc)
Comment on attachment 519518 [details] [diff] [review]
Another potential fix, rev2 (follow Roc's suggestion)

This looks good, but in nsPluginInstanceOwner::PrepareToStop, please clear out the callbacks on the image and unref 'this', to make sure we don't leak.
> This looks good, but in nsPluginInstanceOwner::PrepareToStop, please
> clear out the callbacks on the image and unref 'this', to make sure
> we don't leak.

I don't understand.  Won't this bring back the original bug?

Won't OnDestroyImage() always be called from the MacIOSurfaceImageOGL
destructor, and won't this prevent any leaks?
Maybe I should call aContainer->SetCurrentImage(nsnull) from PrepareToStop()?
Oops, I see this is already happening.
(In reply to comment #40)
> Maybe I should call aContainer->SetCurrentImage(nsnull) from PrepareToStop()?

That won't remove the callbacks on the image though, if there happens to be a reference to the image keeping it alive.

(In reply to comment #39)
> > This looks good, but in nsPluginInstanceOwner::PrepareToStop, please
> > clear out the callbacks on the image and unref 'this', to make sure
> > we don't leak.
> 
> I don't understand.  Won't this bring back the original bug?

No, because if the callbacks have been cleared, we won't try to use a dangling nsObjectFrame pointer.

> Won't OnDestroyImage() always be called from the MacIOSurfaceImageOGL
> destructor, and won't this prevent any leaks?

Not if there's a cycle involving the image and the pluginInstanceOwner. (I actually don't think there is, but I want to make sure it doesn't matter.)
OK.

Come to think of it, clearing the image's update callback in
nsPluginInstanceOwner::PrepareToStop() should be enough by itself to
fix these crashes:  It should stop DrawPlugin ever being called after
the nsObjectFrame it wants to reference has been deleted.  I can get
rid of the destroy callback and my other changes.

I'll prepare a new patch that does only this.  How does that sound to
you?
(Following up comment #43)

Now I see why my suggestion won't work -- there might have been many
other images created besides the one that happens to be current when
PrepareToStop() is called.

I'll just add calls to clear the callbacks in PrepareToStop() to my
previous patch.
How's this?
Attachment #519518 - Attachment is obsolete: true
Attachment #519548 - Flags: review?(roc)
Attachment #519518 - Flags: review?(roc)
-        oglImage->SetCallback(&DrawPlugin, mObjectFrame);
+        NS_ADDREF_THIS();
+        oglImage->SetUpdateCallback(&DrawPlugin, this);
+        oglImage->SetDestroyCallback(&OnDestroyImage);

Ooh, I just realized that we need to be sure to only addref once per image here. So only do this block if the image doesn't already have a destroy-callback.
+      oglImage->SetUpdateCallback(nsnull, nsnull);
+      oglImage->SetDestroyCallback(nsnull);
+      // If we have a current image here, its destructor hasn't yet been
+      // called, so OnDestroyImage() can't yet have been called.  So we need
+      // to do ourselves what OnDestroyImage() would have done.
+      NS_RELEASE_THIS();

And it's probably worth checking here and only doing this block if there is a destroy-callback set. Then the invariant is --- if there's a destroy-callback, we need to release 'this'.
> Ooh, I just realized that we need to be sure to only addref once per
> image here.

That's already taken care of -- every call to
nsIPluginInstance_MOZILLA_2_0_BRANCH::GetImage() creates a new image.

> And it's probably worth checking here and only doing this block if
> there is a destroy-callback set.

I assume that the only way for an image to have been created and made
current is via a call to nsPluginInstanceOwner::SetCurrentImage().  So
in every case a destroy-callback would have already been set.
Comment on attachment 519548 [details] [diff] [review]
Another potential fix, rev3 (follow Roc's suggestion)

ok!
Attachment #519548 - Flags: review?(roc) → review+
Comment on attachment 519548 [details] [diff] [review]
Another potential fix, rev3 (follow Roc's suggestion)

> ok!

Thanks.  Whew :-)

I'd like to get this on the trunk reasonably soon, so that we can find
out if it does actually fix this bug -- which is after all the #1
topcrasher on the Mac.

The patch isn't trivial.  But it's quite simple, and should be
low-risk.
Attachment #519548 - Flags: approval2.0?
>> Ooh, I just realized that we need to be sure to only addref once
>> per image here.
>
> That's already taken care of -- every call to
> nsIPluginInstance_MOZILLA_2_0_BRANCH::GetImage() creates a new
> image.

I've added a comment to this effect to
nsPluginInstanceOwner::SetCurrentImage(), which I'll include when I
land my patch.
Nick: If you can give us the info in Comment 20 as well as the flash version you are using, that would be great. Our QA team has not yet been able to reproduce the crash.  

(In reply to comment #19)
> IMO, this should probably hardblock the final release of 4.0 and not be
> deferred to a .x release. I encounter it in the first RC1 candidate build
> around once every-other-day on my Mac (dual core). It is a rather annoying,
> invasive issue that probably shouldn't be inflicted on uses in a general
> release.
Also exactly what steps you're doing at the time of the crash: opening a new tab, closing a tab, are there multiple youtube videos playing, or whatever is actually happening.
If anyone does figure out how to reproduce these crashes, here's a tryserver build to test with, made using my rev3 patch (the one Roc r+ed):

http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/smichaud@pobox.com-0183b48015a3/tryserver-macosx64/firefox-4.0b13pre.en-US.mac.dmg
I believe I now have some STR this bug, but so far I have only been able to generate a crash once using the latest trunk:

1. Download and install http://www.adobe.com/products/contribute/. The key seems to be Adobe Contribute Toolbar 6.0 true {01A8CA0A-4C96-465b-A49B-65C46FAD54F9} which gets installed as an addon into Firefox as part of this install. You will have to register and install the trial version.
2. Click on this link: http://consumerist.com/2011/03/debtor-turns-entire-suv-into-anti-bofa-collage.html

https://crash-stats.mozilla.com/report/index/bp-72f232c1-a1a6-4fd6-b844-44cb02110316 is my report.
The second time I crashed I got a different stack, which is the one from Bug 636707. In this case I clicked on a link on consumerist.com in the "Related posts" section to generate the crash - https://crash-stats.mozilla.com/report/index/bp-49aa6d03-b11a-4462-a829-e96212110316.
I just reproduced the crash on a different machine. This time my steps were similar to Comment 55, except I did the following:

1. Load http://consumerist.com
2. In the Adobe Contribute toolbar, select "Edit in Contribute" - I had the program running in the background at the time.
3. Click on a "Related Posts" link.
4. Crash.
Can we get Steven's patch on the trunk? That way we can see if at least the trunk crashes go away. I still haven't been able to reproduce this crash consistently.
Comment on attachment 519548 [details] [diff] [review]
Another potential fix, rev3 (follow Roc's suggestion)

a=beltzner to land on mozilla-central to see if it repairs the issue
Attachment #519548 - Flags: approval2.0? → approval2.0+
Comment on attachment 519548 [details] [diff] [review]
Another potential fix, rev3 (follow Roc's suggestion)

Landed on trunk:
http://hg.mozilla.org/mozilla-central/rev/4da037fe27b0

It should take 2-3 days to tell whether or not this helps.
Comment on attachment 519548 [details] [diff] [review]
Another potential fix, rev3 (follow Roc's suggestion)

I also needed to land a breakage patch :-(
http://hg.mozilla.org/mozilla-central/rev/59f8f339653f
(Following up comment #61)

For Windows and Linux.
Of late I had more luck crashing after installing Test Pilot 1.1 and simply clicking links on http://consumerist.com/. Using Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b13pre) Gecko/20110318 Firefox/4.0b13pre.

https://crash-stats.mozilla.com/report/index/bp-d0c28bde-aeca-4052-9419-dfe402110318 is one of the reports.

Adding Jono to the bug.
There haven't been any of these crashes since my latest patch landed.  So it looks like this bug is fixed!
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
I just hit this crash using Firefox 4.0 final. Is this patch supposed to be in the final release?
No, it did not make it.
I would like to try and get this in the first .x release if possible. For a Mac only crash, it's pretty high up on the overall list.
I'd be happy to land my latest patch on the 2.0.x branch (or its
equivalent) -- once that's been created, and I have approval.
Steven, Marcia -- have we tested Steven's patch on a profile with Test Pilot 1.1 installed?
I haven't.
This is in the top 15 for 4.0 even if it's only a mac crash. We have fixed it so I would like to see it in 4.0.1.
Technically my rev3 patch (the one that fixed this bug) already has approval2.0+, and there exists a mozilla-2.0 branch that seems live (http://hg.mozilla.org/releases/mozilla-2.0/).  Should I just go ahead and land my patch there?
blocking2.0: .x+ → Macaw
Approved for mozilla2.0 repository, a=dveditz for release-drivers
I hit what looks like this same crash today with FF4 watching a video on youtube:

https://crash-stats.mozilla.com/report/index/0dcb1b52-4be5-49d0-b9d1-0b15e2110428

Do you want a new bug or reopen this?
Benjamin tells me on irc that this is expected for 4.0 vs. 4.0.1, please ignore.
Crash Signature: [@ DrawPlugin ]
I see a similar crash on Firefox 5 beta in 64 bit mode.

https://crash-stats.mozilla.com/report/index/bp-9a14c30b-1a52-4c08-a571-233df2110613
(In reply to comment #77)

Your crash is unrelated (though it's true both problems happen under a
call to nsObjectFrame.cpp's DrawPlugin()).

Your bug is actually two bugs:

1) A plugin process appears to have hung or crashed.

2) The browser process has crashed while attempting to time out a
   failed call to the plugin process.

To find related bugs, you need to search on "WriteMinidump" in the
summary or in a comment.  Just now I've found a bunch of seemingly
related bugs with "WriteMinidump" in the summary, but I don't know if
any of them is actually your bug.  Take a look for yourself, and
decide whether you want to open a new bug.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: