Closed Bug 785348 Opened 13 years ago Closed 13 years ago

crash in nsTreeBodyFrame::BuildDisplayList mainly with Java plugin

Categories

(Core :: Layout, defect)

18 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla20
Tracking Status
firefox17 + unaffected
firefox18 + verified
firefox19 --- verified
firefox20 --- verified

People

(Reporter: scoobidiver, Assigned: roc)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(3 files)

It first appeared in 17.0a1/20120821135628. The regression range is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c676b554c7bb&tochange=1b51c7bf1e05 Many crashes contain Java in their stack trace. It might be a regression from bug 341604. Signature nsTreeBodyFrame::BuildDisplayList(nsDisplayListBuilder*, nsRect const&, nsDisplayListSet const&) More Reports Search UUID 6fac8c31-60f5-405a-ba2b-958662120823 Date Processed 2012-08-23 23:01:25 Uptime 121 Install Age 1.3 hours since version was first installed. Install Time 2012-08-23 21:42:30 Product Firefox Version 17.0a1 Build ID 20120822030558 Release Channel nightly OS Windows NT OS Version 6.1.7601 Service Pack 1 Build Architecture x86 Build Architecture Info AuthenticAMD family 16 model 5 stepping 3 Crash Reason EXCEPTION_ACCESS_VIOLATION_READ Crash Address 0x0 App Notes AdapterVendorID: 0x10de, AdapterDeviceID: 0x0615, AdapterSubsysID: 11453842, AdapterDriverVersion: 9.18.13.479 D2D? D2D+ DWrite? DWrite+ D3D10 Layers? D3D10 Layers+ Processor Notes This dump is too long and has triggered the automatic truncation routine EMCheckCompatibility True Adapter Vendor ID 0x10de Adapter Device ID 0x0615 Total Virtual Memory 4294836224 Available Virtual Memory 3571474432 System Memory Use Percentage 20 Available Page File 22636716032 Available Physical Memory 10236624896 Frame Module Signature Source 0 xul.dll nsTreeBodyFrame::BuildDisplayList layout/xul/base/src/tree/src/nsTreeBodyFrame.cpp:2780 1 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:2179 2 xul.dll nsBoxFrame::BuildDisplayList layout/xul/base/src/nsBoxFrame.cpp:1338 3 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:2179 4 xul.dll nsBoxFrame::BuildDisplayList layout/xul/base/src/nsBoxFrame.cpp:1338 5 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:2179 6 xul.dll nsBoxFrame::BuildDisplayList layout/xul/base/src/nsBoxFrame.cpp:1338 7 xul.dll nsIFrame::BuildDisplayListForChild layout/generic/nsFrame.cpp:2179 8 xul.dll nsBoxFrame::BuildDisplayList layout/xul/base/src/nsBoxFrame.cpp:1338 9 xul.dll nsMenuPopupFrame::BuildDisplayList layout/xul/base/src/nsMenuPopupFrame.cpp:1758 10 xul.dll nsIFrame::BuildDisplayListForStackingContext layout/generic/nsFrame.cpp:1882 11 xul.dll nsLayoutUtils::PaintFrame layout/base/nsLayoutUtils.cpp:1729 12 xul.dll PresShell::Paint layout/base/nsPresShell.cpp:5337 13 xul.dll nsViewManager::ProcessPendingUpdatesForView view/src/nsViewManager.cpp:438 14 xul.dll nsViewManager::ProcessPendingUpdatesForView view/src/nsViewManager.cpp:406 15 xul.dll nsViewManager::ProcessPendingUpdates view/src/nsViewManager.cpp:1217 16 xul.dll nsRefreshDriver::Notify layout/base/nsRefreshDriver.cpp:421 17 xul.dll nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:476 18 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:624 19 xul.dll NS_ProcessPendingEvents_P obj-firefox/xpcom/build/nsThreadUtils.cpp:170 20 xul.dll nsBaseAppShell::NativeEventCallback widget/xpwidgets/nsBaseAppShell.cpp:97 21 xul.dll nsAppShell::EventWindowProc widget/windows/nsAppShell.cpp:96 ... More reports at: https://crash-stats.mozilla.com/report/list?signature=nsTreeBodyFrame%3A%3ABuildDisplayList%28nsDisplayListBuilder*%2C+nsRect+const%26%2C+nsDisplayListSet+const%26%29
A longer stack like https://crash-stats.mozilla.com/report/index/7cbab8b9-6b29-4e7f-8a83-d15542120829 seems to show what is going on here. We are stopping a java plugin, the java plugin code somehow re-enters our event loop and after that anything can go wrong because we are in an unsafe intermediate state because we are in the middle of destroying stuff. Is there anything we can do to protect ourselves against what the java plugin is doing here?
The only thing I can think of is to add API to nsBaseAppShell to completely ignore calls to NativeEventCallback temporarily, and use that API from nsNPAPIPluginInstance::Stop around the call to pluginFunctions->destroy.
... and maybe other users of NS_TRY_SAFE_CALL_RETURN.
I noticed that nsObjectLoadingContent::DoStopPlugin has a delayed stop option that dispatches a runnable to finish the plugin stop process when we get back to the event loop. Is that an option?
It looks like bug 90268 changed the behaviour here. My reading of the code says that in this situation before bug 90268 we'd call StopPluginInternal with aDelayedStop = true from nsObjectFrame::Destroy. But after bug 90268 the only way aDelayedStop can be true is we are on Windows and we are stopping a real audio plugin.
This is green on try server, whatever that means.
Attachment #657533 - Flags: feedback?
Attachment #657533 - Flags: feedback? → feedback?(joshmoz)
Attachment #657533 - Flags: feedback?(jschoenick)
Comment on attachment 657533 [details] [diff] [review] delayed stop most StopPluginInstance calls There are known issues with delayed stop on plugins, such as protochains being thrown out of sync in certain race conditions. We really should make delayed stop the default for all plugins, but I think bug 767635 should be fixed first, otherwise I suspect we'll see a lot of regressions. If this crash is significant, we could try something more conservative like forcing delayed-stop for java (in addition to realplugin)
Attachment #657533 - Flags: feedback?(jschoenick) → feedback-
Attachment #657533 - Flags: feedback?(joshmoz)
Ok, thanks for the info. This problem is probably causing other crashes too, so it would be nice to fix this general issue.
Even if we do delayed stop, I think it would also be worth doing what I said in comment #2 as well.
It's #23 top browser crasher in 17.0a2 and #18 in 18.0a1.
Keywords: topcrash
Is it agreed with comment 0 that bug 341604 caused this? If not, is this a regression from something else? Should we be looking for a backout here instead of a forward fix?
(In reply to Lukas Blakk [:lsblakk] from comment #11) > Is it agreed with comment 0 that bug 341604 caused this? If not, is this a > regression from something else? Should we be looking for a backout here > instead of a forward fix? I don't think it was bug 341604. I don't think there is a backout that will fix this. I think this is a problem we've had for a long time. I think the reason this signature has jumped up recently is that we've fixed other crashes where we used to crash in the same situation. But instead of hitting those we now hit this.
(In reply to Timothy Nikkel (:tn) from comment #12) > I think the reason this signature has jumped up recently is that we've fixed > other crashes where we used to crash in the same situation. In that case, that crash fix should be in the regression range of comment 0. It's not bug 700583 hat has been uplifted to 16.0 since and I don't see other crash fixes. Here are a few comments: "exiting" "Clicked "Restart to update"" "Firefox crashed right after I closed it... So no harm done." "Reloading to apply update"
(In reply to Scoobidiver from comment #14) > It stopped in 17.0a2/20120918. The Aurora working range is: > http://hg.mozilla.org/releases/mozilla-aurora/ > pushloghtml?fromchange=88b15f553474&tochange=7676a9a06403 Nothing pops out in that.
(In reply to Scoobidiver from comment #13) > Here are a few comments: > "exiting" > "Clicked "Restart to update"" > "Firefox crashed right after I closed it... So no harm done." > "Reloading to apply update" From that I would guess a page with a java plugin is loaded and they are closing Firefox (either normally or to update).
(In reply to Timothy Nikkel (:tn) from comment #16) > (In reply to Scoobidiver from comment #13) > > Here are a few comments: > > "exiting" > > "Clicked "Restart to update"" > > "Firefox crashed right after I closed it... So no harm done." > > "Reloading to apply update" > > From that I would guess a page with a java plugin is loaded and they are > closing Firefox (either normally or to update). Juan - can you try reproducing exiting/updating with a Java plugin installed on Firefox 18?
Keywords: qawanted
QA Contact: jbecerra
Updating the version as per comment 14.
Version: 17 Branch → 18 Branch
I haven't been able to reproduce this on Win7 (32 and 64bit) and latest Java installed, trying a combination of scenarios where: * I have a demo applet running and I exit Aurora * I have a demo applet running and I update to the latest Aurora * I have a demo applet running as well as Flash and try exiting or updating * I have click-to-play plugin prefs enabled and try the above * I have social preference enabled and try the above So far no luck.
The four comments talk about closing Firefox or restarting after update.
(In reply to Scoobidiver from comment #13) > (In reply to Timothy Nikkel (:tn) from comment #12) > > I think the reason this signature has jumped up recently is that we've fixed > > other crashes where we used to crash in the same situation. > In that case, that crash fix should be in the regression range of comment 0. > It's not bug 700583 hat has been uplifted to 16.0 since and I don't see > other crash fixes. > > Here are a few comments: > "exiting" > "Clicked "Restart to update"" > "Firefox crashed right after I closed it... So no harm done." > "Reloading to apply update" I've tried to reproduce this issue as you mentioned and in a few different steps too, but I couldn't reproduce this problem. I've done tests on theese builds: Mozilla/5.0 (Windows NT 6.1; rv:18.0) Gecko/18.0 Firefox/18.0 (20121020042013) Mozilla/5.0 (Windows NT 6.1; rv:18.0) Gecko/18.0 Firefox/18.0 (20121031042011) Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/19.0 Firefox/19.0 (20121121042014)
(In reply to Scoobidiver from comment #20) > The four comments talk about closing Firefox or restarting after update. I wasn't able to reproduce on FF 18b1 on Win XP, Win 7 32/64 with j7u7, j7u9.
Also no sign of crashes on 18b2 after the update.
roc - we're not able to reproduce here. Has anything changed in this code in the 18 timeframe? Can we perform some code investigation?
Assignee: nobody → roc
The fix range in comment #14 suggests the fixing patch was http://hg.mozilla.org/releases/mozilla-aurora/rev/cbb5d4bf1617. However, FF18 doesn't have that patch, and comment #23 says this is fixed on FF18. So I don't know. If crash-stats says this isn't happening on FF18 or later, can we just mark this WFM?
It still happening at least as recently at 2012-12-02 nightly builds.
Yes, it's the #5 browser topcrash on 18.0b2, so surely still happening. :(
These two stacks are particularly interesting: https://crash-stats.mozilla.com/report/index/136e6266-d1b6-48d9-b5fe-82a562121205 https://crash-stats.mozilla.com/report/index/064c1ab0-332f-4e50-8c39-f27f52121205 We're in DoStopPlugin (Java I guess) while closing the window and it's spinning the native event loop, so we reenter our painting code during document destruction :-(. Looks like nsTreeBodyFrame::BuildDisplayList crashes because GetContent()->GetCurrentDoc() returns null, since the element has been unbound from the document. Not surprising. We could wallpaper that easily enough. But we should probably do something at a higher level to avoid doing work inside a DoStopPlugin call. jschoenick, any ideas?
We should ideally always be queuing an event to stop plugins rather than doing it synchronously, but this codepath is fairly broken currently. I opened bug 818785 for this, but it depends on non-trivial work in a few other bugs. If we think this is only happening at document teardown, a intermediate solution might be to try to do asynchronous stopping of java only, and only at document destruction. This works around most of the issues I know of for delayed stop, which involve re-using the same tag for new plugin instances or re-entrance due to DOM manipulation. We also run java in-process still, which might contribute, but I'm not sure what the issues are there.
(In reply to John Schoenick [:johns] from comment #30) > We should ideally always be queuing an event to stop plugins rather than > doing it synchronously, but this codepath is fairly broken currently. I > opened bug 818785 for this, but it depends on non-trivial work in a few > other bugs. I agree. The two patches I've just attached should help work around the problem. They might, however, just shift problems to other crash signatures. I think parts 1 and 2 would be worth having on trunk. Part 2 could be easily landed on branches, it's very safe ... but it might not help much. (In reply to John Schoenick [:johns] from comment #30) > If we think this is only happening at document teardown, a intermediate > solution might be to try to do asynchronous stopping of java only, and only > at document destruction. This works around most of the issues I know of for > delayed stop, which involve re-using the same tag for new plugin instances > or re-entrance due to DOM manipulation. Can you prepare a patch for that? I think we should land a patch for that on trunk in case we need it later.
Comment on attachment 689077 [details] [diff] [review] Part 1: Track when we've called into plugin code. While we're in plugin code, never run the refresh driver >dom/plugins/base/nsPluginSafety.h > #define NS_TRY_SAFE_CALL_VOID(fun, pluginInst) \ > PR_BEGIN_MACRO \ >- PRIntervalTime startTime = PR_IntervalNow(); \ >+ PRIntervalTime startTime = NS_NotifyBeginPluginCall(); \ >+ NS_NotifyBeginPluginCall(); \ oops >layout/base/nsRefreshDriver.cpp >+ NS_WARNING("Refresh driver should not run during plugin call!"); NS_ERROR
Comment on attachment 689078 [details] [diff] [review] Part 2: nsTreeBodyFrame might as well use OwnerDoc(), it's a tiny bit more efficient and can never be null I think this should be a GetContent()->GetCurrentDoc() null-check instead. I think we shouldn't even be here if it's null. Perhaps also returning an error code would be the right thing to do (although I haven't checked what consumers does in that case).
Attachment #689078 - Flags: review?(matspal) → review-
(In reply to Mats Palmgren [:mats] from comment #34) > >dom/plugins/base/nsPluginSafety.h > > #define NS_TRY_SAFE_CALL_VOID(fun, pluginInst) \ > > PR_BEGIN_MACRO \ > >- PRIntervalTime startTime = PR_IntervalNow(); \ > >+ PRIntervalTime startTime = NS_NotifyBeginPluginCall(); \ > >+ NS_NotifyBeginPluginCall(); \ > > oops Fixed locally. > >layout/base/nsRefreshDriver.cpp > >+ NS_WARNING("Refresh driver should not run during plugin call!"); > > NS_ERROR I'm not sure about this. The problem is that it can be triggered by buggy plugins so doesn't necessarily mean we have a bug in our code. But I don't feel strongly about it so I'll change it to NS_ERROR if you confirm. (In reply to Mats Palmgren [:mats] from comment #35) > I think this should be a GetContent()->GetCurrentDoc() null-check instead. > I think we shouldn't even be here if it's null. That's true, but I think patch 1 or async stop should stop us from reaching here. This patch is really just cleanup (with a possible side effect of reducing our crashes). I wouldn't want to make every frame caller of GetContent()->GetCurrentDoc() to have to check for null. > Perhaps also returning an error code would be the right thing to do > (although I haven't checked what consumers does in that case). I don't think we should return an error code; handling this as an error would lead to unwanted complexity.
Looking the crash stack in bp-064c1ab0-332f-4e50-8c39-f27f52121205 it seems to me that the design error is in DocumentViewerImpl::Destroy (in stack frame #76): http://hg.mozilla.org/releases/mozilla-beta/annotate/6eb5edeb9f52/layout/base/nsDocumentViewer.cpp#l1616 I think we should somehow "turn off" the refresh driver for this pres shell before calling mDocument->Destroy(). We're about to destroy it anyway on line 1649. Can we do something like that instead of part 1?
(In reply to Mats Palmgren [:mats] from comment #37) > Looking the crash stack in bp-064c1ab0-332f-4e50-8c39-f27f52121205 > it seems to me that the design error is in DocumentViewerImpl::Destroy > (in stack frame #76): > http://hg.mozilla.org/releases/mozilla-beta/annotate/6eb5edeb9f52/layout/ > base/nsDocumentViewer.cpp#l1616 > > I think we should somehow "turn off" the refresh driver for this pres shell > before calling mDocument->Destroy(). We're about to destroy it anyway on > line 1649. Can we do something like that instead of part 1? Maybe. What if we destroy a subdocument and that spins a nested event loop in a plugin Stop? The toplevel refresh driver could still run. We'd need to make sure it couldn't find the document that's in the process of being destroyed. That sounds a bit hairy. Also, it seems to me like an indirect way of addressing this problem. On platforms that don't have plugins, or when we aren't running any plugins, that will be unnecessary code and it won't be easy to tell why we're doing it that way.
So I think part 1 is a more robust and more direct approach to addressing this problem.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #36) > > NS_ERROR > > I'm not sure about this. The problem is that it can be triggered by buggy > plugins so doesn't necessarily mean we have a bug in our code. But I don't > feel strongly about it so I'll change it to NS_ERROR if you confirm. Yes please - I think it would be valuable to us to know which plugins are buggy and in which situations. It'll also be valuable to the developers of the buggy plugins to know that they are doing something wrong. We can always remove it if it turns out to be too noisy.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #38) > The toplevel refresh driver could still run. We'd need to > make sure it couldn't find the document that's in the process of being > destroyed. That sounds a bit hairy. Isn't that what nsIPresShell::IsPaintingSuppressed() is for? > Also, it seems to me like an indirect way of addressing this problem. On > platforms that don't have plugins, or when we aren't running any plugins, > that will be unnecessary code and it won't be easy to tell why we're doing > it that way. Ignoring paint requests for the associated pres shell during nsDocument::Destroy() would be a performance win as well as make the code more robust (not just for plugins). A little experiment doing that: https://hg.mozilla.org/try/rev/5b305ae8a901 I think that's still too complicated, so I think we should simply destroy the pres shell *before* destroying the document: https://hg.mozilla.org/try/rev/169e2acc9c46 now it's clear that any abuse from a plugin Stop will not cause problems for any frame/pres shell methods, because it does not exist. This seems like a more natural order of destruction to me.
Comment on attachment 689077 [details] [diff] [review] Part 1: Track when we've called into plugin code. While we're in plugin code, never run the refresh driver I don't particularly like this patch (because it wallpapers a situation we shouldn't allow in the first place), but I can live with it. I guess it's less risky than destroying the pres shell first. So, r=mats with the points in comment 34 fixed.
Attachment #689077 - Flags: review?(matspal) → review+
(In reply to Mats Palmgren [:mats] from comment #41) > I think that's still too complicated, so I think we should simply > destroy the pres shell *before* destroying the document: > https://hg.mozilla.org/try/rev/169e2acc9c46 > now it's clear that any abuse from a plugin Stop will not cause problems > for any frame/pres shell methods, because it does not exist. This seems > like a more natural order of destruction to me. I like that patch!
Great, then I think it's something we could try on trunk. I think it's got higher risk for regressions though, so perhaps it's not suitable for branches. Fwiw, it's green on Try so far and I did test it briefly with Flash and Java on Linux, but not on other platforms so far.
Comment on attachment 689078 [details] [diff] [review] Part 2: nsTreeBodyFrame might as well use OwnerDoc(), it's a tiny bit more efficient and can never be null I think this patch is unnecessary, since part 1 aims to make this not occur. If we are wrong, then it's a safe null-pointer crash and I'd prefer that over this patch.
If this sticks on m-c, let's get this into beta 4 (going to build tomorrow EOD PT).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
(In reply to Mats Palmgren [:mats] from comment #44) > Great, then I think it's something we could try on trunk. I think it's got > higher risk for regressions though, so perhaps it's not suitable for > branches. I agree.
Comment on attachment 689077 [details] [diff] [review] Part 1: Track when we've called into plugin code. While we're in plugin code, never run the refresh driver [Approval Request Comment] Bug caused by (feature/regressing bug #): unknown User impact if declined: crashiness Testing completed (on m-c, etc.): just landed on m-c Risk to taking this patch (and alternatives if risky): we could try taking patch 2 in this bug as wallpaper on branches (if Mats agrees). This patch is straightforward but not totally without risk --- it's possible that there are plugins (or Flash apps) that try to do something like open an alert() from within plugin code, which this patch would break. However I think it's safe to say only crazy situations like that would be affected by this patch. String or UUID changes made by this patch: none
Attachment #689077 - Flags: approval-mozilla-beta?
Attachment #689077 - Flags: approval-mozilla-aurora?
Comment on attachment 689077 [details] [diff] [review] Part 1: Track when we've called into plugin code. While we're in plugin code, never run the refresh driver Please go ahead with uplift, beta 4 will go to build tomorrow.
Attachment #689077 - Flags: approval-mozilla-beta?
Attachment #689077 - Flags: approval-mozilla-beta+
Attachment #689077 - Flags: approval-mozilla-aurora?
Attachment #689077 - Flags: approval-mozilla-aurora+
We can still crash when we respond to an OS sent paint event like this https://crash-stats.mozilla.com/report/index/4977b912-5ae2-4630-90ec-890d42121215
Considering crashes with this signature still reproduce on the latest Aurora, Nightly, and Beta, https://crash-stats.mozilla.com/report/list?signature=nsTreeBodyFrame%3A%3ABuildDisplayList%28nsDisplayListBuilder*%2C+nsRect+const%26%2C+nsDisplayListSet+const%26%29 is there a reliable way QA can verify what was fixed for this bug?
Whiteboard: [qa?]
It's #271 top browser crasher w/o hangs in 18.0b4, and #27 in 19.0a2 so the majority of crashes are fixed. I filed bug 823041 for remaining crashes based on the crash report of comment 53.
Scoobidiver, considering the follow-up bug, and comment 53 and 54, do you think it's safe to mark this verified fixed?
Keywords: qawanted
Whiteboard: [qa?]
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #56) > Scoobidiver, considering the follow-up bug, and comment 53 and 54, do you > think it's safe to mark this verified fixed? Yes, it is.
Status: RESOLVED → VERIFIED
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #43) > (In reply to Mats Palmgren [:mats] from comment #41) > > I think that's still too complicated, so I think we should simply > > destroy the pres shell *before* destroying the document: > > https://hg.mozilla.org/try/rev/169e2acc9c46 > > now it's clear that any abuse from a plugin Stop will not cause problems > > for any frame/pres shell methods, because it does not exist. This seems > > like a more natural order of destruction to me. > > I like that patch! Mats, did you ever file a bug to land this change on trunk? I think we should if we can.
Depends on: 827042
(In reply to Timothy Nikkel (:tn) from comment #58) > Mats, did you ever file a bug to land this change on trunk? I think we > should if we can. I figured we could use the follow-up bug Scoobidiver filed, bug 823041. (sorry for not communicating my intentions)
Depends on: 832963
Depends on: 829557
Depends on: 836778
Depends on: 850191
Depends on: 860052
Depends on: 860252
Depends on: 860490
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: