Last Comment Bug 488850 - Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy][@@0x0 | nsFrameList::DestroyFrames()] with popup, menuitem and body onload removing window
: Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy][@@0x0 | nsFrameList...
Status: RESOLVED FIXED
[sg:critical?][3.6.x] deleted shell obj.
: crash, testcase, verified1.9.1, verified1.9.2
Product: Core
Classification: Components
Component: ImageLib (show other bugs)
: Trunk
: x86 Windows XP
: -- critical (vote)
: ---
Assigned To: Boris Zbarsky [:bz]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-17 08:05 PDT by Martijn Wargers [:mwargers] (not working for Mozilla)
Modified: 2011-06-13 10:01 PDT (History)
14 users (show)
roc: wanted1.9.2+
mbeltzner: blocking1.9.0.19-
dveditz: wanted1.9.0.x+
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
needed
.2-fixed
needed
.9-fixed


Attachments
zipped up testcase (592 bytes, application/java-archive)
2009-04-17 08:05 PDT, Martijn Wargers [:mwargers] (not working for Mozilla)
no flags Details
Proposed fix (2.03 KB, patch)
2010-02-25 19:10 PST, Boris Zbarsky [:bz]
joe: review+
vladimir: superreview+
Details | Diff | Splinter Review
1.9.2/1.9.1 merge (2.17 KB, patch)
2010-03-08 11:42 PST, Boris Zbarsky [:bz]
mbeltzner: approval1.9.2.2+
mbeltzner: approval1.9.1.9+
Details | Diff | Splinter Review

Description Martijn Wargers [:mwargers] (not working for Mozilla) 2009-04-17 08:05:56 PDT
Created attachment 373323 [details]
zipped up testcase

See zipped up testcase, to reproduce the crash:
Unzip the testcase, then open the file named 'parentframe.htm'.
It should crash after 500ms (after a reload).

It also crashes Firefox3.0.7, so marking security sensitive for now.

http://crash-stats.mozilla.com/report/index/049381bf-35dc-4616-a4d9-146bc2090417?p=1
0  	 	@0x0  	
1 	xul.dll 	nsRootBoxFrame::RemoveFrame 	layout/xul/base/src/nsRootBoxFrame.cpp:209
2 	xul.dll 	nsCSSFrameConstructor::ContentRemoved 	layout/base/nsCSSFrameConstructor.cpp:7234
3 	xul.dll 	PresShell::ContentRemoved 	layout/base/nsPresShell.cpp:4895
4 	xul.dll 	nsGenericElement::doRemoveChildAt 	content/base/src/nsGenericElement.cpp:3377
5 	xul.dll 	nsDocument::RemoveChildAt 	content/base/src/nsDocument.cpp:3252
6 	xul.dll 	nsGenericElement::doRemoveChild 	content/base/src/nsGenericElement.cpp:3984
7 	xul.dll 	nsDocument::RemoveChild 	content/base/src/nsDocument.cpp:5462
8 	xul.dll 	nsIDOMNode_RemoveChild 	obj-firefox/js/src/xpconnect/src/dom_quickstubs.cpp:4183
9 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5147
10 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1388
11 	js3250.dll 	js_InternalInvoke 	js/src/jsinterp.cpp:1441
12 	js3250.dll 	JS_CallFunctionValue 	js/src/jsapi.cpp:5196
13 	xul.dll 	nsJSContext::CallEventHandler 	dom/base/nsJSEnvironment.cpp:2007
14 	xul.dll 	nsJSEventListener::HandleEvent 	dom/src/events/nsJSEventListener.cpp:247
15 	xul.dll 	nsEventListenerManager::HandleEventSubType 	content/events/src/nsEventListenerManager.cpp:1090
16 	xul.dll 	nsEventListenerManager::HandleEvent 	content/events/src/nsEventListenerManager.cpp:1187
17 	xul.dll 	nsEventTargetChainItem::HandleEventTargetChain 	content/events/src/nsEventDispatcher.cpp:315
18 	xul.dll 	nsEventDispatcher::Dispatch 	content/events/src/nsEventDispatcher.cpp:508
19 	xul.dll 	nsImageBoxFrameEvent::Run 	layout/xul/base/src/nsImageBoxFrame.cpp:130
20 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:510
21 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:170
22 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/src/nsAppStartup.cpp:192
23 	nspr4.dll 	PR_GetEnv 	
24 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:110
25 	firefox.exe 	firefox.exe@0x21a7 	
26 	kernel32.dll 	BaseProcessStart 

Firefox3.0.7 has a different stacktrace:
http://crash-stats.mozilla.com/report/index/a1305891-19df-43f0-9aa4-8de482090417?p=1
0  	xul.dll  	nsFrame::Destroy  	 mozilla/layout/generic/nsFrame.cpp:500
1 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
2 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
3 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
4 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
5 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
6 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
7 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
8 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
9 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
10 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
11 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
12 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
13 	xul.dll 	nsLineBox::DeleteLineList 	mozilla/layout/generic/nsLineBox.cpp:340
14 	xul.dll 	nsBlockFrame::Destroy 	mozilla/layout/generic/nsBlockFrame.cpp:301
15 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
16 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
17 	xul.dll 	nsBoxFrame::RemoveFrame 	mozilla/layout/xul/base/src/nsBoxFrame.cpp:1025
18 	xul.dll 	nsRootBoxFrame::RemoveFrame 	mozilla/layout/xul/base/src/nsRootBoxFrame.cpp:212
19 	xul.dll 	nsCSSFrameConstructor::ContentRemoved 	mozilla/layout/base/nsCSSFrameConstructor.cpp:9644
20 	xul.dll 	PresShell::ContentRemoved 	mozilla/layout/base/nsPresShell.cpp:4724
etc...
Comment 1 Samuel Sidler (old account; do not CC) 2009-07-23 15:25:39 PDT
dbaron: Is this your realm? If not, can you assign to the appropriate person?
Comment 2 Brandon Sterne (:bsterne) 2009-10-08 15:39:57 PDT
On Mac trunk this is a null deref crash.  Different code path than Windows, or has trunk just changed since this bug was filed?

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x1114ca78 in nsFrame::Destroy (this=0x1616580) at /build/m-c/src/layout/generic/nsFrame.cpp:441
441	  shell->NotifyDestroyingFrame(this);
(gdb) p shell
$1 = (class nsIPresShell *) 0x0
Comment 3 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2009-11-10 17:17:18 PST
(In reply to comment #1)
> dbaron: Is this your realm? If not, can you assign to the appropriate person?

Hmmm... just assigning bugs to me doesn't work if I don't see the bugmail (which I often don't).

In any case, I can reproduce the crash (although it's a different stack), but it doesn't crash under valgrind.
Comment 4 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2009-11-11 11:57:33 PST
(In reply to comment #1)
> dbaron: Is this your realm? If not, can you assign to the appropriate person?

To answer the question:  not really, but I'm not sure who's more appropriate.

In any case, this looks like a null dereference in both of Martijn's stacks (I actually see the latter one on trunk now; the first one might have been something that was on trunk for a bit and then fixed), so I'm not sure why it's security-sensitive.

It's also pretty timing sensitive and thus hard to debug.  But it seems like the problem is that the pres context in question has gotten its SetShell(null) call, so the mShell that we try to dereference is null.
Comment 5 Brandon Sterne (:bsterne) 2009-11-12 09:11:30 PST
Unhiding per comment 2 and comment 4.
Comment 6 Daniel Veditz [:dveditz] 2009-11-12 11:26:58 PST
re-hiding. The 3.0.x crash might be a different bug, but the testcase in this bug exhibits a potentially critical crash on that branch.
Comment 7 Daniel Veditz [:dveditz] 2009-11-12 13:29:30 PST
To expand upon comment 6, shell is a deleted object (0xdddddddd in a windows debug build) when we call
   shell->NotifyDestroyingFrame(this);

In an opt build it may be a null in this testcase, but that doesn't mean a different testcase couldn't get a more useful value there.

In 3.5.6pre I'm getting a slightly different stack than in comment 0 but like comment 0 its still trying to execute 0x0 which is scarier than merely dereferencing it.
bp-f96a19bc-2d64-4c0d-aad3-9a8d02091112
Comment 8 Daniel Veditz [:dveditz] 2009-11-13 11:57:31 PST
In a windows 3.5.6pre debug build I get the same stack and deleted shell as 3.0.x
Comment 9 Timothy Nikkel (:tnikkel) 2009-11-13 15:50:16 PST
The menuitem in crash1.xul with an image gets an nsImageBoxFrame, it sets up a nsImageBoxFrameEvent to run when it is loaded that dispatches the load event. This calls the onload handler for the popup, we get a ContentRemoved on the root element and we start destroying frames. We get to nsImageBoxFrame::Destroy, which calls CancelAndForgetObserver on its imgImageRequest, this is the last thing blocking the load of the document, so it triggers the document's onload handler. So we remove the subdocument frame from the parent document and this nulls out the presshell on the subdocument's prescontext. When we get back to nsImageBoxFrame::Destroy it eventually calls nsFrame::Destroy where our presshell is null.
Comment 10 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-11-13 18:55:24 PST
(In reply to comment #9)
> The menuitem in crash1.xul with an image gets an nsImageBoxFrame, it sets up a
> nsImageBoxFrameEvent to run when it is loaded that dispatches the load event.
> This calls the onload handler for the popup, we get a ContentRemoved on the
> root element and we start destroying frames. We get to
> nsImageBoxFrame::Destroy, which calls CancelAndForgetObserver on its
> imgImageRequest, this is the last thing blocking the load of the document, so
> it triggers the document's onload handler.

Clearly something between nsImageBoxFrame and the triggering of onload should be asynchronous so onload doesn't fire during frame destruction.
Comment 11 Timothy Nikkel (:tnikkel) 2009-11-14 02:30:58 PST
Changing the CancelAndForgetObserver call in nsImageBoxFrame::Destroy to a Cancel call and moving the nulling out of the frame of mListener to before the Cancel call would seem like the right way to fix this. Or am I missing something?

Potentially every other call of CancelAndForgetObserver could have this same problem.
Comment 12 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-11-14 02:57:05 PST
(In reply to comment #11)
> Changing the CancelAndForgetObserver call in nsImageBoxFrame::Destroy to a
> Cancel call and moving the nulling out of the frame of mListener to before the
> Cancel call would seem like the right way to fix this. Or am I missing
> something?

Sounds good to me. I don't think that nulling out needs to be moved, though.

> Potentially every other call of CancelAndForgetObserver could have this same
> problem.

Maybe. I'm not sure why CancelAndForgetObserver would want to cancel synchronously, but then I don't know this code at all.
Comment 13 Timothy Nikkel (:tnikkel) 2009-11-14 03:05:30 PST
The nulling out of the frame is because I think we can still get notifications sent to the listener object.
Comment 14 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-11-14 03:21:09 PST
Oh, you mean synchronously during the Cancel() call? OK.
Comment 15 Timothy Nikkel (:tnikkel) 2009-11-14 03:27:45 PST
No, actually, asynchronously from the Cancel call.
Comment 16 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2009-11-14 18:31:08 PST
If it's asynchronous then we don't need to move the nulling out, it's fine for it to be where it is after the Cancel() call.
Comment 17 Timothy Nikkel (:tnikkel) 2009-11-15 01:41:47 PST
I guess you're right if we don't have to worry about synchronous notifications.
Comment 18 Boris Zbarsky [:bz] 2009-11-15 13:15:59 PST
> Or am I missing something?

I think you are, and the something is that imagelib holds a raw pointer weak ref to mListener, which means that once nsImageBoxFrame::~nsImageBoxFrame happens it'll be holding a pointer to a dead object and then will proceed to asynchronously call virtual methods on it... We should really just fix that with some cycle collection smarts in imagelib, imho.

But given that, you really do need to CancelAndForgetObserver().  The fact that this ends up synchronously removing from the loadgroup is really bad; it needs to be guaranteeing that doesn't happen.  Joe, Bobby, can we fix that?
Comment 19 David Baron :dbaron: ⌚️UTC+2 (mostly busy through August 4; review requests must explain patch) 2010-01-25 13:21:41 PST
bzbarsky said he'd be able to get to this in a week or two, I think, and he's a much better owner for it than I am.
Comment 20 Boris Zbarsky [:bz] 2010-02-25 14:12:49 PST
Moving over to the right component.  There's no sane way to fix this in layout; it needs to happen in imagelib.
Comment 21 Boris Zbarsky [:bz] 2010-02-25 19:10:44 PST
Created attachment 429049 [details] [diff] [review]
Proposed fix

This is about minimal.  It fixes the crash I see on Mac.  Martijn, can you try it on Windows?
Comment 22 Joe Drew (not getting mail) 2010-02-25 22:44:01 PST
(In reply to comment #18)
> But given that, you really do need to CancelAndForgetObserver().  The fact that
> this ends up synchronously removing from the loadgroup is really bad; it needs
> to be guaranteeing that doesn't happen.  Joe, Bobby, can we fix that?

Sorry I somehow missed this before.

As you proved, we *can* fix it. We have the technology. My question is: why should removing from the load group be asynchronous when everything else in that method is synchronous, by design?

More broadly stated: how in the world can we get this *right*, aside from hooking imagelib up to the cycle collector?
Comment 23 Boris Zbarsky [:bz] 2010-02-25 23:25:46 PST
> why should removing from the load group be asynchronous when everything else in
> that method is synchronous, by design?

The main thing that needs to be synchronous is setting the observer to null (since it's about to die).  The rest of the synchronicity is due to not wanting to break imagelib api, and therefore needing to deliver notifications before nulling out the observer.  A key part here is that the observer is what receives these notifications and thus has control over how they're handled.

The loadgroup removal is NOT part of the imagelib API; it's part of the necko API.  And necko defines that Cancel() on an nsIRequest is asynchronous.  More importantly, though, the loadgroup removal can trigger arbitrary code to run that is not under the control of the observer, while we know for a fact that use of the CancelAndForgetObserver API means the observer is not in a good state (in the middle of a destructor, for one thing!).

> More broadly stated: how in the world can we get this *right*,

Hooking up to cycle collection is one approach, as you say.  The other approach, which I think might be even better, is switching imagelib to an API more similar to Necko's, where the image would not keep a pointer to its observer for longer than absolutely necessary.  In particular, if animations were pulled by consumers, not pushed by imagelib, we would not need long-term references to observers.  All this modulo multipart jpegs, which we would need to do something with, of course, since right now they're exposed in imagelib API as multiple separate image loads, not as animations, right?
Comment 24 Joe Drew (not getting mail) 2010-02-25 23:37:55 PST
(In reply to comment #23)
> The loadgroup removal is NOT part of the imagelib API; it's part of the necko
> API.  And necko defines that Cancel() on an nsIRequest is asynchronous.  More
> importantly, though, the loadgroup removal can trigger arbitrary code to run
> that is not under the control of the observer, while we know for a fact that
> use of the CancelAndForgetObserver API means the observer is not in a good
> state (in the middle of a destructor, for one thing!).

OK, I totally buy this!
 
> if animations
> were pulled by consumers, not pushed by imagelib, we would not need long-term
> references to observers.

This is probably possible. Because images don't change size, no reflow would be necessary; nsRefreshDriver could tell layout to redraw, and then layout could just get imagelib to draw the current frame.

>  All this modulo multipart jpegs, which we would need
> to do something with, of course, since right now they're exposed in imagelib
> API as multiple separate image loads, not as animations, right?

Right. Which is a bit silly, but before nsRefreshDriver there was really no way to force redraws, I guess.
Comment 25 Martijn Wargers [:mwargers] (not working for Mozilla) 2010-02-26 01:17:23 PST
(In reply to comment #21)
> Created an attachment (id=429049) [details]
> Proposed fix
> 
> This is about minimal.  It fixes the crash I see on Mac.  Martijn, can you try
> it on Windows?

Yes, that seems to fix it, thanks.
Comment 26 Mike Beltzner [:beltzner, not reading bugmail] 2010-03-03 13:55:20 PST
Joe: can you expedite review for this blocking bug?
Comment 27 Joe Drew (not getting mail) 2010-03-05 13:45:28 PST
Vlad, if you can't review this before Sunday, just switch the review request to roc - he can do it on Monday, aka North America's Sunday. :)
Comment 28 Mike Beltzner [:beltzner, not reading bugmail] 2010-03-08 10:35:37 PST
This doesn't need to be fixed in 1.9.0.19 unless it's also fixed in 1.9.1.9. Since it hasn't landed on trunk yet, that's unlikely, so no longer blocking for 1.9.0.19.
Comment 29 Boris Zbarsky [:bz] 2010-03-08 11:30:52 PST
Gah.  The only reason this hasn't landed on trunk yet is because I was waiting on vlad's review and of course I got no notification that it had happened...  I'm landing it on trunk right now.
Comment 30 Boris Zbarsky [:bz] 2010-03-08 11:35:23 PST
Pushed http://hg.mozilla.org/mozilla-central/rev/463ffd8cf98e
Comment 31 Boris Zbarsky [:bz] 2010-03-08 11:41:15 PST
That said, on 1.9.0 this looks like it needs a totally different fix, since CancelAndForgetRequest doesn't exist there.  Joe?
Comment 32 Boris Zbarsky [:bz] 2010-03-08 11:42:25 PST
Created attachment 431158 [details] [diff] [review]
1.9.2/1.9.1 merge

Requesting approvals for what I think are the relevant branches...
Comment 33 Joe Drew (not getting mail) 2010-03-08 12:38:53 PST
I'm going to call this status1.9.0-wontfix, because fixing it will introduce more risk than not fixing it (since it would involve backporting all the Cancel*() fixes that bz and I have cooked up over the past two years).
Comment 34 Mike Beltzner [:beltzner, not reading bugmail] 2010-03-08 13:51:58 PST
b+a=beltzner for 1.9.2.2 and 1.9.1.9
Comment 36 Al Billings [:abillings] 2010-03-12 17:28:14 PST
Verified for 1.9.1.9 using attached testcase and Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9pre) Gecko/20100311 Shiretoko/3.5.9pre (.NET CLR 3.5.30729). It no longer crashes as it does on 1.9.1.8.

Verified for 1.9.2 with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.2pre) Gecko/20100311 Namoroka/3.6.2pre (.NET CLR 3.5.30729).

Note You need to log in before you can comment on or make changes to this bug.