The default bug view has changed. See this FAQ.

Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy][@@0x0 | nsFrameList::DestroyFrames()] with popup, menuitem and body onload removing window

RESOLVED FIXED

Status

()

Core
ImageLib
--
critical
RESOLVED FIXED
8 years ago
6 years ago

People

(Reporter: Martijn Wargers (dead), Assigned: bz)

Tracking

(4 keywords)

Trunk
x86
Windows XP
crash, testcase, verified1.9.1, verified1.9.2
Points:
---
Bug Flags:
wanted1.9.2 +
blocking1.9.0.19 -
wanted1.9.0.x +

Firefox Tracking Flags

(blocking1.9.2 needed, status1.9.2 .2-fixed, blocking1.9.1 needed, status1.9.1 .9-fixed)

Details

(Whiteboard: [sg:critical?][3.6.x] deleted shell obj., crash signature)

Attachments

(3 attachments)

(Reporter)

Description

8 years ago
Created attachment 373323 [details]
zipped up testcase

See zipped up testcase, to reproduce the crash:
Unzip the testcase, then open the file named 'parentframe.htm'.
It should crash after 500ms (after a reload).

It also crashes Firefox3.0.7, so marking security sensitive for now.

http://crash-stats.mozilla.com/report/index/049381bf-35dc-4616-a4d9-146bc2090417?p=1
0  	 	@0x0  	
1 	xul.dll 	nsRootBoxFrame::RemoveFrame 	layout/xul/base/src/nsRootBoxFrame.cpp:209
2 	xul.dll 	nsCSSFrameConstructor::ContentRemoved 	layout/base/nsCSSFrameConstructor.cpp:7234
3 	xul.dll 	PresShell::ContentRemoved 	layout/base/nsPresShell.cpp:4895
4 	xul.dll 	nsGenericElement::doRemoveChildAt 	content/base/src/nsGenericElement.cpp:3377
5 	xul.dll 	nsDocument::RemoveChildAt 	content/base/src/nsDocument.cpp:3252
6 	xul.dll 	nsGenericElement::doRemoveChild 	content/base/src/nsGenericElement.cpp:3984
7 	xul.dll 	nsDocument::RemoveChild 	content/base/src/nsDocument.cpp:5462
8 	xul.dll 	nsIDOMNode_RemoveChild 	obj-firefox/js/src/xpconnect/src/dom_quickstubs.cpp:4183
9 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5147
10 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1388
11 	js3250.dll 	js_InternalInvoke 	js/src/jsinterp.cpp:1441
12 	js3250.dll 	JS_CallFunctionValue 	js/src/jsapi.cpp:5196
13 	xul.dll 	nsJSContext::CallEventHandler 	dom/base/nsJSEnvironment.cpp:2007
14 	xul.dll 	nsJSEventListener::HandleEvent 	dom/src/events/nsJSEventListener.cpp:247
15 	xul.dll 	nsEventListenerManager::HandleEventSubType 	content/events/src/nsEventListenerManager.cpp:1090
16 	xul.dll 	nsEventListenerManager::HandleEvent 	content/events/src/nsEventListenerManager.cpp:1187
17 	xul.dll 	nsEventTargetChainItem::HandleEventTargetChain 	content/events/src/nsEventDispatcher.cpp:315
18 	xul.dll 	nsEventDispatcher::Dispatch 	content/events/src/nsEventDispatcher.cpp:508
19 	xul.dll 	nsImageBoxFrameEvent::Run 	layout/xul/base/src/nsImageBoxFrame.cpp:130
20 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:510
21 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:170
22 	xul.dll 	nsAppStartup::Run 	toolkit/components/startup/src/nsAppStartup.cpp:192
23 	nspr4.dll 	PR_GetEnv 	
24 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:110
25 	firefox.exe 	firefox.exe@0x21a7 	
26 	kernel32.dll 	BaseProcessStart 

Firefox3.0.7 has a different stacktrace:
http://crash-stats.mozilla.com/report/index/a1305891-19df-43f0-9aa4-8de482090417?p=1
0  	xul.dll  	nsFrame::Destroy  	 mozilla/layout/generic/nsFrame.cpp:500
1 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
2 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
3 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
4 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
5 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
6 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
7 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
8 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
9 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
10 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
11 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
12 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
13 	xul.dll 	nsLineBox::DeleteLineList 	mozilla/layout/generic/nsLineBox.cpp:340
14 	xul.dll 	nsBlockFrame::Destroy 	mozilla/layout/generic/nsBlockFrame.cpp:301
15 	xul.dll 	nsFrameList::DestroyFrames 	mozilla/layout/generic/nsFrameList.cpp:67
16 	xul.dll 	nsContainerFrame::Destroy 	mozilla/layout/generic/nsContainerFrame.cpp:257
17 	xul.dll 	nsBoxFrame::RemoveFrame 	mozilla/layout/xul/base/src/nsBoxFrame.cpp:1025
18 	xul.dll 	nsRootBoxFrame::RemoveFrame 	mozilla/layout/xul/base/src/nsRootBoxFrame.cpp:212
19 	xul.dll 	nsCSSFrameConstructor::ContentRemoved 	mozilla/layout/base/nsCSSFrameConstructor.cpp:9644
20 	xul.dll 	PresShell::ContentRemoved 	mozilla/layout/base/nsPresShell.cpp:4724
etc...
Flags: wanted1.9.0.x?
Flags: blocking1.9.0.11?
(Reporter)

Updated

8 years ago
Summary: Crash [@ nsRootBoxFrame::RemoveFrame] with popup, menuitem and boy onload removing window → Crash [@ nsRootBoxFrame::RemoveFrame] with popup, menuitem and body onload removing window
(Reporter)

Updated

8 years ago
Flags: blocking1.9.2?
Flags: wanted1.9.1.x+
Flags: wanted1.9.0.x?
Flags: wanted1.9.0.x+
Flags: blocking1.9.0.12?
Whiteboard: [sg:critical?]
dbaron: Is this your realm? If not, can you assign to the appropriate person?
Assignee: nobody → dbaron
status1.9.1: --- → wanted
Flags: wanted1.9.1.x+
Flags: blocking1.9.2? → wanted1.9.2+
On Mac trunk this is a null deref crash.  Different code path than Windows, or has trunk just changed since this bug was filed?

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x1114ca78 in nsFrame::Destroy (this=0x1616580) at /build/m-c/src/layout/generic/nsFrame.cpp:441
441	  shell->NotifyDestroyingFrame(this);
(gdb) p shell
$1 = (class nsIPresShell *) 0x0
(In reply to comment #1)
> dbaron: Is this your realm? If not, can you assign to the appropriate person?

Hmmm... just assigning bugs to me doesn't work if I don't see the bugmail (which I often don't).

In any case, I can reproduce the crash (although it's a different stack), but it doesn't crash under valgrind.
(In reply to comment #1)
> dbaron: Is this your realm? If not, can you assign to the appropriate person?

To answer the question:  not really, but I'm not sure who's more appropriate.

In any case, this looks like a null dereference in both of Martijn's stacks (I actually see the latter one on trunk now; the first one might have been something that was on trunk for a bit and then fixed), so I'm not sure why it's security-sensitive.

It's also pretty timing sensitive and thus hard to debug.  But it seems like the problem is that the pres context in question has gotten its SetShell(null) call, so the mShell that we try to dereference is null.
Unhiding per comment 2 and comment 4.
Group: core-security
Whiteboard: [sg:critical?] → [sg:dos] null deref
Attachment #373323 - Attachment mime type: application/zip → application/java-archive
re-hiding. The 3.0.x crash might be a different bug, but the testcase in this bug exhibits a potentially critical crash on that branch.
Group: core-security
Whiteboard: [sg:dos] null deref → [sg:critical?] deleted mem in 3.0.x, null deref in newer branches?
Summary: Crash [@ nsRootBoxFrame::RemoveFrame] with popup, menuitem and body onload removing window → Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy] with popup, menuitem and body onload removing window
To expand upon comment 6, shell is a deleted object (0xdddddddd in a windows debug build) when we call
   shell->NotifyDestroyingFrame(this);

In an opt build it may be a null in this testcase, but that doesn't mean a different testcase couldn't get a more useful value there.

In 3.5.6pre I'm getting a slightly different stack than in comment 0 but like comment 0 its still trying to execute 0x0 which is scarier than merely dereferencing it.
bp-f96a19bc-2d64-4c0d-aad3-9a8d02091112
Summary: Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy] with popup, menuitem and body onload removing window → Crash [@ nsRootBoxFrame::RemoveFrame][@ nsFrame::Destroy][@@0x0 | nsFrameList::DestroyFrames()] with popup, menuitem and body onload removing window
Whiteboard: [sg:critical?] deleted mem in 3.0.x, null deref in newer branches? → [sg:critical?] deleted mem in 3.0.x, executing @0x0 in newer branches
In a windows 3.5.6pre debug build I get the same stack and deleted shell as 3.0.x
blocking1.9.1: --- → ?
Flags: blocking1.9.0.17?
Whiteboard: [sg:critical?] deleted mem in 3.0.x, executing @0x0 in newer branches → [sg:critical?] deleted shell obj.
The menuitem in crash1.xul with an image gets an nsImageBoxFrame, it sets up a nsImageBoxFrameEvent to run when it is loaded that dispatches the load event. This calls the onload handler for the popup, we get a ContentRemoved on the root element and we start destroying frames. We get to nsImageBoxFrame::Destroy, which calls CancelAndForgetObserver on its imgImageRequest, this is the last thing blocking the load of the document, so it triggers the document's onload handler. So we remove the subdocument frame from the parent document and this nulls out the presshell on the subdocument's prescontext. When we get back to nsImageBoxFrame::Destroy it eventually calls nsFrame::Destroy where our presshell is null.
(In reply to comment #9)
> The menuitem in crash1.xul with an image gets an nsImageBoxFrame, it sets up a
> nsImageBoxFrameEvent to run when it is loaded that dispatches the load event.
> This calls the onload handler for the popup, we get a ContentRemoved on the
> root element and we start destroying frames. We get to
> nsImageBoxFrame::Destroy, which calls CancelAndForgetObserver on its
> imgImageRequest, this is the last thing blocking the load of the document, so
> it triggers the document's onload handler.

Clearly something between nsImageBoxFrame and the triggering of onload should be asynchronous so onload doesn't fire during frame destruction.
Changing the CancelAndForgetObserver call in nsImageBoxFrame::Destroy to a Cancel call and moving the nulling out of the frame of mListener to before the Cancel call would seem like the right way to fix this. Or am I missing something?

Potentially every other call of CancelAndForgetObserver could have this same problem.
(In reply to comment #11)
> Changing the CancelAndForgetObserver call in nsImageBoxFrame::Destroy to a
> Cancel call and moving the nulling out of the frame of mListener to before the
> Cancel call would seem like the right way to fix this. Or am I missing
> something?

Sounds good to me. I don't think that nulling out needs to be moved, though.

> Potentially every other call of CancelAndForgetObserver could have this same
> problem.

Maybe. I'm not sure why CancelAndForgetObserver would want to cancel synchronously, but then I don't know this code at all.
The nulling out of the frame is because I think we can still get notifications sent to the listener object.
Oh, you mean synchronously during the Cancel() call? OK.
No, actually, asynchronously from the Cancel call.
If it's asynchronous then we don't need to move the nulling out, it's fine for it to be where it is after the Cancel() call.
I guess you're right if we don't have to worry about synchronous notifications.
> Or am I missing something?

I think you are, and the something is that imagelib holds a raw pointer weak ref to mListener, which means that once nsImageBoxFrame::~nsImageBoxFrame happens it'll be holding a pointer to a dead object and then will proceed to asynchronously call virtual methods on it... We should really just fix that with some cycle collection smarts in imagelib, imho.

But given that, you really do need to CancelAndForgetObserver().  The fact that this ends up synchronously removing from the loadgroup is really bad; it needs to be guaranteeing that doesn't happen.  Joe, Bobby, can we fix that?
blocking1.9.1: ? → .7+
Flags: blocking1.9.0.17? → blocking1.9.0.17+
Whiteboard: [sg:critical?] deleted shell obj. → [sg:critical?][3.6.x] deleted shell obj.
bzbarsky said he'd be able to get to this in a week or two, I think, and he's a much better owner for it than I am.
Assignee: dbaron → bzbarsky
blocking1.9.1: .8+ → needed
Flags: blocking1.9.0.18+ → blocking1.9.0.19+
Moving over to the right component.  There's no sane way to fix this in layout; it needs to happen in imagelib.
Component: Layout → ImageLib
QA Contact: layout → imagelib
Created attachment 429049 [details] [diff] [review]
Proposed fix

This is about minimal.  It fixes the crash I see on Mac.  Martijn, can you try it on Windows?
Attachment #429049 - Flags: review?(joe)
(In reply to comment #18)
> But given that, you really do need to CancelAndForgetObserver().  The fact that
> this ends up synchronously removing from the loadgroup is really bad; it needs
> to be guaranteeing that doesn't happen.  Joe, Bobby, can we fix that?

Sorry I somehow missed this before.

As you proved, we *can* fix it. We have the technology. My question is: why should removing from the load group be asynchronous when everything else in that method is synchronous, by design?

More broadly stated: how in the world can we get this *right*, aside from hooking imagelib up to the cycle collector?
> why should removing from the load group be asynchronous when everything else in
> that method is synchronous, by design?

The main thing that needs to be synchronous is setting the observer to null (since it's about to die).  The rest of the synchronicity is due to not wanting to break imagelib api, and therefore needing to deliver notifications before nulling out the observer.  A key part here is that the observer is what receives these notifications and thus has control over how they're handled.

The loadgroup removal is NOT part of the imagelib API; it's part of the necko API.  And necko defines that Cancel() on an nsIRequest is asynchronous.  More importantly, though, the loadgroup removal can trigger arbitrary code to run that is not under the control of the observer, while we know for a fact that use of the CancelAndForgetObserver API means the observer is not in a good state (in the middle of a destructor, for one thing!).

> More broadly stated: how in the world can we get this *right*,

Hooking up to cycle collection is one approach, as you say.  The other approach, which I think might be even better, is switching imagelib to an API more similar to Necko's, where the image would not keep a pointer to its observer for longer than absolutely necessary.  In particular, if animations were pulled by consumers, not pushed by imagelib, we would not need long-term references to observers.  All this modulo multipart jpegs, which we would need to do something with, of course, since right now they're exposed in imagelib API as multiple separate image loads, not as animations, right?
(In reply to comment #23)
> The loadgroup removal is NOT part of the imagelib API; it's part of the necko
> API.  And necko defines that Cancel() on an nsIRequest is asynchronous.  More
> importantly, though, the loadgroup removal can trigger arbitrary code to run
> that is not under the control of the observer, while we know for a fact that
> use of the CancelAndForgetObserver API means the observer is not in a good
> state (in the middle of a destructor, for one thing!).

OK, I totally buy this!
 
> if animations
> were pulled by consumers, not pushed by imagelib, we would not need long-term
> references to observers.

This is probably possible. Because images don't change size, no reflow would be necessary; nsRefreshDriver could tell layout to redraw, and then layout could just get imagelib to draw the current frame.

>  All this modulo multipart jpegs, which we would need
> to do something with, of course, since right now they're exposed in imagelib
> API as multiple separate image loads, not as animations, right?

Right. Which is a bit silly, but before nsRefreshDriver there was really no way to force redraws, I guess.
(Reporter)

Comment 25

7 years ago
(In reply to comment #21)
> Created an attachment (id=429049) [details]
> Proposed fix
> 
> This is about minimal.  It fixes the crash I see on Mac.  Martijn, can you try
> it on Windows?

Yes, that seems to fix it, thanks.
Joe: can you expedite review for this blocking bug?
Whiteboard: [sg:critical?][3.6.x] deleted shell obj. → [sg:critical?][3.6.x] deleted shell obj.[needs review jdrew]
Attachment #429049 - Flags: superreview?(vladimir)
Attachment #429049 - Flags: review?(joe)
Attachment #429049 - Flags: review+
Vlad, if you can't review this before Sunday, just switch the review request to roc - he can do it on Monday, aka North America's Sunday. :)
Attachment #429049 - Flags: superreview?(vladimir) → superreview+
This doesn't need to be fixed in 1.9.0.19 unless it's also fixed in 1.9.1.9. Since it hasn't landed on trunk yet, that's unlikely, so no longer blocking for 1.9.0.19.
Flags: blocking1.9.0.19+ → blocking1.9.0.19-
Gah.  The only reason this hasn't landed on trunk yet is because I was waiting on vlad's review and of course I got no notification that it had happened...  I'm landing it on trunk right now.
Pushed http://hg.mozilla.org/mozilla-central/rev/463ffd8cf98e
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
That said, on 1.9.0 this looks like it needs a totally different fix, since CancelAndForgetRequest doesn't exist there.  Joe?
Created attachment 431158 [details] [diff] [review]
1.9.2/1.9.1 merge

Requesting approvals for what I think are the relevant branches...
Attachment #431158 - Flags: approval1.9.2.3?
Attachment #431158 - Flags: approval1.9.2.2?
Attachment #431158 - Flags: approval1.9.1.9?
Attachment #431158 - Flags: approval1.9.1.10?
blocking1.9.2: --- → ?
status1.9.2: --- → ?
I'm going to call this status1.9.0-wontfix, because fixing it will introduce more risk than not fixing it (since it would involve backporting all the Cancel*() fixes that bz and I have cooked up over the past two years).
Whiteboard: [sg:critical?][3.6.x] deleted shell obj.[needs review jdrew] → [sg:critical?][3.6.x] deleted shell obj.
Attachment #431158 - Flags: approval1.9.2.3?
Attachment #431158 - Flags: approval1.9.2.2?
Attachment #431158 - Flags: approval1.9.2.2+
Attachment #431158 - Flags: approval1.9.1.9?
Attachment #431158 - Flags: approval1.9.1.9+
Attachment #431158 - Flags: approval1.9.1.10?
b+a=beltzner for 1.9.2.2 and 1.9.1.9
blocking1.9.2: ? → needed
status1.9.2: ? → wanted
Pushed:
  http://hg.mozilla.org/releases/mozilla-1.9.1/rev/17b136b70289
  http://hg.mozilla.org/releases/mozilla-1.9.2/rev/88f1e55a2c3e
status1.9.1: wanted → .9-fixed
status1.9.2: wanted → .2-fixed
Verified for 1.9.1.9 using attached testcase and Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9pre) Gecko/20100311 Shiretoko/3.5.9pre (.NET CLR 3.5.30729). It no longer crashes as it does on 1.9.1.8.

Verified for 1.9.2 with Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.2pre) Gecko/20100311 Namoroka/3.6.2pre (.NET CLR 3.5.30729).
Keywords: verified1.9.1, verified1.9.2
Group: core-security
Crash Signature: [@ nsRootBoxFrame::RemoveFrame] [@ nsFrame::Destroy] [@@0x0 | nsFrameList::DestroyFrames()]
You need to log in before you can comment on or make changes to this bug.