Closed Bug 71141 Opened 24 years ago Closed 23 years ago

crash with dangling GC root from nsXULElement::mScriptObject

Categories

(Core :: XUL, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.8.1

People

(Reporter: waterson, Assigned: waterson)

References

()

Details

(Keywords: crash)

Attachments

(7 files)

I've been able to (pretty reliably) reproduce a crash that appears to be caused 
by a dangling GC root from nsXULElement::mScriptObject. Here's how:

1. Go to www.webmonkey.com in Mozilla.
2. Choose File/Edit to open the page in Composer.
3. Repeat (2) as necessary, occasionally closing some of the windows.

Adding belt-and-braces removal of the GC root in nsXULElement's dtor seems to 
fix the problem. Patch to follow...
N.B. that I whacked XPC's JS runtime service to support weak references so that 
I could hack this a bit more cleanly.
Status: NEW → ASSIGNED
Keywords: crash
Target Milestone: --- → mozilla0.8.1
jpatel: I'm thinking this might be related to the JS Mark GC thing topcrash.

  ftp://ftp.mozilla.org/pub/data/crash-data/ns6analysis.html#js_MarkGCThing
Looks good to me, r/sr=brendan@mozilla.org.

/be
r=jband

From your comment, I gather that we are going to sometimes be calling 
JS_RemoveRootRT twice for the same root. Is that going to be happening a lot? I 
guess it is only a minor perf hit. I was also a little concerned about the 
danger of bringing the JSRuntimeService back from the dead - with a different 
JSRuntime! - but it looks like the shutdown code in the service manager will 
always protect us from that.
I thought about this a bit more, and think that maybe it's worth trying to solve 
the problem instead of band-aid around it. I added the above debug code, and 
then found this rather puzzling stack trace:

nsDebug::Assertion(const char * 0x0129d000, const char * 0x0129cfe8,
                   const char * 0x0129cfa8, int 549)
  line 254 + 13 bytes
nsXULElement::~nsXULElement()
  line 549 + 35 bytes
nsXULElement::`scalar deleting destructor'(unsigned int 1) + 15 bytes
nsXULElement::Release(nsXULElement * const 0x05bbc7a0)
  line 687 + 157 bytes
nsJSUtils::nsGenericFinalize(JSContext * 0x05ab6660, JSObject * 0x00f69408)
  line 437 + 12 bytes
FinalizeXULElement(JSContext * 0x05ab6660, JSObject * 0x00f69408)
  line 291 + 13 bytes
js_FinalizeObject(JSContext * 0x05ab6660, JSObject * 0x00f69408)
  line 1662 + 274 bytes
js_GC(JSContext * 0x05ab6660, unsigned int 0)
  line 1218 + 11 bytes
js_ForceGC(JSContext * 0x05ab6660)
  line 945 + 11 bytes
js_DestroyContext(JSContext * 0x05ab6660, int 2)
  line 224 + 9 bytes
JS_DestroyContext(JSContext * 0x05ab6660)
  line 882 + 11 bytes
nsJSContext::~nsJSContext()
  line 392 + 13 bytes
[...]

If I understand this correctly, there's a XUL element that's being finalized 
even though it's still rooted. Do we finalize all JS objects associated with a 
context when the context is destroyed or something? Is this a GC bug? Or is my 
patch just bogus and I've dropped something on the floor here?
There's no way that JS is finalizing objects that are rooted - whether 
JSContexts are going away or not. We'd have noticed that! Are you *sure* there 
is not other code that is unrooting the object (and not setting your debug 
flag)?
brendan and I debugged this for a while. It turns out that there are *two* JS 
objects created for single DOM element. This happens because the generated DOM 
glue assumes (e.g., in NS_NewScriptXULElement) that asking for the script object 
for the parent *won't* recursively result in us asking for the child's script 
object again. XBL's ``onattach'' event handler can break this assumption, and 
that's exactly what happens in this case.

We discovered this by adding a latch in nsXULElement's GetScriptObject() method 
(``mCreatingScriptObject''), and then assert-botching if we re-enter while 
building the script object.
You two should be aware of this.
I suggested to waterson that we change the glue code generated by idlc so
NS_NewScriptXULElement and the rest all detect the recursive set.  One way to do
this: set *aReturn = nsnull; at the top of the function, before QI'ing aParent.

Then test whether *aReturn became non-null during that call.  The rest of the
calls out from NS_NewScriptXULElement look ok to me -- NS_InitXULElementClass
can't do any GetScriptObject on aSupports, and the QI from aSupports to
nsIXULElement shouldn't.  But hyatt should comment -- who knows where XBL might
lurk?  Cc'ing jst too, as we're talking about a deathbed hack to idlc.

/be
brendan: I think that we actually need to do this in 
nsXULElement::GetScriptObject() (et. al), otherwise we'll run the 
onbindingattached handler twice for the child element.
(Also, relying on aReturn pointing to the same place seems sort of scary.)
waterson: right, we need to do this in several places, and we do need some
well-known storage (however veneered by interface methods) that latches things
so we don't double-create.

Reording code, given all the layers and generality, won't guarantee that this
problem is vanquished.  I think we need some memory-based mechanism.  Anyone
have a better idea?

/be
Summary: crash with danling GC root from nsXULElement::mScriptObject → crash with dangling GC root from nsXULElement::mScriptObject
I think this should do it:

1. Instead of creating script object directly into mScriptObject (or
   slots->mScriptObject), use stack variable. After creating script object,
   check to see if mScriptObject (or slots->mScriptObject) was set, indicating
   that we re-entered. If so, disconnect JSObject from native object, and
   ``discard'' redundant JSObject.

   This is a bit unsightly (adding JS_SetPrivate() goop to unlink native from
   JSObject), but I consider this a somewhat temporary hack given that the
   XPConnect'd DOM will make this all obsolete.

   N.B. that I fixed nsGenericElement, too, just in case somebody starts
   using XBL on HTML elements.

2. Add check for null script object to XUL element: if null, don't JS GC root
   the null pointer.

3. Add debug-only assert-botch code to notice if a XUL element is still
   rooted in dtor.

4. Unrelated gratuitous cleanup of XUL script proto manhandling of roots. Make
   nsJSRuntimeServiceImpl support weak references.

r=/sr=, anyone?
*** Bug 70959 has been marked as a duplicate of this bug. ***
jst: brendan suggested that we could avoid this problem by enqueuing XBL's 
attach handler stuff and processing later, after script objects have been 
created. Dunno if you're going to need to wrestle with this problem while 
XPConnectifying the DOM...I'd be interested in what you thought...
r/sr=brendan@mozilla.org, looks good to me.

/be
Doesn't this patch introduce a leak (which is better than a crash, but still...)
in the case where we do re-enter GetScriptObject()? We create the script object,
check if we re-ented and if we did we do a ::JS_SetPrivate(cx, newobject,
nsnull), this will make the finalizer for the new script object *not* do the
release on the native pointer, as it usually does, don't we need a (ew, this is
uggly, but...) this->Release(); call (or maybe ::JS_GetPrivate(cx,
newobject)->Release()) after ::JS_SetPrivate() to match the AddRef() that's done
by factory->NewScriptElement()? Or did I overlook something here?

As to what this will mean in the new XPConnectified DOM world I have no idea
yet, we need to discuss this with jband once XBL actually works with an
XPConnected DOM.
jst's smart, and stuff!

/be
Ok, I opted for leaving *out* the changes to xpconnect (changing 
nsJSRuntimeServiceImpl to support weak references), and left the XUL proto 
element root management stuff alone. I started seeing bizarre shutdown crashes 
(from PREF_Shutdown(), through JS_GC). I suspect jband fears may have been 
realized, and a different runtime got created at shutdown.
Even in the new world, XBL will need a hook for when a DOM element is first
being exposed to JS.  This is needed for display: none elements, since I can't
afford to waste time resolving style on display: none elts to determine what
binding to attach.  I have to do this lazily and it has to be done before the
object is exposed to script.
waterson: I'm neutral on weakref support in nsJSRuntimeServiceImpl. I think it 
is safe in and of itself. Your prefs shutdown crash is probably bug 71237. I 
think that one is due to prefs trying to kill its JSContext late w/o regard to 
whether ot not the nsJSRuntimeServiceImpl has already shutdown and killed the 
one and only JSRuntime. It may be that changing your code to not hold a strong 
reference to the nsJSRuntimeServiceImpl exacerbated the problem by increasing 
the likelihood that the nsJSRuntimeServiceImpl would die earler. But, I don't 
think we are seeing the problem of a zombie nsJSRuntimeServiceImpl.
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Component: XP Toolkit/Widgets: XUL → XUL
QA Contact: jrgmorrison → xptoolkit.widgets
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: