Closed Bug 71141 Opened 25 years ago Closed 25 years ago

crash with dangling GC root from nsXULElement::mScriptObject

Categories

(Core :: XUL, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.8.1

People

(Reporter: waterson, Assigned: waterson)

References

()

Details

(Keywords: crash)

Attachments

(7 files)

I've been able to (pretty reliably) reproduce a crash that appears to be caused by a dangling GC root from nsXULElement::mScriptObject. Here's how: 1. Go to www.webmonkey.com in Mozilla. 2. Choose File/Edit to open the page in Composer. 3. Repeat (2) as necessary, occasionally closing some of the windows. Adding belt-and-braces removal of the GC root in nsXULElement's dtor seems to fix the problem. Patch to follow...
N.B. that I whacked XPC's JS runtime service to support weak references so that I could hack this a bit more cleanly.
Status: NEW → ASSIGNED
Keywords: crash
Target Milestone: --- → mozilla0.8.1
jpatel: I'm thinking this might be related to the JS Mark GC thing topcrash. ftp://ftp.mozilla.org/pub/data/crash-data/ns6analysis.html#js_MarkGCThing
Looks good to me, r/sr=brendan@mozilla.org. /be
r=jband From your comment, I gather that we are going to sometimes be calling JS_RemoveRootRT twice for the same root. Is that going to be happening a lot? I guess it is only a minor perf hit. I was also a little concerned about the danger of bringing the JSRuntimeService back from the dead - with a different JSRuntime! - but it looks like the shutdown code in the service manager will always protect us from that.
I thought about this a bit more, and think that maybe it's worth trying to solve the problem instead of band-aid around it. I added the above debug code, and then found this rather puzzling stack trace: nsDebug::Assertion(const char * 0x0129d000, const char * 0x0129cfe8, const char * 0x0129cfa8, int 549) line 254 + 13 bytes nsXULElement::~nsXULElement() line 549 + 35 bytes nsXULElement::`scalar deleting destructor'(unsigned int 1) + 15 bytes nsXULElement::Release(nsXULElement * const 0x05bbc7a0) line 687 + 157 bytes nsJSUtils::nsGenericFinalize(JSContext * 0x05ab6660, JSObject * 0x00f69408) line 437 + 12 bytes FinalizeXULElement(JSContext * 0x05ab6660, JSObject * 0x00f69408) line 291 + 13 bytes js_FinalizeObject(JSContext * 0x05ab6660, JSObject * 0x00f69408) line 1662 + 274 bytes js_GC(JSContext * 0x05ab6660, unsigned int 0) line 1218 + 11 bytes js_ForceGC(JSContext * 0x05ab6660) line 945 + 11 bytes js_DestroyContext(JSContext * 0x05ab6660, int 2) line 224 + 9 bytes JS_DestroyContext(JSContext * 0x05ab6660) line 882 + 11 bytes nsJSContext::~nsJSContext() line 392 + 13 bytes [...] If I understand this correctly, there's a XUL element that's being finalized even though it's still rooted. Do we finalize all JS objects associated with a context when the context is destroyed or something? Is this a GC bug? Or is my patch just bogus and I've dropped something on the floor here?
There's no way that JS is finalizing objects that are rooted - whether JSContexts are going away or not. We'd have noticed that! Are you *sure* there is not other code that is unrooting the object (and not setting your debug flag)?
brendan and I debugged this for a while. It turns out that there are *two* JS objects created for single DOM element. This happens because the generated DOM glue assumes (e.g., in NS_NewScriptXULElement) that asking for the script object for the parent *won't* recursively result in us asking for the child's script object again. XBL's ``onattach'' event handler can break this assumption, and that's exactly what happens in this case. We discovered this by adding a latch in nsXULElement's GetScriptObject() method (``mCreatingScriptObject''), and then assert-botching if we re-enter while building the script object.
You two should be aware of this.
I suggested to waterson that we change the glue code generated by idlc so NS_NewScriptXULElement and the rest all detect the recursive set. One way to do this: set *aReturn = nsnull; at the top of the function, before QI'ing aParent. Then test whether *aReturn became non-null during that call. The rest of the calls out from NS_NewScriptXULElement look ok to me -- NS_InitXULElementClass can't do any GetScriptObject on aSupports, and the QI from aSupports to nsIXULElement shouldn't. But hyatt should comment -- who knows where XBL might lurk? Cc'ing jst too, as we're talking about a deathbed hack to idlc. /be
brendan: I think that we actually need to do this in nsXULElement::GetScriptObject() (et. al), otherwise we'll run the onbindingattached handler twice for the child element.
(Also, relying on aReturn pointing to the same place seems sort of scary.)
waterson: right, we need to do this in several places, and we do need some well-known storage (however veneered by interface methods) that latches things so we don't double-create. Reording code, given all the layers and generality, won't guarantee that this problem is vanquished. I think we need some memory-based mechanism. Anyone have a better idea? /be
Summary: crash with danling GC root from nsXULElement::mScriptObject → crash with dangling GC root from nsXULElement::mScriptObject
I think this should do it: 1. Instead of creating script object directly into mScriptObject (or slots->mScriptObject), use stack variable. After creating script object, check to see if mScriptObject (or slots->mScriptObject) was set, indicating that we re-entered. If so, disconnect JSObject from native object, and ``discard'' redundant JSObject. This is a bit unsightly (adding JS_SetPrivate() goop to unlink native from JSObject), but I consider this a somewhat temporary hack given that the XPConnect'd DOM will make this all obsolete. N.B. that I fixed nsGenericElement, too, just in case somebody starts using XBL on HTML elements. 2. Add check for null script object to XUL element: if null, don't JS GC root the null pointer. 3. Add debug-only assert-botch code to notice if a XUL element is still rooted in dtor. 4. Unrelated gratuitous cleanup of XUL script proto manhandling of roots. Make nsJSRuntimeServiceImpl support weak references. r=/sr=, anyone?
*** Bug 70959 has been marked as a duplicate of this bug. ***
jst: brendan suggested that we could avoid this problem by enqueuing XBL's attach handler stuff and processing later, after script objects have been created. Dunno if you're going to need to wrestle with this problem while XPConnectifying the DOM...I'd be interested in what you thought...
r/sr=brendan@mozilla.org, looks good to me. /be
Doesn't this patch introduce a leak (which is better than a crash, but still...) in the case where we do re-enter GetScriptObject()? We create the script object, check if we re-ented and if we did we do a ::JS_SetPrivate(cx, newobject, nsnull), this will make the finalizer for the new script object *not* do the release on the native pointer, as it usually does, don't we need a (ew, this is uggly, but...) this->Release(); call (or maybe ::JS_GetPrivate(cx, newobject)->Release()) after ::JS_SetPrivate() to match the AddRef() that's done by factory->NewScriptElement()? Or did I overlook something here? As to what this will mean in the new XPConnectified DOM world I have no idea yet, we need to discuss this with jband once XBL actually works with an XPConnected DOM.
jst's smart, and stuff! /be
Ok, I opted for leaving *out* the changes to xpconnect (changing nsJSRuntimeServiceImpl to support weak references), and left the XUL proto element root management stuff alone. I started seeing bizarre shutdown crashes (from PREF_Shutdown(), through JS_GC). I suspect jband fears may have been realized, and a different runtime got created at shutdown.
Even in the new world, XBL will need a hook for when a DOM element is first being exposed to JS. This is needed for display: none elements, since I can't afford to waste time resolving style on display: none elts to determine what binding to attach. I have to do this lazily and it has to be done before the object is exposed to script.
waterson: I'm neutral on weakref support in nsJSRuntimeServiceImpl. I think it is safe in and of itself. Your prefs shutdown crash is probably bug 71237. I think that one is due to prefs trying to kill its JSContext late w/o regard to whether ot not the nsJSRuntimeServiceImpl has already shutdown and killed the one and only JSRuntime. It may be that changing your code to not hold a strong reference to the nsJSRuntimeServiceImpl exacerbated the problem by increasing the likelihood that the nsJSRuntimeServiceImpl would die earler. But, I don't think we are seeing the problem of a zombie nsJSRuntimeServiceImpl.
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Component: XP Toolkit/Widgets: XUL → XUL
QA Contact: jrgmorrison → xptoolkit.widgets
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: