Closed Bug 28570 Opened 25 years ago Closed 25 years ago

circular reference leak: nsJSContext and nsXULPrototypeDocument

Categories

(Core :: XUL, defect, P3)

x86
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: waterson)

References

Details

(Keywords: memory-leak)

Attachments

(9 files)

Chris - this is the bug I mentioned to you in email last weekend, and I'm
assigning it to you as you said I could.

DESCRIPTION:  An nsJSContext and an nsXULPrototypeDocument both seem to be
leaking because they're holding references to each other.  Both of these objects
are probably holding on to a good bit of other stuff as well.

STEPS TO REPRODUCE:
 * start mozilla
 * hit "Exit" in the profile manager (I'm 90% sure that's how I got the refcount
balancer logs I'll attach)

ADDITIONAL INFORMATION:

I'll attach a refcount balancer log for both objects, made with
--ignore-balanced.  The log for the nsXULPrototypeDocument is huge, and I
attempted to match up the addrefs and the corresponding releases with numbers at
the left edge.  The one with "??" at the left edge is the one that I **think**
is leaking (which would mean this is a case of circular reference).
Status: NEW → ASSIGNED
Target Milestone: M15
I think the cause of this leak is that nsXULPrototypeDocument::GetScriptObject
returns "this" as the script object.  I don't think that's what's supposed to
happen, because nsJSContext::InitContext creates a reference to the script
object.  In other words, I don't think a single class is intended to implement
both nsIScriptObjectOwner and nsIScriptGlobalObject (or something like that).
I think my previous comment is wrong.
It might be worth comparing the way the JSContext is handled in this bug and the
way it's handled in the webshell:
http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDocShell.cpp#2185
(created in EnsureScriptEnvironment)
http://lxr.mozilla.org/seamonkey/source/webshell/src/nsWebShell.cpp#624
(destroyed in ~nsWebShell)
I have a patch that fixes this problem and dramatically reduces the leaks when
clicking "Exit" in the profile manager (to 2352 bytes, from 12668 bytes).

However, if I don't click exit in the profile manager, it hangs during the
destruction of the JSContext that used to be leaked, with the following stack
trace:
#0  0x404ce1bb in __sigsuspend (set=0xbfffed94)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x4024dfe0 in pthread_cond_wait (cond=0x8185d14, mutex=0x8185cb8)
    at restart.h:49
#2  0x4022e958 in PR_WaitCondVar (cvar=0x8185d10, timeout=4294967295)
    at ptsynch.c:360
#3  0x4008ef01 in js_GC (cx=0x82ade78) at jsgc.c:743
#4  0x4008eda5 in js_ForceGC (cx=0x82ade78) at jsgc.c:678
#5  0x4006c3f9 in js_DestroyContext (cx=0x82ade78, gcmode=JS_FORCE_GC)
    at jscntxt.c:179
#6  0x400615b3 in JS_DestroyContext (cx=0x82ade78) at jsapi.c:794
#7  0x40315ae8 in nsJSContext::~nsJSContext (this=0x82b7fe8, __in_chrg=3)
    at nsJSEnvironment.cpp:185
#8  0x40315c91 in nsJSContext::Release (this=0x82b7fe8)
    at nsJSEnvironment.cpp:194
#9  0x402879da in nsCOMPtr<nsIScriptContext>::assign_assuming_AddRef (
    this=0x82cb178, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
#10 0x40287a20 in nsCOMPtr<nsIScriptContext>::assign_with_AddRef (
    this=0x82cb178, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:787
#11 0x402908c7 in nsCOMPtr<nsIScriptContext>::operator= (this=0x82cb178, 
    rhs=0x0) at ../../dist/include/nsCOMPtr.h:526
#12 0x40c6e695 in nsXULPDGlobalObject::SetContext (this=0x82cb170, 
    aContext=0x0) at nsXULPrototypeDocument.cpp:437
#13 0x40c6d8ad in nsXULPrototypeDocument::~nsXULPrototypeDocument (
    this=0x8282bc0, __in_chrg=3) at nsXULPrototypeDocument.cpp:174
#14 0x40c6da20 in nsXULPrototypeDocument::Release (this=0x8282bc0)
    at nsXULPrototypeDocument.cpp:180
#15 0x40c88664 in nsCOMPtr<nsIXULPrototypeDocument>::~nsCOMPtr (
    this=0x827ff50, __in_chrg=2) at ../../../dist/include/nsCOMPtr.h:434
#16 0x40c37643 in nsXULDocument::~nsXULDocument (this=0x827fd70, __in_chrg=3)
    at nsXULDocument.cpp:505
#17 0x40c37e44 in nsXULDocument::Release (this=0x827fd70)
    at nsXULDocument.cpp:598

I'll attach the patch in a minute.  Any ideas on what's going on?  Is it a
problem with the fix or an existing problem?

Ignoring the crash, does the patch look like the right thing to do in the first
place?
Actually, it's not a "crash" (as I called it),  it just hangs (in thread stuff).
The cause of the hang seems to be the thread-safety code in JS_GC (see
http://lxr.mozilla.org/seamonkey/source/js/src/jsgc.c#682 ), which doesn't, if I
understand it correctly, account for the possibility that there could be garbage
collection occurring on the same thread but for a different JS context.

I neglected to look farther down the stack trace I included above.  It's all
within a JSContext::~JSContext on a different object:

#0  0x404ce1bb in __sigsuspend (set=0xbfffed94)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:48
#1  0x4024dfe0 in pthread_cond_wait (cond=0x8185d14, mutex=0x8185cb8)
    at restart.h:49
#2  0x4022e958 in PR_WaitCondVar (cvar=0x8185d10, timeout=4294967295)
    at ptsynch.c:360
#3  0x4008ef01 in js_GC (cx=0x8332b40) at jsgc.c:743
#4  0x4008eda5 in js_ForceGC (cx=0x8332b40) at jsgc.c:678
#5  0x4006c3f9 in js_DestroyContext (cx=0x8332b40, gcmode=JS_FORCE_GC)
    at jscntxt.c:179
#6  0x400615b3 in JS_DestroyContext (cx=0x8332b40) at jsapi.c:794
#7  0x40315ae8 in nsJSContext::~nsJSContext (this=0x8200d50, __in_chrg=3)
    at nsJSEnvironment.cpp:185
#8  0x40315c91 in nsJSContext::Release (this=0x8200d50)
    at nsJSEnvironment.cpp:194
#9  0x402879da in nsCOMPtr<nsIScriptContext>::assign_assuming_AddRef (
    this=0x82b7ff0, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416

...

#28 0x400b0dda in js_FinalizeObject (cx=0x82850a0, obj=0x839ae30)
    at jsobj.c:1387
#29 0x4008f567 in js_GC (cx=0x82850a0) at jsgc.c:891
#30 0x4008eda5 in js_ForceGC (cx=0x82850a0) at jsgc.c:678
#31 0x4006c3f9 in js_DestroyContext (cx=0x82850a0, gcmode=JS_FORCE_GC)
    at jscntxt.c:179
#32 0x400615b3 in JS_DestroyContext (cx=0x82850a0) at jsapi.c:794
#33 0x40315ae8 in nsJSContext::~nsJSContext (this=0x8285070, __in_chrg=3)
    at nsJSEnvironment.cpp:185
#34 0x40315c91 in nsJSContext::Release (this=0x8285070)
    at nsJSEnvironment.cpp:194
#35 0x402879da in nsCOMPtr<nsIScriptContext>::assign_assuming_AddRef (
    this=0x820aa04, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
#36 0x40287a20 in nsCOMPtr<nsIScriptContext>::assign_with_AddRef (
    this=0x820aa04, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:787
#37 0x402908c7 in nsCOMPtr<nsIScriptContext>::operator= (this=0x820aa04, 
    rhs=0x0) at ../../dist/include/nsCOMPtr.h:526
#38 0x4027790a in nsWebShell::~nsWebShell (this=0x820a970, __in_chrg=3)
    at nsWebShell.cpp:630
#39 0x40277f64 in nsWebShell::Release (this=0x820a970) at nsWebShell.cpp:730
To see what would happen, I commented out a few lines of jsgc.c (the check that
there wasn't JS GC going on on another context (which said on another thread) so
that the JS GC would be allowed, and then I crashed in the outer JS GC.  To be
more specific, this:

Index: jsgc.c
===================================================================
RCS file: /cvsroot/mozilla/js/src/jsgc.c,v
retrieving revision 3.18
diff -u -r3.18 jsgc.c
--- jsgc.c	2000/02/04 02:01:41	3.18
+++ jsgc.c	2000/02/27 20:54:40
@@ -737,6 +737,9 @@
 	JS_NOTIFY_REQUEST_DONE(rt);
     }
 
+#if 0
+    /* Is this necessary?  We've already checked to make sure there
+     * isn't GC on this context. */
     /* If another thread is already in GC, don't attempt GC; wait instead. */
     if (rt->gcLevel > 0) {
 	while (rt->gcLevel > 0)
@@ -746,6 +749,7 @@
 	JS_UNLOCK_GC(rt);
 	return;
     }
+#endif
 
     /* No other thread is in GC, so indicate that we're now in GC. */
     rt->gcLevel = 1;


led to this:

#0  __libc_free (mem=0xdadadada) at malloc.c:2914
#1  0x40061c7f in JS_free (cx=0x8284ce0, p=0xdadadada) at jsapi.c:1042
#2  0x400b0daa in js_FinalizeObject (cx=0x8284ce0, obj=0x839af18)
    at jsobj.c:1392
#3  0x4008f507 in js_GC (cx=0x8284ce0) at jsgc.c:895
#4  0x4008eda5 in js_ForceGC (cx=0x8284ce0) at jsgc.c:678
#5  0x4006c3f9 in js_DestroyContext (cx=0x8284ce0, gcmode=JS_FORCE_GC)
    at jscntxt.c:179
#6  0x400615b3 in JS_DestroyContext (cx=0x8284ce0) at jsapi.c:794
#7  0x40315ae8 in nsJSContext::~nsJSContext (this=0x8284cb0, __in_chrg=3)
    at nsJSEnvironment.cpp:185
#8  0x40315c91 in nsJSContext::Release (this=0x8284cb0)
    at nsJSEnvironment.cpp:194
#9  0x402879da in nsCOMPtr<nsIScriptContext>::assign_assuming_AddRef (
    this=0x820aa0c, newPtr=0x0) at ../../dist/include/nsCOMPtr.h:416
#10 0x40287a20 in nsCOMPtr<nsIScriptContext>::assign_with_AddRef (
    this=0x820aa0c, rawPtr=0x0) at ../../dist/include/nsCOMPtr.h:787
#11 0x402908c7 in nsCOMPtr<nsIScriptContext>::operator= (this=0x820aa0c, 
    rhs=0x0) at ../../dist/include/nsCOMPtr.h:526
#12 0x4027790a in nsWebShell::~nsWebShell (this=0x820a978, __in_chrg=3)
    at nsWebShell.cpp:630
#13 0x40277f64 in nsWebShell::Release (this=0x820a978) at nsWebShell.cpp:730


I think fixing this requires someone who knows something about the JS code...
It looks to me like nsJSContext::~nsJSContext should probably use 
JS_DestroyContextNoGC instead of JS_DestroyContext. There is altogether too much 
GC'ing going on anyway. nsJSContext has its own method (nsJSContext::GC) for 
forcing a GC. It should use that as necessary. nsJSContext::GC might want to 
also add a re-entrance guard to its GC method.

It think it is broken for embedding code to be doing nested GC on the same 
thread and JSContext. I don't think the engine should have to guard against 
that (I could be wrong).

Also note that http://bugzilla.mozilla.org/show_bug.cgi?id=13350 gets into the 
need to use JS_BeginRequest/JS_EndRequest to protect against GC on one thread 
while other threads are doing stuff.
But it's not GC on the same JSContext.  It's GC on two different JSContexts, and
the engine doesn't seem to like that.
Blocks: 25911
Blocks: 21056
For reference, the differences in the leaks when clicking exit in the profile
manager are:
http://dbaron.student.harvard.edu/u/david/leaks/myfixes/28570-patch-after.txt
http://dbaron.student.harvard.edu/u/david/leaks/myfixes/28570-patch-before.txt
(I have a few other small leak fixes in my tree that could account for some
differences.  I last updated my tree this morning.)
The "differences" I mention are not differences between my before and after, but
between my data and what you might see.  (Although the things in my tree are
most likely entirely insignificant...)
I'm suggesting something like... (in nsJSEnvironment)

   ::JS_SetGlobalObject(mContext, nsnull);
-  ::JS_DestroyContext(mContext);
+  ::JS_DestroyContextNoGC(mContext);
+  /* do our own GC that has re-entrance protection */
+  GC();

... and ...

 NS_IMETHODIMP
 nsJSContext::GC()
 {
+  /* XXX ASSUME single threaded calling. */
+  /* Protect against re-entry */
+  static PRBool busy = PR_FALSE;
+  if (busy)
+    return NS_OK;
+  busy = PR_TRUE;
   ::JS_GC(mContext);
+  busy = PR_FALSE;
   return NS_OK;
 }

But what happens if a JSContext is destroyed but never garbage-collected?
(/me rushes in yelling about things he hardly knows about. added brendan and 
vidur to wipe my chin.)

I don't think JavaScript "contexts" have any relevance to the garbage collector 
at all (so worrying that a "context won't be garbage collected" is a red 
herring). I believe that the scope of the garbage collector spans the entire JS 
"runtime", which may include many contexts.

I'm afraid of trying to explain what a JS "context" is for fear that I'll just 
confuse the issue with my partial understanding. (brendan? vidur? jband?)
That makes sense.  jband's patch works.  I'm not sure why I didn't just try it
earlier...

Anyway, I'll attach the complete diffs of what I have in my tree, which works. 
I made one additional minor change to the code in nsXULPrototypeDocument.cpp. 
Waterson, could you take a close look at those changes and see if they're
right?  I realize that the contents of nsXULPrototypeDocument.cpp should
probably be put in a different order, but it's a lot easier to see what was
changed when that's not done.
At first blush, these patches seem fine to me. I'm gonna try running with them 
for a bit to make sure nothing evil jumps out.

vidur/brendan: could you review the changes to nsJSEnvironment?

I'm trying to think of a situation where the nsXULPDGlobalObject could outlive 
the nsXULPrototypeDocument, and thus risk returning a dangling pointer when 
calling GetGlobalObjectOwner? (If this is in fact a problem, we could solve it 
by using nsWeakRef or hand-coding something to null the back-pointer in 
nsXULPrototypeDocument's dtor.)

And yes, I agree, the code should be re-arranged. :-)
I think that problem about the dangling pointer is the one thing I changed
between the two versions of the patch, other than adding the JS stuff.  Note
that the destructor for the XULPrototypeDocument now has a
mGlobalObject->SetGlobalObjectOwner(nsnull);
for just that possibility, even though mGlobalObject will most likely be
destroyed almost immediately afterward.  Is that sufficient?
Oops! You're right. That's perfect.
****WARNING****

Do NOT checkin the code I suggested that does a nsJSContext::GC() *after* 
calling ::JS_DestroyContextNoGC(mContext). This makes BAD things happen because 
nsJSContext::GC() uses 'mContext' which JS has just been destroyed. Duh!

We could get by with just not doing a GC here - there are other calls to GC. But 
we are really holding a timebomb here.

We need to fix the reentrance protection in js_GC. (As dbaron pointed out) in 
the JS_THREADSAFE case we just don't deal correctly with nested JS calls on the 
same thread but on different JSContexts. The check "if (cx->gcActive)" just 
doesn't cut it. This fix is going to have to be very carefully made because we 
don't want to screwup the fancy dancing in the "gc called on multiple threads" 
stuff with its wait and notify code.

I'll think about this. Anyone want to propose a fix?
Damn, bugzilla mail got through my sabbatical mail filters.

All that bad old gc interlocking code comes from server-land, where jawahar and 
I developed it long ago, and where there is exactly one context per thread.  The 
client, OTOH, uses a context per DOM window, yet DOM windows are accessed by one 
thread only, so worlds collide.

Even for server-type embeddings, there is a stupid, longstanding AB-BA deadlock 
between js_DestroyContext and js_GC (see the XXX commment in jscntxt.c); 
chouck@geocast.com ran into this but worked around it as others have, by using 
JS_DestroyContextNoGC (which looks like it was added as a hack to avoid fixing 
the deadlock).

It seems to me any fix here must distinguish thread from context, using a new 
JSRuntime state variable, say void *gcThread; to point at the thread currently 
running the GC.  I leave that as an exercise for one of you non-sabbaticalized 
JS guys.  And please fix the js_DestroyContext vs. js_GC deadlock while you're 
at it, ok?

/be



...Looking for reviewers....

My proposed patch to JS engine is attached above. The avoidance of re-entry on 
the same thread seemed pretty straightforward. As brendan suggested, I also 
include a fix for the potential deadlock when js_DestroyContext does js_GC and 
some other thread is waiting in js_GC for the first thread's request to end. My 
fix is a punt to return early from js_GC in the thread doing the 
js_DestroyContext. Comment on that appreciated. 

r=brendan@mozilla.org, looks good (only one nit: nested assignment to 
currentThread in jsgc.c is a little funky, might pull it out to an assignment 
statement right before the if).  Thanks,

/be
Yeah, sometimes the urge to write clever code is too much for me. Patch with nit 
fixed is attached. I'll check it in when the tree opens after the beta1 branch.
Thanks.
It turns out that either:
 * I did a really horrible job testing my fix
 * something else changed since I wrote it
and it's impossible to click on links in the browser with my fix.  I really have
no idea what's causing this.  It doesn't seem like it's because objects aren't
leaked, because putting an NS_ADDREF_THIS in the constructor for
nsXULPrototypeDocument doesn't fix things.

I really have no idea what's causing this.  I imagine that there could be
something that's trying to QueryInterface an nsIScriptGlobalObjectOwner to an
nsIScriptGlobalObject, or vice versa, or nsIScriptGlobalObject to something
else, and that's now failing.  However, I don't know what it is, or where to
look.

I guess the thing to try is writing QueryInterface functions for these two
classes that return the other object in certain cases.  Is this a good thing to
do?  Or could the problem be somewhere else?
Hacking the QueryInterface to return the other object fixes the problem.  Is
that bad style?

Right now I have it as a mix of code and macros, which I'll need to clean up,
unless there's an easy way to write a queryinterface like this...
Except that with this fix it's leaked again... back to the refcount balancer...
I just checked in the fixes to the JS engine to protect against the gc 
reentrance problem.
*IGNORE* - massive spam changing open XPToolkit bug's QA contact to
jrgm@netscape.com
QA Contact: paulmac → jrgm
Fix checked in 2000-03-15 17:11 PST.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Component: XP Toolkit/Widgets: XUL → XUL
QA Contact: jrgmorrison → xptoolkit.widgets
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: