Closed Bug 398219 Opened 12 years ago Closed 12 years ago

function objects cloned by XPConnect keep hidden window alive late into shutdown

Categories

(Core :: XPConnect, defect, P2)

x86
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: igor)

References

Details

(Keywords: memory-leak)

Attachments

(3 files, 4 obsolete files)

I was investigating why the NoScript extension leaks on shutdown, and I came across something that seems like a rather general problem.

I have a cycle collector trace that looks like this:

nsCycleCollector: JS Object (BackstagePass) 0xb733a7e0 was not collected due to 1
  external references (643 total - 642 known)
  An object expected to be garbage could be reached from it by the path:
    JS Object (BackstagePass) 0xb733a7e0
    JS Object (Object) 0xb692fd20
    JS Object (Object) 0xb692c700
    JS Object (XPCWrappedNative_NoHelper) 0xb7f5b9e0
    JS Object (Function - removeObserver) 0xb0b23320
    JS Object (Function - removeObserver) 0xb0b25bc0
    JS Object (ChromeWindow) 0xb35d0260
    nsGlobalWindow 0x8ab742c

This is bad -- something in a JS component keeping a window alive.  (It's not so bad since it's the hidden window.)

From a JS_DumpGCHeap right before this, I have the following GC traces that provide a little bit more information:

0xb0b23360 Function b0b27858      via nsXPCWrappedJS[nsIFactory,0x8875418].mJSObj(0xb692fd20 Object)._instance(0xb692c700 Object).caps(0xb7f5b8c0 XPCWrappedNative_NoHelper).removeObserver

0xb0b27858 function removeObserver via nsXPCWrappedJS[nsIFactory,0x8875418].mJSObj(0xb692fd20 Object)._instance(0xb692c700 Object).caps(0xb7f5b8c0 XPCWrappedNative_NoHelper).removeObserver(0xb0b23360 Function).private

0xb0b25bc0 Function b0b27858      via nsXPCWrappedJS[nsIFactory,0x8875418].mJSObj(0xb692fd20 Object)._instance(0xb692c700 Object).caps(0xb7f5b8c0 XPCWrappedNative_NoHelper).removeObserver(0xb0b23360 Function).object


What's going on here is that "caps" is an XPConnect wrapper for an nsIPrefBranch2 (representing preferences under "capability.policy.").  Its removeObserver property is a cloned function object -- presumably created by a call to JS_CloneFunctionObject from xpc_CloneFunctionObject from DefinePropertyIfFound, which clones the function object cached on an XPCNativeMember.

However, cloned functions in JS keep the original function alive, and in this case the original function's parent is the global of the safe JS context at the time we first used the function, which is the hidden window.  (But by this point in shutdown, the hidden window is no longer the safe JS context -- we've created a new safe JS Context.)

(The window being alive, in turn, prevents nsLayoutStatics from shutting down, which keeps many services, including nsXPConnect alive.  This eventually gets partly cleaned up at shutdown, although not enough to destroy the window -- then again, it may not be the actual root cause of the leak, but I'm reasonably confident (thanks to DEBUG_CC output) that it's why the leak entrains the hidden window and thus everything held by nsLayoutStatics.)


I'm not entirely sure that something is really wrong here, but it seems like if this weren't the case we'd probably have an easier time shutting down without leaking.
The "cloned function" concept seems to be causing us all kinds of problems :(
The reason the cloning keeps the original function alive is that the cloning, while sharing JSFunction between the clone and the original function object, keeps JSFunction.object pointing to the original. That pointer is used for 2 purposes:

1. The object is used as a hashtable to store argument name mappings through so-called hidden properties.

2. SpiderMonkey exposes JSFunction through public API and allows to access the function object through JSFunction.

So to avoid keeping alive the original function object together with its parent chain the following has to be done:

1. The storage of hidden properties has to be moved into JSFunction itself.

2. JSFunction can be renamed into JSFunctionInternal and JSFunction defined as an alias for JSObject preserving API compatibility.

I will try to see how much efforts is necessary for the first item.
It's true that the very old (1995, pre-dating SpiderMonkey) API design decision is causing us stress. If we can equate public JSFunction and JSObject pointer, and make the private state (suggest it still be called JSFunction -- the public type is opaque and used only to type-tag pointers) not link to any object, great.

If you can get rid of SPROP_IS_DUPLICATE and related code, even better. Just don't add too much mechanism that duplicates the scope/property-tree, and don't break any torture tests that synthesize functions with lots (approaching the 64K limit) of locals/formals. Thanks,

/be
It turned it is not that simple to move the storage for argument and variable names out of hidden properties to eventually address the leak. So I will split that process into separated bugs that can be tested and landed separately.
Assignee: nobody → igor
Depends on: 398609
I filed bug 398609 with the intention to hide the details of argument and variable name lookup for the functions behind tailored API.
Just to be clear -- from a Firefox/Gecko perspective, I don't think this bug is particularly important.
(In reply to comment #6)
> Just to be clear -- from a Firefox/Gecko perspective, I don't think this bug is
> particularly important.

On the other hand resolving it should reduce the memory footprint for the browser through better data structures in SM. Plus it will simply SM code as the patch for bug 398609 witnesses.

I *think* I'm seeing this in bug 388573 and bug 401345. Any chance we could fix this in 1.9?
(In reply to comment #8)
> I *think* I'm seeing this in bug 388573 and bug 401345. Any chance we could fix
> this in 1.9?
> 

Yes: after landing bug 398609 and bug 399544 the fix would be straightforward.
Depends on: 399544
No longer depends on: 398609
Flags: blocking1.9?
Moving to B+, P2 based on the fact that this might help address the extension leaks.  If that is an incorrect reading of the comments please adjust...
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
Keywords: mlk
Depends on: 404935
Attached patch v1 (obsolete) — Splinter Review
[The patch should be applied on top of the patch from bug 404935 comment 7.]

This is more of a workaround rather a proper solution. The idea behind the patch is to clear for to-be-cloned functions and regexps proto and parent slots. These slots are not used in any case so clearing them allows to break the references to the variable object which is used during compilation.

This patch makes it very explicit that there is no need to create JSObject wrapper for JSFunction created during the compilation. Unfortunately even with the bug 399544 fixed the compiler still have to create the useless wrappers for API compatibility. The problem with the plan described in comment 2 is newScriptHook which takes JSFunction and from which an implementation may call JS_GetFunctionObject.
Attachment #289954 - Flags: review?(brendan)
The patch from comment 11 should be applied on top of the patch from bug 404935 comment 10.
(In reply to comment #11)
> This patch makes it very explicit that there is no need to create JSObject
> wrapper for JSFunction created during the compilation. Unfortunately even with
> the bug 399544 fixed the compiler still have to create the useless wrappers for
> API compatibility. The problem with the plan described in comment 2 is
> newScriptHook which takes JSFunction and from which an implementation may call
> JS_GetFunctionObject.     

How about we break API compatibility by adding a leading cx parameter to JS_GetFunctionObject and making it fallible -- making it lazily construct the function's object when called?

Or to keep API compatibility (but we don't need to for Mozilla 2 at least, and we probably could get away with select breakage where justified even for 1.9), let it return null but add a new JS_GetOrCreateFunctionObject, or some such?

/be
(In reply to comment #13)
> > The problem with the plan described in comment 2 is
> > newScriptHook which takes JSFunction and from which an implementation may call
> > JS_GetFunctionObject.     
...
> Or to keep API compatibility...,
> let it return null but add a new JS_GetOrCreateFunctionObject, or some such?

I see an incompatible change in the behavior of the API that still allows code to compile is worse than the change that would force to alter the exist code. 

Since the problem here comes from newScriptHook, I would suggest to change that not to taker JSFunction at all. Since the bug 379410 has already shown the deficiency of the current API, the change should address that bug as well and making at least one user of SM happier as a result. 
Attached patch v1b (obsolete) — Splinter Review
This is a CVS diff version of the patch.
Attachment #289954 - Attachment is obsolete: true
Attachment #290720 - Flags: review?(brendan)
Attachment #289954 - Flags: review?(brendan)
Comment on attachment 290720 [details] [diff] [review]
v1b


> static JSFunction *
> NewCompilerFunction(JSContext *cx, JSTreeContext *tc, JSAtom *atom,
>                     JSBool lambda)
> {
>     JSObject *parent;
>+    JSFunction *fun;
> 
>     parent = (tc->flags & TCF_IN_FUNCTION) ? tc->fun->object : cx->fp->varobj;
>-    return js_NewFunction(cx, NULL, NULL, 0,
>-                          JSFUN_INTERPRETED | (lambda ? JSFUN_LAMBDA : 0),
>-                          parent, atom);
>+    fun = js_NewFunction(cx, NULL, NULL, 0,
>+                         JSFUN_INTERPRETED | (lambda ? JSFUN_LAMBDA : 0),

Might be worth reparameterizing the few (formerly one) functions in jsparse.c that take JSBool lambda, to take uintN lambda, which is JSFUN_LAMBDA or 0, to avoid ?: here. Depends on how much it costs for callers to synthesize the flag constant vs. 1 in the "lambda" case.

r/a=me with that, or not if it turns out to lose in code size.

/be
Attachment #290720 - Flags: review?(brendan)
Attachment #290720 - Flags: review+
Attachment #290720 - Flags: approval1.9+
Attached patch v2 (obsolete) — Splinter Review
This is an alternative to v1 that passes 0 or JSFUN_LAMBDA, not JS_FALSE or JS_TRUE to the function parsing code. This gives the same amount of code with GcC at -O, but it is slightly more readable. To emphasis that the type of lambda is numerical the patch uses explicit compare against 0 when testing if the function is a lambda.
Attachment #291131 - Flags: review?(brendan)
Attachment #291131 - Flags: review?(brendan)
Attachment #291131 - Flags: review+
Attachment #291131 - Flags: approval1.9+
Whiteboard: [needs landing]
I checked in the patch from comment 17 to the cVS trunk:

Checking in jsfun.c;
/cvsroot/mozilla/js/src/jsfun.c,v  <--  jsfun.c
new revision: 3.241; previous revision: 3.240
done
Checking in jsinterp.c;
/cvsroot/mozilla/js/src/jsinterp.c,v  <--  jsinterp.c
new revision: 3.392; previous revision: 3.391
done
Checking in jsparse.c;
/cvsroot/mozilla/js/src/jsparse.c,v  <--  jsparse.c
new revision: 3.314; previous revision: 3.313
done
Checking in jsregexp.c;
/cvsroot/mozilla/js/src/jsregexp.c,v  <--  jsregexp.c
new revision: 3.170; previous revision: 3.169
done
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Igor, I think this caused the following regressions:

ecma/extensions/15-1.js
MyObject.prototype.__proto__ == Object.prototype expected: true actual: false reason: wrong value

ecma/extensions/15-1.js
TestCase.prototype.__proto__ == Object.prototype expected: true actual: false reason: wrong value

ecma_2/extensions/instanceof-002.js
pat.__proto__.__proto__.__proto__.__proto__ == Object.prototype expected: true actual: false reason: wrong value

ecma_2/extensions/instanceof-002.js
pat.__proto__.__proto__.__proto__.__proto__.__proto__ == null expected: true actual: false reason: wrong value

js1_3/extensions/proto_10.js
pat.__proto__.__proto__.__proto__.__proto__ == Object.prototype expected: true actual: false reason: wrong value

js1_3/extensions/proto_10.js
pat.__proto__.__proto__.__proto__.__proto__.__proto__ == null expected: true actual: false reason: wrong value


I backed out the check in.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch v3Splinter Review
The new version removes the code that tries to shares function.prototype between the clones to address the regressions above. The sharing effectively leads to the same kind of reference problems that this bug is about as it stores in a shared JSFunction.object an object that depends of the compilation scope.
Attachment #290720 - Attachment is obsolete: true
Attachment #291131 - Attachment is obsolete: true
Attachment #292825 - Flags: review?(brendan)
Attached patch delta between v2 and v3 (obsolete) — Splinter Review
Comment on attachment 292825 [details] [diff] [review]
v3

Sorry, should have remembered this misbegotten code in fun_resolve.

The test will need adjusting still, right?

/be
Attachment #292825 - Flags: review?(brendan)
Attachment #292825 - Flags: review+
Attachment #292825 - Flags: approval1.9+
(In reply to comment #22)
> Created an attachment (id=292826) [details]
> delta between v2 and v3

Looks corrupted, nearly empty with some quilt mumbling at the start.

/be
Here is a file for real
Attachment #292826 - Attachment is obsolete: true
I checked in the patch from comment 21 to the CVS trunk:

http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&branch=HEAD&cvsroot=%2Fcvsroot&date=explicit&mindate=1198070280&maxdate=1198070451&who=igor%mir2.org
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
So it seems like the underlying bug here isn't really fixed -- I'm still seeing a window (presumably the hidden window) being kept alive through functions -- there's just one additional step in the GC traces now, so they look like this:

nsCycleCollector: JS Object (BackstagePass) 0x2aaaafc9d500 was not collected due to external references
  An object expected to be garbage could be reached from it by the path:
    JS Object (BackstagePass) 0x2aaaafc9d500
    JS Object (XPCWrappedNative_NoHelper) 0x2aaab5443880
    JS Object (Function - newURI) 0x2aaab6d4c400
    JS Object 0x2aaabab1b5a0
    JS Object (Function - newURI) 0x2aaaba3ae0c0
    JS Object (ChromeWindow) 0x2aaab6d52300
    nsGlobalWindow 0xb67f70
 (In reply to comment #28)

Any hints about what is exactly that JS Object 0x2aaabab1b5a0?
It's a JS function (an object traced with JSTRACE_FUNCTION).
(In reply to comment #30)
> It's a JS function (an object traced with JSTRACE_FUNCTION).

This is strange as it means that there is a link from the clone to the function. I will check it an file a new bug.
I'm definitely still seeing this too. See for example:

0xa734f20 ChromeWindow 15e66f60  via /Users/peterv/Library/Application Support/Firefox-debug/Profiles/debug/extensions/{0538E3E3-7E9B-4d49-8831-A227C80A7AD3}/components/nsForecastfox.js(0x16ee7e20 BackstagePass).gHelpersBundle(0xf3a1ec0 XPCWrappedNative_NoHelper).GetStringFromName(0x15de6e40 Function).private(0x16edf6f0 function).object(0x16ee9680 Function).__parent__

I tried the hack in this attachment and a bunch of extension leaks go away, so we definitely need to figure out how to fix this. STOBJ_GET_PARENT(fun->object) and STOBJ_GET_PROTO(fun->object) are still true in xpc_CloneJSFunction, wasn't that what the patches in this bug were trying to change? I'm pretty sure this is not the right fix though. Igor, when is it ok to set parent and proto of fun->object to null?
Another way I can fix those leaks is by setting proto and parent of funobj to null in XPCNativeMember::Resolve (http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/js/src/xpconnect/src/xpcwrappednativeinfo.cpp&rev=1.16&mark=195#180).
Depends on: 412491
I filed bug 412491 on the problem noted in comment 32.
You need to log in before you can comment on or make changes to this bug.