Closed Bug 383269 Opened 13 years ago Closed 13 years ago

leak of elements when quitting with 2 empty windows

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.9alpha6

People

(Reporter: dbaron, Assigned: brendan)

References

Details

(Keywords: memory-leak)

Attachments

(5 files, 2 obsolete files)

I've been looking into simple cases where we're leaking significant graphs of objects, and have found the following.

Steps to reproduce:
 1. start in a fresh firefox profile
 2. switch the preference in preferences -> Main -> Startup -> When Firefox Starts to "Show my windows and tabs from last time"
 3. open two new windows (which should be blank)
 4. close the other window, so you have only 2 blank windows
 5. do File | Quit
 6. start up again on that profile, with leak logging this tie
 7. do File | Quit

(I see the bug intermittently with a single empty window, but reliably with 2 empty windows.)

I see a bunch of objects leaked.  The cycle collector says it didn't collect one of the documents or JS contexts because (among a bunch of other reasons) there were outstanding references to an nsGenericElement that it couldn't account for, which according to the refcount balancer is a wrapped native.  JS_DumpHeap says the reason we didn't collect the wrapped native is the GC trace:

0x16986c0 XULElement 16aae90     via ...(0x185e740 Object).keys_(0x185e780 Array).0(0x185e6c0 Function).__parent__(0x185e580 Call).__proto__(0x17e8fc0 Call).arguments(0x19225c0 Object).1(0x1921900 Object).phishingWarden_(0x18e7640 Object).tabbrowser_(0x1698780 XULElement).__parent__(0x1698740 XULElement).__parent__(0x1698700 XULElement).__parent__

I'm not sure what the "..." at the beginning is.
Maybe related to bug 380837?
(In reply to comment #0)
> 0x16986c0 XULElement 16aae90     via ...(0x185e740 Object).keys_(0x185e780
> Array).0(0x185e6c0 Function).__parent__(0x185e580 Call).__proto__(0x17e8fc0
> Call).arguments(0x19225c0 Object).1(0x1921900 Object).phishingWarden_(0x18e7640
> Object).tabbrowser_(0x1698780 XULElement).__parent__(0x1698740
> XULElement).__parent__(0x1698700 XULElement).__parent__
> 
> I'm not sure what the "..." at the beginning is.

The first number in ...(0x185e740 Object) gives the address that can be matched against the first column of the output. Do you have it in your output? If so, it should allow to recover the full chain.
Ah, I was thinking that, but I didn't look for it in the first column, but instead at the end, so I didn't find it.  It's:

0x185e740 Object 1442e14         via /home/dbaron/builds/trunk-cc/obj/firefox-debugopt/dist/bin/components/nsSafebrowsingApplication.js(0x17b0b80 BackstagePass).gDataProvider(0x181c2c0 Object).prefs_(0x181c300 Object).observers_(0x181c340 Object).browser.safebrowsing.dataProvider
Ah, the pref observers are leaking.  We should null out PROT_PhishingWarden.prefs_ in PROT_PhishingWarden.prototype.shutdown.  This probably fixes bug 380837 too.
It seems scary to me that you have references into a specific document from a JS component.  Shouldn't something be ensuring those references are dropped when the document goes away, rather than waiting until shutdown?
PROT_PhishingWarden.prototype.shutdown gets called on XUL window unload.  It's just poorly named.
It looks (from the source) like gDataProvider is a PROT_Application, not a PROT_PhishingWarden.
er, sorry, a PROT_DataProvider.
Assignee: dbaron → nobody
Flags: blocking-firefox3?
I'm not sure how gDataProvider.prefs_ gets an observer containing PROT_PhishingWarden or XUL objects.  The pref observers are all within a small scope:
http://mxr.mozilla.org/seamonkey/source/browser/components/safebrowsing/content/globalstore.js
This GC trace looks like a class variable that should be an instance variable, but I can't find the problem...
(In reply to comment #9)
> I'm not sure how gDataProvider.prefs_ gets an observer containing
> PROT_PhishingWarden or XUL objects.

Comment 0 (cited in comment 2) shows a reference path.

> The pref observers are all within a small scope:
> http://mxr.mozilla.org/seamonkey/source/browser/components/safebrowsing/content/globalstore.js

If I paste the path in comment 3 to the one in comment 0, and wrap lines, I get:

  .../dist/bin/components/nsSafebrowsingApplication.js(0x17b0b80 BackstagePass)
  .gDataProvider(0x181c2c0 Object)
  .prefs_(0x181c300 Object)
  .observers_(0x181c340 Object)
  .browser.safebrowsing.dataProvider.keys_(0x185e780 Array)
  .0(0x185e6c0 Function)
  .__parent__(0x185e580 Call)
  .__proto__(0x17e8fc0 Call)
  .arguments(0x19225c0 Object)
  .1(0x1921900 Object)
  .phishingWarden_(0x18e7640 Object)
  .tabbrowser_(0x1698780 XULElement)
  .__parent__(0x1698740 XULElement)
  .__parent__(0x1698700 XULElement)
  .__parent__

Any old Lispers around will recognize the perils of environment capture. There's a function at index 0 in the keys_ array held by gDataProvider.prefs_.observers_ &c that is a closure: it entrains an object via its arguments array, at index 1 (the second actual arg). This is an Object with a .phishingWarden_ property referencing an Object containing a .tabbrowser_ and the rest is history.

/be
(In reply to comment #11)
>   .browser.safebrowsing.dataProvider.keys_(0x185e780 Array)

This should be two lines:
   .browser.safebrowsing.dataProvider(0x185e740 Object)
   .keys_(0x185e780 Array)

> the rest is history.

It's still not clear; the thing added as an observer/listener (a PROT_Controller, 0x1921900, presumably mangled by BindToObject into 0x185e6c0, although I don't really understand what structure BindToObject should produce) was added to something else.
Yeah, it's unclear to me how a PROT_Controller became an argument to PROT_DataProvider.prefs_.  In particular, what's __parent__ or __parent__.__proto__ on a Function?  The functions added to keys_ should all have |this| (PROT_DataProvider) at arguments[1].

FWIW, it seems like this is pretty straight forward to fix: break the chain during window unload by nulling out tabbrowser_ or phishingWarden_.
(In reply to comment #13)
> Yeah, it's unclear to me how a PROT_Controller became an argument to
> PROT_DataProvider.prefs_.  In particular, what's __parent__ or
> __parent__.__proto__ on a Function?

Since __parent__ in this case is a Call object, it means a closure that holds all its environment as Brendan ponted out.

(In reply to comment #3)
> Ah, I was thinking that, but I didn't look for it in the first column, but
> instead at the end, so I didn't find it.  It's:
> 
> 0x185e740 Object 1442e14         via
> /home/dbaron/builds/trunk-cc/obj/firefox-debugopt/dist/bin/components/nsSafebrowsingApplication.js(0x17b0b80
> BackstagePass).gDataProvider(0x181c2c0 Object).prefs_(0x181c300
> Object).observers_(0x181c340 Object).browser.safebrowsing.dataProvider
> 

But what XPCOM references /home/dbaron/builds/trunk-cc/obj/firefox-debugopt/dist/bin/components/nsSafebrowsingApplication.js(0x17b0b80
 BackstagePass)?

I.e. what is is the full cycle and why the cycle collector can not detect it?
It's not a cycle.  That's the global scope in a JS component, which is a root, and should be.

Brendan thinks this is a JS engine bug, and gave me a long explanation yesterday which I asked him to put in the bug.
I am concerned that something is polluting a prototype object's 'arguments' slot, entraining junk from the "last get" on a Call object that's defunct, keeping stuff alive that should not be. Igor, see the __proto__ of the Call object, which is also a Call object (therefore a Call prototype?). More in a bit.

/be
To David Baron: 

Can you attach the whole trace dump?
(In reply to comment #17)
> I am concerned that something is polluting a prototype object's 'arguments'
> slot, entraining junk from the "last get" on a Call object that's defunct,
> keeping stuff alive that should not be. Igor, see the __proto__ of the Call
> object, which is also a Call object (therefore a Call prototype?). More in a
> bit.

Yes, this is strange. Looks like explicit __proto__ manipulation or a bug somewhere.
(In reply to comment #19)
> Yes, this is strange. Looks like explicit __proto__ manipulation or a bug
> somewhere.

__proto__ is not defined in a Call object or its prototype -- see js_InitCallClass.

/be
This looks like a bug in SpiderMonkey. Here is a simple js shell session with dumpHeap output which shows that somehow the arguments object is managed to be stored in Call.__prototype__.
This is indeed a bug in SpiderMonkey. When the second line of the example above is excuted,

function outer() {  var x = arguments; return function inner() { return x }; }
var f = outer("abc");

js_PutCallObject is eventually called, http://lxr.mozilla.org/seamonkey/source/js/src/jsfun.c#625.

At line 646 that in turn calls js_GetProperty(cx, callobj, argsid, &aval) which invokes js_NativeGet, http://lxr.mozilla.org/seamonkey/source/js/src/jsobj.c#3412

At line 3445 that in turns calls 
    LOCKED_OBJ_SET_SLOT(pobj, slot, *vp); 
since cx->runtime->propertyRemovals == sample is true. And this exactly the line where Call.__prototype__.arguments is assigned.
Here is a test case for the bug. Since __proto__ censors Call objects, it relies on the generators finally invoked by GC so when 380469 is fixed, it will stop working.

~/m/trunk/mozilla/js/src> cat ~/m/y.js
function outer() {  var x = arguments; return function inner() { return x }; }

var gen_was_closed = false;

function gen()
{
  try {
    yield 1;
  } finally {
    gen_was_closed = true;
  }
}

var g = gen();
g.next();

var f = outer(g);
f = null;
g = null;
gc();
gc();
if (!gen_was_closed)
  throw "leaked arguments object";
~/m/trunk/mozilla/js/src> js -v 170 ~/m/y.js
before 27696, after 27696, break 08bd8000
before 27696, after 27696, break 08bd8000
uncaught exception: leaked arguments object
Attached file test case for the js shell (obsolete) —
Here is a test case for js shell that does not relies on generators and just measures the performance of GC to detect the leak.
The new test case that passes with the bug fixed.
Attachment #267550 - Attachment is obsolete: true
The bug happens because arguments property for Class objects is unshared but has getter and setter defined. It means that the corresponding JSScopeProperty.slot points to a valid slot and this is the slot that caches the last getter result after each execution of call_GetProperty. 

I am not sure where the logic is broken , but it seems strange in js_NativeGet to cache the getter result when obj != pobj.
The arguments property of call objects must be assignable -- you can rebind arguments to 42, e.g. Hence its lack of JSPROP_SHARED. We could try starting it out as JSPROP_SHARED in the Call prototype, and then overriding it in a Call instance for a function that mutates it with an unshared (slot-ful) property. Taking this bug.

/be
Assignee: nobody → general
Component: Phishing Protection → JavaScript Engine
Flags: blocking-firefox3?
OS: Linux → All
Priority: -- → P1
Product: Firefox → Core
QA Contact: phishing.protection → general
Hardware: PC → All
Target Milestone: --- → mozilla1.9alpha6
Flags: blocking1.9?
(In reply to comment #27)
> The arguments property of call objects must be assignable -- you can rebind
> arguments to 42, e.g. Hence its lack of JSPROP_SHARED. 

But is exactly the purpose of assigning a slot with a getter result in js_NativeGet when obj != pobj? I do not see how this can be useful. 
(In reply to comment #28)
> (In reply to comment #27)
> > The arguments property of call objects must be assignable -- you can rebind
> > arguments to 42, e.g. Hence its lack of JSPROP_SHARED. 
> 
> But is exactly the purpose of assigning a slot with a getter result in
> js_NativeGet when obj != pobj? I do not see how this can be useful.

If the property has a slot, then the getter must be a JSPropertyOp. The assumption is that the slot was allocated only for the object (pobj) that owns the scope (the     JS_ASSERT(scope->object == pobj)) that contains the property (sprop).

IOW, you can't safely index into obj->dslots[sprop->slot], obj could delegate to pobj but not allocate dslots at all, or big enough to hold that slot, or with the slot available for that property (instead of being claimed by another).

This kind of entrainment is why JSPROP_SHARED was added. SHARED, PERMANENT prototype properties do not entrain, but appear to be directly contained by each delegating object (obj.hasOwnProperty('foo') is true, even though 'foo' is bound in pobj's scope). Hence the idea for a fix suggested in comment 27.

/be
Assignee: general → brendan
(In reply to comment #29)
> > But is exactly the purpose of assigning a slot with a getter result in
> > js_NativeGet when obj != pobj? I do not see how this can be useful.
> 
> If the property has a slot, then the getter must be a JSPropertyOp. 
...
> IOW, you can't safely index into obj->dslots[sprop->slot],
...
But then js_NativeGet should assert that pobj == slot in the case of valid slots. I.e. if object does not want the shared slot, then it must ensure that the slot is created  in the resolve hook.

And this is exactly what call_resolve missed to do.

Also AFAICS for sprop with getter and slot == -1 there is no need to lock the object the second time in js_NativeGet as sprop->slot can not change to something valid during the execution of getter.
Blocks: 383952
Blocks: 383954
Blocks: 383217
Blocks: 384311
(In reply to comment #30)
> But then js_NativeGet should assert that pobj == slot

pobj == obj, right?

> in the case of valid
> slots. I.e. if object does not want the shared slot, then it must ensure that
> the slot is created  in the resolve hook.

The API is old and there are legitimate uses of slot-ful (not JSPROP_SHARED) prototype properties, so I am loath to add such an assert.

> And this is exactly what call_resolve missed to do.

Patch in a second with a simpler fix.

> Also AFAICS for sprop with getter and slot == -1 there is no need to lock the
> object the second time in js_NativeGet as sprop->slot can not change to
> something valid during the execution of getter.

js_NativeGet's caller expects the (pobj's) scope to be locked on return.

/be
Status: NEW → ASSIGNED
Attached patch proposed fixSplinter Review
Igor, both testcases from this bug that you provided pass for me without this patch, so I'm not sure how to test in the shell. But it seems clear the rogue call that gets the current frame's argsobj and stuffs it into Call prototype's slot for the proto-property named 'arguments' is the one in js_PutCallObject.

The interaction with override bits is delicate, so I'm trying this minimal patch.

/be
Attachment #268309 - Flags: review?(igor)
Comment on attachment 268309 [details] [diff] [review]
proposed fix

Right, the removal of js_GetProperty call is an effective cure for the problem.
Attachment #268309 - Flags: review?(igor) → review+
Fixed on trunk:

js/src/jsfun.c 3.187

/be
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
The last patch is not enough since one has to prevent any call js_GetProperty("arguments"). So here is a new test case that demonstrates the leak even with the patch committed:

~/m/trunk/mozilla/js/src $ cat ~/s/t2.js 
function outer() {
    var x;
    with ({}) {
        x = arguments;
    }
    return function inner() { return x };
}

var gen_was_closed = false;

function gen()
{
    try {
        yield 1;
    } finally {
        gen_was_closed = true;
    }
}

var g = gen();
g.next();

var f = outer(g);
f = null;
g = null;
gc();
gc();
if (!gen_was_closed)
    throw "leaked arguments object";
~/m/trunk/mozilla/js/src $ ./Linux_All_DBG.OBJ/js -v 170 ~/s/t2.js
before 27696, after 27696, break 08832000
before 27696, after 27696, break 08832000
uncaught exception: leaked arguments object
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think the best way to fix the bug would be to remove arguments and __callee__ from the Call.prototype and instead add them as necessary in call_resolve.
Duplicate of this bug: 380837
This is a test case for js shell that does not use generators and exposes the problem even with the last fix committed. Its core is essentially:

function f(obj) {
    with (obj)
        return arguments;
}

This triggers js_NativeGet for arguments through name lookup causing the leak through the prototype  object.
Using the resolve hook is a good idea. We can avoid all the overriding tests too by not resolving when assigning.

/be
Status: REOPENED → ASSIGNED
Attached patch fix, part 2 (obsolete) — Splinter Review
The first fix was a fine change on its own. This is the call_resolve addition. We still need the override flag magic in order to handle

js> function f(){
  let a=arguments; print(a[0]); arguments=42; print(arguments); return a
}
js> f(33)[0]
33
42
33

/be
Attachment #268550 - Flags: review?(igor)
Comment on attachment 268550 [details] [diff] [review]
fix, part 2

With this change there is no point to keep arguments defined in the prototype.
Oops, forgot to lose that JSPropertySpec. Thanks,

/be
Attachment #268550 - Attachment is obsolete: true
Attachment #268559 - Flags: review?(igor)
Attachment #268550 - Flags: review?(igor)
Attachment #268559 - Flags: review?(igor) → review+
Fixed harder on trunk:

js/src/jsfun.c 3.188

/be
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Depends on: 384851
No longer blocks: 381992
/cvsroot/mozilla/js/tests/ecma_3/Function/arguments-002.js,v  <--  arguments-002.js
initial revision: 1.1
/cvsroot/mozilla/js/tests/js1_5/GC/regress-383269-01.js,v  <--  regress-383269-01.js
initial revision: 1.1
Checking in js1_5/GC/regress-383269-02.js;
/cvsroot/mozilla/js/tests/js1_5/GC/regress-383269-02.js,v  <--  regress-383269-02.js
initial revision: 1.1

arguments-002.js fails in the trunk with 
Expected value '33,42:33', Actual value '33,[object Object]:33'
Flags: in-testsuite+
Depends on: 396584
verified fixed 1.9.0 linux/mac*/windows js1_5/GC/regress-383269-0[12].js.

filing bug 396584 for ecma_3/Function/arguments-002.js
Status: RESOLVED → VERIFIED
Igor: I get occasional failures on windows vms with js1_5/GC/regress-383269-01.js where the times will be similar to base_gc_time=0, last_gc_time=15. The problem with the windows vms are that they have variable performance depending upon the load being imposed on the host by the other vms.

I noticed that js1_5/GC/regress-383269-01.js tests time > (base_time + 1) * 3 while js1_5/GC/regress-383269-02.js tests time > (base_time + 10) * 3. Should the test in -01, be the same as in -02? Is there another adjustment that can be made to make this more reliable?
update test to handle uncaught exception
http://hg.mozilla.org/mozilla-central/rev/81d6d6ba55e1
http://hg.mozilla.org/tracemonkey/rev/25a8deec5823

/cvsroot/mozilla/js/tests/performance-1.8.0.tests,v  <--  performance-1.8.0.tests
new revision: 1.4; previous revision: 1.3

/cvsroot/mozilla/js/tests/performance-1.8.1.tests,v  <--  performance-1.8.1.tests
new revision: 1.4; previous revision: 1.3

/cvsroot/mozilla/js/tests/performance-1.9.0.tests,v  <--  performance-1.9.0.tests
new revision: 1.4; previous revision: 1.3

/cvsroot/mozilla/js/tests/performance-1.9.1.tests,v  <--  performance-1.9.1.tests
new revision: 1.2; previous revision: 1.1

/cvsroot/mozilla/js/tests/performance-1.9.2.tests,v  <--  performance-1.9.2.tests
new revision: 1.2; previous revision: 1.1

/cvsroot/mozilla/js/tests/performance.tests,v  <--  performance.tests
new revision: 1.3; previous revision: 1.2
You need to log in before you can comment on or make changes to this bug.