Last Comment Bug 893012 - We leak a ContentParent when its corresponding process is SIGKIL'ed
: We leak a ContentParent when its corresponding process is SIGKIL'ed
Status: RESOLVED FIXED
[MemShrink:P1]
: meta
Product: Core
Classification: Components
Component: IPC (show other bugs)
: Trunk
: All All
: -- normal (vote)
: ---
Assigned To: Justin Lebar (not reading bugmail)
:
Mentors:
Depends on: 739681 811636 887075 893172 893242 894081 894135 894147 894191 894597 894669 895933 896755 896773 896825 897701 897707
Blocks: 880940 886217 889261
  Show dependency treegraph
 
Reported: 2013-07-12 08:27 PDT by Justin Lebar (not reading bugmail)
Modified: 2013-11-01 12:14 PDT (History)
24 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
leo+
fixed


Attachments
Part 1: Don't leak BrowserElementParent due to an event listener on the window which contains it. (4.19 KB, patch)
2013-07-13 08:43 PDT, Justin Lebar (not reading bugmail)
no flags Details | Diff | Splinter Review
cc edges from comment 6 (373.96 KB, application/x-xz)
2013-07-14 22:57 PDT, Justin Lebar (not reading bugmail)
no flags Details
gc edges from comment 6 (986.63 KB, application/x-xz)
2013-07-14 22:58 PDT, Justin Lebar (not reading bugmail)
no flags Details
Memory Reports And DMD After 16 hour (2.32 MB, application/x-compressed-tar)
2013-07-17 17:57 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
Memory Reports And DMD And GC/CC log After 33 hour (8.75 MB, application/x-compressed-tar)
2013-07-17 18:00 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
memory report before testing. (5.43 MB, application/x-compressed-tar)
2013-07-23 17:58 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
memory report after testing for 14 hour. (7.69 MB, application/x-compressed-tar)
2013-07-23 18:00 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
memory reports (get_about_memory.py --minimize output) from unagi after 7 STRs (as outlined in comment 31) (3.18 MB, application/zip)
2013-07-24 16:44 PDT, John Hammink
no flags Details
memory report before testing. (5.55 MB, application/x-compressed-tar)
2013-07-25 19:28 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
memory report after testing for 3 hour. (5.03 MB, application/x-compressed-tar)
2013-07-25 19:30 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details
memory report after testing for 15 hours (5.12 MB, application/x-compressed-tar)
2013-07-25 19:33 PDT, jeremy.kim.leo [:jeremykimleo]
no flags Details

Description Justin Lebar (not reading bugmail) 2013-07-12 08:27:28 PDT
STR:

 * Send SIGKIL to a B2G child process
 * ContentParent::ActorDestroy runs, but ContentParent::ShutDownProcess does not, and neither does the ContentParent destructor.

This is at least part of the cause of bug 889261 and friends.
Comment 1 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2013-07-12 08:36:47 PDT
If ShutDownProcess is supposed to be called, it probably needs a slightly different name. But yes, ActorDestroy for the root actor is responsible for making sure that it triggers the actions necessary to destroy that actor, because the IPC layer doesn't own it and is therefore not responsible for destroying it as it does for all the child actors.
Comment 2 Justin Lebar (not reading bugmail) 2013-07-12 11:36:06 PDT
We leak the ContentParent because its nsFrameLoader keeps a strong ref to it in in |mChildHost|.

Still figuring out why we leak the frame loader.
Comment 3 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-07-12 13:10:02 PDT
Smells cycle-y.
Comment 4 Justin Lebar (not reading bugmail) 2013-07-12 13:31:51 PDT
We leak the frameloader because we leak BrowserElementParent.  We leak BEP in part because it registers an event listener on its window.  I fixed that and it seems like we still leak the BEP, and I'm not sure yet what's going on.

But I'm going to open a separate bug for closing the ContentParent's channel upon ActorDestroy.  That should take care of the many leaked messages, and it's a prerequisite for not leaking ContentParent.
Comment 5 Justin Lebar (not reading bugmail) 2013-07-12 18:50:01 PDT
With the patch from bug 893172, if I remove the mChildHost ref in nsFrameLoader, my memory reporter from bug 893242 shows that we no longer leak the ContentParent.

So that's good confirmation.  But removing the mChildHost doesn't really fix the problem, since we're still leaking an nsFrameLoader.
Comment 6 Justin Lebar (not reading bugmail) 2013-07-13 08:09:50 PDT
Andrew (or someone), can you help me understand what this means?

The nsFrameLoader I care about is 0x476a8ec0:

> $ python find_roots.py cc-edges.14646.log 0x476a8ec0
> Parsing cc-edges.14646.log. Done loading graph. Reversing graph. Done.
> 
> 0x44e66600 [JS Object (XPCWrappedNative_NoHelper)]
>     --[js::GetObjectPrivate(obj)]-> 0x4721baf0 [XPCWrappedNative]
>     --[mIdentity]-> 0x476a8ec0 [nsFrameLoader]
> 
>     Root 0x44e66600 is a marked GC object.

So far so good; it's held alive only by JS.

$ python find_roots.py gc-edges.14646.log 0x44e66600
Parsing gc-edges.14646.log. Done loading graph. Reversing graph. Done.

> via resource://gre/modules/BrowserElementParent.jsm :
> 0x4391f0a0 [FakeBackstagePass <no private>]
>     --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>]
>     --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via resource://gre/modules/BrowserElementPromptService.jsm :
> 0x4391f0b0 [FakeBackstagePass <no private>]
>     --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>]
>     --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via XPCWrappedNative::mFlatJSObject :
> 0x47455940 [ContentFrameMessageManager 460381f0]
>     --[BrowserElementPromptService]-> 0x4394e0a0 [Proxy <no private>]
>     --[private]-> 0x4393b2b0 [Object <no private>]
>     --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via jar:file:///system/b2g/omni.ja!/components/Webapps.js :
> 0x4391fa20 [FakeBackstagePass <no private>]
>     --[BrowserElementPromptService]-> 0x4393b2b0 [Object <no private>]
>     --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via XPCWrappedNative::mFlatJSObject :
> 0x47455040 [ContentFrameMessageManager 469ea100]
>     --[BrowserElementPromptService]-> 0x47725460 [Proxy <no private>]
>     --[private]-> 0x4393b2b0 [Object <no private>]
>     --[_browserElementParentMap]-> 0x4393c020 [WeakMap 46ebffd0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via nsXPCWrappedJS::mJSObj :
> 0x43944100 [Object <no private>]
>     --[_bepMap]-> 0x4393c5a0 [WeakMap 460380a0]
>     --[WeakMap entry]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]
> 
> via Preserved wrapper :
> 0x43b72580 [HTMLIFrameElement <no private>]
>     --[setVisible]-> 0x43b84150 [Proxy <no private>]
>     --[private]-> 0x47781f40 [Function]
>     --[fun_callscope]-> 0x47481d30 [Call <no private>]
>     --[**UNKNOWN SLOT 1**]-> 0x47781f00 [Function BrowserElementParent/defineMethod]
>     --[fun_callscope]-> 0x47481b50 [Call <no private>]
>     --[self]-> 0x47481b80 [Object <no private>]
>     --[_frameLoader]-> 0x44e66600 [XPCWrappedNative_NoHelper 4721baf0]

The only non-weak entry here is the last one.  The <iframe mozbrowser> has an expando called setVisible, and that function holds a ref to |self|, which is BrowserElementParent, which holds a ref to _frameLoader.

That makes sense.  There are a few other expandos on <iframe mozbrowser> which are functions whose scopes also hold a ref to the BrowserElementParent, but I presume they're being elided here.

So now, why is the iframe alive?

> $ python find_roots.py gc-edges.14646.log 0x43b72580
> Parsing gc-edges.14646.log. Done loading graph. Reversing graph. Done.
> 
> via Preserved wrapper :
> 0x43b72580 [HTMLIFrameElement <no private>]

Hm...what's a preserved wrapper?  The cc log isn't any more help.

> $ python find_roots.py cc-edges.14646.log 0x43b72580
> Parsing cc-edges.14646.log. Done loading graph. Reversing graph. Done.
> 
> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x43b72580 is a marked GC object.

I tried to look at the object in gdb, but 0x43b72580 does not appear to be a
mozilla::dom::HTMLIFrameElement; its vtable pointers don't line up with
anything gdb knows about, and its refcnt is 1136019944.  Maybe that pointer is
to one of the object's nsISupports's, or something.

Anyway, I'm not sure what to do now.  I could make the expandos on the iframe
hold weak refs to the BrowserElementParent, but then we'd still be leaking the
iframe.

Any ideas?
Comment 7 Justin Lebar (not reading bugmail) 2013-07-13 08:21:23 PDT
Ah, I misunderstood those weakmap refs.

They're refs where the BEP is the /value/, which is a strong ref.

However, these are all keyed on the iframe, so when the iframe goes away, the BEP should too.  So we're still left with the question of why this iframe is alive.
Comment 8 Justin Lebar (not reading bugmail) 2013-07-13 08:43:11 PDT
Created attachment 775147 [details] [diff] [review]
Part 1: Don't leak BrowserElementParent due to an event listener on the window which contains it.

Without this, both the window and the iframe hold the BEP alive.  With this
patch, only the iframe seems to hold the BEP alive (per the above comments),
which is what we want.  But we still leak the iframe, it seems.
Comment 9 Andrew McCreight [:mccr8] 2013-07-13 09:18:49 PDT
Could you upload the GC and CC logs you are talking about?  I can look at them.

0x43b72580 is a JS Object (HTMLIFrameElement), which means it is the JS reflector for the HTMLIFrameElement, not the element itself.  This is an all-traces log (hopefully), so you want to look up the reflector 0x43b72580 in the CC log, and that should contain a pointer to the C++ object.

When a JS reflector (aka wrapper in the context of preserved wrapper) is created for a C++ object, it owns the C++ object, but not vice versa.  If you set an expando on the reflector (or a few other things), we want to ensure the reflector lives as long as the C++ object, so we make the C++ object own its reflector.  That's called a "preserved wrapper".
Comment 10 Andrew McCreight [:mccr8] 2013-07-13 09:30:57 PDT
While a preserved wrapper on a C++ object makes it own its reflector, it shouldn't cause its reflector to be marked black (which means the CC won't collect it), except as a CC optimization (if we decide the C++ object is definitely alive, then if the reflector is preserved, the reflector is definitely alive), so I suspect this is an instance where AllTraces is failing to deoptimize, perhaps in mozilla::dom::TraceBlackJS (I filed bug 893378 for that).

The good news though is that this is really just find_roots getting confused, the information should still be in the log.  In general, you might need to manually root around (no pun intended) in the log, but here I think it should suffice to edit the log, and replace the [gc.marked] next to the entry for 0x43b72580 with [gc].  This will void your warranty (eg it may mess stuff up, including but not limited to things this reflector owns), but I think it should be okay, because nothing else is pointing to the reflector.
Comment 11 Justin Lebar (not reading bugmail) 2013-07-14 22:57:25 PDT
Created attachment 775469 [details]
cc edges from comment 6
Comment 12 Justin Lebar (not reading bugmail) 2013-07-14 22:58:01 PDT
Created attachment 775471 [details]
gc edges from comment 6
Comment 13 Justin Lebar (not reading bugmail) 2013-07-14 23:05:20 PDT
Okay, with the suggested change to the log from comment 10, I now get

> $ python cc/find_roots.py cc-edges.log 0x43b72580
> Parsing cc-edges.log. Done loading graph. Reversing graph. Done.
> 
> 0x43b725b0 [JS Object (HTMLDivElement)]
>     --[UnwrapDOMObject(obj)]-> 0x476a8dd0 [FragmentOrElement (xhtml) div id='appframe0' class='appWindow homescreen active' app://system.gaiamobile.org/index.html]
>     --[mAttrsAndChildren[i]]-> 0x46b9a880 [FragmentOrElement (xhtml) iframe app://system.gaiamobile.org/index.html]
>     --[Preserved wrapper]-> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x43b725b0 is a marked GC object.
> 
> 0x43b72610 [JS Object (HTMLDivElement)]
>     --[UnwrapDOMObject(obj)]-> 0x47b472e0 [FragmentOrElement (xhtml) div class='screenshot-overlay' app://system.gaiamobile.org/index.html]
>     --[GetParent()]-> 0x476a8dd0 [FragmentOrElement (xhtml) div id='appframe0' class='appWindow homescreen active' app://system.gaiamobile.org/index.html]
>     --[mAttrsAndChildren[i]]-> 0x46b9a880 [FragmentOrElement (xhtml) iframe app://system.gaiamobile.org/index.html]
>     --[Preserved wrapper]-> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x43b72610 is a marked GC object.
> 
> 0x47756d90 [JS Object (Proxy)]
>     --[private]-> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x47756d90 is a marked GC object.
> 
> 0x43b84010 [JS Object (Call)]
>     --[iframe]-> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x43b84010 is a marked GC object.
> 
> 0x43b6eeb0 [JS Object (Object)]
>     --[iframe]-> 0x43b72580 [JS Object (HTMLIFrameElement)]
> 
>     Root 0x43b6eeb0 is a marked GC object.

I'm not sure what the last three mean, but the first is indicating that we're holding a div element alive, and that div contains the iframe.  I think the second is saying that this first div contains a second div, and that second div holds alive the first one by virtue of its parent pointer.  It's not clear to me whether this rules out the fact that the second div is held alive by the first one (by virtue of being its child); that is, I don't know if the second div is interesting.
Comment 14 Justin Lebar (not reading bugmail) 2013-07-14 23:07:12 PDT
Looking for gc roots of the first div element appears to show that Gaia is leaking the world.

> $ python ~/code/moz/heapgraph/gc/find_roots.py gc-edges.log 0x43b725b0
> Parsing gc-edges.log. Done loading graph. Reversing graph. Done.
> 
> via nsXPCWrappedJS::mJSObj :
> 0x43b6dac0 [Function AppError/<]
>     --[fun_callscope]-> 0x43b6eee0 [Call <no private>]
>     --[self]-> 0x43b6ef10 [Object <no private>]
>     --[app]-> 0x43b6eeb0 [Object <no private>]
>     --[frame]-> 0x43b725b0 [HTMLDivElement <no private>]
> 
> via nsXPCWrappedJS::mJSObj :
> 0x43b6db40 [Function apploaded]
>     --[fun_callscope]-> 0x43b84010 [Call <no private>]
>     --[homescreenFrame]-> 0x43b725b0 [HTMLDivElement <no private>]
Comment 15 Justin Lebar (not reading bugmail) 2013-07-14 23:38:27 PDT
In Gaia's apps/system/js/window.js:

> window.AppError = function AppError(app) {
>   var self = this;
>   this.app = app;
>   this.app.frame.addEventListener('mozbrowsererror', function(evt) {
>     if (evt.detail.type != 'other')
>       return;
>
>     console.warn(
>       'app of [' + self.app.origin + '] got a mozbrowsererror event.');
>
>     if (self.injected) {
>       self.update();
>     } else {
>       self.render();
>     }
>     self.show();
>     self.injected = true;
>   });
>   return this;
> };

We register one mozbrowsererror listener for every frame, but we never remove the listener.  The mozbrowsererror listener's closure holds alive self, which is |this|, which holds alive |this.app|, which holds alive |frame|, which holds alive our iframe.  Sheesh.

So that's just gonna leak.

In Gaia's apps/system/js/window_manager.js:

> app.frame.addEventListener(type, function apploaded(e) {
>   e.target.removeEventListener(e.type, apploaded, true);
>
>   var evt = document.createEvent('CustomEvent');
>   evt.initCustomEvent('apploadtime', true, false, {
>     time: parseInt(Date.now() - iframe.dataset.start),
>     type: (e.type == 'appopen') ? 'w' : 'c'
>   });
>   iframe.dispatchEvent(evt);
> }, true);

This one is a bit weird.  If this closure holds alive homescreenFrame (which exists in the closure's scope but is not referenced inside the closure), does that mean the closure is heavyweight?  If so, why is it heavyweight?

Let us pause to consider how incredibly difficult it is to write JS that doesn't leak.
Comment 16 (inactive after 6/18) Alive Kuo [:alive] 2013-07-14 23:57:14 PDT
(In reply to Justin Lebar [:jlebar] from comment #15)
> In Gaia's apps/system/js/window.js:
> 
> We register one mozbrowsererror listener for every frame, but we never
> remove the listener.  The mozbrowsererror listener's closure holds alive
> self, which is |this|, which holds alive |this.app|, which holds alive
> |frame|, which holds alive our iframe.  Sheesh.
> 
> So that's just gonna leak.

Does "removeEventListener() before the element is removed" help?
Do you suggest that we should remove event listener before an element is removed from DOM tree?

> 
> In Gaia's apps/system/js/window_manager.js:
> 
> > app.frame.addEventListener(type, function apploaded(e) {
> >   e.target.removeEventListener(e.type, apploaded, true);
> >
> >   var evt = document.createEvent('CustomEvent');
> >   evt.initCustomEvent('apploadtime', true, false, {
> >     time: parseInt(Date.now() - iframe.dataset.start),
> >     type: (e.type == 'appopen') ? 'w' : 'c'
> >   });
> >   iframe.dispatchEvent(evt);
> > }, true);
> 
> This one is a bit weird.  If this closure holds alive homescreenFrame (which
> exists in the closure's scope but is not referenced inside the closure),
> does that mean the closure is heavyweight?  If so, why is it heavyweight?


This one is not invented by me...however I don't get your point here.
Which line is the heavyweight point you mean? Or you are saying the whole WindowManager closure is something bad?

> 
> Let us pause to consider how incredibly difficult it is to write JS that
> doesn't leak.

I agree.
Comment 17 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-15 04:48:50 PDT
(In reply to Justin Lebar [:jlebar] from comment #15)
> In Gaia's apps/system/js/window_manager.js:
> 
> > app.frame.addEventListener(type, function apploaded(e) {
> >   e.target.removeEventListener(e.type, apploaded, true);
> >
> >   var evt = document.createEvent('CustomEvent');
> >   evt.initCustomEvent('apploadtime', true, false, {
> >     time: parseInt(Date.now() - iframe.dataset.start),
> >     type: (e.type == 'appopen') ? 'w' : 'c'
> >   });
> >   iframe.dispatchEvent(evt);
> > }, true);
> 
> This one is a bit weird.  If this closure holds alive homescreenFrame (which
> exists in the closure's scope but is not referenced inside the closure),
> does that mean the closure is heavyweight?  If so, why is it heavyweight?

I believe there are situations where a closure can keep alive things that appear to be unnecessary without being heavyweight.  Let's poke at this tomorrow (today? :-P) in the debugger, and rope in some JS folks if we need them.

> Let us pause to consider how incredibly difficult it is to write JS that
> doesn't leak.

*le sigh*
Comment 18 Boris Zbarsky [:bz] 2013-07-15 07:06:54 PDT
> So that's just gonna leak.

Cycle collector should collect this cycle.  Why doesn't it?  Or is the point that "self" is actually a live window all along or something?
Comment 19 Justin Lebar (not reading bugmail) 2013-07-15 09:52:59 PDT
> I don't get your point here.
> Which line is the heavyweight point you mean? Or you are saying the whole WindowManager 
> closure is something bad?

In our JS engine, a closure can either be "heavyweight" or "lightweight".

The idea is that a closure can access variables in its containing scope, so if you keep the closure alive, it has to keep alive the variables from the containing scope that it can access.

Normally the JS engine parses the text of the closure, understands which variables the closure might access, and keeps them all alive.  When we can do this, we call the closure lightweight.

But sometimes we can't determine statically which variables the closure might keep alive.  For example, if the closure calls eval(), the eval'ed code might access anything.  In this case, we make the closure "heavyweight" and it keeps alive everything in its containing scope.

All of this is to say, I don't see why this particular closure should be heavyweight.  But also, it doesn't access homescreenFrame, so if it's lightweight, I don't see why it's holding that variable alive.  So obviously my explanation is incomplete in some way.  :)  khuey is looking into it.
Comment 20 Justin Lebar (not reading bugmail) 2013-07-15 10:34:36 PDT
> Does "removeEventListener() before the element is removed" help?

In this case, yes.  Alternatively, we could register just one event listener and use the target of the event to get the iframe.

> Do you suggest that we should remove event listener before an element is removed from DOM 
> tree?

I'm not suggesting this as a general rule.  The issue is that this particular event listener holds a reference to the DOM element in question.  Therefore if we don't remove the event listener when we remove the DOM element from the tree, then the listener leaks the DOM element.
Comment 21 Boris Zbarsky [:bz] 2013-07-15 10:38:09 PDT
Again: it shouldn't.  The cycle collector knows about event listeners!
Comment 22 Justin Lebar (not reading bugmail) 2013-07-15 10:39:26 PDT
(In reply to Boris Zbarsky (:bz) from comment #18)
> > So that's just gonna leak.
> 
> Cycle collector should collect this cycle.  Why doesn't it?  Or is the point
> that "self" is actually a live window all along or something?

Confusingly enough, |frame| is not the iframe.  I think it is an element in the window that stays alive.  So I don't think there's a cycle to collect.

However, if we registered the event listener on |this.app.iframe|, that might let the CC collect the closure...
Comment 23 Boris Zbarsky [:bz] 2013-07-15 10:43:17 PDT
Oh, so app.frame sticks around because someone is explicitly keeping it alive? Then yeah, its listeners will leak whatever they close over...
Comment 24 (inactive after 6/18) Alive Kuo [:alive] 2013-07-15 13:23:24 PDT
(In reply to Boris Zbarsky (:bz) from comment #23)
> Oh, so app.frame sticks around because someone is explicitly keeping it
> alive? Then yeah, its listeners will leak whatever they close over...

app.frame is the container element, a DIV, of the mozbrowser iframe.

<div class="app-window">
  <iframe mozbrowser></iframe>
</div>

So we should use app.iframe to attach the event listener and need not to do remove event listener but just remove the whole frame element and then GC would work correctly? Correct me if I'm wrong. If yes, I'm going to fix this case.

(In reply to Justin Lebar [:jlebar] from comment #19)
> > I don't get your point here.
> > Which line is the heavyweight point you mean? Or you are saying the whole WindowManager 
> > closure is something bad?
> 
> In our JS engine, a closure can either be "heavyweight" or "lightweight".
> 
> The idea is that a closure can access variables in its containing scope, so
> if you keep the closure alive, it has to keep alive the variables from the
> containing scope that it can access.
> 
> Normally the JS engine parses the text of the closure, understands which
> variables the closure might access, and keeps them all alive.  When we can
> do this, we call the closure lightweight.
> 
> But sometimes we can't determine statically which variables the closure
> might keep alive.  For example, if the closure calls eval(), the eval'ed
> code might access anything.  In this case, we make the closure "heavyweight"
> and it keeps alive everything in its containing scope.
> 
> All of this is to say, I don't see why this particular closure should be
> heavyweight.  But also, it doesn't access homescreenFrame, so if it's
> lightweight, I don't see why it's holding that variable alive.  So obviously
> my explanation is incomplete in some way.  :)  khuey is looking into it.

Thanks for explanation. I'll try to find out why this is relevant to |homescreenFrame| variable from my point of view.
Comment 25 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-15 13:25:21 PDT
(In reply to Alive Kuo [:alive] from comment #24)
> (In reply to Justin Lebar [:jlebar] from comment #19)
> > > I don't get your point here.
> > > Which line is the heavyweight point you mean? Or you are saying the whole WindowManager 
> > > closure is something bad?
> > 
> > In our JS engine, a closure can either be "heavyweight" or "lightweight".
> > 
> > The idea is that a closure can access variables in its containing scope, so
> > if you keep the closure alive, it has to keep alive the variables from the
> > containing scope that it can access.
> > 
> > Normally the JS engine parses the text of the closure, understands which
> > variables the closure might access, and keeps them all alive.  When we can
> > do this, we call the closure lightweight.
> > 
> > But sometimes we can't determine statically which variables the closure
> > might keep alive.  For example, if the closure calls eval(), the eval'ed
> > code might access anything.  In this case, we make the closure "heavyweight"
> > and it keeps alive everything in its containing scope.
> > 
> > All of this is to say, I don't see why this particular closure should be
> > heavyweight.  But also, it doesn't access homescreenFrame, so if it's
> > lightweight, I don't see why it's holding that variable alive.  So obviously
> > my explanation is incomplete in some way.  :)  khuey is looking into it.
> 
> Thanks for explanation. I'll try to find out why this is relevant to
> |homescreenFrame| variable from my point of view.

I'm digging into this with the help of some JS engine folks.  Hopefully I'll have an answer soon.
Comment 26 Justin Lebar (not reading bugmail) 2013-07-15 13:25:42 PDT
Yes, I think switching to app.iframe would work.  I'm testing a patch now, and I'll send a review your way, if that sounds OK to you.
Comment 27 (inactive after 6/18) Alive Kuo [:alive] 2013-07-15 13:26:16 PDT
(In reply to Justin Lebar [:jlebar] from comment #26)
> Yes, I think switching to app.iframe would work.  I'm testing a patch now,
> and I'll send a review your way, if that sounds OK to you.

Sure, thank you!
Comment 28 Justin Lebar (not reading bugmail) 2013-07-15 17:43:40 PDT
Comment on attachment 775147 [details] [diff] [review]
Part 1: Don't leak BrowserElementParent due to an event listener on the window which contains it.

I'm moving this into its own bug (bug 894147), since the Gaia patches also have their own bugs.

That way, if there are any other surprise fixes we need here, we can continue to attach them off this bug.
Comment 29 Justin Lebar (not reading bugmail) 2013-07-15 17:56:25 PDT
Steps for testing:

1) Apply patches from bug 894081, bug 894135, bug 894147, and bug 893242.  I'm working on landing these, but it will take a day or two longer.

2) Load the browser app, go to example.com

3) Load the gallery app

4) adb shell b2g-ps

5) adb shell kill -9 PIDS

  where PIDS is equal to the list of all PIDs from step 4 above, not including the main b2g process.  For example, if adb shell b2g-ps gives me

>APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
>b2g              root      2810  158   160808 71504 ffffffff 400f8430 S /system/b2g/b2g
>Usage            app_2863  2863  2810  64392  24084 ffffffff 40020430 S /system/b2g/plugin-container
>Homescreen       app_2882  2882  2810  69212  26684 ffffffff 40059430 S /system/b2g/plugin-container
>Gallery          app_2962  2962  2810  84700  25088 ffffffff 400dd430 S /system/b2g/plugin-container
>Browser          app_3004  3004  2810  64072  22204 ffffffff 4002e430 S /system/b2g/plugin-container
>(Preallocated a  root      3068  2810  62996  21152 ffffffff 400e3430 S /system/b2g/plugin-container

Then I would run

> $ adb shell kill -9 2863 2882 2962 3004 3068

6) Wait ten seconds, then run

  $ B2G/tools/get_about_memory.py --minimize

where B2G is the root B2G checkout (from https://github.com/mozilla-b2g/B2G).

7) In the resulting memory report, search for "content-parent".

If this bug is fixed, you should see only one or two content-parent entries:
One for the homescreen, and maybe one for the preallocated process.  Neither
should have pid=-1, and neither should have a closed channel.

If this bug is not fixed, you will see content-parent entries with pid=-1 or
closed channels.  You may see content-parent entries for Browser, Gallery, or
Usage, and you may see multiple content-parent entries for the Homescreen.

When I checked with these STR a few minutes ago, everything seemed fixed.  But
now I'm seeing things which indicate that this bug may not be fixed.  Please
stand by.
Comment 30 Justin Lebar (not reading bugmail) 2013-07-15 18:00:31 PDT
Here's an example of a bad memory report, indicating that the bug still exists.

> 0 (100.0%) -- queued-ipc-messages
> ├──0 (100.0%) ── content-parent((Preallocated), pid=3809, open channel, 0x46dd8000, refcnt=14)
> ├──0 (100.0%) ── content-parent(Homescreen, pid=-1, closed channel, 0x43208c00, refcnt=1)
> ├──0 (100.0%) ── content-parent(Homescreen, pid=3756, open channel, 0x45bef800, refcnt=15)
> └──0 (100.0%) ── content-parent(Usage, pid=-1, closed channel, 0x46f0a800, refcnt=1)

We want it instead to look like this.

> 0 (100.0%) -- queued-ipc-messages
> ├──0 (100.0%) ── content-parent((Preallocated), pid=3809, open channel, 0x46dd8000, refcnt=14)
> └──0 (100.0%) ── content-parent(Homescreen, pid=3756, open channel, 0x45bef800, refcnt=15)
Comment 31 Justin Lebar (not reading bugmail) 2013-07-15 18:50:17 PDT
It turns out that we "leak" one or two ContentParents under some circumstances
(*).  This isn't a big deal -- each one is relatively small.  What's important
is that the number of ContentParent objects we leak is bounded by a small
constant.

So to test for this, I need to amend the steps from comment 29 a bit.

A) Do step 1
B) Repeat steps 2, 3, 4, 5 five times.
C) Do steps 6 and 7
D) When looking through the memory report, we pass if there are at most two
   ContentParent entries with pid=-1 and/or a closed channel.  Otherwise we
   fail.

Let me know if this isn't clear.

(*) In our testing, it looks like BrowserElementPanning.js can leak one
    ContentParent, via its pointerDownTarget member.

    It also looks like Gaia leaks one ContentParent when you kill the Usage
    app.  It gets a mozbrowsererror when the process crashes, but it apparently
    does not remove the old, dead iframe until it wants to show the Usage app
    again.  These may be bugs (particularly the second one), but I don't think
    they're a big deal.
Comment 32 Justin Lebar (not reading bugmail) 2013-07-15 19:07:27 PDT
One thing I should add is that I haven't tested any of this in b2g18 yet.  All of my experiments have been on trunk.

If you're going to test on trunk, be sure that you have bug 887075; without it, we leak like crazy.  As far as I know, this isn't a problem on b2g18.
Comment 33 jeremy.kim.leo [:jeremykimleo] 2013-07-17 17:57:29 PDT
Created attachment 777511 [details]
Memory Reports And DMD After 16 hour

this log applied Bug 894081, Bug 894135, Bug 894147, Bug 893242
this was taken after 16 hour.
Comment 34 jeremy.kim.leo [:jeremykimleo] 2013-07-17 18:00:32 PDT
Created attachment 777512 [details]
Memory Reports And DMD And GC/CC log After 33 hour

this log applied Bug 894081, Bug 894135, Bug 894147, Bug 893242
this was taken after 33 hour.
Comment 35 Justin Lebar (not reading bugmail) 2013-07-17 18:07:16 PDT
Thank you; this is very useful information.
Comment 36 Justin Lebar (not reading bugmail) 2013-07-17 18:08:14 PDT
Next time you collect this information, it would be helpful if you'd collect it with get_about_memory.py --minimize.

--minimize forces a GC right before collecting the information.  This helps reduce noise in the test results.
Comment 37 Justin Lebar (not reading bugmail) 2013-07-17 18:15:29 PDT
> Memory Reports And DMD After 16 hour

Also next time it would be helpful to have a GC/CC log both before and after.  We'll work with what we have until you can get us better data, but if that's at all possible, please post the new data to the bug when you get it.
Comment 38 Justin Lebar (not reading bugmail) 2013-07-17 18:40:19 PDT
Judging by the DMD logs, there's definitely a ContentParent leak still remaining here.

But even if we fix that, it's not going to fix the whole problem.  About 50% of the increase in memory usage is due to JS; that is likely a completely unrelated problem.  To diagnose that, we really need before and after GC/CC logs.

My working theory is that maybe ContentParent::ActorDestroy isn't being run.  That would definitely cause us to leak.  I think we're going to try to catch this in a debugger tomorrow.
Comment 39 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-17 18:45:05 PDT
Just to be sure for when we try to reproduce this, the tests from comment 33 and 34 are run on b2g-18 and gaia v1-train (plus the patches)?
Comment 40 Justin Lebar (not reading bugmail) 2013-07-17 18:51:49 PDT
By the way, I just pushed a fix to the get_about_memory.py script that should fix getting memory reports from phones running b2g18.

If you git fetch && git merge in the root B2G repository, you should get this new version; hopefully it will work for you!

https://github.com/mozilla-b2g/B2G/commit/1f777756bfd1a80381208ef9851315ce1cec3dc3
Comment 41 jeremy.kim.leo [:jeremykimleo] 2013-07-17 22:50:02 PDT
Base Revision : AU_LINUX_GECKO_ICS_STRAWBERRY.01.01.00.019.149
gecko : 499c1f8ea7ad0cdaa7086214278e2944b8a2ea33
gaia : 0d5a9a7577f16b6a72a982148c6f509ee1714ea2

and add those patches.
+ Bug 894081, Bug 894135, Bug 894147, Bug 894191
+ Bug 893242 : except 
Part 3, v2: Add a memory reporter counting ContentParents and the number of IPC messages they have outstanding. 
Part 1.5: Make sContentParents a StaticAutoPtr.
Comment 42 Andrew McCreight [:mccr8] 2013-07-19 09:36:41 PDT
Digging in the logs some more, from the 33 hour CC/GC logs, I see around a thousand BEPs, but I also see about a thousand iframe elements.  These iframes are not present in the CC log, so it seems like maybe the CC log it not an AllTraces log, so we can't tell why the event listeners that are holding the iframes alive are being held alive.  But I guess the good news here is that it is a similar problem to one we've looked into before.  Kyle is trying to reproduce the problem locally.
Comment 43 Justin Lebar (not reading bugmail) 2013-07-19 10:11:48 PDT
> so it seems like maybe the CC log it not an AllTraces log

Ack, I must have forgotten to mark bug 890135 as leo+.  Sorry.  :-/
Comment 44 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-19 17:46:31 PDT
We were able to reproduce more ContentParent leaks today.  These latest leaks appear to be distinct from everything we've fixed so far.  We were able to partially address them, but we ran into a bug in our GC heap dumping code where we're not dumping the entire heap (!?!?).  We'll pick up on Monday trying to fix the tool so we can see what leaks remain.
Comment 45 dkumar 2013-07-22 11:25:24 PDT
Can you please provide steps to verify this fix?- to perform blackbox testing.
Comment 46 jeremy.kim.leo [:jeremykimleo] 2013-07-23 17:58:47 PDT
Created attachment 780106 [details]
memory report before testing.

to gather more debugging infomation. this version applied more patchs.
- Bug 893242, Bug 895933, Bug 890135

and this is log of before testing after boot up. 
memory-report with minimize
dmd
gc/cc log.
Comment 47 jeremy.kim.leo [:jeremykimleo] 2013-07-23 18:00:51 PDT
Created attachment 780109 [details]
memory report after testing for 14 hour.

and this is log of after testing for 14 hour
memory-report with minimize
dmd
gc/cc log.
Comment 48 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-23 20:42:33 PDT
Thanks, jeremy, that's really useful.

The log from comment 47 has 827 (yes, really) leaked ContentParent objects.  Tomorrow I think we will have patches that fix all the ContentParent leaks we have found so far.  Hopefully after you retest with those we'll see a much small (maybe even 0!) number.
Comment 49 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-24 16:30:45 PDT
Hi Jeremy,

B2G-18 tip and gaia v1-train tip plus the patches from bug 897701, bug 896825, and bug 896755 fix all of the ContentParent leaks that we have been able to reproduce.  It would be great if leo could test with those patches.
Comment 50 John Hammink 2013-07-24 16:44:07 PDT
Created attachment 780685 [details]
memory reports (get_about_memory.py --minimize output) from unagi after 7 STRs (as outlined in comment 31)
Comment 51 John Hammink 2013-07-24 16:45:49 PDT
For comparison, I also checked this on the latest nightly on Unagi after 7 Trials

Gecko  http://hg.mozilla.org/releases/mozilla-b2g18/rev/91fa9bf5e638
Gaia   a64f37c3327e4cf2105a16eedaf68727bb2adbde
BuildID 20130724070208
Version 18.0
Comment 52 John Hammink 2013-07-24 16:46:58 PDT
It seems that, at least there, the bug was still occurring.
Comment 53 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-24 16:51:25 PDT
If you weren't on tip and don't have the patches from comment 49 that's expected.
Comment 54 Andre Graziani (:graziani) 2013-07-25 16:18:29 PDT
Hi Justin,

This is Andre from Leo.
The bug 811636 is observed in Leo as well.
After hours of test, the output channel's queue starts accumulating again.
The child associated to that channel no longer exists, and the main process seems to hold both sides of the pipe. So no channel error is dispatched.
This is exactly what bug 811636 fixes.
I tested that patch in Leo, and the result was positive: no messages accumulating in the output queue.
So I am putting the current issue as dependent of that, since that causes the ContentParent to leak.
Comment 55 jeremy.kim.leo [:jeremykimleo] 2013-07-25 19:28:59 PDT
Created attachment 781414 [details]
memory report before testing.

this version applied more patchs.
- Bug 897701, Bug 896825, Bug 896755

and this is log of before testing after boot up. 
memory-report with minimize
dmd
gc/cc log.
Comment 56 jeremy.kim.leo [:jeremykimleo] 2013-07-25 19:30:24 PDT
Created attachment 781419 [details]
memory report after testing for 3 hour.

and this is log of after testing for 3 hour
memory-report with minimize
dmd
gc/cc log.
Comment 57 jeremy.kim.leo [:jeremykimleo] 2013-07-25 19:33:58 PDT
Created attachment 781424 [details]
memory report after testing for 15 hours

and this is log of after testing for 15 hours
memory-report with minimize
dmd
gc/cc log.

awesome! there are only 3 ContentParent objects. it looks very greats.
Comment 58 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-25 21:38:25 PDT
That is great news.  No leaked ContentParent objects after 15 hours, and RSS has only increased by 1.5 MB in 12 hours.

Jeremy if leo is happy with these results I'd like to close this bug as RESOLVED FIXED.  But if you want to do more QA testing on these patches we can wait until you are done.
Comment 59 Andrew McCreight [:mccr8] 2013-07-26 10:51:06 PDT
It is also worth noting that in the main process, from 3 hours to 15 hours, the only real increase in explicit memory (around 0.36mb) is from DMD itself.
Comment 60 leo.bugzilla.gecko 2013-07-28 19:07:13 PDT
sorry for delay because of weekends. 
first of all, thanks for efforts of all. 
i agree with Comment 58. it looks clear leaked issue caused by ContentParents.
let's change status of this as RESOLVED-FIXED. 

in additionally, our official QA testing will start on next monday(05-AUG-2013) for a week.
it maybe, the official results is possible to take after two weeks.'
Comment 61 Kyle Huey [:khuey] (khuey@mozilla.com) 2013-07-28 19:47:18 PDT
(In reply to leo.bugzilla.gecko from comment #60)
> sorry for delay because of weekends. 

No problem.  Have to take a break sometime ;-)

> first of all, thanks for efforts of all. 
> i agree with Comment 58. it looks clear leaked issue caused by
> ContentParents.
> let's change status of this as RESOLVED-FIXED. 

Great!

> in additionally, our official QA testing will start on next
> monday(05-AUG-2013) for a week.
> it maybe, the official results is possible to take after two weeks.'

Looking forward to those results.
Comment 62 leo.bugzilla.gecko 2013-07-28 22:23:09 PDT
Kyle, i have some question.
currently, i found an abnormal behavior. 
after 3~4 days testing this, b2g has some delay when i do something on target.
(for examples, delayed launch application / delayed response as my action)

i checked 'adb shell top -m 10'
----------------------------------------------------------------------------------------------
User 88%, System 11%, IOW 0%, IRQ 0%
User 275 + Nice 0 + Sys 35 + Idle 0 + IOW 0 + IRQ 0 + SIRQ 0 = 310

  PID PR CPU% S  #THR     VSS     RSS PCY UID      Name
  140  0  92% R    57 240296K  82372K  fg root     /system/b2g/b2g
  163  0   5% S     6   5908K    436K  fg root     /system/bin/sensord
 9099  0   1% R     1   1060K    452K  fg root     top
   79  0   0% S     1      0K      0K  fg root     mmcqd/0
 8909  0   0% S     1      0K      0K  fg root     kworker/0:2
    8  0   0% S     1      0K      0K  fg root     suspend
    9  0   0% S     1      0K      0K  fg root     sync_supers
   10  0   0% S     1      0K      0K  fg root     bdi-default
   11  0   0% S     1      0K      0K  fg root     kblockd
   12  0   0% S     1      0K      0K  fg root     khubd
----------------------------------------------------------------------------------------------
sometimes, b2g process has 92% of CPU per 20 seconds.

so i guess that b2g is enabled some log to gather memory-report/dmd/gc/cc log.
is it possible to cause some delays? (after 3days, cc log is 14mbytes, gc log is 20mbytes)
(if it is right, how i can disable to log data?)

if i builds user-mode, is it disabled logging? (currently, i'm used userdebug-mode.)
Comment 63 Justin Lebar (not reading bugmail) 2013-07-28 22:34:30 PDT
There is no way to disable logging, but having logging enabled causes very little overhead, because Firefox has to maintain the data we use to generate the logs, even if we're not using the logs.  We have this sort of logging enabled on all desktop Firefox builds.

What you are seeing is very unlikely related to gc/cc/about:memory logs.
Comment 64 Justin Lebar (not reading bugmail) 2013-07-28 22:47:55 PDT
You should use a profiler to see why Firefox is spinning.  I'm not sure how a userdebug-mode build is made, but if you have debug symbols, you could attach GDB, hit ctrl-c, and observe the backtrace.  Do this a bunch of times and you might understand why the build is spinning.

It might also be worth trying to use SPS, but that will probably only be helpful if we're spending time running JavaScript.  Also I'm not sure you can use SPS without first rebooting the device, which isn't what you want to do.

https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Boot_to_Gecko_%28with_a_real_device%29

This issue is probably worth a separate bug; I can help you make sure you have the right people cc'ed to help you with this issue.
Comment 65 leo.bugzilla.gecko 2013-07-28 23:18:17 PDT
ok. thanks Justin.

After attach GDB, When b2g has very high cpu, i hits ctrl-c, at this point, almost, b2g is doing to collect cycle memory. so, i doubt cc logging has a overhead.
and also, when i reboots to enable profile, because this will be cleared, so i can't do this yet.

i understood, i will file a bug as new after more investigation.
Comment 66 Justin Lebar (not reading bugmail) 2013-07-28 23:42:55 PDT
Ah, long cycle collection pauses!  That is interesting.

Please cc me, :mccr8, and :smaug on the bug.  Please also attach the gc and cc logs.
Comment 67 Ryan VanderMeulen [:RyanVM] 2013-07-29 07:15:03 PDT
OK to mark this as status-b2g18:fixed? If so, please do :)

Note You need to log in before you can comment on or make changes to this bug.