Last Comment Bug 669545 - coffeekup.org zombie compartment due to web workers
: coffeekup.org zombie compartment due to web workers
Status: RESOLVED WORKSFORME
[MemShrink:P1]
: mlk, testcase
Product: Core
Classification: Components
Component: XPConnect (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Alon Zakai (:azakai)
:
Mentors:
http://coffeekup.org
Depends on: new-web-workers
Blocks: ZombieCompartments 674679
  Show dependency treegraph
 
Reported: 2011-07-05 20:37 PDT by Nicholas Nethercote [:njn] (on vacation until July 11)
Modified: 2011-07-27 16:19 PDT (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
partially reduced testcase (232.14 KB, application/octet-stream)
2011-07-14 11:55 PDT, Alon Zakai (:azakai)
no flags Details
weak refs in nsDOMWorkerPool (4.22 KB, patch)
2011-07-14 17:56 PDT, Alon Zakai (:azakai)
bent.mozilla: feedback-
Details | Diff | Review
refcount tree (201.80 KB, text/plain)
2011-07-15 18:47 PDT, Alon Zakai (:azakai)
no flags Details
proto leak info (2.11 KB, text/plain)
2011-07-18 13:43 PDT, Alon Zakai (:azakai)
no flags Details
CC graph root finder script (1.57 KB, text/plain)
2011-07-20 17:41 PDT, Alon Zakai (:azakai)
no flags Details
XPCOM_MEM_LEAK_LOG (37.69 KB, text/plain)
2011-07-21 09:44 PDT, Alon Zakai (:azakai)
no flags Details
Reduced testcase (748 bytes, text/html)
2011-07-26 19:00 PDT, Alon Zakai (:azakai)
no flags Details

Description Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-05 20:37:29 PDT
FF8 trunk build, not-quite-new profile, no add-ons enabled.

Steps to reproduce:
- Open about:memory?verbose in one tab
- Open coffeekup.org in another tab, then close it
- Hit "minimize memory usage" multiple times in the first tab
- Wail when the coffeekup.org compartment doesn't disappear.

I tried waiting 10 or so minutes and minimizing memory usage again, just in case it was like TBPL in bug 668871, but it didn't help.

Here's the compartment stats:

│  ├───9,953,435 B (11.30%) -- compartment(http://coffeekup.org/)
│  │   ├──3,809,280 B (04.32%) -- gc-heap
│  │   │  ├──1,529,272 B (01.74%) -- arena-unused
│  │   │  ├──1,425,224 B (01.62%) -- objects
│  │   │  ├────600,512 B (00.68%) -- shapes
│  │   │  ├────176,896 B (00.20%) -- strings
│  │   │  ├─────55,056 B (00.06%) -- arena-padding
│  │   │  ├─────22,320 B (00.03%) -- arena-headers
│  │   │  └──────────0 B (00.00%) -- xml
│  │   ├──2,293,760 B (02.60%) -- mjit-code
│  │   ├──1,540,752 B (01.75%) -- tjit-data
│  │   │  ├──1,392,752 B (01.58%) -- allocators-main
│  │   │  └────148,000 B (00.17%) -- allocators-reserve
│  │   ├────785,977 B (00.89%) -- scripts
│  │   ├────655,360 B (00.74%) -- tjit-code
│  │   ├────520,608 B (00.59%) -- mjit-data
│  │   ├────257,904 B (00.29%) -- object-slots
│  │   └─────89,794 B (00.10%) -- string-chars
Comment 1 Alon Zakai (:azakai) 2011-07-13 17:41:50 PDT
Did reduction of the JS on this website, until it became clear that a Worker is crucial for this leak to happen. (The website uses a Worker inside the bespin editor, to do syntax highlighting.)

khuey reminded me that bent has a patch that rewrites much of the Worker code (bug 649537). Sadly though that patch doesn't fix this.

What complicates figuring this bug out is that the leak doesn't always happen, and it appears to be sensitive to small things in the JS code, like whether the Worker is created inside a closure or not.
Comment 2 Jesse Ruderman 2011-07-14 01:01:24 PDT
Alon, can you attach your partially-reduced testcase? I wonder if adding a GC or CC call somewhere would make it deterministic.
Comment 3 Alon Zakai (:azakai) 2011-07-14 11:55:59 PDT
Created attachment 545965 [details]
partially reduced testcase

Here is the partially reduced testcase. Much simpler than the original, but still very complicated.

Needs a webserver to run (|python -m SimpleHTTPServer 8000| will work).
Comment 4 Alon Zakai (:azakai) 2011-07-14 15:34:17 PDT
Refcount logging appears to confirm that the problem is in Worker code. cc'ing bent, any help understanding what is going on here would be much appreciated.

It looks like nsDOMWorker::InitializeInternal calls nsDOMThreadService::RegisterWorker which creates an nsDOMWorkerPool. The nsDOMWorkerPool grabs an nsCOMPtr<nsIDocument> mParentDocument on the document (that is leaking).

nsDOMWorkerPool is not an XPCOM object and that reference to the document is unknown to the cycle collector, so my current guess is that it should be freed manually, but for some reason that isn't happening. Starting to read the worker code now.
Comment 5 Alon Zakai (:azakai) 2011-07-14 16:28:02 PDT
I verified that zeroing out the document in nsDOMWorkerPool clears up the leak, as expected, confirming the cause of this bug.

No idea yet what a proper fix would look like here though.
Comment 6 Alon Zakai (:azakai) 2011-07-14 17:56:38 PDT
Created attachment 546062 [details] [diff] [review]
weak refs in nsDOMWorkerPool

I don't know if this makes sense or not, but here is a patch that uses a weak ref in nsDOMWorkerPool for the document.

With this, we no longer leak the compartment. However, XPCOM_MEM_LEAK_LOG indicates we do leak an nsWeakReference, I assume the one in the nsDOMWorkerPool, which means we also leak the nsDOMWorkerPool silently. So the bigger issue is, what is in charge of the lifecycle of that object, and why isn't it working here.
Comment 7 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-15 02:51:28 PDT
(In reply to comment #6)
 I assume the one in the
> nsDOMWorkerPool, which means we also leak the nsDOMWorkerPool silently.

I don't understand the "also leak the nsDOMWorkerPool silently".
Either you leak or not.
Comment 8 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-15 02:55:23 PDT
Comment on attachment 546062 [details] [diff] [review]
weak refs in nsDOMWorkerPool


>+  if (aDocument) {
>+    nsCOMPtr<nsISupportsWeakReference> document = do_QueryInterface(aDocument);
>+    nsIWeakReference *weakRef;
>+    nsresult rv = document->GetWeakReference(&weakRef);
>+    if (NS_SUCCEEDED(rv))
>+      mParentDocument = weakRef;
>+  }
This leaks nsIWeakReference.
GetWeakReference AddRefs, and so do the assignment to mParentDocument.
Comment 9 Ben Turner (not reading bugmail, use the needinfo flag!) 2011-07-15 05:41:12 PDT
Live workers require that the parent document stay alive, weak references are the wrong solution here.

Can you explain to me a little more about what's going on here?
Comment 10 Ben Turner (not reading bugmail, use the needinfo flag!) 2011-07-15 05:42:25 PDT
Comment on attachment 546062 [details] [diff] [review]
weak refs in nsDOMWorkerPool

Oh, also, nsIWeakReference is not threadsafe.
Comment 11 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-07-15 06:14:17 PDT
> This leaks nsIWeakReference.

Yeah, this is why do_GetWeakReference exists.
Comment 12 Alon Zakai (:azakai) 2011-07-15 09:50:29 PDT
(In reply to comment #7)
> (In reply to comment #6)
>  I assume the one in the
> > nsDOMWorkerPool, which means we also leak the nsDOMWorkerPool silently.
> 
> I don't understand the "also leak the nsDOMWorkerPool silently".
> Either you leak or not.

Sorry for not being clear, I meant that I suspected it was leaked, but it did not show up in the leak log. Given the last few comments though, I think I was wrong, so please ignore that comment of mine.

(In reply to comment #9)
> Live workers require that the parent document stay alive, weak references
> are the wrong solution here.
> 
> Can you explain to me a little more about what's going on here?

The situation here is that the parent document is dead, or rather should be dead - the tab containing it is closed. But it is kept alive by the worker pool because of a worker inside that tab.

I don't really understand the lifecycle here. Should a worker be able to keep a document alive if otherwise it should be cleaned up?
Comment 13 Alon Zakai (:azakai) 2011-07-15 11:22:05 PDT
Debugging through the worker code, it looks like

* When creating the page's first worker, the nsDOMWorkerPool is put in the hashtable of pools, which addrefs it.
* The worker addrefs the pool it belongs to.
* Two additional workers are created on this page, each of them addrefing the pool.
* When the tab is closed, nsDOMThreadService::CancelWorkersForGlobal is called, which leads to the pool being removed from the hashtable, releasing it.
* The workers get Cancel() called on them, but they never release their holds on the pool, keeping it alive.

I can't find a place in the code where the workers release their holds on the pool. So I think that will only happen when they are destroyed. Which means that the next question is why the workers are not destroyed.

However, is there a reason why the workers do not release their hold on the pool when they are cancelled?
Comment 14 Alon Zakai (:azakai) 2011-07-15 18:47:25 PDT
Created attachment 546268 [details]
refcount tree

Still investigating why the workers are not destroyed here. The situation seems a little complicated, analyzing the refcount tree with --ignore has turned out to be misleading (it hid branches that appeared balanced, but are balanced with other stuff).

I am manually auditing each addref and release currently, without --ignore. So far I ruled out most of the straightforward stuff as valid. What remains is mainly some scary-looking (180+ frames deep...) xpconnect wrapping stuff.

Attached is the refcount tree (partially annotated by me, comments begin with XXX) if anyone wants to take a look too.
Comment 15 Ben Turner (not reading bugmail, use the needinfo flag!) 2011-07-15 23:28:30 PDT
Yeah, I think you're right, the problem is that the workers aren't going away. Once they do the pool and document should follow.
Comment 16 Andrew McCreight (PTO-ish through 6-29) [:mccr8] 2011-07-16 05:05:00 PDT
Zombie workers!
Comment 17 Alon Zakai (:azakai) 2011-07-18 13:43:21 PDT
Created attachment 546620 [details]
proto leak info

Looks like the problem is in the js engine or xpconnect. Comparing the non-leak case to the leaking one, this attachment is what is missing. That is, there is an XPCWrappedNativeProto that is not cleaned up, and it holds on to the worker (which holds on the the worker pool, which holds on the document).
Comment 18 Alon Zakai (:azakai) 2011-07-19 17:03:58 PDT
I am pretty much stuck at this point, as far as I can tell there is nothing wrong in the Worker code, and I am getting confused in xpconnect. Help would be greatly appreciated.

From what I can tell, the nsDOMWorker creates an xpconnect wrapper for itself. That XPCWrappedNative has an XPCWrappedNativeProto which has a strong ref on the nsDOMWorker. That keeps the nsDOMWorker alive, which keeps the worker pool alive, which keeps the document alive, and we leak the entire tab even after it is closed.

Now, the XPCWrappedNative code says

 *  Wrapped Native lifetime management is messy!
 *
 *  - At creation we push the refcount to 2 (only one of which is owned by
 *    the native caller that caused the wrapper creation).
 *  - During the JS GC Mark phase we mark any wrapper with a refcount > 1.
 *  - The *only* thing that can make the wrapper get destroyed is the
 *    finalization of mFlatJSObject. And *that* should only happen if the only
 *    reference is the single extra (internal) reference we hold.

I see that the XPCWrappedNative has one remaining ref, so it sounds like that should be the single extra internal one mentioned there? But the XPCWrappedNative is not getting cleaned up for some reason.

On the other hand, the XPCWrappedNativeProto code says

    // This native object lives as long as its associated JSObject - killed
    // by finalization of the JSObject (or explicitly if Init fails).

I assume the associated JSObject is the nsDOMWorker itself? I haven't found that in the code, but don't have any other guesses. That seems problematic however, since the XPCWrappedNativeProto (a non cycle collected object) is keeping the nsDOMWorker alive, and that comment seems to imply that the nsDOMWorker keeps the XPCWrappedNativeProto alive, which is a loop the cycle collector is not aware of. So I must be misunderstanding something.
Comment 19 Alon Zakai (:azakai) 2011-07-19 18:13:41 PDT
Further investigation shows that NS_CYCLE_COLLECTION_CLASSNAME(XPCWrappedNative)::Traverse is *not* called on the leaked XPCWrappedNative that we care about (the worker's one). This seems suspicious because the code in Traverse looks like it is responsible for enforcing the logic in the comment mentioned in the previous comment, namely, that refcount == 1 is special.
Comment 20 Alon Zakai (:azakai) 2011-07-20 17:41:48 PDT
Created attachment 547304 [details]
CC graph root finder script

Today's update:

* Looks like XPCWrappedNatives are expected to have at least one ref, an 'internal' one that is from their internal JS object (or some other form of self-ref, I am not sure). XPCWrappedNatives are usually *not* CC'd normally, as their refs normally never reach 0. When they reach a refcount of 1, that doesn't mean they should be collected. Only when they have a refcount of 1, then when their internal JS Object is destroyed - then they are cleaned up, directly from that destruction code, and not from the cycle collector. (Or, I suspect that if the internal JSObject dies while they have additional refs, they will live while those additional refs live - and eventually be collected normally.)

* So, having a single ref for the important XPCWrappedNative, and it not being cleaned up, means that the internal JSObject is leaking.

* Attached is a python script (maybe someone else will find it useful) that parses the CC log. It prints out the roots for a specific object, that is, what is holding it alive.

* Running that script on the leaking page, during the leak (that is: close the tab, run GC/CC multiple times, see the page refuses to vanish from about:memory) gives that the root for the XPCWrappedNative's internal JSObject is... drumroll... the page's nsDocument, through a convoluted path of length 64 (there might be other paths too). The convoluted path is probably due to the convoluted JS code, specifically the CommonJS Tiki module dependency system.

So:

1. As we saw before, the worker keeps the worker pool alive, which keeps the document alive.

2. The document is a root because the cycle collector knows about 4 of its 5 references. We know what that additional reference is, it is the worker pool, which is not cycle collected.

3. The document keeps the worker's wrapper's internal JSObject alive, which keeps the worker alive.

I am a little tired at the moment, so maybe I am missing something, but this looks like an untangleable loop that the cycle collector cannot fix. Which would seem to explain the leak.

However, a remaining mystery is why this does *not* happen in my attempted reduced testcases. They also create a worker and make the document link to it, but things work out somehow.

So there must be some additional element in this system, that helps out. For example, perhaps when the page is unloaded it's document manually detaches its child objects or something like that? That would prevent a leak like this, but it's just a random wild guess, there could be any number of other possibilities. But whatever the additional element, it works normally but fails in this leaking page.
Comment 21 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-21 02:15:29 PDT
Cycle cyllector sure should be thought about the mParentDocument in nsDOMWorkerPool.

What all are we leaking? What does XPCOM_MEM_LEAK_LOG say?
Comment 22 Alon Zakai (:azakai) 2011-07-21 07:20:28 PDT
(In reply to comment #21)
> Cycle cyllector sure should be thought about the mParentDocument in
> nsDOMWorkerPool.
> 

Yes, one solution here would be to make the nsDOMWorkerPool cycle collected and tell it about mParentDocument. However I am still puzzled why we don't leak on pages other than this one, for some reason that solution is not normally needed. I want to fully understand that before doing anything.

> What all are we leaking? What does XPCOM_MEM_LEAK_LOG say?

We are leaking the entire tab I think, the document holds on to basically everything, so XPCOM_MEM_LEAK_LOG is very long. If it's useful I can paste the entire output of XPCOM_MEM_LEAK_LOG in a few hours when I get to work.
Comment 23 Peter Van der Beken [:peterv] 2011-07-21 07:45:52 PDT
(In reply to comment #22)
> Yes, one solution here would be to make the nsDOMWorkerPool cycle collected

We can't add threadsafe objects to the CC.
Comment 24 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-21 08:51:59 PDT
Yeah, I was thinking to add mParentDocument to CC using some non-thread-safe
helper object.

XPCOM_MEM_LEAK_LOG can be useful.
Comment 25 Alon Zakai (:azakai) 2011-07-21 09:44:17 PDT
Created attachment 547420 [details]
XPCOM_MEM_LEAK_LOG

Here is the XPCOM_MEM_LEAK_LOG.

Interesting about the threading issues with cycle collection, I wasn't aware of that. Can either of you perhaps give an example of a class that does what we might need here, to help me understand how this stuff works?
Comment 26 Alon Zakai (:azakai) 2011-07-21 17:24:23 PDT
bent, I think we can resolve this bug through getting the workers to release their hold on the pool when they are Killed. Any reason that would be a bad thing to do?

I still can't figure out why this leak doesn't happen on other web pages. The internal JS Object of the Worker is garbage collected in the JS engine somehow. I haven't been able to figure out from JS_DumpHeap what is going on there, though (for example it isn't even obvious to me what is considered a root in that dump).
Comment 27 Alon Zakai (:azakai) 2011-07-22 16:56:22 PDT
Hacking the JS engine a bit, I think I can see enough of the JS GC heap. When the leak happens, the Worker is not cleaned up due to an nsXPCWrappedJS'd nsIDOMEventListener, and looking through the CC graph with find_leaks.py, what keeps that event listener alive is the Document.

This confirms the analysis of the leak from before, and suggests that the solution might be what I asked about in comment 26 (detach the pool from the worker when the worker is killed).

However, the mystery remains as to why this leak does *not* happen in other web pages. I've been looking for code that does something to prevent this kind of situation, like say detaching event handlers when a window's docshell is set to null, but can't find anything so far. Does anyone know?
Comment 28 Alon Zakai (:azakai) 2011-07-25 18:28:40 PDT
Some more notes as I continue to investigate this:

* The hold on the Worker JS object changes when nsGlobalWindow::CleanUp does

  mInnerWindowHolder = nsnull;

This is exactly the object at the beginning of the chain holding on to the Worker in the JS GC dump log. At this point, the GC dump log indicates that the object is still held on to by the same chain, but with |XPCWrappedNative::mFlatJSObject(... Window)| replaced by |machine stack(... Window)|, which makes sense. So far the leaking and non-leaking cases are basically the same.

* When the stack unwinds enough (back up to |nsGlobalWindow::SetDocShell(0)|), in the non-leaking case the object is no longer held on to, and will eventually be collected. However, in the leaking case there is another hold on the object, this time beginning with |nsXPCWrappedJS[nsIDOMEventListener,...]|. Next step is to start investigating what is going on there.

I also noticed that in the leaking case, but not the non-leaking case, the Worker object is used through two JS contexts and not just one (that is, when printing the JS GC dump log for all the contexts, the Worker appears in two context's logs). No idea if this is significant in any way.
Comment 29 Alon Zakai (:azakai) 2011-07-26 19:00:59 PDT
Created attachment 548657 [details]
Reduced testcase

Here is a properly reduced testcase. (Warning: that will leak 10MB of memory if you load it.)

The issue is event listeners on Nodes in the page, and the worker code. More specifically,

* The worker code, as explained before, creates a cycle with the document, the worker (both JS object and XPCWrappedNative), and the worker pool.

* Something must be done to untangle that loop. It turns out that nsGlobalWindow, when it is told to clean itself up - before it is actually deleted, in FreeInnerObjects - will do various useful things to prevent leaks. One is to detach from the Window JS Object. Another is to call Disconnect() on its nsEventListenerManager, which detaches event listeners on the window. (Figuring out how the event listeners relate to their XPCWrappedJS objects turned out to involve some tricky XPConnect things, thanks go to khuey for explaining that stuff to me.)

* If an event listener is created for a *Node* in the page, and not on the window itself, then it will leak in this case. Nothing tells it to Disconnect() its event listener manager, and that holds on to the whole page.

To summarize, we do *not* leak if the reduced testcase has

  window.addEventListener(

but we do with

  document.getElementById('the_div').addEventListener(

because the former cleans up its event listeners while the latter does not.

So, this leak might happen without workers as well, if other code creates such cycles that assume event listeners will be detached manually. I don't know how to estimate the chance of that though.

I am thinking that the way to go here is

* Call Disconnect for all Nodes in the page, and not just the global window itself, when the window is told to clean itself up. Does that make sense?

* Make DOMWorkers release their holds on their Pool when Killed. This is not strictly needed with the former item, but seems safer, and I don't immediately see a problem with doing it?
Comment 30 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-26 19:17:57 PDT
Nice work, azakai.  I give you a gold star.
Comment 31 Boris Zbarsky [:bz] (Out June 25-July 6) 2011-07-26 19:50:49 PDT
We could disconnect on everything in the document tree, but things not in the tree would still be able to form that cycle.  And it might well give the wrong behavior to do that.

If we can break the cycle on the DOMWorker side, that seems like the way to go.
Comment 32 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-27 02:41:16 PDT
Yep, if worker code keeps document alive and doesn't add the
reference to document to CC, that is the problem to fix.

Nodes' event listeners are cycle collected, so if there is a cycle
which CC can detect, everything should be released properly.
The problem is that in this case CC doesn't know about one reference.
Comment 33 Alon Zakai (:azakai) 2011-07-27 11:03:33 PDT
Heh, looks like the leak has been fixed by bent's landing of the latest version of the new workers code :) I tried with the patch from 2 weeks ago back then when I started to work on this bug, it didn't fix the leak, so it must be one of the updates since.

I am still worried though about the DOM stuff. We are manually disconnecting event listeners on the window but not on child nodes, which seems inconsistent. I mean, is there a situation where it is correct to disconnect the window's listeners but *not* child nodes? khuey suggested that I not disconnect the window, and see what if anything leaks, doing that now.
Comment 34 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-07-27 11:08:38 PDT
(In reply to comment #33)
> I am still worried though about the DOM stuff. We are manually disconnecting
> event listeners on the window but not on child nodes, which seems
> inconsistent. I mean, is there a situation where it is correct to disconnect
> the window's listeners but *not* child nodes? khuey suggested that I not
> disconnect the window, and see what if anything leaks, doing that now.

Can you file a separate MemShrink bug (maybe clone this one, so we get the CC list) for that.  It might be worth asking Jesse to fuzz a build without the window event listener disconnect too.
Comment 35 Olli Pettay [:smaug] (high review load, please consider other reviewers) 2011-07-27 13:08:38 PDT
(In reply to comment #33)
> I am still worried though about the DOM stuff. We are manually disconnecting
> event listeners on the window but not on child nodes, which seems
> inconsistent.
We are manually disconnecting event listeners from nodes, if unlinking
hasn't done that already.
See nsNodeUtils::LastRelease. I know that happens only during last release, but
unlinking should just do the right thing anyway.
Comment 36 Alon Zakai (:azakai) 2011-07-27 14:03:20 PDT
Created a clone, bug 674679, to continue investigation of the DOM issues that arose here. Let's continue the discussion there.
Comment 37 Nicholas Nethercote [:njn] (on vacation until July 11) 2011-07-27 16:13:03 PDT
(In reply to comment #33)
> Heh, looks like the leak has been fixed by bent's landing of the latest
> version of the new workers code :)

I just tried and its fixed for me too.  Thanks, bent!

Note You need to log in before you can comment on or make changes to this bug.