Last Comment Bug 790201 - Nuke social worker sandbox on shutdown
: Nuke social worker sandbox on shutdown
Status: VERIFIED FIXED
[Fx17]
: mlk
Product: Firefox
Classification: Client Software
Component: SocialAPI (show other bugs)
: Trunk
: All All
: -- normal (vote)
: Firefox 19
Assigned To: Mark Hammond [:markh]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-11 05:32 PDT by Jared Wein [:jaws] (please needinfo? me)
Modified: 2012-12-04 15:12 PST (History)
9 users (show)
markh: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
verified
verified
verified


Attachments
Patch (5.70 KB, patch)
2012-09-13 04:33 PDT, :Felipe Gomes (needinfo me!)
markh: review+
Details | Diff | Splinter Review
Felipe's patch modified with hacks to get tests passing (4.71 KB, patch)
2012-10-11 23:49 PDT, Mark Hammond [:markh]
no flags Details | Diff | Splinter Review
New saner version (4.47 KB, patch)
2012-10-12 16:23 PDT, Mark Hammond [:markh]
no flags Details | Diff | Splinter Review
note to self - you still can't rush patches on Saturdays! (4.48 KB, patch)
2012-10-12 17:18 PDT, Mark Hammond [:markh]
no flags Details | Diff | Splinter Review
Rebased and new comments re this.sandbox (4.70 KB, patch)
2012-10-15 00:10 PDT, Mark Hammond [:markh]
felipc: review+
gavin.sharp: feedback+
Details | Diff | Splinter Review
Patch changes for reload (2.42 KB, patch)
2012-10-17 17:49 PDT, Mark Hammond [:markh]
mixedpuppy: review+
felipc: review+
Details | Diff | Splinter Review
Rollup patch for aurora/beta (8.65 KB, patch)
2012-10-25 12:58 PDT, :Felipe Gomes (needinfo me!)
felipc: review+
lukasblakk+bugs: approval‑mozilla‑aurora+
lukasblakk+bugs: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Jared Wein [:jaws] (please needinfo? me) 2012-09-11 05:32:18 PDT
After enabling the Social API, there are reports from people that the heap-unclassified is growing to a very large percentage.

This needs to be investigated and we need to make sure that Firefox is not the source of any of these memory leaks.

As a side note, by design these social API panels have a lifetime equal to the browser window they reside in, so references may not be removed until shutdown in the majority of cases.
Comment 1 :Felipe Gomes (needinfo me!) 2012-09-13 04:33:56 PDT
Created attachment 660789 [details] [diff] [review]
Patch

This fixes most of the issues that I found while analyzing the code this week. Not everything is necessary but I believe they all increase correctness. Testing with Jared's demo provider I don't see any window leaks during runtime, and the nukeSandbox call gets rid of the shutdown leak.
Comment 2 :Felipe Gomes (needinfo me!) 2012-09-13 04:39:15 PDT
Another thing that we need to look into more (but I don't see it making any difference), is that if I call nukeSandbox in the middle of a session (not on shutdown), this message listener (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/social/MessagePortWorker.js#99) being dispatched (not the JS code of the function -- the js stack is empty at the time) will throw "can't access dead object" errors.  So this mean something along the way there has a reference back to something that existed in the sandbox, which might or might not be a problem
Comment 3 Mark Hammond [:markh] 2012-09-13 04:41:37 PDT
Comment on attachment 660789 [details] [diff] [review]
Patch

Drive-by:

+    Services.obs.addObserver(function cleanupSandbox () {
+      Services.obs.removeObserver(cleanupSandbox, "xpcom-shutdown");
+      Cu.nukeSandbox(sandbox);
+    }, "xpcom-shutdown", false);
+

Would it be better to call terminateWorker() in that notification (or maybe that is too late for the other DOM cleanup)?

But that looks great (and I learned about nukeSandbox :)
Comment 4 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-09-13 05:27:46 PDT
Comment on attachment 660789 [details] [diff] [review]
Patch

I don't think the event handler removals are necessary.

We should get a better bug on file about the need to unwrap for XHR, that WFM one isn't going anywhere :)
Comment 5 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-09-13 09:29:39 PDT
(In reply to :Felipe Gomes from comment #2)
> Another thing that we need to look into more (but I don't see it making any
> difference), is that if I call nukeSandbox in the middle of a session (not
> on shutdown), this message listener
> (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/social/
> MessagePortWorker.js#99) being dispatched (not the JS code of the function
> -- the js stack is empty at the time) will throw "can't access dead object"
> errors.  So this mean something along the way there has a reference back to
> something that existed in the sandbox, which might or might not be a problem

Not sure I follow exactly, but if you nuke the sandbox while the provider panels are still active, their use of mozSocial.getWorker or ports they've previously obtained to the worker will throw, right? Is that what you're seeing?
Comment 6 Mark Hammond [:markh] 2012-09-16 18:53:54 PDT
Comment on attachment 660789 [details] [diff] [review]
Patch

In general this looks good to me.  I'm happy with the event listeners being removed and as Gavin didn't explicitly say to remove that code, I'll leave that up to you.

In the same vein though, I think it would be nice if the xpcom shutdown observer was removed in terminateWorker - once terminateWorker has been called (which should happen in normal usage), the shutdown code is no longer necessary.  Another alternative might be to have a single observer that iterates over _workerCache and nukes all sandboxes it finds.  This really is a nit though as we don't expect many workers to be created and terminated in a normal run, so again, I'll leave that to your judgement.
Comment 7 :Felipe Gomes (needinfo me!) 2012-10-05 18:07:46 PDT
I've removed the removeEventListeners as they don't seem to be necessary at this poing, and I'll continue investigating

(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #4)
> We should get a better bug on file about the need to unwrap for XHR, that
> WFM one isn't going anywhere :)

Filed bug 798660


https://hg.mozilla.org/integration/mozilla-inbound/rev/e9237d1ae5ff
Comment 8 Phil Ringnalda (:philor) 2012-10-05 20:29:05 PDT
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/9b9678df7672 because something in the push was hitting "browser_frameworker.js | sub-test testEarlyClose failed: [Exception... "Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIXPCComponents_Utils.nukeSandbox]" nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)" location: "JS frame :: resource://gre/modules/FrameWorker.jsm :: terminate :: line 239" data: no]" and "browser_frameworker.js | check that websockets worked - Got FAILED calling WebSocket constructor: TypeError: WebSocket is not a constructor, expected ok" and I didn't know which (though I presume this one), and didn't know interconnectedness among them.
Comment 9 Mark Hammond [:markh] 2012-10-11 23:49:25 PDT
Created attachment 670701 [details] [diff] [review]
Felipe's patch modified with hacks to get tests passing

I had a play with this and can reproduce the errors.  Some of the errors (eg, "Can't access dead object") are caused by the nukeSandbox being done before the hidden iframe has unloaded (infact, in at least 1 test we terminate the worker before the documentelementinserted (sp?) event has fired, so this.sandbox is null)

I attempted to solve this by doing all the termination in an unload handler for the document, which solved some of the problems but still leaves a number more related directly to the removal of the XPCNativeWrapper.unwrap - it causes the WebSocket test to outright fail, and most other tests to dump obscure exceptions and c++ stack traces - so something is still very wrong there.  These obscure assertions and WebSocket failure happens even when the nukeSandbox call is commented out.

This new patch is obviously not ready to be checked in (see the:

        // XXX - markh - this is necessary to avoid lots of obscure assertions

comment and commented out code around it - but it does pass the tests and doesn't fill the log with assertions.  I *know* WebSocket is one thing that needs it, but I'm not sure what others do - I've just re-enabled the unwrapping for all - and I probably won't get back to this until Monday.

There is also a change to toolkit/components/social/test/browser/worker_relative.js which looks like a genuine problem in the test which starts happening due to the worker termination waiting for the unload event)

*sob*
Comment 10 Mark Hammond [:markh] 2012-10-12 16:23:57 PDT
Created attachment 670979 [details] [diff] [review]
New saner version

It appears many of those assertions were caused by the patch adding XMLHttpRequest to the sandbox twice - once the "normal" version and once the unwrapped version.  By not doing that and by adding WebSocket to the "whitelist" those assertions and test failures go away.

I figured that instead of the xpcom shutdown observer, we might as well *always* add an unload event handler that does the cleanup, and terminate simply removes the node from the document triggering that unload.  App shutdown should also cause the unload to fire, meaning it gets cleaned up in that case too.  By referencing |sandbox| in the unload handler, we no longer need to store the sandbox in |this|.

Assuming this looks OK, we should file another bug for the need to unwrap WebSocket like we already have done for XMLHttpRequest.

Like my previous version of the patch, this also includes a fix for a legitimate test problem that wasn't seen before due to the timing characteristics of the previous shutdown - this shutdown process actually happens a little later after the worker window has completely finished doing its thing and unload fires.
Comment 11 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-10-12 16:37:23 PDT
Comment on attachment 670979 [details] [diff] [review]
New saner version

>diff --git a/toolkit/components/social/FrameWorker.jsm b/toolkit/components/social/FrameWorker.jsm

>+    workerWindow.addEventListener("uload", function unloadListener() {
>+      workerWindow.removeEventListener("iload", unloadListener);

uload/iload? o_O

>   terminate: function terminate() {

>+    if (!this.url in workerCache) {

"!" binds tighter than "in", so you're testing "false in workerCache" here AFAICT (i.e. you'll never hit this).

I wonder whether moving the destruction of the ports to during unload cause trouble for any port-close handlers that try to make use of window-dependent APIs?
Comment 12 Mark Hammond [:markh] 2012-10-12 17:18:02 PDT
Created attachment 670994 [details] [diff] [review]
note to self - you still can't rush patches on Saturdays!

Fixed the obvious typos in the event handler and the "!" thang...
Comment 13 Mark Hammond [:markh] 2012-10-14 17:56:21 PDT
FWIW, I just built 17 from the beta tree and without this patch a shutdown reports:

 => mFreeCount:           77674  --  LEAKED 2799 !!!
 => mAdoptFreeCount:       4708  --  LEAKED 4 !!!

and with the patch it reports no leaks.  On trunk, no string leaks are reported either way.  So this patch has a noticeable effect on 17 and as such may still be a candidate for uplift.
Comment 14 Mark Hammond [:markh] 2012-10-15 00:10:20 PDT
Created attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Rebased to avoid a conflict on m-c and as discussed on IRC, added a comment about why storing the sandbox in |this| is no longer necessary.
Comment 15 :Felipe Gomes (needinfo me!) 2012-10-15 10:32:52 PDT
Comment on attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Review of attachment 671331 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/social/FrameWorker.jsm
@@ +118,4 @@
>                       'location'];
>      workerAPI.forEach(function(fn) {
>        try {
> +        // Bug 756173 - XHR has issues in a sandbox and need

update this bug number with the new filed about the issue: bug 798660. I'll update that bug to also mention WebSocket

@@ +252,5 @@
> +          Cu.reportError("FrameWorker: failed to close pending port. " + ex);
> +        }
> +      }
> +      worker.pendingPorts = null;
> +  

nit: whitespace on blank line
Comment 16 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-10-15 15:07:51 PDT
Comment on attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Looks good to me.
Comment 17 Mark Hammond [:markh] 2012-10-16 22:11:21 PDT
This patch is failing on try with:

TEST-INFO | chrome://mochitests/content/browser/toolkit/components/social/test/browser/browser_workerAPI.js | Console message: [JavaScript Error: "TypeError: pending is null" {file: "resource://gre/modules/FrameWorker.jsm" line: 215}]
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/social/test/browser/browser_workerAPI.js | Test timed out

https://tbpl.mozilla.org/php/getParsedLog.php?id=16183641&tree=Try

which implies we are executing a 'load' handler after the 'unload' handler has set the array to null.  While the patch could avoid this by setting the arrays back to an empty array instead of null, that smells too much - we probably should work out why...
Comment 18 Mark Hammond [:markh] 2012-10-17 17:49:56 PDT
Created attachment 672616 [details] [diff] [review]
Patch changes for reload

The reload of workers broke the original patch.  Attaching modifications to the original patch necessary to get the nukeSandbox changes working with reload.

Note this is *on top of* the other patch - I'll fold them into 1 for checkin, but figured this might make the reload specific changes easier to review.

Ideally Shane *and* Felipe could review - Shane as he understands the reload code best, and Felipe so I can carry the original r+ forward.

Try at https://tbpl.mozilla.org/?tree=Try&rev=bcd7c250ba0a
Comment 20 Ed Morley [:emorley] 2012-10-18 10:34:37 PDT
https://hg.mozilla.org/mozilla-central/rev/b872b207397d
Comment 21 :Felipe Gomes (needinfo me!) 2012-10-25 12:58:47 PDT
Created attachment 675253 [details] [diff] [review]
Rollup patch for aurora/beta

[Approval Request Comment]
Bug caused by (feature/regressing bug #): social api
User impact if declined: better clean-up on shutdown
Testing completed (on m-c, etc.): landed on m-c last week
Risk to taking this patch (and alternatives if risky): touches social-only worker reload/shutdown
String or UUID changes made by this patch: none
Comment 22 Lukas Blakk [:lsblakk] use ?needinfo 2012-10-25 13:35:22 PDT
Comment on attachment 675253 [details] [diff] [review]
Rollup patch for aurora/beta

needed for social-api, been on m-c a week, approving.
Comment 24 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-12-04 15:12:55 PST
Marking verified fixed since this landed with tests. Please add verifyme keyword if there's something you need QA to test.

Note You need to log in before you can comment on or make changes to this bug.