Nuke social worker sandbox on shutdown

VERIFIED FIXED in Firefox 17

Status

()

Firefox
SocialAPI
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: jaws, Assigned: markh)

Tracking

({mlk})

Trunk
Firefox 19
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox17+ verified, firefox18 verified, firefox19 verified)

Details

(Whiteboard: [Fx17])

Attachments

(3 attachments, 4 obsolete attachments)

After enabling the Social API, there are reports from people that the heap-unclassified is growing to a very large percentage.

This needs to be investigated and we need to make sure that Firefox is not the source of any of these memory leaks.

As a side note, by design these social API panels have a lifetime equal to the browser window they reside in, so references may not be removed until shutdown in the majority of cases.
Whiteboard: [Fx17]
Created attachment 660789 [details] [diff] [review]
Patch

This fixes most of the issues that I found while analyzing the code this week. Not everything is necessary but I believe they all increase correctness. Testing with Jared's demo provider I don't see any window leaks during runtime, and the nukeSandbox call gets rid of the shutdown leak.
Attachment #660789 - Flags: review?(gavin.sharp)
Another thing that we need to look into more (but I don't see it making any difference), is that if I call nukeSandbox in the middle of a session (not on shutdown), this message listener (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/social/MessagePortWorker.js#99) being dispatched (not the JS code of the function -- the js stack is empty at the time) will throw "can't access dead object" errors.  So this mean something along the way there has a reference back to something that existed in the sandbox, which might or might not be a problem
(Assignee)

Comment 3

5 years ago
Comment on attachment 660789 [details] [diff] [review]
Patch

Drive-by:

+    Services.obs.addObserver(function cleanupSandbox () {
+      Services.obs.removeObserver(cleanupSandbox, "xpcom-shutdown");
+      Cu.nukeSandbox(sandbox);
+    }, "xpcom-shutdown", false);
+

Would it be better to call terminateWorker() in that notification (or maybe that is too late for the other DOM cleanup)?

But that looks great (and I learned about nukeSandbox :)
Comment on attachment 660789 [details] [diff] [review]
Patch

I don't think the event handler removals are necessary.

We should get a better bug on file about the need to unwrap for XHR, that WFM one isn't going anywhere :)
(In reply to :Felipe Gomes from comment #2)
> Another thing that we need to look into more (but I don't see it making any
> difference), is that if I call nukeSandbox in the middle of a session (not
> on shutdown), this message listener
> (http://mxr.mozilla.org/mozilla-central/source/toolkit/components/social/
> MessagePortWorker.js#99) being dispatched (not the JS code of the function
> -- the js stack is empty at the time) will throw "can't access dead object"
> errors.  So this mean something along the way there has a reference back to
> something that existed in the sandbox, which might or might not be a problem

Not sure I follow exactly, but if you nuke the sandbox while the provider panels are still active, their use of mozSocial.getWorker or ports they've previously obtained to the worker will throw, right? Is that what you're seeing?
(Assignee)

Comment 6

5 years ago
Comment on attachment 660789 [details] [diff] [review]
Patch

In general this looks good to me.  I'm happy with the event listeners being removed and as Gavin didn't explicitly say to remove that code, I'll leave that up to you.

In the same vein though, I think it would be nice if the xpcom shutdown observer was removed in terminateWorker - once terminateWorker has been called (which should happen in normal usage), the shutdown code is no longer necessary.  Another alternative might be to have a single observer that iterates over _workerCache and nukes all sandboxes it finds.  This really is a nit though as we don't expect many workers to be created and terminated in a normal run, so again, I'll leave that to your judgement.
Attachment #660789 - Flags: review?(gavin.sharp) → review+
I've removed the removeEventListeners as they don't seem to be necessary at this poing, and I'll continue investigating

(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #4)
> We should get a better bug on file about the need to unwrap for XHR, that
> WFM one isn't going anywhere :)

Filed bug 798660


https://hg.mozilla.org/integration/mozilla-inbound/rev/e9237d1ae5ff
tracking-firefox17: --- → ?
Summary: Memory leak when Social API is enabled → Nuke social worker sandbox on shutdown
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/9b9678df7672 because something in the push was hitting "browser_frameworker.js | sub-test testEarlyClose failed: [Exception... "Component returned failure code: 0x80070057 (NS_ERROR_ILLEGAL_VALUE) [nsIXPCComponents_Utils.nukeSandbox]" nsresult: "0x80070057 (NS_ERROR_ILLEGAL_VALUE)" location: "JS frame :: resource://gre/modules/FrameWorker.jsm :: terminate :: line 239" data: no]" and "browser_frameworker.js | check that websockets worked - Got FAILED calling WebSocket constructor: TypeError: WebSocket is not a constructor, expected ok" and I didn't know which (though I presume this one), and didn't know interconnectedness among them.

Updated

5 years ago
tracking-firefox17: ? → +
(Assignee)

Comment 9

5 years ago
Created attachment 670701 [details] [diff] [review]
Felipe's patch modified with hacks to get tests passing

I had a play with this and can reproduce the errors.  Some of the errors (eg, "Can't access dead object") are caused by the nukeSandbox being done before the hidden iframe has unloaded (infact, in at least 1 test we terminate the worker before the documentelementinserted (sp?) event has fired, so this.sandbox is null)

I attempted to solve this by doing all the termination in an unload handler for the document, which solved some of the problems but still leaves a number more related directly to the removal of the XPCNativeWrapper.unwrap - it causes the WebSocket test to outright fail, and most other tests to dump obscure exceptions and c++ stack traces - so something is still very wrong there.  These obscure assertions and WebSocket failure happens even when the nukeSandbox call is commented out.

This new patch is obviously not ready to be checked in (see the:

        // XXX - markh - this is necessary to avoid lots of obscure assertions

comment and commented out code around it - but it does pass the tests and doesn't fill the log with assertions.  I *know* WebSocket is one thing that needs it, but I'm not sure what others do - I've just re-enabled the unwrapping for all - and I probably won't get back to this until Monday.

There is also a change to toolkit/components/social/test/browser/worker_relative.js which looks like a genuine problem in the test which starts happening due to the worker termination waiting for the unload event)

*sob*
(Assignee)

Comment 10

5 years ago
Created attachment 670979 [details] [diff] [review]
New saner version

It appears many of those assertions were caused by the patch adding XMLHttpRequest to the sandbox twice - once the "normal" version and once the unwrapped version.  By not doing that and by adding WebSocket to the "whitelist" those assertions and test failures go away.

I figured that instead of the xpcom shutdown observer, we might as well *always* add an unload event handler that does the cleanup, and terminate simply removes the node from the document triggering that unload.  App shutdown should also cause the unload to fire, meaning it gets cleaned up in that case too.  By referencing |sandbox| in the unload handler, we no longer need to store the sandbox in |this|.

Assuming this looks OK, we should file another bug for the need to unwrap WebSocket like we already have done for XMLHttpRequest.

Like my previous version of the patch, this also includes a fix for a legitimate test problem that wasn't seen before due to the timing characteristics of the previous shutdown - this shutdown process actually happens a little later after the worker window has completely finished doing its thing and unload fires.
Attachment #660789 - Attachment is obsolete: true
Attachment #670701 - Attachment is obsolete: true
Attachment #670979 - Flags: feedback?(gavin.sharp)
Attachment #670979 - Flags: feedback?(felipc)
Comment on attachment 670979 [details] [diff] [review]
New saner version

>diff --git a/toolkit/components/social/FrameWorker.jsm b/toolkit/components/social/FrameWorker.jsm

>+    workerWindow.addEventListener("uload", function unloadListener() {
>+      workerWindow.removeEventListener("iload", unloadListener);

uload/iload? o_O

>   terminate: function terminate() {

>+    if (!this.url in workerCache) {

"!" binds tighter than "in", so you're testing "false in workerCache" here AFAICT (i.e. you'll never hit this).

I wonder whether moving the destruction of the ports to during unload cause trouble for any port-close handlers that try to make use of window-dependent APIs?
(Assignee)

Comment 12

5 years ago
Created attachment 670994 [details] [diff] [review]
note to self - you still can't rush patches on Saturdays!

Fixed the obvious typos in the event handler and the "!" thang...
Attachment #670979 - Attachment is obsolete: true
Attachment #670979 - Flags: feedback?(gavin.sharp)
Attachment #670979 - Flags: feedback?(felipc)
Attachment #670994 - Flags: feedback?(gavin.sharp)
Attachment #670994 - Flags: feedback?(felipc)
(Assignee)

Comment 13

5 years ago
FWIW, I just built 17 from the beta tree and without this patch a shutdown reports:

 => mFreeCount:           77674  --  LEAKED 2799 !!!
 => mAdoptFreeCount:       4708  --  LEAKED 4 !!!

and with the patch it reports no leaks.  On trunk, no string leaks are reported either way.  So this patch has a noticeable effect on 17 and as such may still be a candidate for uplift.
(Assignee)

Comment 14

5 years ago
Created attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Rebased to avoid a conflict on m-c and as discussed on IRC, added a comment about why storing the sandbox in |this| is no longer necessary.
Attachment #670994 - Attachment is obsolete: true
Attachment #670994 - Flags: feedback?(gavin.sharp)
Attachment #670994 - Flags: feedback?(felipc)
Attachment #671331 - Flags: review?(gavin.sharp)
Attachment #671331 - Flags: review?(felipc)
Comment on attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Review of attachment 671331 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/social/FrameWorker.jsm
@@ +118,4 @@
>                       'location'];
>      workerAPI.forEach(function(fn) {
>        try {
> +        // Bug 756173 - XHR has issues in a sandbox and need

update this bug number with the new filed about the issue: bug 798660. I'll update that bug to also mention WebSocket

@@ +252,5 @@
> +          Cu.reportError("FrameWorker: failed to close pending port. " + ex);
> +        }
> +      }
> +      worker.pendingPorts = null;
> +  

nit: whitespace on blank line
Attachment #671331 - Flags: review?(felipc) → review+
Comment on attachment 671331 [details] [diff] [review]
Rebased and new comments re this.sandbox

Looks good to me.
Attachment #671331 - Flags: review?(gavin.sharp) → feedback+
(Assignee)

Comment 17

5 years ago
This patch is failing on try with:

TEST-INFO | chrome://mochitests/content/browser/toolkit/components/social/test/browser/browser_workerAPI.js | Console message: [JavaScript Error: "TypeError: pending is null" {file: "resource://gre/modules/FrameWorker.jsm" line: 215}]
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/social/test/browser/browser_workerAPI.js | Test timed out

https://tbpl.mozilla.org/php/getParsedLog.php?id=16183641&tree=Try

which implies we are executing a 'load' handler after the 'unload' handler has set the array to null.  While the patch could avoid this by setting the arrays back to an empty array instead of null, that smells too much - we probably should work out why...
(Assignee)

Comment 18

5 years ago
Created attachment 672616 [details] [diff] [review]
Patch changes for reload

The reload of workers broke the original patch.  Attaching modifications to the original patch necessary to get the nukeSandbox changes working with reload.

Note this is *on top of* the other patch - I'll fold them into 1 for checkin, but figured this might make the reload specific changes easier to review.

Ideally Shane *and* Felipe could review - Shane as he understands the reload code best, and Felipe so I can carry the original r+ forward.

Try at https://tbpl.mozilla.org/?tree=Try&rev=bcd7c250ba0a
Attachment #672616 - Flags: review?(mixedpuppy)
Attachment #672616 - Flags: review?(felipc)
Attachment #672616 - Flags: review?(mixedpuppy) → review+
Attachment #672616 - Flags: review?(felipc) → review+
(Assignee)

Comment 19

5 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/b872b207397d
(Assignee)

Updated

5 years ago
Flags: in-testsuite+
https://hg.mozilla.org/mozilla-central/rev/b872b207397d
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 19
Assignee: felipc → mhammond
Created attachment 675253 [details] [diff] [review]
Rollup patch for aurora/beta

[Approval Request Comment]
Bug caused by (feature/regressing bug #): social api
User impact if declined: better clean-up on shutdown
Testing completed (on m-c, etc.): landed on m-c last week
Risk to taking this patch (and alternatives if risky): touches social-only worker reload/shutdown
String or UUID changes made by this patch: none
Attachment #675253 - Flags: review+
Attachment #675253 - Flags: approval-mozilla-beta?
Attachment #675253 - Flags: approval-mozilla-aurora?
Comment on attachment 675253 [details] [diff] [review]
Rollup patch for aurora/beta

needed for social-api, been on m-c a week, approving.
Attachment #675253 - Flags: approval-mozilla-beta?
Attachment #675253 - Flags: approval-mozilla-beta+
Attachment #675253 - Flags: approval-mozilla-aurora?
Attachment #675253 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/0d836526cff7
https://hg.mozilla.org/releases/mozilla-beta/rev/6a3c51c617a5
status-firefox17: --- → fixed
status-firefox18: --- → fixed
status-firefox19: --- → fixed
Marking verified fixed since this landed with tests. Please add verifyme keyword if there's something you need QA to test.
Status: RESOLVED → VERIFIED
status-firefox17: fixed → verified
status-firefox18: fixed → verified
status-firefox19: fixed → verified
You need to log in before you can comment on or make changes to this bug.