Closed Bug 1195295 Opened 4 years ago Closed 4 years ago

content-sessionStore.js sends a sync message to the parent in SyncHandler.init

Categories

(Firefox :: Session Restore, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 47
Tracking Status
firefox45 --- wontfix
firefox46 --- wontfix
firefox47 --- fixed

People

(Reporter: mconley, Assigned: mconley)

References

(Blocks 1 open bug)

Details

Attachments

(5 files)

Here is a profile:

http://people.mozilla.org/~bgirard/cleopatra/#report=48152b29cb5d69ce48546326cd96d55b92b241af&filter=[{%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A3394296,%22end%22%3A3394941}]&selection=0,5575,5576,5577,5578,5291,5579,8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,25,26,726,727,728,729,730,731,5621,6793,6374,6794,1497,1498,1499,80,6831,6832,81,113,79,134,135,168,6815,5778,5779,5779,6821,6822,4289,4290,4167,4168

If you filter on content-sessionStore.js for the content process, you can see a sync message going up to the parent process in SyncHandler.init. According to the comment above SyncHandler, this will probably go away once we get async APIs for window and tab closing. Since I'm pretty sure tab closing no longer uses CPOWs, I guess we're just waiting on window closing, which is bug 1171708.
With CPOWs removed on window closing (bug 1171708) and shutdown (bug 1177310) we don't need the SyncHandler anymore, correct. I'm looking forward to that day :)
Assignee: nobody → mconley
Priority: -- → P1
Status: NEW → ASSIGNED
Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r?ttaubert
Attachment #8694367 - Flags: review?(ttaubert)
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

https://reviewboard.mozilla.org/r/26733/#review24183

::: browser/components/sessionstore/TabState.jsm:62
(Diff revision 1)
> +  browserReady(browser) {
> +    // Even if we've already got the browser in _latestMessageID,
> +    // we reset the latest ID to 0 because the browser has reloaded
> +    // the content script, and therefore reset its own internal notion
> +    // of the most recently sent update ID.
>      this._latestMessageID.set(browser.permanentKey, 0);
>    },

We don't need to keep this and can get rid of |_latestMessageID|.

(more on that below)

::: browser/components/sessionstore/TabState.jsm:74
(Diff revision 1)
> +    if (!this._latestMessageID.has(browser.permanentKey)) {
> +      // We've never seen this browser before, so initialize the latest
> +      // seen ID at 0 so that we can ensure that we only ever handle
> +      // updates with increasing IDs.
> +      this._latestMessageID.set(browser.permanentKey, 0);
> +    }
> +
>      // Only ever process messages that have an ID higher than the last one we
>      // saw. This ensures we don't use stale data that has already been received
>      // synchronously.
>      if (id > this._latestMessageID.get(browser.permanentKey)) {
>        this._latestMessageID.set(browser.permanentKey, id);
>        TabStateCache.update(browser, data);
>      }

We don't actually need this with the MessageQueue simplifications, and we can remove the check in |update()| now too. Just always call |TabStateCache.update()| directly.

::: browser/components/sessionstore/content/content-sessionStore.js
(Diff revision 1)
> -  flush: function (id) {
> -    MessageQueue.flush(id);
> -  },

MessageQueue.flush() can be removed, nothing uses it anymore.

MessageQueue.send() doesn't need the |sync| and |id| options anymore. It should never use |sendRpcMessage|.

The whole MessageQueue can be simplified as we don't need to account for sync flushes while we're waiting for an async update message to arrive. We don't need |_lastUpdated| anymore. |send()| simply iterates over |_data| and clears it. Guess that means we can remove |_id| as well.
Attachment #8694367 - Flags: review?(ttaubert)
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26733/diff/1-2/
Attachment #8694367 - Flags: review?(ttaubert)
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

https://reviewboard.mozilla.org/r/26733/#review24411

LGTM if try agrees.

::: browser/components/sessionstore/content/content-sessionStore.js
(Diff revision 2)
> -    let sync = options && options.sync;
> -    let startID = (options && options.id) || this._id;

The function comment needs to be adjusted, |sync| and |id| references should be removed.
Attachment #8694367 - Flags: review?(ttaubert) → review+
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26733/diff/2-3/
Attachment #8694367 - Attachment description: MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r?ttaubert → MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26733/diff/3-4/
Bug 1195295 - Fix browser_tab_detach_restore.js and browser_popupNotifications_4.js r?felipe

These tests seemed to rely on the synchronous infrastructure that the
now defunct SyncHandler for SessionStore provided. This moves them
to safer, more modern alternatives.
Attachment #8697176 - Flags: review?(felipc)
Attachment #8697176 - Flags: review?(felipc) → review+
Comment on attachment 8697176 [details]
MozReview Request: Bug 1195295 - Fix browser_tab_detach_restore.js and browser_popupNotifications_4.js r?felipe

https://reviewboard.mozilla.org/r/27527/#review24823
Comment on attachment 8694367 [details]
MozReview Request: Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/26733/diff/4-5/
Comment on attachment 8697176 [details]
MozReview Request: Bug 1195295 - Fix browser_tab_detach_restore.js and browser_popupNotifications_4.js r?felipe

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/27527/diff/1-2/
Comment on attachment 8697424 [details]
MozReview Request: Bug 1195295 - Remove race-y SDK test that relied on slow content-sessionStore.js load. r=Mossop

Mossop r+'d the SDK test removal over IRC.
Attachment #8697424 - Flags: review+
https://hg.mozilla.org/integration/fx-team/rev/18eec0849a1f53377704d2602593561ef631a50a
Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert

https://hg.mozilla.org/integration/fx-team/rev/5d5885a15ca7ec5649ab207e39a1ba589d3985bb
Bug 1195295 - Fix browser_tab_detach_restore.js and browser_popupNotifications_4.js r=felipe

https://hg.mozilla.org/integration/fx-team/rev/05b98fe2ac4f8d4aade367623f2931b652f543d0
Bug 1195295 - Remove race-y SDK test that relied on slow content-sessionStore.js load. r=Mossop
https://hg.mozilla.org/integration/fx-team/rev/fff7c04182f9f5ca24e9656e1458efa100b016cc
Bug 1195295 - Follow-up: remove other race-y SDK test that I failed to remove in 05b98fe2ac4f. r=Mossop
Huh. Alright. I'll poke at this in bug 1220517 - thanks for letting me know, philor.
Flags: needinfo?(mconley)
Backing out commit 18eec0849a1f fixes the permaorange on Aurora (bug 1234697):
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=557502e5b88a
I backed out 18eec0849a1f for causing intermittent leaks. If it sticks, I'll get Aurora approval for backing it out there.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Approval Request Comment
[Feature/regressing bug #]: this bug
[User impact if declined]: I'm not sure what the actual user impact would be, but this patch is causing a permanent orange on TreeHerder.
[Describe test coverage new/current, TreeHerder]: There are tests for this feature. I've already backed this patch out on inbound, though the tests haven't finished yet, so I'm not sure if it will stick.
[Risks and why]: It will reintroduce whatever session store issue this was supposed to fix, but this bug was around for a while before it was fixed so hopefully it isn't too bad. It is also possible some other changes depend on this, but hopefully nothing too bad if my landing is okay on inbound.
[String/UUID change made/needed]: none
Attachment #8701604 - Flags: approval-mozilla-aurora?
Comment on attachment 8701604 [details] [diff] [review]
back out 18eec0849a1f for causing permaleaks on Aurora

Apparently the chunking works differently on Aurora, so I'm not 100% sure yet this fixes the leak. I'll reland or something if I was wrong, but I'm clearing the Aurora flag for now.

backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/76d888a191cf
Attachment #8701604 - Flags: approval-mozilla-aurora?
Flags: needinfo?(mconley)
Comment on attachment 8701604 [details] [diff] [review]
back out 18eec0849a1f for causing permaleaks on Aurora

Approval Request Comment (see comment 25)
Attachment #8701604 - Flags: approval-mozilla-aurora?
I triggered bc7 for my Aurora try run, and it is green, so hopefully the backout fixes bug 1234697.
bummer we had to back this out- we had a perf win in the linux64 sessionrestore tests which showed up as a regression in talos when we backed this out.  maybe we can fix this later and get the win back:)
Comment on attachment 8701604 [details] [diff] [review]
back out 18eec0849a1f for causing permaleaks on Aurora

ok, let's try that. thanks
Attachment #8701604 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Backout needed on aurora.
Keywords: checkin-needed
I'll wait for :ting to sort out things in bug 1220517 before attempting to reland this.
Flags: needinfo?(mconley)
(In reply to Joel Maher (:jmaher) from comment #29)
> bummer we had to back this out- we had a perf win in the linux64
> sessionrestore tests which showed up as a regression in talos when we backed
> this out.  maybe we can fix this later and get the win back:)

This is great to hear. We expected that and that was partly the reason we wanted to remove it. Still great to confirm :)
Target Milestone: Firefox 45 → ---
The reason why attachment 8694367 [details] causes leak easily was explained in bug 1220517 comment 53, which it creates a lot more CC logs. From Try [1] [2], the number of processing count of CCAnalyzer without and with the patch:

  Without:  03:01:17     INFO -  CCAnalyzer.processingCount=28998
  With:     07:15:57     INFO -  CCAnalyzer.processingCount=98373

I downloaded the cc/gc edge logs [3] from both Try runs. The only thing I can tell now is there're unknown edges to keep some DOMEventTargetHelper alive, which point to the functions in content-sessionStore.js when the patch is applied. For example:

  ting@ting-pc:~/w/fx/tools/heapgraph/cc$ python find_roots.py ~/Desktop/cc-edges.2168.log 133C1C20
  Parsing /home/ting/Desktop/cc-edges.2168.log. Done loading graph. Reversing graph. Done.

  19B73200 [DOMEventTargetHelper ]
      --[mAnonymousGlobalScopes[i]]--> 266E26E0 [JS Object (Block)]
      --[enclosing_environment]--> 266E2680 [JS Object (NonSyntacticVariablesObject)]
      --[MessageQueue]--> 133C1C20 [JS Object (Object)]

      Root 19B73200 is a ref counted object with 1 unknown edge(s).
      known edges:
         1EE80E50 [XPCWrappedNative (ContentFrameMessageManager)] --[mIdentity]--> 19B73200

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa52faa7074a
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=487f96a26495
[3] https://goo.gl/lhaB08
I think I know what's going on here - I think I'm supposed to be clearing the functions in the MessageQueue data map after a flush, and that stuff got removed.

Let me see if that clears the orange. Gimme a few minutes to rebase and push.
I'm afraid that theory was bunk. Exploring other scenarios now.
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #39)
> I'm afraid that theory was bunk. Exploring other scenarios now.

I tried [1] to clear |_data| also in the early return [2] and I can't see the functions from content-sessionStore.js, but CCAnalyzer processed more entries (>100000).

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=8709210e62c0&selectedJob=16035233
[2] https://dxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/content/content-sessionStore.js#754
(In reply to Ting-Yu Chou [:ting] from comment #36)
> The reason why attachment 8694367 [details] causes leak easily was explained
> in bug 1220517 comment 53, which it creates a lot more CC logs. From Try [1]
> [2], the number of processing count of CCAnalyzer without and with the patch:
> 
>   Without:  03:01:17     INFO -  CCAnalyzer.processingCount=28998
>   With:     07:15:57     INFO -  CCAnalyzer.processingCount=98373
> 
> I downloaded the cc/gc edge logs [3] from both Try runs. The only thing I
> can tell now is there're unknown edges to keep some DOMEventTargetHelper
> alive, which point to the functions in content-sessionStore.js when the
> patch is applied. For example:
> 
>   ting@ting-pc:~/w/fx/tools/heapgraph/cc$ python find_roots.py
> ~/Desktop/cc-edges.2168.log 133C1C20
>   Parsing /home/ting/Desktop/cc-edges.2168.log. Done loading graph.
> Reversing graph. Done.
> 
>   19B73200 [DOMEventTargetHelper ]
>       --[mAnonymousGlobalScopes[i]]--> 266E26E0 [JS Object (Block)]
>       --[enclosing_environment]--> 266E2680 [JS Object
> (NonSyntacticVariablesObject)]
>       --[MessageQueue]--> 133C1C20 [JS Object (Object)]
> 
>       Root 19B73200 is a ref counted object with 1 unknown edge(s).
>       known edges:
>          1EE80E50 [XPCWrappedNative (ContentFrameMessageManager)]
> --[mIdentity]--> 19B73200
> 
> [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa52faa7074a
> [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=487f96a26495
> [3] https://goo.gl/lhaB08

Andrew, do you think you could look at these logs and try to figure out what's going on?
Flags: needinfo?(continuation)
mconley looked at them a little bit with some of my guidance. I'm not sure if he came to any conclusions. I'll try to look later in the week.
I have been poking at this without too much success, though I think I've made some maybe interesting discoveries.

I didn't get too far examining the GC / CC logs, primarily because I wasn't exactly sure what I was looking for, patterns or otherwise.

What I ended up doing was breaking up my patch into 5 atomic chunks to see which one would cause the orange to rear its head. After breaking up the chunks, I landed them on try, one on top of the other, and here's what I got:

Chunk 1: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd0c471ec83e
Chunk 2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1c170a77644 CULPRIT
Chunk 3: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0f9c2f1ec43
Chunk 4: https://treeherder.mozilla.org/#/jobs?repo=try&revision=54e9ee9cbf6f
Chunk 5: https://treeherder.mozilla.org/#/jobs?repo=try&revision=54e9ee9cbf6f

So the second chunk, when overlaid on top of the first, seems to introduce the issue. I thought that was useful.

Then, I started to break Chunk 2 down into smaller bits, again overlaying them on top of one another to see which chunk causes the orange to appear.

Remove _latestMessageID: https://treeherder.mozilla.org/#/jobs?repo=try&revision=61c5c0b25240
Make setSyncHandler a no-op: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dc6b98cdfd6
Don’t respond to setupSyncHandler message: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba95840be976
Don’t listen for setupSyncHandler on closed:  https://treeherder.mozilla.org/#/jobs?repo=try&revision=67bc80826d4d
Don’t listen for setupSyncHandler at all: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f129ec2ca7f6

This one is interesting, because making setSyncHandler a no-op seems to make the orange appear, but _not responding_ to the SessionStore:setupSyncHandler message seems to make the orange _much_ more frequent, though I'm not certain why.

So I feel like I'm narrowing in on it, but I honestly have _no idea_ how these little chunks might be introducing a need for so much cycle collection.
The test browser/base/content/test/general/browser_syncui.js can reproduce the issue (comment 36) consistently.
Doing a try build with MOZ_CC_ALL_TRACES=1 and MOZ_CC_LOG_ALL=1 resulted in green tests. :/

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1d8688865fd&selectedJob=16437082
I looked at this a little bit. The DETH in comment 36 is whatever is implementing ContentFrameMessageManager. You can tell because an XPCWN (ContentFrameMessageManager) has an mIdentity field that points to it, so that's the underlying native for the XPCWN. I think. I was looking at a different set of objects, JS Functions for SpecialPowersAPI.prototype.* methods, but they were also being held alive by ContentFrameMessageManager, through mAnonymousGlobalScopes. I'm not sure why all of this JS is being entrained there.
Flags: needinfo?(continuation)
So one thing I've noticed looking at some of my earlier try runs, is that there are a lot more occurences of Promises throwing with messages like "A coding exception was thrown in a Promise resolution callback." in my green runs than there are in my orange runs.

I found some of the tests that were triggering those Promise exceptions, and disabled them ham-fistedly to see if that'd make any kind of impact.

Here's a try build with just one of the tests disabled:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=868c894fd7a4

I might be imagining it, but it looks like things have gotten slightly better. I wonder if somehow the mysterious anonymous scopes are these in-flight Promises that haven't yet resolved or rejected.

I've disabled more Promise-throwing tests in this push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1536ac8ea9f4
From ting's log in comment 36, without the patch, there are 52 content script scopes (I think that is what mAnonymousGlobalScopes is). With the patch, there are 87. It looks like they are never cleared until the message manager goes away.
I'm trying to write a patch that will let us see what those content scripts are.
Hey ting, I've been trying to reproduce this locally, and you mentioned in comment 44 that browser_syncui.js would do the trick, but I'm not seeing the nsBaseAppShell leak on my Windows machine in a debug build when running that test.

Or did you mean that when we run that test, we seem to increase the amount of cycle collection that we do?
Flags: needinfo?(janus926)
(In reply to Andrew McCreight [:mccr8] from comment #53)
> From ting's log in comment 36, without the patch, there are 52 content
> script scopes (I think that is what mAnonymousGlobalScopes is). With the
> patch, there are 87. It looks like they are never cleared until the message
> manager goes away.

It is odd to see more anon global scopes, but it is expected that whatever scopes we have, stay there until the end.
Might be useful to log in nsMessageManagerScriptExecutor::LoadScriptInternal which all scripts get loaded to anon scopes.
I applied the first two patches from comment 43 and ran tests:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5f6f4f7690d (without patches from this bug)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b55a77320e3a (with patches)

A diff of the frame scripts seen by the cycle collector between two bc7 runs:

--- /dev/fd/63	2016-02-09 19:24:31.065413771 -0500
+++ /dev/fd/62	2016-02-09 19:24:31.105414148 -0500
@@ -1,10 +1,16 @@
+chrome://browser/content/content.js
 chrome://browser/content/content-sessionStore.js
 chrome://browser/content/content-UITour.js
+chrome://browser/content/tab-content.js
+chrome://global/content/browser-content.js
 chrome://global/content/manifestMessages.js
 chrome://mochikit/content/tests/BrowserTestUtils/content-task.js
 chrome://mochikit/content/tests/BrowserTestUtils/content-utils.js
 chrome://mochikit/content/tests/SimpleTest/AsyncUtilsContent.js
 chrome://satchel/content/formSubmitListener.js
+chrome://specialpowers/content/MozillaLogger.js
+chrome://specialpowers/content/specialpowersAPI.js
+chrome://specialpowers/content/specialpowers.js
 data:,(function
 mGlobal
 mListenerManager

(The data:,(function URL comes from browser_windowactivation.js, in case you were wondering.)

So it looks like we are entraining some fairly significant scripts with the patches applied.

Apologies for the dumb question, but why are we seeing this problem in non-e10s mochitests?  Everything should be in the same process in that case, right...?
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #55)
> Or did you mean that when we run that test, we seem to increase the amount
> of cycle collection that we do?

Yes, this is what I meant.
Flags: needinfo?(janus926)
(In reply to Nathan Froyd [:froydnj] from comment #57)
> 
> A diff of the frame scripts seen by the cycle collector between two bc7 runs:
Perhaps CC isn't too interesting here, but comment 53. Why we end up getting so much more frame scripts with the patch?
And which ones (urls) we have with the patch?

> Apologies for the dumb question, but why are we seeing this problem in
> non-e10s mochitests?  Everything should be in the same process in that case,
> right...?
Not sure if I understand the question, but...
non-e10s uses in-process message managers so that the setup can be as similar as possible to the e10s case.
(In reply to Olli Pettay [:smaug] (high review load) from comment #59)
> (In reply to Nathan Froyd [:froydnj] from comment #57)
> > 
> > A diff of the frame scripts seen by the cycle collector between two bc7 runs:
> Perhaps CC isn't too interesting here, but comment 53. Why we end up getting
> so much more frame scripts with the patch?
> And which ones (urls) we have with the patch?

All the URLs present in that diff are the ones we load; we load them multiple times, so that accounts for the 52 vs. 87 number in comment 53.  To be unambiguously clear, we're only loading:

chrome://browser/content/content-sessionStore.js
chrome://browser/content/content-UITour.js
chrome://global/content/manifestMessages.js
chrome://mochikit/content/tests/BrowserTestUtils/content-task.js
chrome://mochikit/content/tests/BrowserTestUtils/content-utils.js
chrome://mochikit/content/tests/SimpleTest/AsyncUtilsContent.js
chrome://satchel/content/formSubmitListener.js
data:,(function

without the patch and:

chrome://browser/content/content.js
chrome://browser/content/content-sessionStore.js
chrome://browser/content/content-UITour.js
chrome://browser/content/tab-content.js
chrome://global/content/browser-content.js
chrome://global/content/manifestMessages.js
chrome://mochikit/content/tests/BrowserTestUtils/content-task.js
chrome://mochikit/content/tests/BrowserTestUtils/content-utils.js
chrome://mochikit/content/tests/SimpleTest/AsyncUtilsContent.js
chrome://satchel/content/formSubmitListener.js
chrome://specialpowers/content/MozillaLogger.js
chrome://specialpowers/content/specialpowersAPI.js
chrome://specialpowers/content/specialpowers.js
data:,(function

with the patch.
The ContentFrameMessageManager was marked gray during GC, but without the patch, it will be unmarked to black when FixWeakMappingGrayBitsTracer::trace() is called for the WeakMap |_syncHandlers|:

  https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsCycleCollector.cpp#3791
  https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsCycleCollector.cpp#3478

which winds up calling JS::UnmarkGrayGCThingRecursively() and unmark everything SyncHandler instances reference to:

  https://dxr.mozilla.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.cpp#269

This is why there're much less CC participants without the patch. But I don't know should CC check all these things when the patch is applied, it's a lot of them.

So it's not there're more anon global scopes, it's just they are marked gray or black. For example, from the log of comment 36, without the patch:

cc-edges.2548.log:
  18AAA600 [rc=1] DOMEventTargetHelper 
  > 18083900 mAnonymousGlobalScopes[i]
  > 180B4340 mAnonymousGlobalScopes[i]
  > 18C0AEA0 mAnonymousGlobalScopes[i]
  > 18C2C020 mAnonymousGlobalScopes[i]
  > 18C2CE40 mAnonymousGlobalScopes[i]
  > 194697E0 mAnonymousGlobalScopes[i]
  > 1B153C00 mAnonymousGlobalScopes[i]

gc-edges.2548.log:
  18083860 B mAnonymousGlobalScopes[i]
  18083900 G mAnonymousGlobalScopes[i]
  18083980 B mAnonymousGlobalScopes[i]
  18083F00 B mAnonymousGlobalScopes[i]
  180B41C0 B mAnonymousGlobalScopes[i]
  180B4340 G mAnonymousGlobalScopes[i]
  183D7940 B mAnonymousGlobalScopes[i]
  18C0A360 B mAnonymousGlobalScopes[i]
  18C0AEA0 G mAnonymousGlobalScopes[i]
  18C2C020 G mAnonymousGlobalScopes[i]
  18C2CE40 G mAnonymousGlobalScopes[i]
  194697E0 G mAnonymousGlobalScopes[i]
  1B153C00 G mAnonymousGlobalScopes[i]
That's very interesting! Olli, is there something that would be holding alive these message managers we could check, and then mark these mAnonymousGlobalScopes black?
Flags: needinfo?(bugs)
We do call nsInProcessTabChildGlobal::MarkForCC() normally and that should optimize out mAnonymousGlobalScopes from CC graph.
Are we possibly leaking a nsInProcessTabChildGlobal temporarily?

If anyone has STR (not tryserver run) to get mAnonymousGlobalScopes showing up in the graph, I'd be happy to look at it.

But I'll upload a patch which _might_ help here anyhow.
Flags: needinfo?(bugs)
(In reply to Olli Pettay [:smaug] (high review load) from comment #63)
> But I'll upload a patch which _might_ help here anyhow.

I managed to upload it to wrong bug
bug 1220517#c78
(In reply to Olli Pettay [:smaug] (high review load) from comment #63)
> Are we possibly leaking a nsInProcessTabChildGlobal temporarily?

I am also wondering.

When is the expected timing of destroying nsInProcessTabChildGlobal after calling nsInProcessTabChildGlobal::Disconnect()? There're some survived 3 rounds of GC/CC from CCAnalyzer.
Just doing some guesswork here - I think this is the lifetime of an nsInProcessTabChildGlobal -

Creation: https://dxr.mozilla.org/mozilla-central/rev/6ea654cad929c9bedd8a4161a182b6189fbeae6a/dom/base/nsFrameLoader.cpp#2607

When the FrameLoader is destroyed, and once all messages from the message manager have been dispatched / serviced, we enter here:

https://dxr.mozilla.org/mozilla-central/rev/6ea654cad929c9bedd8a4161a182b6189fbeae6a/dom/base/nsFrameLoader.cpp#1522

Which calls ::Disconnect.

Then, _I think_, when CC starts, one of the phases it goes through involves marking message managers for CC:

https://dxr.mozilla.org/mozilla-central/rev/6ea654cad929c9bedd8a4161a182b6189fbeae6a/dom/base/nsCCUncollectableMarker.cpp#447

It might be worth logging how many times we call nsFrameLoader::Destroy, nsInProcessTabChildGlobal::Disconnect, and how many times we call the nsInProcessTabChildGlobal destructor...
(In reply to Olli Pettay [:smaug] (high review load) from comment #63)
> If anyone has STR (not tryserver run) to get mAnonymousGlobalScopes showing
> up in the graph, I'd be happy to look at it.

nsInProcessTabChildGlobal::MarkForCC() does not get called for every instance, I can see it by running:

  ./mach mochitest browser/base/content/test/general/browser_syncui.js

:mconley's patch here is not required.

Note some instances (and nsFrameLoader instances) are destructed very late when nsCycleCollector::Shutdown(). I don't know if it is expected.
That is not expected.

nsInProcessTabChildGlobal::MarkForCC() is called for those objects which are still bound to the frameloader, in other words when the relevant xul:browser is still in DOM tree.

If we have nsInProcessTabChildGlobal after that in CC graph, it means some kind of leak.
Good thing is that apparently those are released during shutdown so analyzing the leak should be easier. Need to track who is calling Release.
Does it matter that nsInProcessTabChildGlobal's CC traverse doesn't call NS_IMPL_CYCLE_COLLECTION_TRAVERSE_SCRIPT_OBJECTS?  I don't think adding that in affects mconley's patches, but it might make debugging easier...?
DOMEventTargetHelper has NS_IMPL_CYCLE_COLLECTION_TRAVERSE_SCRIPT_OBJECTS
(In reply to Olli Pettay [:smaug] (high review load) from comment #69)
> nsInProcessTabChildGlobal::MarkForCC() is called for those objects which are
> still bound to the frameloader, in other words when the relevant xul:browser
> is still in DOM tree.
I could clarify this. MarkForCC does not happen during shutdown CC, even if InProcessTabChildGlobal was in DOM tree.
Also, at some point CC needs to traverse/unlink InProcessTabChildGlobal, so it is expected to see those objects
in CC graph after closing a tab. But after that they should get deleted.


Is there a testcase showing that after closing a window or a tab an instance of InProcessTabChildGlobal stays in the CC graph?
And if there is, a way to debug that is to capture stack traces for each nsInProcessTabChildGlobal::Release calls. Some of them probably shows what is keeping the object alive too long.
(In reply to Olli Pettay [:smaug] (HIGH REVIEW LOAD) from comment #72)
> (In reply to Olli Pettay [:smaug] (high review load) from comment #69)
> > nsInProcessTabChildGlobal::MarkForCC() is called for those objects which are
> > still bound to the frameloader, in other words when the relevant xul:browser
> > is still in DOM tree.
> I could clarify this. MarkForCC does not happen during shutdown CC, even if
> InProcessTabChildGlobal was in DOM tree.
> Also, at some point CC needs to traverse/unlink InProcessTabChildGlobal, so
> it is expected to see those objects
> in CC graph after closing a tab. But after that they should get deleted.
> 
> 
> Is there a testcase showing that after closing a window or a tab an instance
> of InProcessTabChildGlobal stays in the CC graph?

I believe so - according to ting, that's browser/base/content/test/general/browser_syncui.js (with or without my patch).
How do I run that so that browser isn't shutdown automatically, but the relevant tabs have been removed?

./mach mochitest browser/base/content/test/general/browser_syncui.js just runs the test and closes the browser.
Ok, 

#0  0x00007f5e3339a6a0 in nsInProcessTabChildGlobal::Release() (this=0x7f5e062954f0) at /home/smaug/mozilla/hg/inbound2/dom/base/nsInProcessTabChildGlobal.cpp:177
#1  0x00007f5e3339a6f9 in non-virtual thunk to nsInProcessTabChildGlobal::Release() () at /home/smaug/mozilla/hg/inbound2/dom/base/nsInProcessTabChildGlobal.cpp:175
#2  0x00007f5e3337de6d in nsCOMPtr<nsIInProcessContentFrameMessageManager>::assign_assuming_AddRef(nsIInProcessContentFrameMessageManager*) (this=0x7f5e061eea08, aNewPtr=0x0)
    at /home/smaug/mozilla/hg/inbound2/ff_build/dist/include/nsCOMPtr.h:380
#3  0x00007f5e3337dd43 in nsCOMPtr<nsIInProcessContentFrameMessageManager>::assign_with_AddRef(nsISupports*) (this=0x7f5e061eea08, aRawPtr=0x0)
    at /home/smaug/mozilla/hg/inbound2/ff_build/dist/include/nsCOMPtr.h:1066
#4  0x00007f5e333529ef in nsCOMPtr<nsIInProcessContentFrameMessageManager>::operator=(nsIInProcessContentFrameMessageManager*) (this=0x7f5e061eea08, aRhs=0x0)
    at /home/smaug/mozilla/hg/inbound2/ff_build/dist/include/nsCOMPtr.h:578
#5  0x00007f5e33350a79 in ImplCycleCollectionUnlink<nsIInProcessContentFrameMessageManager>(nsCOMPtr<nsIInProcessContentFrameMessageManager>&) (aField=...)
    at /home/smaug/mozilla/hg/inbound2/ff_build/dist/include/nsCOMPtr.h:1045
#6  0x00007f5e33324a3d in nsFrameLoader::cycleCollection::Unlink(void*) (this=0x7f5e3b0034e8 <nsFrameLoader::_cycleCollectorGlobal>, p=0x7f5e061ee9d0)
    at /home/smaug/mozilla/hg/inbound2/dom/base/nsFrameLoader.cpp:131

looks like a problematic release call. Why do we have frameloader pointing to the tabchildglobal still so late.
My guess is currently that testing framework, or test does something unexpected so that the MarkForCC optimization can't kick in.
Depends on: 1249226
I've spotted two places leak nsFrameLoader, not sure yet are they all the leaks. Separated this to bug 1249226.
SUCCESS!

Along with the patches that smaug landed on inbound earlier today for bug 1249439 and bug 1249451, as well as this patch he supplied over IRC:

https://hg.mozilla.org/try/rev/43ea3c4fdb9d

We've got ourselves a green try build!

https://treeherder.mozilla.org/#/jobs?repo=try&revision=73570b27131d

Huzzah! Wow! Thanks so much everybody!

I'm doing a second try build[1] without the patch from smaug just to see if the other patches that smaug landed today ended up fixing things on their own.

Hey smaug - assuming that the last patch (the one that clears the mMessageManager and mChildMessageManager) is helping to clear the orange here, is that patch for another bug, or should we make it part of this one?

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=db90213e0e76
ni? for comment 79.
Flags: needinfo?(bugs)
I think it would make sense to land that nsFrameLoader patch in a separate bug. It tends to simplify regression tracking when landing somewhat unrelated patches in separate bugs.
Flags: needinfo?(bugs)
Depends on: 1249795
Looking at the trybuild without bug 1249795, smaug's patch seems, at least by eyeballing, to make the orange far less likely. So I guess I'll wait for that to land before (finally) landing this thing.
if that is the case, we should definitely fix also bug 1249226. Not for this bug, but to just not leak frameloaders by keeping them alive in JS.
It is so easy to leak in JS :/
https://hg.mozilla.org/integration/fx-team/rev/7eb1382d6cef49a27f7ced970a2b135f0f20f014
Bug 1195295 - Remove SessionStore's SyncHandler since all tab and window flushing is now async. r=ttaubert
No longer depends on: 1249226
https://hg.mozilla.org/mozilla-central/rev/7eb1382d6cef
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 47
is there anything need to be done for 45/46 ?
Flags: needinfo?(sledru)
Flags: needinfo?(mconley)
Nothing needs to be done for 45. For 46, uplifting is contingent on the dependencies for the frameloader / browser leaks also being uplifted, which we might want for the leaks they address.

IMO, we might want to uplift the leak fixes just because they're leak fixes.

mccr8, as someone who deals with leaks pretty regularly, do you think it's worth uplifting those leak fixes to aurora? Or should we just let them ride the trains as per usual?

The actual removal of the SyncHandler is not something that I think warrants uplift.
Flags: needinfo?(mconley) → needinfo?(continuation)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #88)
> mccr8, as someone who deals with leaks pretty regularly, do you think it's
> worth uplifting those leak fixes to aurora? Or should we just let them ride
> the trains as per usual?

These don't seem like very severe leaks, if they are only showing up in our testing, and only in a few tests. I don't entirely understand the patches, though, so the patch authors may disagree.
Flags: needinfo?(continuation)
bug 1249795 is a bit risky, or not risk-free at least, so give it at least couple of days testing time on Nightly. 
bug 1249439 should be quite safe.
"Nothing needs to be done for 45" I like when people are talking to me like that :p
Flags: needinfo?(sledru)
(In reply to Sylvestre Ledru [:sylvestre] from comment #91)
> "Nothing needs to be done for 45" I like when people are talking to me like
> that :p

:)

Note however that we _do_ still have that sync message still being sent in 45 and 46. I just don't think we want to uplift given the risk of the patches that this one relies on in order to cause oranges on CI.
You need to log in before you can comment on or make changes to this bug.