Fix leaks that occur when nsGlobalWindowOuter and nsDocShell have the same lifetime

RESOLVED FIXED in Firefox 68

Status

()

defect
P2
normal
RESOLVED FIXED
4 months ago
3 months ago

People

(Reporter: mccr8, Assigned: mccr8)

Tracking

(Blocks 1 bug)

unspecified
mozilla68
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Fission Milestone:M2, firefox68 fixed)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(1 attachment)

For Fission, Nika said that it would be useful to tie together these objects more completely. They are conceptually the same thing, and they each hold a strong reference to the other, but they each have a method where the strong pointer to the other is nulled out, so their life times are not yet identical.

Nika looked at this before, but was stymied by leaks, so I said I'd take a look. She said that you can't just stop nulling out the strong refs in the close/detach methods, because various places use that to check the close/detached state, but that I could add some new strong references between the two, so I have done that.

There are a number of intermittent leaks that seem to fail fairly often, but I haven't managed to reproduce them locally yet in my non-optimized build.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4daf16356a2c59502a0a4e2c40e0779a7c566c40

A number of the leaking tests involve postMessage, so maybe I can look around that code to try to figure something out.

PostMessageEvent is the only thing outside of docshell and nsGlobalWindowOuter with a field of type RefPtr<nsGlobalWindowOuter>, which feels suspicious.

I'm able to reliably trigger this with:
./mach mochitest -f plain --headless --run-until-failure dom/tests/mochitest/whatwg/

The downside is that it takes 20 minutes to run. Running that directory once does not trigger the issue. I haven't tested it to see if you can run it a few times to trigger the issue.

So far what I have is that the windows that show up in the CC graph at shutdown are being held alive by a BrowsingContext, which is being held by child BrowsingContexts. I'm not sure exactly what is holding those contexts alive. There's a little cycle between the browsing context, a doc shell, and an outerwindow, but no other heap references. It is possible that there are stack references, or heap references that go away after the shutdown CC, so that they are not present for the very late in shutdown DMD heap dump. I found a cache of BrowsingContexts in a static variable, but that should go away before the shutdown CC.

--repeat 10 works instead of --run-until-failure, and takes about 10 minutes. (repeat 5 didn't work)

It turns out nothing is holding those window-browsingcontext-docshell clusters alive at shutdown. If I addref and release every outer window right before the shutdown CC (to make sure they are put into the purple buffer), then there are no leaks. I have no idea how that could happen, aside from bizarre scenarios where you move around refcounted pointers without touching the refcount using forget() or move(), but generally our code is so AddRef/Release heavy that's not possible.

I did a run that captured every CC, but that didn't turn up anything interesting. I'll try to think of some additional logging that might help.

Priority: -- → P2

I looked back at my run that captured every CC, and one of the leaking outer windows appears in only a single CC log. In that log, the BrowsingContext for the outer window has a refcount of 2: the window and the docshell. The three are held alive because of an extra reference to the docshell.

In the very next log, that BrowsingContext still has a refcount of 2, but now there is only one known reference: the docshell. The extra reference on the docshell has gone away. At shutdown, the three still point to each other in the ways that you would expect.

My theory is that this happens:
a) In between CCs, we decide that the outer window is definitely alive.
b) The outer window becomes garbage for some reason.
c) We run the CC. The window is not freed, because we marked it as being alive.
d) No non-garbage references to the window exist any more, so we never consider it again and we leak.

I think this is a bug in our CC optimizations: if we decide an object is definitely alive before the start of a CC, we need to add it to the purple buffer after the end of that CC, so we can guarantee that it will be considered the next time we CC. (I think it is not possible to "de-optimize" a window if it is, eg, no longer focused, because we decide other objects must also be alive once we have decided that a window or document is alive.)

I think this has not been a common problem in practice because this is not an issue for objects that have wrappers, as we always consider objects with wrappers in the CC. Boris said that he thought that an outer window would only not have a wrapper after close was called on it, and the current behavior is that we manually break the document-window cycle when close is called on it, so this won't be an issue before the patch here.

I hacked up a patch that created a kind of "uncollectable purple buffer" that is an array of strong references to anything we call MarkUncollectableForCCGeneration() on. That array gets cleared at the end of a CC, which adds them into the purple buffer. However, this did not fix the leak, so either something else is optimizing them out of the graph, or I'm wrong entirely. I'll try a broader approach that calls Root()/Unroot() on everything that we optimize out of the graph.

Fission Milestone: --- → M2

I tried my second idea for a new purple buffer, but that didn't fix the leak either.

Changing nsGlobalWindowOuter::IsBlackForCC() to always return false does fix the leak, so I still think I'm on the right track in terms of what is going wrong.

Whiteboard: [MemShrink]
Whiteboard: [MemShrink] → [MemShrink:P2]

I looked at this some more, and my previous analysis seems to be wrong. The reason that the outer windows are being kept out of the CC graph in the last CC that their BrowsingContext occurs in is actually that the window's wrapper is marked black.

There are a few odd things here.

  • The wrapper stays black until what I think is the first shutdown GC, when it becomes gray and I guess gets collected? But if the wrapper is gray, why is the C++ object for the outer window not in the CC graph?
  • Why is the C++ object for the outer window never showing up in any later CCs? If it still has a pointer to its reflector, it should at least be considered for inclusion. (I using a hacked up version of the can skip for the window that will still add the window to the graph, but just not free it, if the can skip would have returned true.)
  • In this one log I'm looking at, it seems like the path holding the window reflector alive in the CC where the outerwindow is in the present is:

via Preserved wrapper :
0x16cc7eb73f70 [HTMLDocument <no private>]
--[group]--> 0x266bdb8b36a0 [object_group]
--[group_global]--> 0x3b610b6b9d80 [Window <no private>]
--[SpecialPowers]--> 0x16cc7eb4ad00 [Proxy <no private>]
--[group]--> 0x3b610b64ba60 [object_group]
--[group_global]--> 0xfc42c556740 [Window <no private>]
--[CLASS_OBJECT(Object), CLASS_OBJECT(Function), UNKNOWN SLOT 169]--> 0x1a412531eb40 [Proxy <no private>]

I think this means that the outer window is associated with the global for a SpecialPowers object, which seems weird.

Peter, any ideas why we might end up collecting the wrapper for a SpecialPowers outer window, but not the outer window?

Flags: needinfo?(peterv)
Status: NEW → ASSIGNED

I think I might have figured a little bit of something out here, by looking a little more at the inner windows involved here, instead of the outer windows.

In CC log 62, there's an inner window for dom/tests/mochitest/whatwg/postMessage_origin_helper.xhtml that gets found to be garbage. It shows up again in CC logs 63, 64, and 68, but it clearly has been unlinked, so something is going wrong on the JS side that is keeping its reflector alive.

In GC log 63, the inner window is being held alive like this:
0x266bdb8e9600 [Promise <no private>]
--[group]--> 0x773f263f520 [object_group]
--[group_global]--> 0xfc42c556c40 [Window <no private>]
--[SpecialPowers]--> 0x266bdb8e96c0 [Proxy <no private>]
--[group]--> 0xb334768ba90 [object_group]
--[group_global]--> 0xfc42c556a60 [Window <no private>]

0xfc42c556c40 is the reflector for the inner window for http://example.org:8000/tests/dom/tests/mochitest/whatwg/postMessage_helper.html, and it is found to be garbage in CC log 63. Moreover, it did not exist at all in log 62.

Looking at testing/mochitest/tests/SimpleTest/SimpleTest.js, it seems like the SpecialPowers object is set by digging around in the parent and openers of the current window. The opener, at least, is a weak reference. So I think we CC the window, then before we GC it, special powers grabs a weak reference, then turns it into a strong reference, and then we have a window sitting around in an unlinked state. I'm not exactly sure how we go from there to leaking the windows forever, but it isn't too surprising that it might happen.

I'll see if I can improve things by calling ClearWeakReferences in the outer window's unlink method, like we do for XPCWJS. Technically this could have web visible behavior, but anything that depends on this is probably not going to have a good time anyways. (This sounds vaguely related to bug 1543537...)

Flags: needinfo?(peterv)

Just jamming ClearWeakReferences into the inner and outer window unlinks did not fix the issue. I'll have to see if that affected anything at all in the graphs.

Adding ClearWeakReferences did help, in that it fixed the inner window leaks, but not the outer window leaks.

On some further investigation, I think another issue is that nsGlobalWindowOuter::IsBlackForCC() returns true if the wrapper for the outer window is black, but the wrapper for the outer window is actually a proxy for the inner window, so it does not own the outer window. I think I can add an additional check that mInnerWindow is true. The reasoning there is that if the proxy is black then it will be alive, which holds the inner window alive, which will have a strong reference to the outer window. I'll see if this actually helps.

To rephrase comment 9, the sequence of steps is something like:

CC n:
outer window is alive (held by docshell)
inner window is garbage (and gets unlinked)
inner window wrapper/outer window proxy are marked gray

CC n+1:
outer window is alive (optimized out due to black outer window proxy)
inner window is already unlinked, found to be garbage again
inner window wrapper/outer window proxy are marked black, due to SpecialPowers object.

Some time later, the outer window proxy gets destroyed. I think that because it does not actually release a reference to the outer window, the cycle collector does not reconsider the outer window, and we leak it.

My patch stops us from removing the outer window from the CC graph in CC n+1, because we don't optimize based on the outer window proxy if the inner window has been unlinked (that seems to be the only case where the inner window doesn't point to the outer window).

It still seems like a bug that we're creating additional references to the wrapper of an unlinked inner window. There must be some additional weak reference we should clear out in the inner window's unlink method, but I'm not sure how to find it. There's no longer even any obvious symptom for it once I have fixed the leak. Maybe I can add an unlinked flag, and check it at various points.

My patches seem quite non-leaky, at least:
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1&revision=80b1f21beb57483090782dabe704a2f43823e12d

I'll make this bug about fixing the leaks, and the work about making them actually tied together can be in another bug.

Summary: Tie together lifetime of nsGlobalWindowOuter and nsDocShell → Fix leaks that occur when nsGlobalWindowOuter and nsDocShell have the same lifetime

Most wrapper cached C++ objects are held alive by their wrapper. The
cycle collector takes advantage of this in many classes and ignores
the C++ object if the wrapper is marked black.

However, this is not true for the outer window's wrapper. Instead, the
outer window's wrapper keeps the inner window alive. The inner window
usually keeps its outer window alive, but not after it has been
unlinked. For reasons I do not yet understand, the outer window's
wrapper can be kept alive after the inner window it is a proxy for is
unlinked.

This patch fixes the cycle collector optimization for the outer window
by only applying it if the outer window still has a weak reference to
the inner window, which it will until the inner no longer holds the
outer alive. This in turn fixes, or at least helps fix, window leaks
seen intermittently when the lifetime of outer windows and docshells
are tied together.

This patch only includes the CC optimization fix. I'm hoping that is enough to fix these leaks, but I'm not sure. I'll work on figuring out why we end up with references to a SpecialPower object in an unlinked window separately.

Whiteboard: [MemShrink:P2] → [4/17] Patch posted for peterv's review [MemShrink:P2]
See Also: → 1545441
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3073770e06f1
Take indirection into account for the CC optimizations for the outer window wrapper. r=peterv
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Whiteboard: [4/17] Patch posted for peterv's review [MemShrink:P2] → [MemShrink:P2]
Blocks: 1548077
You need to log in before you can comment on or make changes to this bug.