Closed Bug 1264161 Opened 8 years ago Closed 8 years ago

Leaking many MB of heap-textures each time you open and close a Firefox window with e10s enabled

Categories

(Core :: Graphics, defect, P1)

Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s m9+ ---
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: bzbarsky, Assigned: kats)

References

Details

(Whiteboard: [gfx-noted][MemShrink:P1])

Attachments

(1 file, 1 obsolete file)

Over the last few days I've commonly been getting the "our of application memory" dialog on Mac.  I just tried looking at about:memory for my nightly, and it's showing about 2GB in the main process used by explicit/gfx/heap-textures.

Is this expected?  If this is a cache of some sort, can I somehow flush it to see what the effect is on memory usage of the nightly process and the kernel?
Flags: needinfo?(milan)
Oh, and I have 36 windows containing 132 tabs between them, though 82 of those tabs are not actually loaded.
Boris, I imagine you need a long enough workflow that we can't do a mozregression on it?  Anthony, can you see if you can reproduce this as a regression?

Looking at the last few days, there were a few bugs where we tried to improve the memory situation, it's possible one of those introduced a regression.

* Bug 1259785 - we fixed a canvas memory leak
* Bug 1262328 - handling memory pressure
* Bug 1262846 - we updated a font-related library
Flags: needinfo?(milan) → needinfo?(anthony.s.hughes)
OS: Unspecified → Mac OS X
Yes, I don't know offhand how to get into this state.  :(

Also of note:

1)  This is a 2016-04-07 nightly.  The previous nightly I was using was probably another week or two older
    than that.
2)  I just re-measured and now I'm at 2.8GB of gfx/heap-textures, according to about:memory....
(In reply to Milan Sreckovic [:milan] from comment #2)
> Boris, I imagine you need a long enough workflow that we can't do a
> mozregression on it?  Anthony, can you see if you can reproduce this as a
> regression?

I'm not seeing this on my profile. I'm sitting at 16.72MB in gfx/heap-textures on my profile.
Flags: needinfo?(anthony.s.hughes)
OK, so I just tried restarting the browser (so now I'm on the 2016-04-08 nightly).  With my session restored but without any browsing so far other than the things I already head loaded, I'm at ~590MB of gfx/heap-textures.

Every time I open/close a window gfx/heap-textures goes up by 1MB or so.  Minimizing memory usage via about:memory sometimes seems to reclaim this memory (or at least other gfx/heap-textures memory?), and sometimes not....
(In reply to Boris Zbarsky [:bz] from comment #5)
> 
> Every time I open/close a window gfx/heap-textures goes up by 1MB or so. 
> Minimizing memory usage via about:memory sometimes seems to reclaim this
> memory (or at least other gfx/heap-textures memory?), and sometimes not....

I think I see this on nightly, but Markus does not.  Anthony, what about you?
Flags: needinfo?(anthony.s.hughes)
Whiteboard: [gfx-noted]
No I don't see this at all. I'm using Firefox Nightly 48.0a1 20160413030239 on Mac OS 10.11.5 Beta (15F18b).
Flags: needinfo?(anthony.s.hughes)
And now in that same session I mentioned in comment 5, I'm at 1.35GB for gfx/heap-textures.  I have been browsing around with it in the meantime, obviously, opening/closing windows, etc.

If I minimize memory usage, that number changes to .... 1.416GB (!).  And now I minimized again and it's at 1.37GB.
One other thing. Under "Other Measurements", there are these two entries:

12,829.90 MB ── gfx-textures
15,854.78 MB ── gfx-textures-peak

I'm not sure what those mean.  explicit/gfx/heap-textures is currently around 1.6GB; do we also have non-heap textures?
Nicolas, mccr8 suggested asking you to take a look.
Flags: needinfo?(nical.bugzilla)
I can also reproduce the heap-textures measurement going up and staying up when you open and close windows on OSX.
Whiteboard: [gfx-noted] → [gfx-noted][MemShrink]
Let's start with Jeff first.
Flags: needinfo?(nical.bugzilla) → needinfo?(jmuizelaar)
I just got the "Your system has run out of application memory" popup again.  At this time, I have, in my parent process:

3,923.12 MB (100.0%) -- explicit
├──2,773.27 MB (70.69%) -- gfx
│  ├──2,773.07 MB (70.69%) ── heap-textures
....
17,718.42 MB ── gfx-textures
18,957.02 MB ── gfx-textures-peak

Note that those last two numbers are now larger than my available RAM (16GB), if that means anything...
Assignee: nobody → bgirard
Flags: needinfo?(jmuizelaar)
Any chance this is related to switching to skia for content rendering on OS X in bug 1207332?

bz, can you try setting the preference "gfx.content.azure.backends" to "cg", restarting Firefox and seeing if things clear up?
Flags: needinfo?(bzbarsky)
(In reply to Eric Rahm [:erahm] from comment #14)
> Any chance this is related to switching to skia for content rendering on OS
> X in bug 1207332?
> 
> bz, can you try setting the preference "gfx.content.azure.backends" to "cg",
> restarting Firefox and seeing if things clear up?

Sorry for the noise, I was able to repro locally. Going back to cg didn't fix anything. This appears to be an e10s only bug, I can't repro with e10s disabled.
tracking-e10s: --- → ?
Flags: needinfo?(bzbarsky)
Whiteboard: [gfx-noted][MemShrink] → [gfx-noted][MemShrink:P1]
I ended up getting sucked into this thinking it might be related to bug 1262088.

We're leaking nsBaseWidget, and along with it the layer manager, compositor and any number of textures.

The problem is that nsBaseWidget::ConfigureAPZCTreeManager() takes a reference (inside the ChromeProcessController), and then SetControllerForLayerTree stores it in sIndirectLayerTrees.

Nothing calls DeallocateLayerTreeId for root layer tree id, so this entry exists as long as the process does.

Kats, want to fix this?

The threading model on this code confuses me a fair bit. It looks like the only use of mWidget within ChromeProcessController is to access views and pres shell, which would only be safe on the main thread.
SetControllerForLayerTree explicitly posts the messages to the compositor thread though, and sIndirectLayerTrees is a structure usually used by the compositor for associating child process content with the chrome process window.
Flags: needinfo?(bugmail.mozilla)
Some users open and close a lot of windows, so this could be contributing to memory problems. about:memory reports from OOM crashes in the main process could be checked to confirm that.
Blocks: e10s-oom
Summary: Fairly large amount of memory used for textures → Leaking 1MB of heap-textures when opening and closing a Firefox window
Summary: Leaking 1MB of heap-textures when opening and closing a Firefox window → Leaking 1MB of heap-textures each time you open and close a Firefox window
My experience was we're leaking a lot more than 1MiB.
Summary: Leaking 1MB of heap-textures each time you open and close a Firefox window → Leaking many MB of heap-textures each time you open and close a Firefox window with e10s enabled
I can look into this, sure. I'm wondering how none of our automated tests are picking this up...
Assignee: bgirard → bugmail.mozilla
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)
> I'm wondering how none of our automated tests are picking this up...

Our main leak checking runs late in shutdown. I'd guess that this memory gets cleaned up when you close the main window. AreWeSlimYet opens pages in new tabs, not new windows, so it wouldn't detect this.
I'm going to guess that 47 is also affected.
So it looks like in the gtk and windows widgets, the nsWindow::Destroy override triggers destruction of the compositor by calling DestroyLayerManager (windows) or DestroyCompositor (gtk). nsChildView.mm on OS X, and Android's nsWindow don't appear to do this. This leaves sIndirectLayerTrees with a dangling entry for the root layer id, which holds the ChromeProcessController which holds the widget, as well as a bunch of other stuff.

I added this bit of code to nsBaseWidget::OnDestroy():

+  if (mCompositorBridgeParent) {
+    const CompositorBridgeParent::LayerTreeState* state = CompositorBridgeParent::GetIndirectShadowTree(mCompositorBridgeParent->RootLayerTreeId());
+    if (state && state->mController) {
+      state->mController->Destroy();
+    }
+  }

which fixed the problem by clearing the root layer id entry from sIndirectLayerTrees on widget Destroy(). However I think a better fix is to just call DestroyCompositor() or DestroyLayerManager(), either in the nsBaseWidget::OnDestroy function or in the OS X and Android subclasses. I'm unsure if the ordering of stuff here matters, the nsChildView::Destroy function has a comment and a lock [1] that seems to imply we can't be destroying the compositor there. Markus, do you know what that's about?

[1] http://mxr.mozilla.org/mozilla-central/source/widget/cocoa/nsChildView.mm?rev=327c61df0bae#624
Flags: needinfo?(mstange)
(In reply to Andrew McCreight [:mccr8] from comment #20)
> Our main leak checking runs late in shutdown. I'd guess that this memory
> gets cleaned up when you close the main window. AreWeSlimYet opens pages in
> new tabs, not new windows, so it wouldn't detect this.

Yes, you're right. During shutdown CompositorBridgeParent::ShutDown is called, which clears the entire sIndirectLayerTrees table, and frees up everything.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #22)
> I'm
> unsure if the ordering of stuff here matters, the nsChildView::Destroy
> function has a comment and a lock [1] that seems to imply we can't be
> destroying the compositor there. Markus, do you know what that's about?

Actually looking at it closer, it should be safe to destroy the compositor. The lock is to prevent the composition interleaving with shutdown.
Flags: needinfo?(mstange)
For future reference, the shutdown procedure goes like this:

1. nsWindow::Destroy is called
2. Somewhere during that call, DestroyCompositor/DestroyLayerManager is called
3. This calls CompositorBridgeChild::Destroy, which sends a WillClose to CompositorBridgeParent and then schedules a DeferredDestroy
4. The RecvWillClose on the parent side clears out a bunch of layer managers
5. The DeferredDestroy closes the IPC channel, which triggers ActorDestroy on the parent side
6. CompositorBridgeParent::ActorDestroy clears the mRootLayerID entry from the sIndirectLayerTrees map, which clears the widget pointer
7. Stuff is destroyed.

On OS X, step 2 wasn't happening. After I fixed that I noticed that step 6 wasn't happening either because mLayerManager was getting nulled out in step 4. And that seems to affect Linux as well, and probably windows too? So maybe this bug affects all platforms. Anyway, I fixed up the code to follow the shutdown procedure above and it seems to work. Try push and patches coming shortly.
Attached patch Patch (obsolete) — Splinter Review
Try push including this patch at https://treeherder.mozilla.org/#/jobs?repo=try&revision=05e96ebb3278
Attachment #8743931 - Flags: review?(matt.woodrow)
Priority: -- → P1
(In reply to Matt Woodrow (:mattwoodrow) from comment #16)
> The threading model on this code confuses me a fair bit. It looks like the
> only use of mWidget within ChromeProcessController is to access views and
> pres shell, which would only be safe on the main thread.
> SetControllerForLayerTree explicitly posts the messages to the compositor
> thread though, and sIndirectLayerTrees is a structure usually used by the
> compositor for associating child process content with the chrome process
> window.

Also for completeness I looked over this code and I think it's fine. You're right that mWidget can only be used on the main thread, and I think all the functions in ChromeProcessController that touch the mWidget do so from the main thread. The ChromeProcessController itself does get passed around and touched on the compositor thread as well though.
Attachment #8743931 - Flags: review?(matt.woodrow) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #27)

> Also for completeness I looked over this code and I think it's fine. You're
> right that mWidget can only be used on the main thread, and I think all the
> functions in ChromeProcessController that touch the mWidget do so from the
> main thread. The ChromeProcessController itself does get passed around and
> touched on the compositor thread as well though.

That's good to know. It would be nice if this were structured in a way that this was obvious, as it seems like we could easily add a bug in the future here. That can go on my wish list though :)
Looks like on OS X the patch causes an assertion failure in some tests. I can repro locally, investigating.
I'm not entirely sure why the assertion failure is happening, and I don't know that code too well. As best I can tell the early shutdown of the compositor/layermanager on the parent side triggers a premature shutdown on the client side as well? At this point the ShadowLayerForwarder hasn't finished emptying out its shmems and so we get the Assertion failure: mUsedShmems.empty() in ShadowLayers.cpp. I'm not sure why this doesn't already happen on Linux/Windows, maybe this is some Mac-specific gfx code.

I'm going to try to sidestep this entirely and go with a simpler fix that is similar to comment 22 which carries less risk. Try push with that at https://treeherder.mozilla.org/#/jobs?repo=try&revision=09c77d899151
Attached patch Patch v2Splinter Review
Attachment #8743931 - Attachment is obsolete: true
Attachment #8744367 - Flags: review?(botond)
Comment on attachment 8744367 [details] [diff] [review]
Patch v2

Review of attachment 8744367 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/nsBaseWidget.h
@@ +523,5 @@
>    RefPtr<CompositorBridgeChild> mCompositorBridgeChild;
>    RefPtr<CompositorBridgeParent> mCompositorBridgeParent;
>    RefPtr<mozilla::CompositorVsyncDispatcher> mCompositorVsyncDispatcher;
>    RefPtr<APZCTreeManager> mAPZC;
> +  RefPtr<GeckoContentController> mGeckoContentController;

Can we call this mRootGeckoContentController, to reflect the fact that this is the GeckoContentController for the root layer tree ID?
Attachment #8744367 - Flags: review?(botond) → review+
Based on IRC discussion I renamed it to mRootContentController. The try push also revealed a missing null check which I added (again, got r+ over IRC). Latest try push is at https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c60a5753876&group_state=expanded&exclusion_profile=false - still going, but looking green on all the tests that were blowing up before. I'm going to go ahead and land this.
https://hg.mozilla.org/mozilla-central/rev/3d22a2e2a594
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Is this something you feel we can uplift to 47 at some point?
Flags: needinfo?(bugmail.mozilla)
Yeah. I'd like to have somebody (bz?) verify it first before requesting uplift. It's in today's nightly (Apr 25).
Flags: needinfo?(bugmail.mozilla) → needinfo?(bzbarsky)
Depends on: 1267246
That's the same failure mode I was seeing in the try push in comment 26; the new patch didn't seem to have that problem. Weird. I don't know what to do about that - maybe file a new bug, and Nical can look into it? It's in code that I'm not really familiar with and might be a pre-existing issue exposed by this fix.
(Mid-aired, thanks for filing the bug)
Will check once I update to a build with the fix.  I thought I had, but hadn't realized the lag between inbound and m-c in this case.  :(  Sadly, updating is a 10-20 minute process that interrupts the world, so I don't do it every day.
Depends on: 1268169
Hello, would it be possible to uplift this fix to 47.0beta? I think, that this fix could solve this Bug 1190170. 
I tested the last 46.0beta and I ended up with 1 GB per main process after 1 day of usage and this fix could solve the problem or at least mitigate it. 
Thank you for feedback.
Yeah I might as well request uplift while we wait for verification.
Comment on attachment 8744367 [details] [diff] [review]
Patch v2

Approval Request Comment
[Feature/regressing bug #]: APZ
[User impact if declined]: opening a new window and closing it leaks some of the structures from that window. opening and closing many windows will accumulate these memory leaks, resulting in high memory usage counts.
[Describe test coverage new/current, TreeHerder]: tested locally only, waiting for verification from somebody. We don't appear to have leak-checking in automation that can be run on a per-window basis; it just runs late during shutdown after the leaked objects have been released, so it didn't catch this problem.
[Risks and why]: medium-ish risk. One risk factor is that it might expose other pre-existing bugs (e.g. bug 1268169, bug 1267246) that were masked by this problem. Another risk factor is that the code is a little fragile and not super well-understood. I did the best I could in making the patch low-risk, but more bake time will be better. Still, it's a high-reward patch for trains where APZ is enabled.
[String/UUID change made/needed]: none
Attachment #8744367 - Flags: approval-mozilla-beta?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #43)
> Yeah I might as well request uplift while we wait for verification.

I tested Nightly 2016-04-25 and results were good. Uptime > 2 days and ~400MB per main process consumed.
Comment on attachment 8744367 [details] [diff] [review]
Patch v2

Verified on Nightly, though this is a bit risky, the memleak is severe enough to uplift quicky (20% of beta users have e10s on by default), Beta47+

If we find regressions, we can consider backing it out and leaving things as is.
Attachment #8744367 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This doesn't apply as cleanly to beta as I would like. The main code dependency that is bug 1215265, if we can get that uplifted it would make me happier. There's also bug 1263200 but that's a one-liner that's trivial to uplift. If we can't uplift bug 1215265 I'll probably need to retest the patch on a local beta build to make sure it's doing what it's supposed to.
Nical is reluctant to uplift bug 1215265 so I rebased around it and verified locally that the rebased patch fixes the leak. I also did a try push at https://treeherder.mozilla.org/#/jobs?repo=try&revision=5260a53ebba0 which looks ok. I'll land the rebased version shortly.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #49)
> https://hg.mozilla.org/releases/mozilla-beta/rev/c70884434329

Does it mean that the fix is part of 47.0beta? If yes, I'm looking forward to test it if it helps or not. Actually I have 1.3 GB per main process, which is really tough :-)
Should be in 47b2, yes.
Yeah, the next beta build should go out today or tomorrow I believe. Please do test it and let us know if it helps. Thanks!
I've been browsing with a nightly with this patch for some time now.  heap-textures is holding steady at about 600MB (instead of ballooning to multiple GB), gfx-textures holding steady at about 9GB (instead of ballooning to 16-20GB).  Haven't had the "run out of application memory" problem.  Looks good.  ;)
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #53)
> I've been browsing with a nightly with this patch for some time now. 
> heap-textures is holding steady at about 600MB (instead of ballooning to
> multiple GB), gfx-textures holding steady at about 9GB (instead of
> ballooning to 16-20GB).  Haven't had the "run out of application memory"
> problem.  Looks good.  ;)

Am I the only one who thinks 9GB of gfx-textures is still way too high? Or is this a caching thing and since we have the memory we just go for it?
You're not the only one :)  Right now, some caching is being done per tab.  We have work in progress to switch that to per window.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #52)
> Yeah, the next beta build should go out today or tomorrow I believe. Please
> do test it and let us know if it helps. Thanks!

Finally I'm happy with 47.0b2. This uplift of your fix really, really helped. I have uptime more than 3 days and spent only ~325 MB per main process. It is really GREAT. Thank you.
Filed https://github.com/mozilla/areweslimyet/issues/125 for what I think is a way to prevent this from happening again.
You need to log in before you can comment on or make changes to this bug.