Closed Bug 1033679 Opened 10 years ago Closed 10 years ago

suspected memory leak in full screen on OSX

Categories

(Core :: Layout, defect)

30 Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: lonnen, Assigned: tnikkel)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(7 files, 2 obsolete files)

The TVs around the offices are in full screen mode constantly and intermittently every 4-14 days the boxes run out of application memory and suggest killing Firefox. Earlier this week we were able to get an about:memory report, which contained the following: 

```
WARNING: the following values are negative or unreasonably large.

    gfx-surface-quartz
    page-faults-soft 

This indicates a defect in one or more memory reporters. The invalid values are highlighted. 

```

Terminating Firefox released the memory. I've attached the about:memory dump.

I took a sub-OOM machine and captured screen caps of memory usage before and after killing Firefox. I've attached those as well. Freed memory jumps disproportionately to application memory. 

Both of these machines are 4GB mac minis. There are no plug-ins. They spend all day in full screen (the browser full-screen-api, not the OSX level full screen) displaying content in an iframe.
This is occurring in a current FF install
Version: 32 Branch → 30 Branch
What page are they displaying?

The memory report has 56GB (!) of heap unclassified.
Whiteboard: [MemShrink]
The browser was displaying the client page of Corsica[0], which is intended to run all the time in full screen mode, displaying rotating content in iframes (mostly, sometimes it doesn't use iframes).

[0]: https://github.com/mozilla/corsica
Is there a way that someone could try to reproduce to investigate?
If you're on the appropriate networks, you can get the client here http://airmozilla-ops2.corpdmz.scl3.mozilla.com/. We'll probably need to make some tweaks to give your client some content.

If not, we can set up an instance you can look at that mirrors the "production" instance above.

You can ping me (mythmon) on irc if you want to explore either of these options. We've tried to reproduce it ourselves, but haven't been able to figure anything out.
Hmm, would it be possible to use a special build of Firefox that has DMD (https://wiki.mozilla.org/Performance/MemShrink/DMD) enabled in the "production" environment? Then when the problem happens again we can get a DMD report that will tell us what is actually in that 56gb of heap unclassified.
Sure. I'll get it set up on monday.
Blocks: 993214
I made you a try server build with DMD enabled at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tnikkel@gmail.com-4fc291a89c09/try-macosx64-debug/firefox-33.0a1.en-US.mac64.dmg  It also has debug enabled (to get a working DMD build right now requires debug on mac) so it might be a little slower, but only a little because optimizations are enabled. These are the steps I used to run that build with DMD. First copy the app to a read-write folder (just mounting the disk image and running from there won't work) so that it can write the output file in the app directory. Then in a terminal cd to FirefoxNightlyDebug.app/Contents/MacOS/ and then use this command line

DYLD_INSERT_LIBRARIES=/path/to/FirefoxNightlyDebug.app/Contents/MacOS/libdmd.dylib LD_LIBRARY_PATH=/path/to/FirefoxNightlyDebug.app/Contents/MacOS/ DMD=1 ./firefox

When you load about:memory now there is a button to save dmd output. It will save it to out.dmd in the app directory. Let me know if this works out.
Thank you!

On Monday we'll install that on the boxes and get back to you with when the error reoccurs. Sometimes we see it every day, sometimes it takes 2 weeks. I'll also load it on our staging box and increase the rate at which content cycles through that to see if we can repro it sooner.
Depends on: 1035215
Attached file out.dmd (obsolete) —
Here is the result from 12 hours or so on one of the TVs.
Comment on attachment 8452812 [details]
out.dmd

Thank you!

There is one stack responsible for about 1GB of heap-unclassified, 85% of unreported. So we have a clear target here.

I don't think running DMD builds will give us any more information here (it's already done it's job excellently) so you can go back to normal builds if you'd like.
rgr that. Let me know if we should start running it again; I'll leave the builds on the box. In the meanwhile we're likely to move to periodic browser reboots to keep from the OS from putting up a shame modal.
Actually I don't think I trust the symbols in the stacks from the DMD output. :(

Is it a lot of work to create an instance that I can connect to via the Mozilla VPN in a local build of Firefox?
I resymbolicated the log using the crash reporter symbols of the tryserver build. The stacks appear to be correct now.
Attachment #8452812 - Attachment is obsolete: true
(In reply to Timothy Nikkel (:tn) from comment #16)
> Actually I don't think I trust the symbols in the stacks from the DMD
> output. :(
> 
> Is it a lot of work to create an instance that I can connect to via the
> Mozilla VPN in a local build of Firefox?

Nevermind this request, Markus has saved the day!
The DMD output points the finger squarely at frames from images. These should already be covered by existing about:memory reporters, but only for images that are registered with an imgLoader. In a freshly started fairly simple browser I got about 3 RasterImage's that aren't registered with an imgLoader (of about 60 total images). So this may be why; figuring out why all images don't get registered with an imgLoader.
tn: any ideas why the imgLoader isn't seeing some images?
Whiteboard: [MemShrink] → [MemShrink:P2]
No, not yet. That was my next thing to investigate but I've been busy with other bugs.
I don't think it impacts the general pop of users as heavily as it impacts the office TVs, but it's internally visible. Leads to tweets: https://twitter.com/dolske/status/492755870971727872

Going to try and work around it with periodic browser reboots. Other problems getting Firefox and puppet to play nice together have held that up, though.
Assignee: nobody → dougt
I've actually had some time to look into this lately.

Using the image cache to get a list of all images in memory is just not what is was designed for. Anytime we load an image and find that we can't use the image that we have in the image cache we will overwrite the entry for that URI in the image cache, but we have to leave the old image around (a new image element referencing the same URI shouldn't change an existing image). So we will need to change how we do image memory reporting in order to catch all images.
And I was planning to do just that.
Flags: needinfo?(tnikkel)
Assignee: dougt → tnikkel
Flags: needinfo?(tnikkel)
I made a build that should now report all images to about:memory (I'll get that patch landed in another bug)
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tnikkel@gmail.com-5b46bd0f48ef/

(The build should be there in about an hour.)

Chris could you test this on one machine? If my guess is correct instead of seeing heap-unclassified grow we should see images/stale grow. An about:memory dump should then tell us where to look for the problem.
Flags: needinfo?(chris.lonnen)
Sure thing. I'll put this on a machine in the AM and will post the dump at EOD.
Installed on ambient4. Will get back to you in a few hours with the result.
Flags: needinfo?(chris.lonnen)
Here's the memory dump after a few hours. I'll leave it running until I hear otherwise. Let me know if you need anything else from it.
Thanks again! That points the finger squarely at images that aren't stored in the image cache, just like I suspected. The interesting thing is that they are mostly in the "unused" bin, meaning they have no proxies, so they likely aren't being referenced from any page anymore, but something is still holding a ref to them to keep them alive.

I'll debug locally to see why we keep unused images like this alive (if I can reproduce such a case). You can go back to running a regular build.
The patch for the images that aren't stored in the image cache to be reported in about:memory is in bug 1059654.

I've reproduced locally with Chris' help. Making progress, but no patch yet.
In the next part we need to get the refresh driver (via the presshell) from the ImageLoader, and the ImageLoader only has a pointer to the document.
Attachment #8484387 - Flags: review?(dholbert)
iframes use the refresh driver of the root content document, so they don't go away when the iframe goes away. We were failing to remove image requests from the refresh driver when tearing down the whole document because we skip notifying per frame when tearing down the whole document as a perf optimization.
Attachment #8484392 - Flags: review?(khuey)
Comment on attachment 8484387 [details] [diff] [review]
Part 1. When tearing down a document move the call to ImageLoader::ClearFrames to before we clear out the presshell pointer on the document.

>diff --git a/layout/base/nsPresShell.cpp b/layout/base/nsPresShell.cpp
>@@ -1176,16 +1176,25 @@ PresShell::Destroy()
>+    // Do this before the mDocument->DeleteShell() call so the ImageLoader can
>+    // still find the shell and hence refresh driver to remove image requests
>+    // from it.

The phrase "...and hence refresh driver to remove..." doesn't make sense.

Maybe you mean "...and get the refresh driver to remove...", or something like that.


> void
> PresShell::SetIgnoreFrameDestruction(bool aIgnore)
> {
>-  if (mDocument) {
>-    // We need to tell the ImageLoader to drop all its references to frames
>-    // because they're about to go away and it won't get notifications of that.
>-    mDocument->StyleImageLoader()->ClearFrames();
>-  }
>   mIgnoreFrameDestruction = aIgnore;

Hmm. I'm a little uneasy decoupling these things -- mIgnoreFrameDestruction being toggled & the StyleImageLoader()->ClearFrames() call.  Elsewhere, we seem to depend on them being tied -- e.g. this chunk of patch-context:

> void
> PresShell::NotifyDestroyingFrame(nsIFrame* aFrame)
> {
>   if (!mIgnoreFrameDestruction) {
>     mDocument->StyleImageLoader()->DropRequestsForFrame(aFrame);

In the current codebase, we know we're safe skipping the DropRequestsForFrame() call, if the flag is set, because the flag is only set when we've cleared the frame<-->request maps.  That's a bit fuzzier with your change, though.

I don't know the precise order of destruction steps here, but it seems like it'd be better to do one of the following:
 (1) Move the flag-tweak along with the ClearFrames() call (and perhaps just drop SetIgnoreFrameDestruction() altogether)
...or...
 (2) Make PresShell::Destroy just call SetIgnoreFrameDestruction(), and
   (2a) ...drop FrameManager's call to SetIgnoreFrameDestruction, if we're very sure that PresShell::Destroy's invocation must always happen first
     or
   (2b) ... make SetIgnoreFrameDestruction() a no-op if its flag has already been set (so that nsFrameManager.cpp can still call it)
...or...
 (3) Don't mess with SetIgnoreFrameDestruction/ClearFrames, and instead just create a new function to deregister the requests (in part 2)
...or maybe something else.

I think I'm leaning towards (3)... Though it seems perhaps a bit odd to keep the maps around after unregistering the requests.  What do you think?
option (4), from IRC: Just pass the PresShell pointer (or better, the PresContext pointer) directly into the existing ClearFrames() call.

Then we don't need patch 1 here at all.
Attachment #8484387 - Attachment is obsolete: true
Attachment #8484387 - Flags: review?(dholbert)
Attachment #8484392 - Attachment description: Part 2. Remove image requests from the refresh driver when we clear all frames from the image loader. → Remove image requests from the refresh driver when we clear all frames from the image loader.
Comment on attachment 8484488 [details] [diff] [review]
This implements dholbert's review comment, to be folded into main patch

Review of attachment 8484488 [details] [diff] [review]:
-----------------------------------------------------------------

I'll admit to not understanding the shutdown sequence terribly well here, but this certainly looks sane.
Attachment #8484488 - Flags: review?(khuey) → review+
Comment on attachment 8484488 [details] [diff] [review]
This implements dholbert's review comment, to be folded into main patch

Make sure this is what Daniel had in mind.
Attachment #8484488 - Flags: review?(dholbert)
Comment on attachment 8484488 [details] [diff] [review]
This implements dholbert's review comment, to be folded into main patch

[sorry for the delay; I thought I'd already submitted this, but it was sitting in a tab waiting to be submitted.]

Looks good. Just one verbage nit:

> void
> ImageLoader::DropDocumentReference()
> {
>-  ClearFrames();
>+  // It's okay if GetPresContext returns null here because the presshell
>+  // pointer on the document is null as that means the presshell has already
>+  // been destroyed, and it also calls ClearFrames when it is destroyed.
>+  ClearFrames(GetPresContext());

The "It's ok...because" language is a bit hard to follow here. (It sounds like you're saying "It's ok, because [blah].", but really "because" is still part of the thing that you're describing as being OK.)

Maybe reword as:
  // It's okay if GetPresContext returns null here (due to the presshell
  // pointer on the document being null), as that means [...]

r=me with that or something similar
Attachment #8484488 - Flags: review?(dholbert) → review+
Comment on attachment 8484392 [details] [diff] [review]
Remove image requests from the refresh driver when we clear all frames from the image loader.

>+++ b/layout/style/ImageLoader.h
>@@ -88,16 +88,20 @@ private:
>   nsPresContext* GetPresContext();
> 
>   void DoRedraw(FrameSet* aFrameSet);
> 
>   static PLDHashOperator
>   SetAnimationModeEnumerator(nsISupports* aKey, FrameSet* aValue,
>                              void* aClosure);
> 
>+  static PLDHashOperator
>+  DeregisterRequestEnumerator(nsISupports* aKey, FrameSet* aValue,
>+                             void* aClosure);

Nit: Fix indentation here------^
Component: General → Layout
Product: Firefox → Core
https://hg.mozilla.org/mozilla-central/rev/c72942afe172
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: