window.open()/.close() memory leak

RESOLVED FIXED in Firefox 39

Status

()

defect
P1
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: m1, Assigned: kats)

Tracking

({regression})

unspecified
mozilla39
ARM
Gonk (Firefox OS)
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.2+, firefox37 wontfix, firefox38 wontfix, firefox39 fixed, b2g-v2.2 fixed, b2g-master fixed)

Details

(Whiteboard: [caf priority: p1][CR 801548][MemShrink:P2])

Attachments

(3 attachments, 2 obsolete attachments)

+++ This bug was initially created as a clone of Bug #998504 +++

A memory leak is observed on b2g v2.2 in content process of an app that repeatedly invokes window.open()/close().

USS of the content process grows until the LMK kills the process after a couple hours.

STR:
* Run the test app at bug 964386 attachment 8366455 [details] on a Flame device w/ a recent b2g v2.2 build.
Can we get memory reports after every 100 or so open/close pairs?  (Not sure how many open/close cycles it takes; maybe every 10 or every 1000 would be better)  GC/CC logs from those times would also be helpful.
Flags: needinfo?(mvines)

Updated

4 years ago
Whiteboard: [MemShrink]
Absolutely, please run the test app from the other bug to reproduce this issue yourself.
Flags: needinfo?(mvines)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #2)
> Absolutely, please run the test app from the other bug to reproduce this
> issue yourself.

I don't have a Flame to test apps on, nor any idea how to load apps on it.  In both of the previous bugs, it looked like logs etc. were provided to help diagnose the memory issues, so that's why I asked.
Got it.  I'm sure somebody from Moz QA can help.  This isn't the first time (or second time) we have regressed here so it's surprising that this still isn't covered as a part of the normal Moz b2g test plan.
I have a Flame, and I've gotten it working.  It seems to be leaking 2 /dev/ashmem fds per iteration (my thanks to the SystemMemoryReporter open-fds tree).  It may be interesting to see what happens to it after about 2¼ hours of runtime, when it hits its fd limit.
Whiteboard: [MemShrink] → [CR 801548][MemShrink]
Whiteboard: [CR 801548][MemShrink] → [caf priority: p1][CR 801548][MemShrink]
blocking-b2g: --- → 2.2+
Dave, you fixed something like this back in 1.3, any thoughts?
Flags: needinfo?(huseby)
This seems to be where the two fd leaks are coming from:

(gdb) bt
#0  mozilla::ipc::SharedMemoryBasic::SharedMemoryBasic (this=0xb24231a0, aHandle=...)
    at ../../../gecko/ipc/glue/SharedMemoryBasic_android.cpp:47
#1  0xb5133510 in mozilla::layers::CompositorChild::SharedFrameMetricsData::SharedFrameMetricsData
    (this=0xb2468290, metrics=..., handle=..., aAPZCId=<optimized out>)
    at ../../../gecko/gfx/layers/ipc/CompositorChild.cpp:236
#2  0xb5133596 in mozilla::layers::CompositorChild::RecvSharedCompositorFrameMetrics (
    this=0xb3d7f8e0, metrics=..., handle=..., aAPZCId=@0xbefcbe60: 79)
    at ../../../gecko/gfx/layers/ipc/CompositorChild.cpp:209
#3  0xb4f4063e in mozilla::layers::PCompositorChild::OnMessageReceived (this=0xb3d7f8e0, __msg=...)
    at PCompositorChild.cpp:898
#4  0xb4f0203c in mozilla::ipc::MessageChannel::DispatchAsyncMessage (this=0xb3d7f910, aMsg=...)
    at ../../../gecko/ipc/glue/MessageChannel.cpp:1218

(gdb) bt
#0  mozilla::ipc::SharedMemoryBasic::SharedMemoryBasic (this=0xb24627a0, aHandle=...)
    at ../../../gecko/ipc/glue/SharedMemoryBasic_android.cpp:47
#1  0xb4f03f78 in mozilla::CrossProcessMutex::CrossProcessMutex (this=0xb2cffc20, aHandle=...)
    at ../../../gecko/ipc/glue/CrossProcessMutex_posix.cpp:89
#2  0xb5133544 in mozilla::layers::CompositorChild::SharedFrameMetricsData::SharedFrameMetricsData
    (this=0xb2468290, metrics=..., handle=..., aAPZCId=<optimized out>)
    at ../../../gecko/gfx/layers/ipc/CompositorChild.cpp:238
#3  0xb5133596 in mozilla::layers::CompositorChild::RecvSharedCompositorFrameMetrics (
    this=0xb3d7f8e0, metrics=..., handle=..., aAPZCId=@0xbefcbe60: 79)
    at ../../../gecko/gfx/layers/ipc/CompositorChild.cpp:209
#4  0xb4f4063e in mozilla::layers::PCompositorChild::OnMessageReceived (this=0xb3d7f8e0, __msg=...)
    at PCompositorChild.cpp:898
#5  0xb4f0203c in mozilla::ipc::MessageChannel::DispatchAsyncMessage (this=0xb3d7f910, aMsg=...)
    at ../../../gecko/ipc/glue/MessageChannel.cpp:1218

So, here:

    229 CompositorChild::SharedFrameMetricsData::SharedFrameMetricsData(
    230     const ipc::SharedMemoryBasic::Handle& metrics,
    231     const CrossProcessMutexHandle& handle,
    232     const uint32_t& aAPZCId) :
    233     mMutex(nullptr),
    234     mAPZCId(aAPZCId)
    235 {
==> 236   mBuffer = new ipc::SharedMemoryBasic(metrics);
    237   mBuffer->Map(sizeof(FrameMetrics));
==> 238   mMutex = new CrossProcessMutex(handle);
    239   MOZ_COUNT_CTOR(SharedFrameMetricsData);
    240 }

Called from here:

    209   SharedFrameMetricsData* data = new SharedFrameMetricsData(metrics, handle, aAPZCId);
    210   mFrameMetricsTable.Put(data->GetViewID(), data);
I guess the removed SharedFrameMetricsData isn't getting destroyed at [1]? The destructor for that class should clear the fds if it's running.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/CompositorChild.cpp?rev=15dea2b0929b#340
mFrameMetricsTable.Count() seems to be increasing without bound.  This looks like a misbalance of SendSharedCompositorFrameMetrics and SendReleaseSharedCompositorFrameMetrics in the parent.
This, in AsyncPanZoomController::GetSharedFrameMetricsCompositor:

    // |state| may be null here if the CrossProcessCompositorParent has already been destroyed.

It is, and this seems to leak the child-side APZC metrics table entry.  At this point it might be better for someone who know actually knows this code to look at it.
Component: Performance → Graphics: Layers
Product: Firefox OS → Core
I can take it. Thanks for tracking it down!
Assignee: nobody → bugmail.mozilla
Component: Graphics: Layers → Panning and Zooming
Whiteboard: [caf priority: p1][CR 801548][MemShrink] → [caf priority: p1][CR 801548][MemShrink:P2]
Try push with the above patches: https://treeherder.mozilla.org/#/jobs?repo=try&revision=694bd7a3bf9f

Basically what's happening here is that there is a single child process, with a single CompositorChild instance, which keeps allocating and destroying PLayerTransactions (once with each window open/close). Each PLayerTransaction has a new layers id and so each window open/close creates a new LayerTreeState in the table on the parent-side.

However, the SharedFrameMetrics code doesn't deal with the case where the LayerTreeState for a layers id is removed without the child process itself getting removed. In this scenario it might end up (racy behaviour here) leaking the entries in the mFrameMetricsTable on the child side corresponding to the destroyed PLayerTransaction.

The patches I posted track basically find these dangling SharedFrameMetrics objects on the client side at the time that the PLayerTransaction is destroyed, and remove them. I verified locally that the code is getting hit although I didn't verify that the fd's where actually getting closed. I assume that happens though because it seems to be happening in the normal flow path where we remove these objects via an explicit Release call.

Note that the explicit Release call is still needed, because we want to destroy the SharedFrameMetrics objects when their corresponding APZC goes away, which can happen for reasons other than destroying the PLayerTransaction (for example, if the page content changes so that the scrollframe frame is no longer).

Will flag patches for review once the try results are in, but if anybody wants to try these patches to verify it fixes the memory leak please feel free to do so.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #15)
> Will flag patches for review once the try results are in, but if anybody
> wants to try these patches to verify it fixes the memory leak please feel
> free to do so.

I've rebased them to v2.2 and confirmed they fix the file descriptor leak.

I haven't reproduced the memory leak mentioned in comment #0; I conjectured that it would occur when the file descriptor table filled up with leaked ashmem instances, but I didn't let it run that long.  I'm currently trying to see if I can reproduce it more easily by lowering the file descriptor limit.
(In reply to Jed Davis [:jld] from comment #16)
> I've rebased them to v2.2 and confirmed they fix the file descriptor leak.

Thanks! The try push also appears green. There was a static analysis build fail which I fixed locally; I'll update the part 1 patch with that and flag for review. Would you prefer if I moved these patches to another bug, if they don't fix the memory leak this bug was filed about?
(In reply to Michael Vines [:m1] [:evilmachines] from comment #0)
> A memory leak is observed on b2g v2.2 in content process of an app that
> repeatedly invokes window.open()/close().
> 
> USS of the content process grows until the LMK kills the process after a
> couple hours.

How sure are we that this is what's going on?  I've seen the content process exit due to IPC channel failure after its fd table is filled, but I haven't seen a significant USS increase.  (There will be some increase due to the “leaked” objects under the SharedFrameMetricsData, but that shouldn't be very large relative to the phone memory given that the leak can happen at most 4000ish times.)
Flags: needinfo?(mvines)
If you pass me the rebased patch for v2.2 I can give it a shot here
Flags: needinfo?(mvines)
The three patches, in git-format-patch format, rebased to mozillaorg/v2.2.
Attachment #8572878 - Flags: feedback?(mvines)
Thanks Jed.  I've pulled attachment 8572878 [details] [diff] [review] into our trees and will let automation take a stab at it overnight.
Flags: needinfo?(mvines)
Comment on attachment 8572878 [details] [diff] [review]
Patches, rebased for v2.2, in Git format.

Tests are pretty happy with this patch on v2.2, thanks so much guys.

(FTR: in some devices this bug showed up as a VSS leak, in others as a USS leak.  I suspect device differences in display composition with the GPU vs. HWC)
Flags: needinfo?(mvines)
Attachment #8572878 - Flags: feedback?(mvines) → feedback+
Updated to make the constructor explicit so that it doesn't fail the static analysis build.
Attachment #8572700 - Attachment is obsolete: true
Attachment #8573258 - Flags: review?(nical.bugzilla)
Attachment #8572701 - Flags: review?(nical.bugzilla)
Attachment #8572702 - Flags: review?(nical.bugzilla)
Attachment #8572702 - Flags: feedback?(rbarker)
Attachment #8572701 - Flags: review?(nical.bugzilla) → review+
Attachment #8572702 - Flags: review?(nical.bugzilla) → review+
Attachment #8573258 - Flags: review?(nical.bugzilla) → review+
Flags: needinfo?(huseby)
Hey kats, can you please request approval‑mozilla‑b2g37? on what needs to get uplifted here?
Flags: needinfo?(bugmail.mozilla)
Yeah, I wanted to give it a few days to settle just in case. I'll make a rollup and request approval.
Flags: needinfo?(bugmail.mozilla)
Comment on attachment 8572878 [details] [diff] [review]
Patches, rebased for v2.2, in Git format.

Actually the patches apply cleanly to 2.2, so I'd rather just uplift them individually rather than do a rollup.
Attachment #8572878 - Attachment is obsolete: true
Comment on attachment 8573258 [details] [diff] [review]
Part 1 - Make LayerTransactionChild track the layers id

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): progressive painting + APZ
User impact if declined: in some cases after opening and closing a window in the child process we will leak some file descriptors. Eventually after prolonged use this can cause an OOM crash
Testing completed: patches verified to fix the problem on master and 2.2 (see comment 22)
Risk to taking this patch (and alternatives if risky): fairly low risk; the first two patches just add a new field and then the last patch with the fix is pretty small. patches apply cleanly, this code hasn't been touched a lot.
String or UUID changes made by this patch: none
Attachment #8573258 - Flags: approval-mozilla-b2g37?
Attachment #8572701 - Flags: approval-mozilla-b2g37?
Attachment #8572702 - Flags: feedback?(rbarker) → approval-mozilla-b2g37?
Attachment #8572701 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Attachment #8572702 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Attachment #8573258 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+

Updated

4 years ago

Updated

4 years ago
No longer blocks: CAF-v3.0-FL-metabug
You need to log in before you can comment on or make changes to this bug.