[META] window.open()/.close() memory leak

RESOLVED FIXED in 2.0 S1 (9may)

Status

defect
P1
blocker
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: m1, Unassigned)

Tracking

(5 keywords)

unspecified
2.0 S1 (9may)
ARM
Gonk (Firefox OS)
Dependency tree / graph
Bug Flags:
in-moztrap -

Firefox Tracking Flags

(blocking-b2g:1.4+, b2g-v1.4 fixed, b2g-v2.0 fixed)

Details

(Whiteboard: [c=memory p= s= u=1.4] [cr 651835][MemShrink:P1][ETA 5/2])

Attachments

(2 attachments)

A memory leak is observed on v1.4 in both the main process and content process of an app that repeatedly invokes window.open()/close().

USS/RSS are fine, but VSS blows up.  It seems to be a /dev/ashmem-related fd leak, as |watch ls -l /proc/<b2g_pid>/fd/| shows thousands of open file descriptors right before memory is exhausted and the content process is killed.

STR:
* Run the test app at bug 964386 attachment 8366455 [details] on a QRD device.

This leak does not reproduce with a v1.3 Gecko/Gaia on the same device/gonk/testapp
Keywords: mlk, perf, regression
Whiteboard: [cr 651835] → [cr 651835][MemShrink]
Keywords: footprint
Can someone on the QC side grab an about:memory report when this bug reproduces?
An about:memory report is not helpful in this case, as the leak isn't in normal heap this time.  Looks like it's a file descriptor leak.
May I know that are we confirmed this is a CS blocker for sure? Is this a hard 1.4 blocker? 
Thanks.
Flags: needinfo?(praghunath)
Flags: needinfo?(mvines)
Absolutely.  Without this bug fixed we'll never get close to the stability goals.
Flags: needinfo?(mvines)
(In reply to Kevin Hu [:khu] from comment #3)
> May I know that are we confirmed this is a CS blocker for sure? Is this a
> hard 1.4 blocker? 
> Thanks.

Yes Kevin confirmed blocker.
Flags: needinfo?(praghunath)
Is it possible to get a regression range here?
Flags: needinfo?(mvines)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #0)
> USS/RSS are fine, but VSS blows up.  It seems to be a /dev/ashmem-related fd
> leak, as |watch ls -l /proc/<b2g_pid>/fd/| shows thousands of open file
> descriptors right before memory is exhausted and the content process is
> killed.

Can we get an about:memory dump at several points during the test app's execution?

My guess is fallout from bug 748958 + bug 962670, since AFAIK the last one is the only thing making real use of /dev/ashmem.
(In reply to Andrew Overholt [:overholt] from comment #6)
> Is it possible to get a regression range here?

Does not reproduce on v1.3.  Please assign a Mozilla engineer to debug further and fix, thanks!
Flags: needinfo?(mvines)
FWIW I was hoping for something more recent than when 1.4 branched to Aurora since it not reproducing on 1.3 gives us at least that range.
I'll see if I can reproduce and start investigating on my nexus4 tomorrow.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Thanks, Ben.  Just watch the Vss from |adb shell b2g-procrank| grow for the b2g and corresponding content process.  Also notice that /proc/<pid>/fd/ grows into the thousands for both.

Looking back at some automation logs it looks like the rough regression range is: 
* ftp://ftp.mozilla.org/pub/mozilla.org/b2g/manifests/nightly/1.4.0/2014/03/2014-03-13-16
* ftp://ftp.mozilla.org/pub/mozilla.org/b2g/manifests/nightly/1.4.0/2014/03/2014-03-09-16

This doesn't reproduce on a Buri, so please try with one of the newer 8x10-based devices.
Reproduced on my open-c.
I don't see any obvious leaks diff'ing those two memory reports.  Some slight increase in strong observers, but not enough for 40 cycles.

The fact that this happens on newer devices, but not buri suggests gfx fence fd issues.
I don't have convincing evidence yet that this is related to gfx fences, but adding Sotaro in case he has any suggestions on how to rule them out.

Also, the fact that we don't like during normal painting and app usage suggestions that might not be it.
The fd's are leaking in the parent process as well.  I think that's consistent with gfx buffers, but could also indicate IPC resources or something.
See Also: → 999473
Some more observations:

1) The leak still occurs if we remove 'attention' from window.open().
2) If the screen is off, the window.open() continues to fire from js, but we do not leak.
3) If the screen is on, but locked, then we do leak.

Also, if I leave the screen off for a while and then activate it, I get this for each window activation while asleep:

E/GeckoConsole( 3759): [JavaScript Error: "TypeError: Argument 1 of Node.removeChild is not an objec
t." {file: "app://system.gaiamobile.org/js/popup_manager.js" line: 92}]
Would connecting ashmem to DMD, assuming that makes any sense, help get data about this?
(In reply to Jed Davis [:jld] from comment #19)
> Would connecting ashmem to DMD, assuming that makes any sense, help get data
> about this?

Unfortunately DMD crashed when I tried enabling it on this device.
Looking at the long listing of the process fd directory, this certainly appears to be a fence issue:

lrwx------ root     root              2014-04-22 15:45 626 -> anon_inode:dmabuf
lrwx------ root     root              2014-04-22 15:45 627 -> anon_inode:dmabuf
lr-x------ root     root              2014-04-22 15:45 628 -> anon_inode:sync_fence
lr-x------ root     root              2014-04-22 15:45 629 -> anon_inode:sync_fence

bkelly@lenir:/srv/gaia-master/apps/system/js$ adb shell ls -l /proc/9779/fd | grep fence | wc -l
247
bkelly@lenir:/srv/gaia-master/apps/system/js$ adb shell ls -l /proc/9779/fd | grep dmabuf | wc -l
508
Sotaro, Peter, Sushil,

Do any of you have suggestions on how to track down or rule out a graphics related fence leak?

We don't leak under normal painting, just when opening/closing iframes within an existing child process.  Does that hit a different code path for our locking?

Thanks!
Flags: needinfo?(sushilchauhan)
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(pchang)
Here's a profile I've been using to keep my bearings on the overall process.

 http://people.mozilla.org/~bgirard/cleopatra/#report=d1c7dc5412800059414908d6ef9d37c13aa3acc5
It seems that for each window open/close cycle we are constructing 8 GrallocTextureClientOGL objects, but only destructing 3 of them.
Correction, only two are being destructed.  Some instrumentation to show what I am seeing:

I/Gecko   (16445): ### ### TabChild::BrowserFrameProvideWindow() start, fds:574
I/Gecko   (16445): ### ### TabChild::BrowserFrameProvideWindow() end, fds:574
E/GeckoConsole(16445): Content JS LOG at app://windowtest.gaiamobile.org/js/window_test.js:19 in am_
set: In set
E/GeckoConsole(16445): [JavaScript Warning: "No meta-viewport tag found. Please explicitly specify o
ne to prevent unexpected behavioural changes in future versions. For more help https://developer.moz
illa.org/en/docs/Mozilla/Mobile/Viewport_meta_tag" {file: "app://windowtest.gaiamobile.org/helloworl
d.html" line: 0}]
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:171 fds:574
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:171 fds:574
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:172 fds:576
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:172 fds:576
E/GeckoConsole(16445): Content JS LOG at app://windowtest.gaiamobile.org/js/helloworld.js:5 in rv_in
it: In helloworld page
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:173 fds:578
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:173 fds:578
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:174 fds:581
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:174 fds:581
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:175 fds:583
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:175 fds:583
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:176 fds:585
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:176 fds:585
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:177 fds:587
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:177 fds:587
I/Gecko   (16445): ### ### GrallocTextureClientOGL() start, count:178 fds:589
I/Gecko   (16445): ### ### GrallocTextureClientOGL() end, count:178 fds:589
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() start, count:178 fds:591
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() should not deallocate
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() end, count:178 fds:591
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() start, count:177 fds:591
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() should not deallocate
I/Gecko   (16445): ### ### ~GrallocTextureClientOGL() end, count:177 fds:591
I/Gecko   (16445): ### ### TabChild::DestroyWindow() start, fds:587
I/Gecko   (16445): ### ### TabChild::DestroyWindow() destroy base window, fds:587
I/Gecko   (16445): ### ### TabChild::DestroyWindow() destroy widget, fds:587
I/Gecko   (16445): ### ### TabChild::DestroyWindow() destroy remote frame, fds:587
I/Gecko   (16445): ### ### TabChild::DestroyWindow() end, fds:590

Comment 26

5 years ago
(In reply to Ben Kelly [:bkelly] from comment #22)
> Sotaro, Peter, Sushil,
> 
> Do any of you have suggestions on how to track down or rule out a graphics
> related fence leak?
> 

Hi Ben,

If you have the set-up ready, can you quickly check by reverting these 2 patches:
1. https://bugzilla.mozilla.org/show_bug.cgi?id=986253
2. https://bugzilla.mozilla.org/show_bug.cgi?id=974152

Try reverting this too, only if above 2 do not help:
https://bugzilla.mozilla.org/show_bug.cgi?id=977880
Flags: needinfo?(sushilchauhan)

Comment 27

5 years ago
Sotaro,

I checked with | adb shell lsof | grep "sync" | command. There is huge increase in sync_fence fd counts when Settings App is launched and this count does not decrease even when user returns to Home Screen.
Reverting your patch: https://bugzilla.mozilla.org/show_bug.cgi?id=977880 is fixing it.
Flags: needinfo?(sotaro.ikeda.g)
Whiteboard: [cr 651835][MemShrink] → [cr 651835][MemShrink:P1]

Comment 28

5 years ago
Reverting the patch mentioned in Comment 27 hugely reduce the sync_fence fd count. Then, reverting below 2 patches further reduce sync_fence fd count. Use cases: Launch and Exit the Settings or Video App:
1. https://bugzilla.mozilla.org/show_bug.cgi?id=986253
2. https://bugzilla.mozilla.org/show_bug.cgi?id=974152
I am going to investigate about it today.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(sotaro.ikeda.g)
Thanks Sotaro!  Mind if I pass the assignment to you for now?  Feel free to pass back if it ends up not being in your court.
Assignee: bkelly → sotaro.ikeda.g
Component: Performance → Graphics: Layers
Product: Firefox OS → Core
Version: unspecified → 30 Branch
A lifetime of Fence is tied to TextureHost/TextureClient's lifetime. My current assumption is TextureHost/TextureClient leak also causes Fence leak.

Updated

5 years ago
Whiteboard: [cr 651835][MemShrink:P1] → [cr 651835][MemShrink:P1][c= p= s= u=]

Updated

5 years ago
Whiteboard: [cr 651835][MemShrink:P1][c= p= s= u=] → [cr 651835][MemShrink:P1][c=memory p= s= u=]
I confirmed that the increase of GrallocTextureClientOGL on v1.4 nexus-4. But it does not happen on master nexus-4.
Flags: needinfo?(sotaro.ikeda.g)
When tiling is disabled, the increase of GrallocTextureClientOGL did not happen.
It becomes clear that the GrallocTextureClientOGL leak happens by Bug 982339.
Depends on: 982339
(In reply to Sotaro Ikeda [:sotaro] from comment #34)
> It becomes clear that the GrallocTextureClientOGL leak happens by Bug 982339.

To apply Bug 982339 on b2g v1.4, Bug 985302 is also necessary.
Depends on: 985302
After applying Bug 982339 and Bug 985302, I did not see the increase of file descriptors. But during running the test, the app stop to work because of pipe error on IPC.
(In reply to Sotaro Ikeda [:sotaro] from comment #36)
> After applying Bug 982339 and Bug 985302, I did not see the increase of file
> descriptors. But during running the test, the app stop to work because of
> pipe error on IPC.

It is same to master b2g.
Depends on: 1000525
(In reply to Sotaro Ikeda [:sotaro] from comment #37)
> (In reply to Sotaro Ikeda [:sotaro] from comment #36)
> > After applying Bug 982339 and Bug 985302, I did not see the increase of file
> > descriptors. But during running the test, the app stop to work because of
> > pipe error on IPC.
> 
> It is same to master b2g.

Created Bug 1000525 to handle the problem in Comment 36.
Flags: needinfo?(pchang)

Updated

5 years ago
Whiteboard: [cr 651835][MemShrink:P1][c=memory p= s= u=] → [c=memory p= s= u=1.4] [cr 651835][MemShrink:P1]
Sotaro,

Can you please confirm that bug 1000525 fixes the issue here?

What would be pending once bug 1000525 lands?
Flags: needinfo?(sotaro.ikeda.g)
Depends on: 1004191
We also need bug 1004191.
Flags: needinfo?(sotaro.ikeda.g)
Whiteboard: [c=memory p= s= u=1.4] [cr 651835][MemShrink:P1] → [c=memory p= s= u=1.4] [cr 651835][MemShrink:P1][ETA 5/2]
By Bug 982339 and Bug 1004191, the memory leak around gfx layers seems to be fixed. But there are still other area's leak. During running the test, I still saw WindowTest app's VSS and RSS increase. One thing I recognized is that TabChild is leaking during test. TabChild's destructor was not called.
(In reply to Sotaro Ikeda [:sotaro] from comment #41)
> One thing I recognized is
> that TabChild is leaking during test. TabChild's destructor was not called.

I confirmed that TabChild::RecvDestroy() is called.
Depends on: 1004630
(In reply to Sotaro Ikeda [:sotaro] from comment #41)
> By Bug 982339 and Bug 1004191, the memory leak around gfx layers seems to be
> fixed. But there are still other area's leak. During running the test, I
> still saw WindowTest app's VSS and RSS increase. One thing I recognized is
> that TabChild is leaking during test. TabChild's destructor was not called.

Bug 1004630 is created for TabChild's leak.
Component: Graphics: Layers → Performance
Product: Core → Firefox OS
Version: 30 Branch → unspecified
For the gfx's leaks, each bug is created. And all of them are near to fix. Return this bug's component back to Firefox OS Performance.
Assignee: sotaro.ikeda.g → nobody
I do not saw the leak around gfx layers after locally applying the ongoing fixes.
(In reply to Sotaro Ikeda [:sotaro] from comment #45)
> I do not saw the leak around gfx layers after locally applying the ongoing
> fixes.

Hi Sotaro,

Which local ongoing fixes are we using?

Please add bug numbers. Can this blocker be closed now?
Flags: needinfo?(milan)
This can't be closed until the blocking bugs are closed.  One remaining graphics bugs, close to landing is bug 1004191, but there is also dom bug 1004630 that's needed.
Flags: needinfo?(milan)
No longer depends on: 1004630
changing title to include META so that we are clear about why it isn't assigned to anybody specific.
Keywords: meta
Summary: window.open()/.close() memory leak → [META] window.open()/.close() memory leak

Updated

5 years ago
Depends on: 1004630
(In reply to Milan Sreckovic [:milan] from comment #40)
> We also need bug 1004191.

Not anymore.
No longer depends on: 1004191
Let's see if we can close this.  All the dependent bugs have landed on all relevant branches (note that bug 1004630 didn't need an uplift to b2g30, as it was introduced in 31).  So, this should now be fixed in 1.4 (30), and 2.0 (32) (also at 31, but there is no b2g version for that.)

With that in mind, if we can repeat the original test, it would help us confirm the fix, or see if there is something we missed.  Michael, can you arrange for that?
Flags: needinfo?(mvines)
I just checked our automation and this test is now green again over here so looks like we're done.  Thanks!
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Flags: needinfo?(mvines)
Resolution: --- → FIXED
Target Milestone: --- → 2.0 S1 (9may)
Flags: in-moztrap?(ychung)
Flags: in-moztrap?(ychung) → in-moztrap?(rmead)
No STR is present to create test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(rmead)
Flags: in-moztrap-
See Also: → 1137875
You need to log in before you can comment on or make changes to this bug.