Closed Bug 981871 Opened 6 years ago Closed 6 years ago

b2g process take large amount of memory and causes OOM

Categories

(Firefox OS Graveyard :: Performance, defect, P1, critical)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.3+, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed, b2g-v2.0 fixed)

RESOLVED FIXED
1.4 S5 (11apr)
blocking-b2g 1.3+
Tracking Status
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: tkundu, Assigned: bkelly)

References

Details

(Keywords: perf, Whiteboard: [caf priority: p1][CR 615380][MemShrink:P1][c=memory p=4 s= u=1.3])

Attachments

(15 files)

789.39 KB, text/plain
Details
603.98 KB, text/plain
Details
8.36 MB, application/x-bzip
Details
8.01 MB, application/x-bzip
Details
7.92 MB, application/x-bzip
Details
9.00 KB, application/vnd.ms-excel
Details
2.78 MB, application/x-bzip
Details
4.87 MB, application/x-bzip
Details
6.13 MB, application/x-bzip
Details
5.74 MB, application/x-bzip
Details
6.17 MB, application/x-bzip
Details
4.96 MB, application/x-bzip
Details
4.70 MB, application/x-bzip
Details
4.87 MB, application/x-bzip
Details
1.25 KB, text/plain
Details
We are seeing that b2g process is taking >~80MB memory and it does not decrease memroy  even if we keep device idle. 

We can see OOM quickly after 1 day of stability testing in 256 MB device. For 512MB device, we can see big memory growth which is unexpected and it does not goes away if we keep device idle.

I attached gecko memory report for both unaffected and affected 512MB device.
Curious - are you also seeing bug 980559 with the device idle when reproducing this bug? I'm wondering if the two issues are related to each other.
Keywords: perf
Whiteboard: [CR 615380] → [CR 615380][MemShrink]
I can debug it further quickly if anyone can point me in right direction from gecko memory report.(In reply to Jason Smith [:jsmith] from comment #3)
> Curious - are you also seeing bug 980559 with the device idle when
> reproducing this bug? I'm wondering if the two issues are related to each
> other.

No. I didn't see that . I can debug it further quickly if anyone can point me in right direction from gecko memory report.
blocking-b2g: --- → 1.3?
A quick diff between the two attached reports shows some nasty stuff going on: heap-unclassified is *significantly* larger and there's also a lot additional overhead in the allocator. There are also other differences that might simply point to different usage scenarios, when were the two about:memory dumps grabbed? Also, did you try minimizing the memory to see if the extra growth went away? You can do it when grabbing the about:memory data by passing the '--minimize' option to get_about_memory.py
(In reply to Gabriele Svelto [:gsvelto] from comment #5)
> A quick diff between the two attached reports shows some nasty stuff going on: heap-unclassified is *significantly* larger and there's also a lot additional overhead in the allocator.
How do we analyze it more ?

> There are also other differences that might simply point to different usage scenarios, when were the two about:memory dumps grabbed? 
One is grabbed after 1.5 days of stability testing and another is grabbed just after rebooting device and launch camera, sms, dialer app.

> Also, did you try minimizing the memory to see if the extra growth went away? 
Sorry  I didn't try it yet. Its a good point. I will try and update soon
Flags: needinfo?(gsvelto)
Please note "affected 512MB device memory-reports-1394428057.json" is collected after keeping device idle for 20 mins. But b2g process memory usage didn't go down even if we keep it idle. So I think that ''--minimize' option will not help here. But still I will try to do this in my next stability run (it will take at least 1 day to confirm it).
Let me see if I can reproduce and get a DMD report.  Any particular STR?
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
blocking-b2g: 1.3? → 1.3+
(In reply to Tapas Kumar Kundu from comment #6)
> How do we analyze it more ?

You need a DMD build to figure out where memory is going, you can find instructions here:

https://wiki.mozilla.org/Performance/MemShrink/DMD

> One is grabbed after 1.5 days of stability testing and another is grabbed
> just after rebooting device and launch camera, sms, dialer app.

What happens during the stability testing? Which apps are run and how frequently?


(In reply to Tapas Kumar Kundu from comment #7)
> So I think that ''--minimize' option will not help here.

It will force delivery of memory pressure events to all components so it's likely to flush even caches that are normally not flushed as part of the regular GC/CC cycles.
Flags: needinfo?(gsvelto)
Tapas indicated on IRC last night that his build has DMD enable.  I gave him the get_about_memory.py command to get them.  On my machine I use:

  MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py --no-gc-cc-log --gecko-objdir /srv/mozilla-central/objdir-gonk-hamachi-m-c

The important thing is to set the --gecko-objdir if you are using an external gecko directory.

He's going to try to collect some reports and post here.
Whiteboard: [CR 615380][MemShrink] → [CR 615380][MemShrink][c=memory p=4 s= u=]
It takes some time to reproduce this issue again .But I will update here soon with DMD logs.
Memory growth report for 1st One hour of stability testing. It has DMD logs.
It contains additional  memory growth report which we got after collecting [1] for same device . I uploaded twice because I have limitation in uploading big files in bugzilla

[1] 8389611: memory_growth_in_1st_hours.tar.bz2
dmesg, logcat during full 2 hour 20mins stability testing for [1] [2]

[1]  attachment 8389611 [details]: memory_growth_in_1st_hours.tar.bz2
[2]  attachment 8389613 [details]: memory_growth_in_next_1hour_20mins.tar.bz2
Attached file mem_growth.xls
(In reply to Ben Kelly [:bkelly] from comment #10) 
>   MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py --no-gc-cc-log
> --gecko-objdir /srv/mozilla-central/objdir-gonk-hamachi-m-c
> 

Hi I used same command and collected memory growth for it. STR may be very difficult for you. I am still running stability testing on test device. I will upload another memory report once I see huge USS memory growth for b2g (ex: 100MB).

STR:
1) Run camera, camcorder, call, sms, video recording, wifi on/off, airplane on/off, bluetooth on/off, airplane on/off for 48hours on device. Please note that this issue comes even if we don't use telephony call/sms.
2) Check available memory on device after 48 hours

I followed above STR but I killed all apps every 20mins and collected memory report so that I can see actual growth of b2g within 2 hours 20mins. 

I can upload another gecko memory report with DMD logs which will contain big growth for b2g . But it will take sometime like >1 day to reproduce. Please let me know if you want me to do any experiment/collect any logs. I can reproduce it consistently on my test device.

I also attached one excel sheet which just summarizes memory growth data from 
[1] [2] and shows that which is growing fast in gecko memory report [1][2].

[1] attachment 8389611 [details]: memory_growth_in_1st_hours.tar.bz2
[2] attachment 8389613 [details]: memory_growth_in_next_1hour_20mins.tar.bz2
Flags: needinfo?(bkelly)
I am going to move this over to Runtime to get it out of the General component. This should probably be in a new "Performance" component.
Component: General → Runtime
Thanks Tapas, I'm looking at these now.  From what you have said, though, so far these do not include the huge increase yet.

Can you say what device you are testing with?  Also, what gecko revision was this built from?

Thanks again for collecting these!
Flags: needinfo?(bkelly) → needinfo?(tkundu)
I see a few media stacks that are new, but nothing out of control.  I suspect its from the keyboard audio sounds.

I also see the stack below increasing over time, but also not to a ridiculous amount.  Its unclear if this is normal.  Note, this has a large amount of allocation slop, so could explain the degradation in allocation overhead as well.

We'll have to see what later DMD reports show, though.

Unreported: 7 blocks in stack trace record 35 of 897
 57,344 bytes (32,424 requested / 24,920 slop)
 0.19% of the heap (16.26% cumulative);  0.52% of unreported (43.69% cumulative)
 Allocated at
   replace_malloc /local/mnt/workspace/tkundu/tip_6thMarch/gecko/memory/replace/dmd/DMD.cpp:1247 (0xb6f41e6c libdmd.so+0x3e>
   _cairo_scaled_font_allocate_glyph /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-scaled-font.c>
   _cairo_scaled_font_glyph_device_extents /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-scaled->
   _cairo_composite_rectangles_init_for_glyphs /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-com>
   _cairo_image_surface_glyphs /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-image-surface.c:407>
   _cairo_surface_show_text_glyphs /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-surface.c:2769 >
   _cairo_gstate_show_text_glyphs /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo-gstate.c:1992 (0>
   _moz_cairo_show_glyphs /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/cairo/cairo/src/cairo.c:3533 (0xb61f28bc libxu>
   ~_Vector_base /local/mnt/workspace/tkundu/tip_6thMarch/gecko/build/stlport/stlport/stl/_vector.h:87 (0xb625e64e libxul.s>
   GlyphBufferAzure::Flush(mozilla::gfx::DrawTarget*, gfxTextContextPaint*, mozilla::gfx::ScaledFont*, DrawMode, bool, mozi>
   gfxFont::Draw(gfxTextRun*, unsigned int, unsigned int, gfxContext*, DrawMode, gfxPoint*, gfxFont::Spacing*, gfxTextConte>
   ~nsTArray /local/mnt/workspace/tkundu/tip_6thMarch/out/target/product/msm8610/obj/objdir-gecko/dist/include/nsTArray.h:1>
   gfxTextRun::Draw(gfxContext*, gfxPoint, DrawMode, unsigned int, unsigned int, gfxTextRun::PropertyProvider*, double*, gf>
   DrawTextRun /local/mnt/workspace/tkundu/tip_6thMarch/gecko/layout/generic/nsTextFrame.cpp:5928 (0xb5f43610 libxul.so+0xc>
   nsTextFrame::DrawTextRun(gfxContext*, gfxPoint const&, unsigned int, unsigned int, PropertyProvider&, unsigned int, doub>
   nsTextFrame::DrawText(gfxContext*, gfxRect const&, gfxPoint const&, gfxPoint const&, unsigned int, unsigned int, Propert>
   ~nsTextPaintStyle /local/mnt/workspace/tkundu/tip_6thMarch/gecko/layout/generic/nsTextFrame.cpp:306 (0xb5f4c92a libxul.s>
   ~gfxContextAutoDisableSubpixelAntialiasing /local/mnt/workspace/tkundu/tip_6thMarch/out/target/product/msm8610/obj/objdi>
   mozilla::FrameLayerBuilder::PaintItems(nsTArray<mozilla::FrameLayerBuilder::ClippedDisplayItem>&, nsIntRect const&, gfxC>
   mozilla::FrameLayerBuilder::DrawThebesLayer(mozilla::layers::ThebesLayer*, gfxContext*, nsIntRegion const&, mozilla::lay>
   nsIntRegion /local/mnt/workspace/tkundu/tip_6thMarch/out/target/product/msm8610/obj/objdir-gecko/dist/include/nsRegion.h>
   mozilla::layers::ClientThebesLayer::PaintThebes() /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/layers/client/Clien>
   mozilla::RefPtr<mozilla::layers::ContentClient>::operator->() const /local/mnt/workspace/tkundu/tip_6thMarch/out/target/>
   mozilla::layers::ClientContainerLayer::RenderLayer() /local/mnt/workspace/tkundu/tip_6thMarch/gecko/gfx/layers/client/Cl>
That stack in comment 18 is probably normal.  Cairo appears to continue allocating glyph pages as you fill up its cache.

I think we need the DMD reports after the large memory spike occurs.

Thanks again for collecting the reports Tapas.
Component: Runtime → Performance
(In reply to Ben Kelly [:bkelly] from comment #19)
> That stack in comment 18 is probably normal.  Cairo appears to continue
> allocating glyph pages as you fill up its cache.
> 
> I think we need the DMD reports after the large memory spike occurs.
> 
> Thanks again for collecting the reports Tapas.

Here it is .I collected it after 13 hours of stability testing. It shows b2g has USS 91MB and we have DMD reports for it. 

I killed all open apps, stopped stability testing and kept device idle for 10 mins before collecting logs.

Can you please analyze it and let me know next steps from here. 
Thanks for your help
Flags: needinfo?(tkundu) → needinfo?(bkelly)
Looks like we have issues in DeviceStorageFile.  I'll need to dig in further.
Flags: needinfo?(bkelly)
More specifically, DeviceStorageFile::CollectFiles() in nsDeviceStorage.cpp around line 1089.
Hmm, something crazy going on with the marionette listener and the system.gaiamobile.org compartments as well.

In about-memory-0:

36 (100.0%) -- js-main-runtime-compartments
├──30 (83.33%) ++ system
└───6 (16.67%) -- user
    ├──1 (02.78%) ── app://keyboard.gaiamobile.org/index.html#en
    ├──1 (02.78%) ── app://system.gaiamobile.org/index.html
    ├──1 (02.78%) ── moz-nullprincipal:{2acec792-84c1-40e4-8208-01822af32d50}
    ├──1 (02.78%) ── moz-nullprincipal:{5905c070-64dd-4b5a-9df5-5d0d8929a9dc}
    ├──1 (02.78%) ── moz-nullprincipal:{f73933d4-0ee0-4fee-bc30-ac3e0f5b6aed}
    └──1 (02.78%) ── null-principal

In about-memory-7:

44 (100.0%) -- js-main-runtime-compartments
├──30 (68.18%) ++ system
└──14 (31.82%) -- user
   ├───8 (18.18%) ── app://system.gaiamobile.org/index.html, [anonymous sandbox] (from: chrome://marionette/content/marionette-listener.js:354) [8]
   ├───1 (02.27%) ── app://keyboard.gaiamobile.org/index.html#en
   ├───1 (02.27%) ── app://system.gaiamobile.org/index.html
   ├───1 (02.27%) ── moz-nullprincipal:{09132c3c-f3fb-4e6d-886e-be3df33ab97f}
   ├───1 (02.27%) ── moz-nullprincipal:{2acec792-84c1-40e4-8208-01822af32d50}
   ├───1 (02.27%) ── moz-nullprincipal:{5905c070-64dd-4b5a-9df5-5d0d8929a9dc}
   └───1 (02.27%) ── null-principal

In about memory 39:

86 (100.0%) -- js-main-runtime-compartments
├──56 (65.12%) -- user
│  ├──49 (56.98%) ── app://system.gaiamobile.org/index.html, [anonymous sandbox] (from: chrome://marionette/content/marionette-listener.js:354) [49]
│  ├───1 (01.16%) ── app://keyboard.gaiamobile.org/index.html#en
│  ├───1 (01.16%) ── app://system.gaiamobile.org/index.html
│  ├───1 (01.16%) ── moz-nullprincipal:{0e1c3d05-773a-4056-9f9e-8b2626260262}
│  ├───1 (01.16%) ── moz-nullprincipal:{5114e49d-04d1-44b4-a321-da7233d0346a}
│  ├───1 (01.16%) ── moz-nullprincipal:{6092f843-41c8-427d-beb4-ed042654201a}
│  ├───1 (01.16%) ── moz-nullprincipal:{d9a101c0-be76-46af-80c9-6b4688266635}
│  └───1 (01.16%) ── null-principal
└──30 (34.88%) ++ system
For #comment 17:

I am using v1.3 gaia/gecko and MSM8610 WVGA device. I collected DMD logs for following gaia/gecko revision. And I am sure that it is also present in latest v1.3 tip.

gecko revision: https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/commit/?h=mozilla/v1.3&id=fcf711bbb0affff0925cb133c82073998bb46abf

gaia revision: https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/commit/?h=mozilla/v1.3&id=e8bf7326ce5ed63bc8ef8b2cff1eba094887b9bf
It seems like we are leaking sandboxes when |msg.json.newSandbox| is set.

Tapas, Malini, do you know if this is setting just started getting used or if there are other changes recently around this.  I can investigate a bit more tonight.
Flags: needinfo?(tkundu)
Flags: needinfo?(mdas)
(In reply to Ben Kelly [:bkelly] from comment #25)
> It seems like we are leaking sandboxes when |msg.json.newSandbox| is set.
> 
> Tapas, Malini, do you know if this is setting just started getting used or
> if there are other changes recently around this.  I can investigate a bit
> more tonight.

This has been around for a while, and we've been setting newSandbox to true by default for all js tests, and any test that uses execute_script or execute_async_script.

We don't explicitly delete the older sandbox created, we reassign the variable we use instead (https://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-listener.js#589), I don't see any other places we store the sandbox, so I'm not sure where it would be leaking at first glance.
Flags: needinfo?(mdas)
Well, there are a number of anonymous functions here which could hang onto it in their closures:

  https://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-listener.js#360
And "for a while", I mean at least a year. Is this a new memory issue?
 Malini Das has answered it . So clearing NI for me :)
Flags: needinfo?(tkundu)
(In reply to Ben Kelly [:bkelly] from comment #27)
> Well, there are a number of anonymous functions here which could hang onto
> it in their closures:
> 
>  
> https://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-
> listener.js#360

Interesting, I missed that. I'm not sure if this will cause it to persist, I'm not that familiar with how the references are counted. How can I go about checking this?
So in order to figure out why we are leaking the marionette sandbox objects we probably need the GC and CC logs.  Unfortunately I stupidly gave a command line which disables these.

Tapas, if you still have the device in the high memory state, can you run this command:


  MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py --gecko-objdir /srv/mozilla-central/objdir-gonk-hamachi-m-c

Note, I removed the --no-gc-cc-log flag.
Flags: needinfo?(tkundu)
Is there a chance that this doesn't happen in production and is an automation artifact?
Yes that's where we are trending it seems.  But even if so, without a fix we can't stress the device enough to meet the goal.
Sure, we can totally fix this, just trying to understand the risk.
(In reply to Ben Kelly [:bkelly] from comment #31)
> So in order to figure out why we are leaking the marionette sandbox objects
> we probably need the GC and CC logs.  Unfortunately I stupidly gave a
> command line which disables these.
> 
> Tapas, if you still have the device in the high memory state, can you run
> this command:
> 
> 
>   MOZ_IGNORE_NUWA_PROCESS=1 ./tools/get_about_memory.py --gecko-objdir
> /srv/mozilla-central/objdir-gonk-hamachi-m-c
> 
> Note, I removed the --no-gc-cc-log flag.

Unfortunately, Device is not in same state now. I need to reproduce it again and I will update here soon.
Tapas, what kind of test is this? Is this a monkey test with random touch inputs?
(In reply to Tapas Kumar Kundu from comment #35)
> Unfortunately, Device is not in same state now. I need to reproduce it again
> and I will update here soon.

From comment 23 we were already starting to leak some sandboxes even from your about-memory-7.  So a memory report from a few hours in may be enough.  Look for lines like this with numbers greater than 1 in the memory report:

   ├───8 (18.18%) ── app://system.gaiamobile.org/index.html, [anonymous sandbox] (from: chrome://marionette/content/marionette-listener.js:354) [8]
See Also: → 979173
(In reply to Andreas Gal :gal from comment #36)
> Tapas, what kind of test is this? Is this a monkey test with random touch
> inputs?
Andreas, These are not monkey tests. These are stability stress tests where some set of tests are performed over and over again involving various applications simultaneously.
Ok thanks thats very useful to know to diagnose this.
(In reply to Ben Kelly [:bkelly] from comment #37)

Here it is . This is after 1 hour 10mins. I am still trying to reproduce very high memory growth (80MB - 100MB) of b2g process .I will upload it here soon. Please let me know if you want me to do any experiment/collect any new logs.
Flags: needinfo?(bkelly)
It appears that the tests are somehow adding a |test1| attribute to the system app's global window.  This attribute then contains a |songs| array which references back to the sandbox.

Tapas, Inder, is the source for these tests available anywhere?  Do you see any code that might be doing "test1 = " instead of "var test1 = "?

Here is one of the object chains referencing from analyzing the gc log:

    via mCallback :
    0xb1d649d0 [Object <no private>]
        --[handleEvent]--> 0xb1d38fe0 [Function exports.KeyboardHelper.handleEvent]
        --[lazyScript]--> 0xb1d6a1f0 [lazyscript]
        --[enclosingScope]--> 0xb1d38b60 [Function]
        --[script]--> 0xb1dbc808 [script app://system.gaiamobile.org/shared/js/keybo]
        --[objects[45]]--> 0xb1d3f200 [Function kh_getLayouts]
        --[script]--> 0xb1dcc448 [script app://system.gaiamobile.org/shared/js/keybo]
        --[objects[0]]--> 0xb1d3f220 [Function withApps]
        --[script]--> 0xb1dcce98 [script app://system.gaiamobile.org/shared/js/keybo]
        --[objects[0]]--> 0xb1d3f240 [Function eachApp]
        --[script]--> 0xb1dccf10 [script app://system.gaiamobile.org/shared/js/keybo]
        --[objects[1]]--> 0xb1d3f260 [Function eachApp/<]
        --[fun_callscope]--> 0xb1d2a0d0 [Window ad2bbbe0]
        --[test1]--> 0xb1d3bbb0 [Proxy <no private>]
        --[private]--> 0xab8e5fd0 [Object <no private>]
        --[songs]--> 0xa65000b0 [Array <no private>]
        --[type]--> 0xab8ed2a0 [type_object]
        --[type_proto]--> 0xaa09e7c0 [Array <no private>]
        --[shape]--> 0xaa0a3a48 [shape]
        --[base]--> 0xa8ce8178 [base_shape]
        --[base]--> 0xa8cb3ba0 [base_shape]
        --[parent]--> 0xa659d430 [Sandbox acce1e64]
Flags: needinfo?(bkelly) → needinfo?(ikumar)
Little historic anecdote: "test1 = " vs "var test1 = " infamously lead to browsers massively optimizing global variable access during the earlier days of the browser performance wars because in a tight loop the author of the sunspider benchmark accidentally forgot a "var" declaration and was using global variables as a result. Global variables live on the window/global object (x = 1; window.x === x) and are notoriously difficult to optimize in the compiler due to this aliasing. Modules in the next edition of JavaScript will remove the window object from the global scope, finally putting an end to what is most likely the sharpest edge of JavaScript.
I think the test1 issue needs to be fixed, but what I find suspicious is that we also started having issues with b2g-endurance tests leaking sandboxes recently as well.  See bug 976584.

There it appears to be a separate set of references holding on to the Sandbox, but did we really create two leaks almost simultaneously?  Or is something else going on here?
Depends on: 976584
Flags: needinfo?(tkundu)
> Tapas, Inder, is the source for these tests available anywhere?  Do you see
> any code that might be doing "test1 = " instead of "var test1 = "?
> 
Ben, unfortunately the test scripts can't be shared but we did find test1 variables in global scope. Tapas will provide the details soon. He is fixing the scripts and trying to see if it fixes this issue. As you said, it looks like we may still have some more issues in bug 976584.
Flags: needinfo?(ikumar)
Hi ben,

We are doing something like this: 

1) Call Init() function from python script client.execute_async_script('return init()' ) to initialize song list

pseudo-code: 
var global_struct = {
    songs: [],
    currIndex: 0,
    timedOut: false,
    onEnded: false
};

function init() {
 a) Detect audio files on sdcard and push all audio song data in  global_struct Array using
    global_struct.songs.push(songData);
 b) Initialize test1 variable as 
    window.wrappedJSObject.test1 = global_struct;
    This creates intializes test1 variable . we this variable to access  global_struct later
}

2) Call get_No_Of_Songs() from python script using client.execute_async_script('return get_No_Of_Songs()')
 a) Use window.wrappedJSObject.test1 to access global_struct variable in javascropt code and return number of songs. For ex:
    function getNoOfSongs() {
      return window.wrappedJSObject.test1.songs.length;
    }


Can you please suggest best way to do it without any leak in current v1.3 gecko framework?
Flags: needinfo?(bkelly)
Please note that we are doing similar activities like #comment 45 for music , camcorder etc again and again for 24-48 hours. Please suggest the best alternatives to avoid mem leak. Thanks a lot
So its still not clear what changed recently, but let me address your question.  I'm starting to wonder why an array would be keeping a reference back to the Sandbox.  As far as I can tell it looks like this:

  songs.type.type_proto.shape.base.base.parent === sandbox

Anyway, to your question, I think there are a couple avenues you could take.

1) If you want to keep state between tests on the device then perhaps don't use a separate sandbox for each script execution.  To do this use |client.execute_async_script(new_sandbox=False)|.

2) Alternatively, to maintain the safety of script sandboxes between tests, keep the state information in your python driver by doing:
a) Pass current state in using |client.execute_async_script(script_args=stateObj)|.
b) Return the updated state object back to the python driver with |marionetteScriptFinished(stateObj)|.

However, I have never written a marionette script, so I may be wrong.

Malini, Jonathon, do you have any recommendations for Tapas on the best way to keep state across multiple marionette async script executions?
Flags: needinfo?(mdas)
Flags: needinfo?(jgriffin)
Flags: needinfo?(bkelly)
No longer depends on: 976584
So I realized bug 976584 is mozilla-central and this is v1.3.  That and the fact that the gc/cc logs look completely different mean they are probably unrelated.
(In reply to Ben Kelly [:bkelly] from comment #47)
> I'm starting to wonder why an array would be keeping a reference
> back to the Sandbox.  As far as I can tell it looks like this:
> 
>   songs.type.type_proto.shape.base.base.parent === sandbox

Kyle Huey informs me that this is normal when array is created with a prototype owned by the sandbox.
Every JavaScript object conceptually links to the global object (window, sandbox) it was created within. This is necessary because every object has a prototype and the prototype object points to function objects which have a scope chain that ends with the global object.
If I understand the problem correctly, it looks like you're creating an implicit reference between the sandbox and the window object by doing:

var foo = {/*something*/}; // belongs to sandbox
window.wrappedJSObject.test1 = foo;  // belongs to window

Normally, a sandbox is cleaned up at the end of execute_script, but this prevents that from happening.

The suggestions that Ben gave are good; you could avoid creating multiple sandboxes in the first place by using 'execute_async_script(new_sandbox=False)', although there are cases where a new sandbox will be created for you anyway, such as when switching between OOP frames.

You could also maintain state on the Python client side, using the method described by Ben.

You could also store everything on 'window', instead of relying on variables defined in the sandbox:

window.wrappedJSObject.test1 = { songs:[], etc. };

i.e., avoid the 'var global_struct = ' declaration.
Flags: needinfo?(jgriffin)
(In reply to Jonathan Griffin (:jgriffin) from comment #51)

> You could also store everything on 'window', instead of relying on variables
> defined in the sandbox:
> 
> window.wrappedJSObject.test1 = { songs:[], etc. };
> 
> i.e., avoid the 'var global_struct = ' declaration.

Thanks for your help. I am fixing scripts using "window.wrappedJSObject.test1" approach and re-running this. I will update here soon.
(In reply to Jonathan Griffin (:jgriffin) from comment #51)
> You could also store everything on 'window', instead of relying on variables
> defined in the sandbox:
> 
> window.wrappedJSObject.test1 = { songs:[], etc. };
> 
> i.e., avoid the 'var global_struct = ' declaration.

Is this really something we support and recommend?  The impression I got from the #memshrink channel was that sticking stuff on wrappedJSObject was discouraged.

Tapas, if you do go this way I recommend you put a large comment about the problem we encountered here so that it doesn't accidentally get added back in the future.  :-)
Whiteboard: [CR 615380][MemShrink][c=memory p=4 s= u=] → [CR 615380][MemShrink][c=memory p=4 s= u=1.3]
We do support it, but whether it's recommended or not, I'd definitely defer to the people on #memshrink.
Flags: needinfo?(mdas)
Yea, we recommend keeping the state in the python driver if you are using separate sandboxes.  Its too easy to accidentally leak something across the sandbox boundary with |window.wrappedJSObject|.

So Tapas, it would be great if you guys could move to keeping state in python in the long term.
(In reply to Ben Kelly [:bkelly] from comment #55)
> Yea, we recommend keeping the state in the python driver if you are using
> separate sandboxes.  Its too easy to accidentally leak something across the
> sandbox boundary with |window.wrappedJSObject|.
> 
> So Tapas, it would be great if you guys could move to keeping state in
> python in the long term.

Thanks ben. I followed your suggestion and I am running stability test now. I will update you soon with data from that run . 

But I have some data for big memory growth in b2g for #comment 31 (with gc/cc logs) . Please note that it contains sandbox leaks. I just want to confirm whether you can find any other mem leak apart from sandbox leak by looking at this data.
Flags: needinfo?(bkelly)
Hi Ben,

I fixed sandbox issue in our script. But still I am seeing linear increase in b2g process USS over time . I generated this report after 8 hours of stability testing and b2g has reached 69MB in b2g-procrank. I killed all app and then taken memory report with 

|MOZ_IGNORE_NUWA_PROCESS=1 ./device/qcom/b2g_common/mozilla-b2g/tools/get_about_memory.py --minimize --gecko-objdir=./out/target/product/msm8610/obj/objdir-gecko|

I am still running stability testing. Can you please take a look in this report and suggest what is causing b2g to reach 69 within 8 hours ? I will update another report after seeing 80-100MB in b2g USS.
So the good news is it appears that we are not leaking sandboxes any more:

38 (100.0%) -- js-main-runtime-compartments
├──30 (78.95%) ++ system
└───8 (21.05%) -- user
    ├──2 (05.26%) ── app://system.gaiamobile.org/index.html, [anonymous sandbox] (from: chrome://marionette/content/marionette-listener.js:354) [2]

I do see a lot of child-process-manager references, though:

8,500 (100.0%) -- message-manager
└──8,500 (100.0%) -- referent
   ├──7,890 (92.82%) -- child-process-manager
   │  ├──5,760 (67.76%) -- weak
   │  │  ├──5,274 (62.05%) ── dead
   │  │  └────486 (05.72%) ── alive
   │  └──2,130 (25.06%) ── strong

And a lot of inner-window-destroyed observer references:

3,572 (100.0%) -- observer-service-suspect
├──3,066 (85.83%) ── referent(topic=inner-window-destroyed)
├────169 (04.73%) ── referent(topic=file-watcher-update)
├────169 (04.73%) ── referent(topic=volume-state-changed)
└────168 (04.70%) ── referent(topic=disk-space-watcher)

Let me see if I can find these in the gc/cc logs.
Flags: needinfo?(bkelly)
Hi Ben,

I attached a gecko  memory-report as I mentioned in #comment 57 . It shows that b2g USS Is 77MB and It is collected after 10 hours stability testing. It also does not have any sandbox memleak issue from our scripts. Please have a look in this and you can compare this with memory-report in #comment 57 to see what has increased between this period. Please suggest.
Thanks Tapas.  I did a diff between the two and it appears its those observer/message managers continuing to grow.  I think this is causing us to fragment memory space and hold on to a lot of unused GC arenas.

Delta:

2,321 (100.0%) -- message-manager
└──2,321 (100.0%) -- referent
   ├──2,175 (93.71%) -- child-process-manager
   │  ├──1,440 (62.04%) ── weak/dead
   │  └────735 (31.67%) ── strong
   └────146 (06.29%) ── global-manager/weak/dead

1,168 (100.0%) -- observer-service
└──1,168 (100.0%) -- referent
   ├────849 (72.69%) ── strong
   └────319 (27.31%) -- weak
        ├──305 (26.11%) ── dead
        └───14 (01.20%) ── alive

1,166 (100.0%) -- observer-service-suspect
├──1,004 (86.11%) ── referent(topic=inner-window-destroyed)
├─────54 (04.63%) ── referent(topic=disk-space-watcher)
├─────54 (04.63%) ── referent(topic=file-watcher-update)
└─────54 (04.63%) ── referent(topic=volume-state-changed)
It looks like some js api is not releasing things properly when a window is closed.

A few suspects:
https://mxr.mozilla.org/mozilla-central/source/dom/datastore/DataStore.jsm#96 (pretty sure this one doesn't unregister the observer)

https://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/server/actors/webgl.js#370 (not sure when/how is stopListening() called)
https://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/server/actors/storage.js#803 (may be ok also)

https://mxr.mozilla.org/mozilla-central/source/toolkit/devtools/DevToolsExtensions.jsm#46 (very suspiscious)
Depends on: 985042
I see 7682 SettingsManager objects.  Its unclear to me how these could be getting created if its only ever referenced via navigator.mozSettings.

Tapas, are you by any chance using mozSettings in your test script?  If so, could you show us the relevant code?
Flags: needinfo?(tkundu)
(In reply to Ben Kelly [:bkelly] from comment #62)
> I see 7682 SettingsManager objects.  Its unclear to me how these could be
> getting created if its only ever referenced via navigator.mozSettings.
> 
> Tapas, are you by any chance using mozSettings in your test script?  If so,
> could you show us the relevant code?

Yes, we are using mozSettings object to enable/disable wifi, bluetooth, airplane .

Here is the pseudo-code for bluetooth enabling function:

// Get SettingsManager object using webapi

   settings = window.navigator.mozSettings;
   if(!settings) {
      log("SettingsManager is null")
      marionetteScriptFinished(1);
      return;
   }
   try {
      // Enable Bluetooth
      bluetoothSet = {"bluetooth.enabled":true}
      set_lock =  settings.createLock();
      set_Req = set_lock.set(bluetoothSet);
      set_Req.onsuccess = function onsuccess(){
         console.log(set_Req.settingName);
         marionetteScriptFinished(0);
         return;
      };
      set_Req.onerror  = function onerror(){
         console.log(set_Req.settingName);
         marionetteScriptFinished(1);
         return
      };
   } catch(e) {
        log("Exception while enabling airplane mode:" , e.message)
        marionetteScriptFinished(1);
        return;
   }

Ben, Do we need to free up this SettingsManger object explicitly ? It should be taken care by GC . right ? 

We are enabling/disabling bluetooth, airplane, wifi by simple modification of above function. Please suggest next step here. By the way, we are not using any global variable anymore.
Flags: needinfo?(tkundu) → needinfo?(bbbkelly)
Flags: needinfo?(bbbkelly) → needinfo?(bkelly)
We talked in IRC and requested that Tapas try replacing:

   settings = window.navigator.mozSettings;

With

   var settings = window.navigator.mozSettings;

Right now, however, we are not confident that will solve the problem we see with the excessive SettingsManager objects.  Even something like this should only be creating one SettingsManager.
Flags: needinfo?(bkelly)
Whiteboard: [CR 615380][MemShrink][c=memory p=4 s= u=1.3] → [CR 615380][MemShrink:P1][c=memory p=4 s= u=1.3]
(In reply to Tapas Kumar Kundu from comment #63)
> (In reply to Ben Kelly [:bkelly] from comment #62)
> > I see 7682 SettingsManager objects.  Its unclear to me how these could be
> > getting created if its only ever referenced via navigator.mozSettings.
> > 
> > Tapas, are you by any chance using mozSettings in your test script?  If so,
> > could you show us the relevant code?
> 
> Yes, we are using mozSettings object to enable/disable wifi, bluetooth,
> airplane .
> 
> Here is the pseudo-code for bluetooth enabling function:
> 
> // Get SettingsManager object using webapi
> 
>    settings = window.navigator.mozSettings;
>    if(!settings) {
>       log("SettingsManager is null")
>       marionetteScriptFinished(1);
>       return;
>    }
>    try {
>       // Enable Bluetooth
>       bluetoothSet = {"bluetooth.enabled":true}
>       set_lock =  settings.createLock();
>       set_Req = set_lock.set(bluetoothSet);

Please don't use globals like this. All of this causes possible leaks. Please use vars everywhere for everything. Nothing should ever be an implicit global. Any global should be explicit via window.name. Its important that you fix this for all your test scripts since this is very brittle.

>       set_Req.onsuccess = function onsuccess(){
>          console.log(set_Req.settingName);
>          marionetteScriptFinished(0);
>          return;
>       };
>       set_Req.onerror  = function onerror(){
>          console.log(set_Req.settingName);
>          marionetteScriptFinished(1);
>          return
>       };
>    } catch(e) {
>         log("Exception while enabling airplane mode:" , e.message)
>         marionetteScriptFinished(1);
>         return;
>    }
> 
> Ben, Do we need to free up this SettingsManger object explicitly ? It should
> be taken care by GC . right ? 
> 
> We are enabling/disabling bluetooth, airplane, wifi by simple modification
> of above function. Please suggest next step here. By the way, we are not
> using any global variable anymore.
Yeah we have script cleanup to do real bad it seems.  Thanks alot for pointing this out and all the debug help earlier!  

/me blushes
Ben and I expect that there are issues to fix somewhere even after cleaning up those global variables.  That doesn't really explain all the symptoms we're seeing.
Agreed. The globals make it harder to track those down though. Datastore definitely looks broken. No idea why our leak tests didn't catch that.

/me blushes, too
We're bad at tracking small things like these, especially since wrapper cutting means the DataStore won't entrain a Window in any of our automated tests (which run with content privileges).
Also I'm pretty convinced now that strong observers are an antipattern we should get rid of.
Bug please :)
Tapas, you might also want to add 'use strict' to your various scripts to help track down these accidental globals and prevent them in the future.

  https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions_and_function_scope/Strict_mode

Just thought I would mention that as its saved me from this issue in the past.  :-)
(In reply to Ben Kelly [:bkelly] from comment #73)
> Tapas, you might also want to add 'use strict' to your various scripts to
> help track down these accidental globals and prevent them in the future.
> 
>  
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/
> Functions_and_function_scope/Strict_mode
> 
> Just thought I would mention that as its saved me from this issue in the
> past.  :-)

Sure. I will run with these corrections and update memory report here soon
I think bug 985827 is the root cause of the thousands of copies of the SettingsManager we have floating around.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #75)
> I think bug 985827 is the root cause of the thousands of copies of the
> SettingsManager we have floating around.

Thanks for your help. I attached memory report with all global variable and sandbox memleak fixes in our script  but this stability memory report does not contain fix from bug 985197 , bug 985827 and bug 985042 .  This memory report is collected after 5 hours of run and b2g has became 61MB.

I am trying fix from both bug 985042 and bug 985827 in new stability test. I think that bug 985197 has no fix available yet. 


I was thinking whether attached memory report can help you to find if we have any other leak issue in this bug or not.
Flags: needinfo?(khuey)
Flags: needinfo?(bkelly)
I don't think it's worth spending time on analyzing logs that don't have the fix for bug 985827.  It's a serious enough leak that the results will be quite different with it fixed.
Flags: needinfo?(khuey)
Flags: needinfo?(bkelly)
Blocks: 985042
No longer depends on: 985042
No longer blocks: 985042
Depends on: 985042
Can we start running this test again now that bugs 985042 and 985827 have landed on 1.3?
Flags: needinfo?(tkundu)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #78)
> Can we start running this test again now that bugs 985042 and 985827 have
> landed on 1.3?

Sure. Running it now.. I will update asap here. Thanks a lot.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #78)
> Can we start running this test again now that bugs 985042 and 985827 have
> landed on 1.3?

I am still running this test on device . This snapshot is collected after  9 hours of run. And It has fix from above bugs. I also make sure that there is no memleak in our script anymore.

I will upload another snapshot once device passes 24 hours in stability testing.
How does 56 MB after 9 hours compare to earlier?  Is that substantially better than we were doing?

I do see that there aren't 3000 SettingsManagers in the heap, so we fixed that issue at least.

Do we have an idea of what "acceptable" for 1.3 is here?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #81)
> How does 56 MB after 9 hours compare to earlier?  Is that substantially
> better than we were doing?

Looking at past memory reports for this bug we ranged from ~36MB initially up to ~80+MB for later captures.

I guess we need another memory report here after some time period to see if we're stable at 56MB or continuing to degrade.
(In reply to Ben Kelly [:bkelly] from comment #82)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #81)
> > How does 56 MB after 9 hours compare to earlier?  Is that substantially
> > better than we were doing?

Yes . I agree. But I want to check whether it is completely fixed or not.

> Looking at past memory reports for this bug we ranged from ~36MB initially
> up to ~80+MB for later captures.
> 
> I guess we need another memory report here after some time period to see if
> we're stable at 56MB or continuing to degrade.

Here is another snapshot after 18 hours. I am still running test on that device . I will upload another report soon.
Can you please look into my last comment ?
Flags: needinfo?(khuey)
Flags: needinfo?(bkelly)
Yep, its on my todo list today.  I think Kyle may be traveling.
Skimming the logs, I see in the 56MB DMD log:

Unreported: ~40 blocks in stack trace record 5 of 825
 ~163,720 bytes (~163,720 requested / ~0 slop)
 0.53% of the heap (8.52% cumulative);  1.54% of unreported (25.02% cumulative)
 Allocated at
   replace_malloc /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/memory/replace/dmd/DMD.cpp:1247 (0xb6f9ae6c libdmd.so+0x3e6c)
   qc_loc_fw::OutMemoryStreamImpl::expand(unsigned int) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/memorystream.cpp:139 (0xb30f4d9c libqc_b2g_location.so+0x1ed9c)
   qc_loc_fw::OutMemoryStreamImpl::append(void const*, unsigned int) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/memorystream.cpp:170 (0xb30f4df2 libqc_b2g_location.so+0x1edf2)
   int qc_loc_fw::OutPostcardImpl::add<qc_loc_fw::SimpleNode<(unsigned short)18, unsigned char const> >(char const*, qc_loc_fw::SimpleNode<(unsigned short)18, unsigned char const> const&) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/postcard.cpp:347 (0xb30f2d20 libqc_b2g_location.so+0x1cd20)
   qc_loc_fw::OutPostcardImpl::addBool(char const*, bool const&) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/postcard.cpp:205 (0xb30f2d8c libqc_b2g_location.so+0x1cd8c)
   OSAgent::FillStatusUpdateMessage(qc_loc_fw::OutPostcard*) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:775 (0xb30ef790 libqc_b2g_location.so+0x19790)
   SendStatusUpdate /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:820 (0xb30ef9a4 libqc_b2g_location.so+0x199a4)
   nsBaseHashtable<nsStringHashKey, nsAutoPtr<UpdateSubscriber>, UpdateSubscriber*>::s_EnumStub(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*) /local/mnt/workspace/tkundu/tip_25thMarch/out/target/product/msm8610/obj/objdir-gecko/dist/include/nsBaseHashtable.h:401 (0xb30eeb8c libqc_b2g_location.so+0x18b8c)
   PL_DHashTableEnumerate /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/glue/pldhash.cpp:632 (0xb30f5c70 libqc_b2g_location.so+0x1fc70)
   OSAgent::TriggerStatusUpdate(bool) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:842 (0xb30efbea libqc_b2g_location.so+0x19bea)
   OSAgent::NotifyMozSettingChangeBool(nsAString const&, bool) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:1083 (0xb30efc9a libqc_b2g_location.so+0x19c9a)
   NS_InvokeByIndex /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:165 (0xb4e85964 libxul.so+0x368964)
   CallMethodHelper::Call() /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedNative.cpp:1909 (0xb536247e libxul.so+0x84547e)
   XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1301 (0xb5364226 libxul.so+0x847226)
   CallJSNative /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/jscntxtinlines.h:220 (0xb5c74800 libxul.so+0x1157800)
   Interpret /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:2511 (0xb5c6c91a libxul.so+0x114f91a)
   RunScript /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:420 (0xb5c745e8 libxul.so+0x11575e8)
   RunScript /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:389 (0xb5c7478c libxul.so+0x115778c)
   js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:519 (0xb5c75de2 libxul.so+0x1158de2)
   JS_CallFunctionValue(JSContext*, JSObject*, JS::Value, unsigned int, JS::Value*, JS::Value*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/jsapi.cpp:5002 (0xb5bd4500 libxul.so+0x10b7500)
   nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedJSClass.cpp:1413 (0xb535f20a libxul.so+0x84220a)
   nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedJS.cpp:479 (0xb535d160 libxul.so+0x840160)
   PrepareAndDispatch /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/reflect/xptcall/src/md/unix/xptcstubs_arm.cpp:107 (0xb4e861ec libxul.so+0x3691ec)
   SharedStub libgcc2.c:0 (0xb4e85980 libxul.so+0x368980)

and in the 63MB DMD log:

Unreported: ~80 blocks in stack trace record 3 of 841
 ~327,440 bytes (~327,440 requested / ~0 slop)
 0.89% of the heap (6.56% cumulative);  2.70% of unreported (19.85% cumulative)
 Allocated at
   replace_malloc /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/memory/replace/dmd/DMD.cpp:1247 (0xb6f9ae6c libdmd.so+0x3e6c)
   qc_loc_fw::OutMemoryStreamImpl::expand(unsigned int) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/memorystream.cpp:139 (0xb30f4d9c libqc_b2g_location.so+0x1ed9c)
   qc_loc_fw::OutMemoryStreamImpl::append(void const*, unsigned int) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/memorystream.cpp:170 (0xb30f4df2 libqc_b2g_location.so+0x1edf2)
   int qc_loc_fw::OutPostcardImpl::add<qc_loc_fw::SimpleNode<(unsigned short)18, unsigned char const> >(char const*, qc_loc_fw::SimpleNode<(unsigned short)18, unsigned char const> const&) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/postcard.cpp:347 (0xb30f2d20 libqc_b2g_location.so+0x1cd20)
   qc_loc_fw::OutPostcardImpl::addBool(char const*, bool const&) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/gps-noship/util/native/base_util/src/postcard.cpp:205 (0xb30f2d8c libqc_b2g_location.so+0x1cd8c)
   OSAgent::FillStatusUpdateMessage(qc_loc_fw::OutPostcard*) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:775 (0xb30ef790 libqc_b2g_location.so+0x19790)
   SendStatusUpdate /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:820 (0xb30ef9a4 libqc_b2g_location.so+0x199a4)
   nsBaseHashtable<nsStringHashKey, nsAutoPtr<UpdateSubscriber>, UpdateSubscriber*>::s_EnumStub(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*) /local/mnt/workspace/tkundu/tip_25thMarch/out/target/product/msm8610/obj/objdir-gecko/dist/include/nsBaseHashtable.h:401 (0xb30eeb8c libqc_b2g_location.so+0x18b8c)
   PL_DHashTableEnumerate /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/glue/pldhash.cpp:632 (0xb30f5c70 libqc_b2g_location.so+0x1fc70)
   OSAgent::TriggerStatusUpdate(bool) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:842 (0xb30efbea libqc_b2g_location.so+0x19bea)
   OSAgent::NotifyMozSettingChangeBool(nsAString const&, bool) /local/mnt/workspace/tkundu/tip_25thMarch/vendor/qcom/proprietary/b2g_location/LocationService/OSAgent.cpp:1083 (0xb30efc9a libqc_b2g_location.so+0x19c9a)
   NS_InvokeByIndex /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:165 (0xb4e85964 libxul.so+0x368964)
   CallMethodHelper::Call() /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedNative.cpp:1909 (0xb536247e libxul.so+0x84547e)
   XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1301 (0xb5364226 libxul.so+0x847226)
   CallJSNative /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/jscntxtinlines.h:220 (0xb5c74800 libxul.so+0x1157800)
   Interpret /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:2511 (0xb5c6c91a libxul.so+0x114f91a)
   RunScript /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:420 (0xb5c745e8 libxul.so+0x11575e8)
   RunScript /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:389 (0xb5c7478c libxul.so+0x115778c)
   js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/vm/Interpreter.cpp:519 (0xb5c75de2 libxul.so+0x1158de2)
   JS_CallFunctionValue(JSContext*, JSObject*, JS::Value, unsigned int, JS::Value*, JS::Value*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/src/jsapi.cpp:5002 (0xb5bd4500 libxul.so+0x10b7500)
   nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedJSClass.cpp:1413 (0xb535f20a libxul.so+0x84220a)
   nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/js/xpconnect/src/XPCWrappedJS.cpp:479 (0xb535d160 libxul.so+0x840160)
   PrepareAndDispatch /local/mnt/workspace/tkundu/tip_19thMarch_jb32/gecko/xpcom/reflect/xptcall/src/md/unix/xptcstubs_arm.cpp:107 (0xb4e861ec libxul.so+0x3691ec)
   SharedStub libgcc2.c:0 (0xb4e85980 libxul.so+0x368980)

which represents a growth of ~160K.  I don't know whether that's a leak, or whether that memory needs to be kept around or not.
(In reply to Nathan Froyd (:froydnj) from comment #86)
> which represents a growth of ~160K.  I don't know whether that's a leak, or whether that memory needs to be kept around or not

Looks like very little only 160KB growth in 9 hours !!. I still asked another person (in our internal team) to check it.
Looking at about:memory diffs, the things that stand out are:

- A ~1.5MB image increase:

├──1.65 MB (24.35%) ── images/content/used/raw

- A ~1.5MB heap-unclassified increase:

├──1.52 MB (22.40%) ── heap-unclassified

- A ~660KB heap waste increase:

├──0.66 MB (09.69%) ── heap-overhead/waste

- ~500KB unused-gc-things increase (!)

├──0.68 MB (10.01%) -- js-non-window
│  ├──0.68 MB (10.01%) -- zones/zone(0xNNN)
│  │  ├──0.57 MB (08.33%) ── unused-gc-things [3]

Andrew suggested trying to visualize stuff in the GC log to see why this is.
There's also ~2MB of layout and DOM growth:

│  │  ├──1.96 MB (28.88%) -- active/window(app://system.gaiamobile.org/index.html)
│  │  │  ├──0.94 MB (13.88%) -- layout
│  │  │  │  ├──0.63 MB (09.28%) -- frames
│  │  │  │  │  ├──0.29 MB (04.25%) ── nsHTMLScrollFrame
│  │  │  │  │  ├──0.16 MB (02.34%) ── nsBlockFrame
│  │  │  │  │  ├──0.09 MB (01.37%) ++ (2 tiny)
│  │  │  │  │  └──0.09 MB (01.32%) ── nsTextFrame
│  │  │  │  ├──0.15 MB (02.19%) ── pres-contexts
│  │  │  │  ├──0.14 MB (02.05%) ── line-boxes
│  │  │  │  └──0.02 MB (00.36%) ── pres-shell
│  │  │  ├──0.87 MB (12.88%) -- dom
│  │  │  │  ├──0.72 MB (10.67%) ── element-nodes
│  │  │  │  ├──0.15 MB (02.17%) ── text-nodes
│  │  │  │  └──0.00 MB (00.04%) ++ (2 tiny)

This seems unusual.  Is something holding on to some iframes or something and keeping these things alive?
Here's the top strings in the 8 hour log:
47168 :: 352 x '18588450210 Error Invalid Number. Please re-send using a valid 10 digit mobile number or valid short'
15000 :: 375 x 'app://sms.gaiamobile.org/manifest.webapp'
12950 :: 370 x 'app://sms.gaiamobile.org/index.html'
4037 :: 367 x '+1121611611'

Here they are from the second log:
101840 :: 760 x '18588450210 Error Invalid Number. Please re-send using a valid 10 digit mobile number or valid short'
31920 :: 798 x 'app://sms.gaiamobile.org/manifest.webapp'
27755 :: 793 x 'app://sms.gaiamobile.org/index.html'
8690 :: 790 x '+1121611611'

As you can see, we're holding onto hundreds of more copies of these strings.  I'll try to figure out what is holding them alive.
This is the most common path that is holding alive one of these strings.  There are 802 paths that look like this.  Eg these strings are held alive by many paths.

via mCallback :[Object] --[element]--> [HTMLDivElement] --[type]--> [type_object] --[type_proto]--> [HTMLDivElementPrototype] --[type]--> [type_object] --[type_proto]--> [HTMLElementPrototype] --[type]--> [type_object] --[type_proto]--> [ElementPrototype] --[type]--> [type_object] --[type_proto]--> [NodePrototype] --[constructor]--> [Function] --[shape]--> [shape] --[base]--> [base_shape] --[parent]--> [Window ae0e9970] --[SpecialPowers]--> [Proxy] --[private]--> [Object] --[type]--> [type_object] --[type_proto]--> [Object] --[type]--> [type_object] --[type_proto]--> [Object] --[hasPermission]--> [Function SpecialPowersAPI.prototype.hasPermission] --[fun_callscope]--> [ContentFrameMessageManager ad8c6610] --[BrowserElementPromptService]--> [Proxy] --[private]--> [Object] --[_browserElementParentMap]--> [WeakMap ad8c6460] --[WeakMap entry value]--> [Object] --[_frameElement]--> [Proxy] --[private]--> [HTMLIFrameElement] --[type]--> [type_object] --[type_proto]--> [HTMLIFrameElementPrototype] --[constructor]--> [Function] --[type]--> [type_object] --[type_proto]--> [Function] --[type]--> [type_object] --[type_proto]--> [Function] --[prototype]--> [ElementPrototype] --[type]--> [type_object] --[type_proto]--> [NodePrototype] --[shape]--> [shape] --[parent]--> [shape] --[parent]--> [shape] --[base]--> [base_shape] --[parent]--> [ChromeWindow b00cbe20] --[AlertsHelper]--> [Object] --[_listeners]--> [Object] --[app-notif-{4d5a50bf-0d23-41a4-87ed-cb85f100be18}]--> [Object] --[text]--> [string <length 134> 18588450210 Error Invalid Numb]

This is one of the callbacks that is holding alive a thing:
0xae198c40 [rc=3] FragmentOrElement (xhtml) input app://system.gaiamobile.org/index.html
> 0xaefb9060 mNodeInfo
> 0xaefb5fb0 GetParent()
> 0xaab3ddc0 [via hash] mListenerManager

The mCallback is an event listener or something.
Regarding comment 86, there is a leak in the Location code here:
OutPostcard* network_info = OutPostcard::createInstance();

This leak should be happening only when a network change happens, working on the fix.
Here's the most common types of stuff in the CC graph before and after.  we have about 500 more of a number of types of nodes.  another 1800 callback objects. etc.
The increase in DOM nodes all appear to be coming from notifications.  Does the device have 800+ pending notifications at this point?
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #94)
> The increase in DOM nodes all appear to be coming from notifications.  Does
> the device have 800+ pending notifications at this point?

please let me know if I can run any command to confirm it.
Flags: needinfo?(khuey)
Can you attach a screenshot or a photograph of the home screen of the device?
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #96)
> Can you attach a screenshot or a photograph of the home screen of the device?

I can see 1177 in top left corner of device's homescreen . Do you mean this number as notification numbers ? Please let me know if you still need a screenshot.
Flags: needinfo?(khuey)
(In reply to Tapas Kumar Kundu from comment #97)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #96)
> > Can you attach a screenshot or a photograph of the home screen of the device?
> 
> I can see 1177 in top left corner of device's homescreen . Do you mean this
> number as notification numbers ? Please let me know if you still need a
> screenshot.

Yes, that's the notification count. That's a lot ;)
(In reply to Fabrice Desré [:fabrice] from comment #98)
> (In reply to Tapas Kumar Kundu from comment #97)
> > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #96)
> > > Can you attach a screenshot or a photograph of the home screen of the device?
> > 
> > I can see 1177 in top left corner of device's homescreen . Do you mean this
> > number as notification numbers ? Please let me know if you still need a
> > screenshot.
> 
> Yes, that's the notification count. That's a lot ;)

Basically device is sending sms but the recipient device does not have SMS enabled. So it is generating some notifications. is this a memory leak or we need to change our script for stability testing so that it sends SMS to only SMS enabled mobile ? please confirm us.
Flags: needinfo?(fabrice)
(In reply to Tapas Kumar Kundu from comment #99)

> Basically device is sending sms but the recipient device does not have SMS
> enabled. So it is generating some notifications. is this a memory leak or we
> need to change our script for stability testing so that it sends SMS to only
> SMS enabled mobile ? please confirm us.

I think it's expected that we keep the notification in this situation. I would send the sms to a sms enabled mobile instead.
Flags: needinfo?(fabrice)
The alternative would be to have your test clear the notification once in a while.
We should probably consider having a cap on the number of notifications of this type, or maybe only one per number.
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #102)
> We should probably consider having a cap on the number of notifications of
> this type, or maybe only one per number.

I filed bug 989182 for this.  Regardless, I don't think that we should block on sending hundreds of SMSs to a number that does not support SMS, so I would prefer that you changed the test to send them to a number that can receive SMS.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #103)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #102)
> > We should probably consider having a cap on the number of notifications of
> > this type, or maybe only one per number.
> 
> I filed bug 989182 for this.  Regardless, I don't think that we should block
> on sending hundreds of SMSs to a number that does not support SMS, so I
> would prefer that you changed the test to send them to a number that can
> receive SMS.

yeah I will do that change in our testing. But it is not real fix. Real fix should come from bug 989182.

@inder: can you please suggest for bug 989182 ? Should we block it for v1.4 ?
Flags: needinfo?(ikumar)
I looked around for other orphaned DOM nodes and noticed that there is an orphaned subtree at 0xae346dd0.  This subtree has an iframe in it.  It's held alive by 

0xb2ff3180 [nsGlobalWindow #3]
    --[mListenerManager]--> 0xb6b36220 [nsEventListenerManager]
    --[mListeners event=onmozContentEvent listenerType=2 [i]]--> 0xafb701c8 [nsX
PCWrappedJS (nsIDOMEventListener)]
    --[mJSObj]--> 0xaa5aa200 [JS Object (Function - contentEvent)]
    --[fun_callscope]--> 0xaa274970 [JS Object (Call)]
    --[callback]--> 0xaa5aa1e0 [JS Object (Function - ContentPermissionPrompt.pr
ototype.delegatePrompt/]
    --[fun_callscope]--> 0xa90ed790 [JS Object (Call)]
    --[callback]--> 0xaa5aa1c0 [JS Object (Function - onCallback)]
    --[fun_callscope]--> 0xa90ed740 [JS Object (Call)]
    --[frame]--> 0xac55c490 [JS Object (Proxy)]
    --[private]--> 0xabcd3850 [JS Object (Proxy)]
    --[private]--> 0xabcd3610 [JS Object (HTMLIFrameElement)]
    --[UnwrapDOMObject(obj)]--> 0xafe94400 [FragmentOrElement (xhtml) iframe id=
'browser1463' class='browser hidden' app://system.gaiamobile.org/index.html]
    --[[via hash] mListenerManager]--> 0xafe94e80 [nsEventListenerManager]
    --[mListeners event=onappclose listenerType=3 [i]]--> 0xafb7be20 [CallbackOb
ject]
    --[mCallback]--> 0xabcea620 [JS Object (Function - onclose)]
    --[**UNKNOWN SLOT 0**]--> 0xabcb2d90 [JS Object (Object)]
    --[fadeOverlay]--> 0xabcd39d0 [JS Object (HTMLDivElement)]
    --[UnwrapDOMObject(obj)]--> 0xae4065b0 [FragmentOrElement (xhtml) div class=
'fade-overlay' app://system.gaiamobile.org/index.html]
    --[GetParent()]--> 0xae346dd0 [FragmentOrElement (xhtml) div id='appframe146
2' class='appWindow fullscreen-app render' app://system.gaiamobile.org/index.htm
l]

Where the root nsGlobalWindow #3 is the root window of the system app.  What's happening here is that we are prompting (it appears to be the camera app prompting for geolocation) and the prompt is never answered.  I wonder why we're not shutting down prompts when the app is closed ...
> @inder: can you please suggest for bug 989182 ? Should we block it for v1.4 ?

Yes, that's a good to have fix. Please mark it for 1.4 CS blocking.
Flags: needinfo?(ikumar)
I am running another stability run with suggestion from #comment 103 .  I will update here soon with that result. 

kyle, are you creating a separate bug-id for #comment 105 ?
Flags: needinfo?(tkundu)
Flags: needinfo?(khuey)
Sounds like you all found the issues in the last run.  Clearing my ni for now.
Flags: needinfo?(bkelly)
I can split off 105, but it's a minor issue.  Don't think we need to block on it.

Is the current level of memory usage acceptable for 1.3?  I think we've found all the low-hanging fruit here ...
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #109)
> I can split off 105, but it's a minor issue.  Don't think we need to block
> on it.

Kyle, agree that we don't need to block on it but can you please open a bugzilla bug anyways? This is an issue and needs to be fixed as well.

> 
> Is the current level of memory usage acceptable for 1.3?  I think we've
> found all the low-hanging fruit here ...

@tapas can provide the details but yes it looks like we are done here.
Highest b2g memory usage is less than <70MB for 48 hours run. I confirmed it with latest stability run.
Thanks everyone for all the help with this. We solved some real good issues both in testing and in framework with this bug.
FYI, there is still a kernel SLAB growth we are observing which is not getting reclaimed. Tapas is working with Kernel team on this. We will update you all if we find any issues from there.
But, for now I am marking this bug resolved.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → 1.4 S5 (11apr)
Flags: in-moztrap-
Whiteboard: [CR 615380][MemShrink:P1][c=memory p=4 s= u=1.3] → [caf priority: p1][CR 615380][MemShrink:P1][c=memory p=4 s= u=1.3]
You need to log in before you can comment on or make changes to this bug.